Issue37224
Created on 2019-06-11 00:58 by vstinner, last changed 2019-12-01 04:26 by aeros.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 16293 | open | eric.snow, 2019-11-22 23:56 | |
| Messages (30) | |||
|---|---|---|---|
| msg345162 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-06-11 00:58 | |
https://buildbot.python.org/all/#/builders/224/builds/5 0:46:02 load avg: 1.53 [259/423/1] test__xxsubinterpreters failed -- running: test_httplib (1 min 6 sec), test_mmap (16 min 42 sec) beginning 6 repetitions 123456 ..Exception in thread Thread-28: Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 923, in _bootstrap_inner self.run() File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 182 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true (...) FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 765, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised |
|||
| msg345163 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-06-11 00:58 | |
See bpo-33356. |
|||
| msg345219 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-06-11 10:55 | |
Fail on x86 Gentoo Refleaks 3.x: https://buildbot.python.org/all/#/builders/1/builds/618 3:15:59 load avg: 6.93 [339/423/1] test__xxsubinterpreters failed -- running: test_multiprocessing_spawn (19 min 36 sec) beginning 6 repetitions 123456 .Exception in thread Thread-17: Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 923, in _bootstrap_inner self.run() File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 108 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 765, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised (...) test_after_destroying (test.test__xxsubinterpreters.ListAllTests) ... ok test_initial (test.test__xxsubinterpreters.ListAllTests) ... ok test_SystemExit (test.test__xxsubinterpreters.RunStringTests) ... ok test_already_running (test.test__xxsubinterpreters.RunStringTests) ... spam Exception in thread Thread-31: Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 923, in _bootstrap_inner self.run() File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 190 FAIL test_bad_id (test.test__xxsubinterpreters.RunStringTests) ... ok test_bad_script (test.test__xxsubinterpreters.RunStringTests) ... ok test_bytes_for_script (test.test__xxsubinterpreters.RunStringTests) ... ok test_create_thread (test.test__xxsubinterpreters.RunStringTests) ... ok test_does_not_exist (test.test__xxsubinterpreters.RunStringTests) ... ok test_error_id (test.test__xxsubinterpreters.RunStringTests) ... ok test_execution_namespace_is_main (test.test__xxsubinterpreters.RunStringTests) ... ok test_failure (test.test__xxsubinterpreters.RunStringTests) ... ok test_fork (test.test__xxsubinterpreters.RunStringTests) ... ok test_in_thread (test.test__xxsubinterpreters.RunStringTests) ... ok test_invalid_syntax (test.test__xxsubinterpreters.RunStringTests) ... ok test_main_reused (test.test__xxsubinterpreters.RunStringTests) ... ok test_shared_overwrites (test.test__xxsubinterpreters.RunStringTests) ... ok test_shared_overwrites_default_vars (test.test__xxsubinterpreters.RunStringTests) ... ok test_still_running_at_exit (test.test__xxsubinterpreters.RunStringTests) ... skipped 'blocking forever' test_success (test.test__xxsubinterpreters.RunStringTests) ... ok test_sys_exit (test.test__xxsubinterpreters.RunStringTests) ... ok test_with_shared (test.test__xxsubinterpreters.RunStringTests) ... ok test_bytes (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_int (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_non_shareable_int (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_singletons (test.test__xxsubinterpreters.ShareableTypeTests) ... ok test_types (test.test__xxsubinterpreters.ShareableTypeTests) ... ok ====================================================================== FAIL: test_already_running (test.test__xxsubinterpreters.RunStringTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.refleak/build/Lib/test/test__xxsubinterpreters.py", line 855, in test_already_running interpreters.run_string(self.id, 'print("spam")') AssertionError: RuntimeError not raised |
|||
| msg345415 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-06-12 21:52 | |
Failure on AppVeyor on my PR 14029 (which is unrelated): test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ... Exception in thread Thread-8: Traceback (most recent call last): File "C:\projects\cpython\lib\threading.py", line 923, in _bootstrap_inner self.run() File "C:\projects\cpython\lib\threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "C:\projects\cpython\lib\test\test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 FAIL FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\projects\cpython\lib\test\test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true test__xxsubinterpreters passed when run again. |
|||
| msg347486 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-07-08 07:46 | |
The test still fails times to times: https://buildbot.python.org/all/#/builders/80/builds/644 FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\buildarea\3.x.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true |
|||
| msg352508 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-09-16 05:15 | |
Upon digging through Modules/_xxsubinterpretersmodule.c, I noticed that on line 2059, `PyInterpreterState_Delete(interp);` is commented out (https://github.com/python/cpython/blob/bf169915ecdd42329726104278eb723a7dda2736/Modules/_xxsubinterpretersmodule.c#L2059). This was done when _xxsubinterpretersmodule.c was first added by Eric Snow (https://github.com/python/cpython/blob/bf169915ecdd42329726104278eb723a7dda2736/Modules/_xxsubinterpretersmodule.c#L2059), so it seems to have been done intentionally but I don't understand why. Is this because `Py_EndInterpreter()` is supposed to shutdown the interpreter, so `PyInterpreterState_Delete()` isn't needed? If so, that still doesn't particularly explain why it was commented out. Perhaps Eric can elaborate. |
|||
| msg352509 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-09-16 05:17 | |
Note that I'm not particularly experienced with the c-api, I'm just trying to take a stab at understanding why the buildbot failure is occuring. |
|||
| msg352510 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-09-16 05:20 | |
Clarification: In the above comment when I was referring to the commit made by Eric Snow, I meant to link to https://github.com/python/cpython/commit/7f8bfc9b9a8381ddb768421b5dd5cbd970266190. |
|||
| msg352514 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-09-16 06:48 | |
Upon further reading of the documentation and some experimentation, it would definitely not make sense to call `PyInterpreterState_Delete` here (since we're only closing the sub-interpreter in the current thread), so that's not the issue. I still have no idea as to why it's commented out there, but that's besides the point.
I think I may have found the potential cause of the test failure though. Looking into `test_still_running()`, it's clear that the intention is for a RuntimeError to be raised if `interpreters.destroy()` is called on a currently running interpreter:
```
def test_still_running(self):
main, = interpreters.list_all()
interp = interpreters.create()
with _running(interp):
with self.assertRaises(RuntimeError):
interpreters.destroy(interp)
self.assertTrue(interpreters.is_running(interp))
```
However, within interp_destroy (https://github.com/python/cpython/blob/56a45142e70a1ccf3233d43cb60c47255252e89a/Modules/_xxsubinterpretersmodule.c#L2024), it is apparent that the RuntimeError is ONLY raised when attempting to destroy the current interpreter:
```
if (interp == current) {
PyErr_SetString(PyExc_RuntimeError,
"cannot destroy the current interpreter");
return NULL;
}
```
When attempting to destroy a running interpreter, NULL is returned without raising the RuntimeError:
```
if (_ensure_not_running(interp) < 0) {
return NULL;
}
```
This was my first guess at a solution:
```
if (_ensure_not_running(interp) < 0) {
PyErr_SetString(PyExc_RuntimeError,
"cannot destroy a running interpreter")
return NULL;
}
```
But, within `_ensure_not_running()` (https://github.com/python/cpython/blob/56a45142e70a1ccf3233d43cb60c47255252e89a/Modules/_xxsubinterpretersmodule.c#L1842), a RuntimeError is supposed to be raised from:
```
if (is_running) {
PyErr_Format(PyExc_RuntimeError, "interpreter already running");
return -1;
}
```
Initially, I was unsure of the difference between `PyErr_SetString()` and `PyErr_Format()`, so I referred to the documentation. `PyErr_Format()` is similar, but it also returns NULL. If I'm not mistaken doesn't mean that the `return -1` within the `if (is_running)` bock is effectively ignored? If so, this would potentially explain the problem...
I think the appropriate solution would be to replace `PyErr_Format()` with `PyErr_SetString()` within `_ensure_not_running()`. Also, I think it would be more useful to additionally raise the RuntimeError within `interp_destroy()` if the interpreter is running, to provide a more useful and specific error message. "cannot destroy a running interpreter" is far more useful for debugging purposes than a more generic "interpreter already running".
I plan on opening a PR to make these changes in the next few days. Let me know what you think Victor.
|
|||
| msg352516 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-09-16 06:52 | |
Clarification: "If I'm not mistaken doesn't mean that the `return -1` within ..." Should instead be: "If I'm not mistaken doesn't this mean that the `return -1` within ..." (I am really looking forward to moving issue over to GitHub at some point. It's nice to be able to edit comments and properly format code blocks...) |
|||
| msg352592 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-09-17 00:11 | |
Upon further consideration, I don't think this will address the issue. If the RuntimeError was not being raised, this failure would be consistent rather than intermittent. I think I have have gotten a bit mixed up, even if the return value of PyErr_Format is NULL, it would not implicitly return NULL upon being called and exit the function early (like a macro could) . I'm not experienced with programming in C, I only started learning it more recently (Python, Java, and C# have been my primary languages) when I started contributing to CPython in June. Instead, I suspect this is likely a concurrency issue, where multiple threads are trying to access the same sub-interpreter during ``interp_destroy()``. The solution will likely involve finding the correct place for a lock. I'll continue to work on this and see if I can find a solution. |
|||
| msg352835 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-09-20 05:03 | |
Is there a currently reliable way of accessing the GIL functions within the sub-interpreters, without causing deadlock issues? I was trying to follow the advice in the documentation (https://docs.python.org/3/c-api/init.html?highlight=global%20interpreter%20lock#bugs-and-caveats). "It is highly recommended that you don’t switch sub-interpreters between a pair of matching PyGILState_Ensure() and PyGILState_Release() calls." But it seemed that any attempt to use any of the PyGIL* calls within ``interp_destroy()`` in a meaningful way resulted in a deadlock, even if it was done away from the sub-interpreter switching. My next idea would be to add a conditional check to see if the current thread has ownership of the GIL, and using ``PyEval_RestoreThread()`` to acquire it if it doesn't. This would be followed by releasing the GIL with ``PyThreadState_Get()`` at the end of the function. I'll try experimenting with that idea next. |
|||
| msg354658 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-14 21:24 | |
I marked bpo-33356 as a duplicate of this issue: * Windows10 3.x: test_already_running() failed once, but passed when run again: ... * AMD64 FreeBSD 10-STABLE Non-Debug 3.x: ... |
|||
| msg354659 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-14 21:26 | |
I closed bpo-38154 as a duplicate of this issue: * AMD64 FreeBSD CURRENT Shared 3.x: ... Kyle Stanley proposed a fix: PR 16293. |
|||
| msg354661 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-14 21:27 | |
AMD64 FreeBSD CURRENT Shared 3.x: https://buildbot.python.org/all/#/builders/168/builds/1630 FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) 1:07:09 load avg: 8.18 Re-running failed tests in verbose mode 1:07:09 load avg: 8.18 Re-running test__xxsubinterpreters in verbose mode FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) FAIL: test_already_running (test.test__xxsubinterpreters.RunStringTests) |
|||
| msg354662 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-14 21:28 | |
AMD64 FreeBSD CURRENT Shared 3.8: https://buildbot.python.org/all/#/builders/212/builds/310 FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) FAIL: test_run_coroutine_threadsafe_with_timeout (test.test_asyncio.test_tasks.RunCoroutineThreadsafeTests) 1:01:44 load avg: 7.72 Re-running failed tests in verbose mode 1:01:44 load avg: 7.72 Re-running test__xxsubinterpreters in verbose mode FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests) 1:02:01 load avg: 7.58 Re-running test_asyncio in verbose mode |
|||
| msg354677 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-10-14 23:00 | |
> Kyle Stanley proposed a fix: PR 16293. Note that I'm not confident about the fix I proposed in GH-16293, it was more of an idea to fix the intermittent failures more than anything. It definitely needs review from someone knowledgeable about sub-interpreters and the PyInterpreterState API. This isn't an area that I'm experienced in, but I am interested in it. |
|||
| msg354876 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-18 11:06 | |
The race condition can be reproduced on Linux using a lot of parallel test worker processes. For example, I reproduce it on my laptop (8 CPUs) using: $ ./python -m test test__xxsubinterpreters -F -j30 ... 0:00:26 load avg: 17.86 [ 24] test__xxsubinterpreters passed 0:00:26 load avg: 17.86 [ 25] test__xxsubinterpreters passed 0:00:26 load avg: 17.86 [ 26] test__xxsubinterpreters passed 0:00:27 load avg: 17.86 [ 27/1] test__xxsubinterpreters failed Exception in thread Thread-8: Traceback (most recent call last): File "/home/vstinner/python/master/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/home/vstinner/python/master/Lib/threading.py", line 882, in run self._target(*self._args, **self._kwargs) File "/home/vstinner/python/master/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/home/vstinner/python/master/Lib/test/test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true |
|||
| msg355221 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-23 14:42 | |
AMD64 FreeBSD Shared 3.x: https://buildbot.python.org/all/#/builders/371/builds/7 test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ... Exception in thread Thread-8: Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 882, in run self._target(*self._args, **self._kwargs) File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 FAIL test_already_running (test.test__xxsubinterpreters.RunStringTests) ... Exception in thread Thread-9: Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/threading.py", line 882, in run self._target(*self._args, **self._kwargs) File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: interpreter already running FAIL ====================================================================== FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true ====================================================================== FAIL: test_already_running (test.test__xxsubinterpreters.RunStringTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/home/buildbot/python/3.x.koobs-freebsd-564d/build/Lib/test/test__xxsubinterpreters.py", line 853, in test_already_running interpreters.run_string(self.id, 'print("spam")') AssertionError: RuntimeError not raised |
|||
| msg355520 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-28 10:52 | |
See also bpo-33868: test__xxsubinterpreters fails randomly since at least 2018-06-15. |
|||
| msg356631 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-11-14 22:27 | |
AMD64 Windows8.1 Refleaks 3.8: https://buildbot.python.org/all/#/builders/224/builds/151 0:54:23 load avg: 5.62 [306/423/3] test__xxsubinterpreters failed -- running: test_asyncio (4 min 26 sec), test_zipfile (6 min 6 sec), test_compileall (6 min 21 sec) beginning 6 repetitions 123456 ..Exception in thread Thread-27: Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 932, in _bootstrap_inner self.run() File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 175 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "D:\buildarea\3.8.ware-win81-release.refleak\build\lib\test\test__xxsubinterpreters.py", line 763, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised FAIL: test_subinterpreter (test.test__xxsubinterpreters.IsRunningTests) |
|||
| msg357347 - (view) | Author: Eric Snow (eric.snow) * ![]() |
Date: 2019-11-22 23:57 | |
Sorry I haven't gotten back to you sooner, Kyle. Thanks for working on this. I'm looking at your PR right now. |
|||
| msg357349 - (view) | Author: Eric Snow (eric.snow) * ![]() |
Date: 2019-11-23 00:17 | |
BTW, Kyle, your problem-solving approach on this is on-track. Don't get discouraged. This stuff is tricky. :) |
|||
| msg357350 - (view) | Author: Eric Snow (eric.snow) * ![]() |
Date: 2019-11-23 00:30 | |
Thus far these are the failures we've seen: * not running when we expect it to be running: * interpreters.is_running(interp) * interpreters.run_string(interp, ...) * interpreters.destroy(interp) * can't find the interpreter even though we expect it to exist * interpreters.run_string(interp, ...) * finds it running when we expect it to not be running * interpreters.run_string(interp, ...) Except for the last one (which might be a separate issue), they all look like they could be explained by the same thing: the subinterpreter stopped (or went away) prematurely. That could be related to the code in _xxsubinterpretersmodule.c or it could be the cleanup code that makes sure interpreters get cleaned up at the end of tests (e.g. running too soon). Either way I expect the fix will be in the module code and not the tests. Regarding "is_running()", notice that it relies almost entirely on "frame->f_executing". That might not be enough (or maybe the behavior there changed). That would be worth checking out. @aeros, feel free too keep investigating. I'd be glad to help you out. Otherwise I'll dive into this probably next week. |
|||
| msg357361 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-11-23 06:46 | |
> Sorry I haven't gotten back to you sooner, Kyle. Thanks for working on this. I'm looking at your PR right now. > BTW, Kyle, your problem-solving approach on this is on-track. Don't get discouraged. This stuff is tricky. :) No problem at all, and thank you for the words of encouragement. (: > Regarding "is_running()", notice that it relies almost entirely on "frame->f_executing". That might not be enough (or maybe the behavior there changed). That would be worth checking out. Hmm, that's an interesting point that I hadn't considered. > @aeros, feel free too keep investigating. I'd be glad to help you out. Otherwise I'll dive into this probably next week. Sounds good, I'll do some further digging around, particularly anywhere that interacts with PyFrameObject's `f_executing` field. I think it's possible that there's a non-obvious issue with `_is_running()`, where it works correctly 99% of the time. That seems to be a significant commonality between the different areas where the intermittent failures are occurring, they all directly or indirectly call `_is_running()`. Also, thanks again Eric for the PR review. Looking back it after Eric's analysis and having a couple of months to think it over, I don't think that GH-16293 is the correct solution. It seems unlikely that this is being caused by a lack of proper GIL acquisition, as that would likely be causing far more consistent build failures. I'm thinking that it's more likely to be an issue with either: 1) Subinterpreters occasionally disappearing due to premature cleanup (as Eric suggested) 2) _is_running() being incorrect a small percentage of the time |
|||
| msg357364 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-11-23 07:58 | |
So, I was finally able to replicate a failure in test_still_running locally, it required using a rather ridiculous number of parallel workers: $ ./python -m test test__xxsubinterpreters -j200 -F ... Exception in thread Thread-7: Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 944, in _bootstrap_inner self.run() File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 882, in run self._target(*self._args, **self._kwargs) File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 39 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 766, in test_still_running interpreters.destroy(interp) AssertionError: RuntimeError not raised ... == Tests result: FAILURE == 94 tests OK. 1 test failed: test__xxsubinterpreters Total duration: 1 min 49 sec Tests result: FAILURE OS: Arch Linux x86_64 Kernel: 5.3.11 CPU: Intel i5-4460 I was able to consistently reproduce the above failure using 200 parallel workers, even without `-f`. I have a few different theories that I'd like to test for fixing the failure, I'll report back if any of them yield positive results. |
|||
| msg357365 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-11-23 08:01 | |
> I was able to consistently reproduce the above failure using 200 parallel workers, even without `-f`. Oops, I didn't mean without passing `-F`, as this would result in only a single test being ran. I meant without letting it repeat multiple times, as in the test failure would consistently occur in the first batch of 200 parallel workers. |
|||
| msg357378 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-11-23 12:58 | |
> So, I was finally able to replicate a failure in test_still_running locally, it required using a rather ridiculous number of parallel workers I forgot to mention that I was able to replicate the above failure on the latest commit to the 3.8 branch. I was _unable_ to replicate the failure on the latest commit to master, even when scaling up the number of parallel workers (the max I tested was 300 workers and ~3k iterations of test__xxsubinterpreters). |
|||
| msg357671 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-12-01 04:23 | |
> Regarding "is_running()", notice that it relies almost entirely on "frame->f_executing". That might not be enough (or maybe the behavior there changed). That would be worth checking out. Based on the above hint, I was able to make some progress on a potential solution. Thanks Eric. Instead of only checking "frame->f_executing", I changed "_is_running()" to also check the "finalizing" field of PyInterpreterState. The "finalizing" field is set to 1 in "Py_EndInterpreter()", so this ensures that an interpreter in the process of being destroyed is considered "running", so that operations (such as running scripts, destroying the interpreter, etc) can't occur during finalization. I had to add a private function to the C-API in order to access "interp->finalizing" from Modules/_xxsubinterpretersmodule.c due to the struct for PyInterpreterState being internal only. The above fix seems to completely remove the test failure that occurs in "interpreters.destroy(interp)" in "test_already_running" after running it several times, but I'm able to consistently reproduce the following: Exception in thread Thread-8: Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 932, in _bootstrap_inner self.run() File "/home/aeros/repos/aeros-cpython/Lib/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 51, in run interpreters.run_string(interp, dedent(f""" RuntimeError: unrecognized interpreter ID 46 test test__xxsubinterpreters failed -- Traceback (most recent call last): File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 492, in test_subinterpreter self.assertTrue(interpreters.is_running(interp)) AssertionError: False is not true I have a few ideas that I'd like to test out for fixing this failure, and if any of them produce positive results I'll report back. Since the failures are still consistently occurring, I have not yer revised GH-16293. I'll do that when/if I come up with a more thorough solution. |
|||
| msg357672 - (view) | Author: Kyle Stanley (aeros) * ![]() |
Date: 2019-12-01 04:26 | |
> so that operations (such as running scripts, destroying the interpreter, etc) can't occur during finalization Clarification: by "destroying the interpreter" I am specifically referring to calling `interp_destroy()` after finalization has already started. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2019-12-01 04:26:21 | aeros | set | messages: + msg357672 |
| 2019-12-01 04:23:09 | aeros | set | messages: + msg357671 |
| 2019-11-23 12:58:33 | aeros | set | messages: + msg357378 |
| 2019-11-23 08:01:39 | aeros | set | messages: + msg357365 |
| 2019-11-23 07:58:49 | aeros | set | messages: + msg357364 |
| 2019-11-23 06:46:39 | aeros | set | messages: + msg357361 |
| 2019-11-23 00:30:36 | eric.snow | set | messages: + msg357350 |
| 2019-11-23 00:17:20 | eric.snow | set | messages: + msg357349 |
| 2019-11-22 23:57:36 | eric.snow | set | messages: + msg357347 |
| 2019-11-22 23:56:40 | eric.snow | set | keywords:
+ patch stage: patch review pull_requests: + pull_request16843 |
| 2019-11-14 22:27:18 | vstinner | set | messages: + msg356631 |
| 2019-10-28 10:52:32 | vstinner | set | messages: + msg355520 |
| 2019-10-23 14:42:33 | vstinner | set | messages: + msg355221 |
| 2019-10-18 11:06:24 | vstinner | set | messages: + msg354876 |
| 2019-10-14 23:00:08 | aeros | set | messages: + msg354677 |
| 2019-10-14 21:28:12 | vstinner | set | messages: + msg354662 |
| 2019-10-14 21:27:17 | vstinner | set | messages: + msg354661 |
| 2019-10-14 21:26:36 | vstinner | set | title: test__xxsubinterpreters failed on AMD64 Windows8.1 Refleaks 3.8 -> test__xxsubinterpreters fails randomly |
| 2019-10-14 21:26:24 | vstinner | link | issue38154 superseder |
| 2019-10-14 21:26:18 | vstinner | set | messages: + msg354659 |
| 2019-10-14 21:24:46 | vstinner | set | messages: + msg354658 |
| 2019-09-20 05:03:25 | aeros | set | messages: + msg352835 |
| 2019-09-17 00:11:09 | aeros | set | messages: + msg352592 |
| 2019-09-16 06:52:07 | aeros | set | messages: + msg352516 |
| 2019-09-16 06:48:15 | aeros | set | messages: + msg352514 |
| 2019-09-16 05:20:01 | aeros | set | messages: + msg352510 |
| 2019-09-16 05:17:13 | aeros | set | messages: + msg352509 |
| 2019-09-16 05:15:02 | aeros | set | nosy:
+ aeros messages: + msg352508 |
| 2019-07-08 07:46:36 | vstinner | set | messages: + msg347486 |
| 2019-06-12 21:52:44 | vstinner | set | messages: + msg345415 |
| 2019-06-11 10:55:23 | vstinner | set | messages: + msg345219 |
| 2019-06-11 00:58:20 | vstinner | set | messages: + msg345163 |
| 2019-06-11 00:58:01 | vstinner | create | |

