New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-114058: The Tier2 Optimizer #114059
base: main
Are you sure you want to change the base?
gh-114058: The Tier2 Optimizer #114059
Conversation
|
Please give me some time to write out the proper docs explaining the abstract IR this uses. |
|
All tests run with uops on now passes except for the following two:
|
Maybe we should just fix the test? This sort of seems to be kicking the can down the road, since eventually tier 2 will be on by default. There's a |
|
1% slower on macOS (other platforms aren't building right now). 8% reduction in traces executed, but 3% increase in uops executed. PGO failure on Windows: PGO failure on Linux: Both look like recursion-related issues. The Windows one may be fixed on |
|
Thanks Brandt. Seems like the slowdown is due to Llike Mark said though, benchmark results aren't too important at the moment. Function inlining would be the most important optimization and that's missing from this PR, to be added in a future PR. |
|
Nevertheless it would be wise to dig deeper into what goes on with bm_nbody. It may be an important canary. :-) |
|
The super deep recursive stuff is failing on Windows x64 and I can reproduce it on my own machine. I have no clue how to fix it. |
Okay, but that ought to be fixed on main separately, and there needs to be an issue for it. I know there are all sorts of issues with deep recursion, and especially on Windows, and we need to track it. Mark has some thoughts. |
|
Here's another long comment. The emitterThe abstract interpreter emits a "brand new" uop instruction stream. This goes into Instructions are written into the write buffer using emitter->writebuffer[emitter->curr_i++] = inst;This function is the only place that writes to the buffer. Where is it called? Emitting constantsThere's a helper function for emitting constants, becomes this output: A later peephole optimizer stage (in (FWIW, perhaps we could move this peephole work to Emitting other stuffThe rest is pretty straightforward, after each impure instruction a copy of that instruction is emitted using |
|
A while ago, @markshannon asked:
With what I now know about the interpreter's structure (which I believe has been simplified since Mark wrote that comment), I think the hand-written version of this would be pretty straightforward: case _EXIT_INIT_CHECK:
_Py_UOpsSymType *value = PEEK(1);
if (sym_is_type(value, NONE_TYPE)) {
new_inst.opcode = _POP_TOP;
}
stack_pointer--;
break;The only missing piece is that currently we don't track if (tp == &_PyNone_Type) {
sym_set_type(sym, PYNONE_TYPE, 0);
}
else(and add |
Python/optimizer_analysis.c
Outdated
| op_is_zappable(int opcode) | ||
| { | ||
| switch(opcode) { | ||
| case _SET_IP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is missing _NOP. It seems a fine opcode to back over in the peepholer. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, after I do this, I believe we won't need to run the peepholer more than once. (I put in assert(peephole_attempts == 0 || done); after the call and it didn't fail a thing.) SO that would simplify things a bit.
I also have a more radical idea: we can move the two things that the peepholer currently does (_SHRINK_STACK and _CHECK_PEP_523) to hand-written cases in uop_abstract_interpret_single_inst(). Then we don't need the peepholer at all any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so that doesn't work for _SHRINK_STACK, because it's being generated by uop_abstract_interpret_single_inst(), not consumed. But it could probably be moved to emit_const() quite easily.
The advantage of doing it on the fly instead of in a separate peephole pass is that we reduce the risk that we'll run out of buffer space -- the current code occasionally emits more instructions than the input (namely whenever it emits _SHRINK_STACK), which potentially (if we get very close to the limit) could cause the optimizer to fail even if, using a longer buffer, it would have succeeded and produced a shorter result. There would still be worst-case scenarios where we'd end up with many non-zappable _SHRINK_STACK uops, but in most cases the instant zapping would free up output space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now actually the peepholer only needs to run once. I just let it run a few times just to be sure.
My intuition why it can eliminate all _SHRINK_STACK on the first run is that since we are operating on a bytecode stack machine IR, which is operating forwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But see my counter-example of a walrus: A + (X := B*C).
Either way it seems we both agree that _SHRINK_STACK is a temporary crutch.
Co-Authored-By: Guido van Rossum <gvanrossum@users.noreply.github.com>
Yeah extending this for this specific case is quite trivial. In he long run, we probably want to express that using a separate bytecodes_abstract.c DSL file, because handwriting all that adds up in the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response.
|
BTW the Windows stack overflow failures are being tracked at #114797. Apparently they exist on main as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for indulging me. :-)
Closes #114058
This PR turns on the optimizer for all uops. The tier 2 uops optimizer contains a few parts: the abstract interpreter, the IR, and the codegen.
The abstract interpreter does the following:
Function inlining is left out for a future PR, as it's the most complex.
After analysis of the bytecode and doing all of the above, it emits optimized uops, and passes that to the executor.
When uops is enabled, **this passes the entire CPython test suite **. The significant milestone is that this is able to analyse and abstract interpret all CPython uops that we currently support. The other significant milestone is that this generates code that passes CPython's test suite.
Refleak tests will fail as well, as they need a design overhaul.
The design of this PR is here https://github.com/Fidget-Spinner/cpython_optimization_notes/blob/main/3.13/uops_optimizer.md
High level discussion here faster-cpython/ideas#648.
0-2% faster on Linux, 3% faster on macOS ARM64