gh-116477: Improve performance of range for the single argument case#116478
gh-116477: Improve performance of range for the single argument case#116478eendebakpt wants to merge 27 commits intopython:mainfrom
Conversation
f3e6ab9 to
bf88016
Compare
serhiy-storchaka
left a comment
There was a problem hiding this comment.
In general LGTM. But please remove redundant spaces.
Misc/NEWS.d/next/Core and Builtins/2024-03-07-22-38-07.gh-issue-116477.AqwRAV.rst
Outdated
Show resolved
Hide resolved
…e-116477.AqwRAV.rst Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Objects/rangeobject.c
Outdated
| lstop = PyLong_AsLong(r->stop); | ||
| if (lstop == -1 && PyErr_Occurred()) { | ||
|
|
||
| if (r->start == _PyLong_GetZero() && r->step == _PyLong_GetOne() ) { |
There was a problem hiding this comment.
How large is overhead of two PyLong_AsLong() calls for 0 and 1? Is it noticeable?
Run something like:
./python -m timeit -s 'r = range(10)' 'for i in r: break'
AFAIK it is the fastest code that is affected by the iterator creation time.
There was a problem hiding this comment.
The overhead for the two PyLong_AsLong calls (plus some bits and pieces) are from the first benchmark above:
range(1): Mean +- std dev: [main_pgo] 44.4 ns +- 1.0 ns -> [pr_pgo] 34.7 ns +- 4.0 ns: 1.28x faster
So about 10 ns.
The last benchmark from the first comment:
for loop: Mean +- std dev: [main_pgo] 295 ns +- 9 ns -> [pr_pgo] 267 ns +- 5 ns: 1.10x faster
executes:
def g():
x=0
for ii in range(10):
x += 1
So a small for loop with minimal work becomes 10% faster. (timings may vary, my system is noisy)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>
erlend-aasland
left a comment
There was a problem hiding this comment.
Looks good to me, but I'd like Serhiy's thumbs up before landing.
erlend-aasland
left a comment
There was a problem hiding this comment.
I'll merge this later at the sprints. Thanks!
serhiy-storchaka
left a comment
There was a problem hiding this comment.
I am not convinced that this change is worth to do, but I will not object against committing it if other core developer likes it.
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
@eendebakpt, are the benchmarks up to date? |
|
Here is a set of benchmarks for the PR rebased to main: Benchmark script(compiled with PGO, executed with There is some noise in the data, but there are enough datapoints to make the trends clear. Most relevant for real applications are the |
|
So, if we exclude the iter optimisation, there will be no slowdowns (unlesss I misread the benchmark)? |
Both optimizations use the same check (two pointer comparisons to Two runs of benchmarks of main vs. the updated PR |
erlend-aasland
left a comment
There was a problem hiding this comment.
LGTM. @serhiy-storchaka, any objections?
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Please repeat benchmarks with the current code.
|
New benchmarks (Ubuntu, with the new tail calling interpreter, non-PGO): |
|
Pieter, can you fix the linting CI? Let's land this! |
I understand your concern. For example, we did not optimise |
Benchmark
The fast paths eliminate several calls to
PyLong_AsLongAndOverflowand the construction of a new Python integer withPyLong_FromLong.The fast path for the range object is in
compute_range_length. That implies a small performance penalty (two pointer comparisons) for the cases wherestart != 0orstep != 1. We could move the fast path torange_from_array(thecase 1:part), but that leads to a bit more duplicated code.