PERF: avoid temporary allocation in np.gradient for strided axes#31505
PERF: avoid temporary allocation in np.gradient for strided axes#31505karbalaie wants to merge 2 commits into
Conversation
| slice4_t = tuple(slice4) | ||
|
|
||
| if f.strides[axis] != f.itemsize: | ||
| np.subtract(f[slice4_t], f[slice2_t], out=out[slice1_t]) |
There was a problem hiding this comment.
Is there a reason not to simply always take this path?
There was a problem hiding this comment.
Yes, I tested it and the local benchmarks showed that for contiguous differentiation axes the np.subtract(out=...) path was sometimes slower e.g. for the last axis of C-contiguous arrays and the first of F-contiguous the current method as fast as or faster. That is why I suggested to distinguish between those two cases.
Here are some additional benchmarks results for always using np.subtract(out=...) path:
C-order float64, shape (1000, 1000), axis=1:
current: 0.03968
always-out: 0.05526
ratio: 0.72x
F-order float64, shape (1000, 1000), axis=0:
current: 0.03669
always-out: 0.05070
ratio: 0.72x
There was a problem hiding this comment.
Sorry to not have had time to look earlier, but I now did a simple test. Are you sure you didn't flip the results? I ask since I seem to find the reverse, that using np.subtract inplace is always faster:
shape = (1000,1000)
dtype=float
rng = np.random.default_rng(42)
arr = rng.random(shape).astype(dtype)
%timeit np.gradient(arr, 0.01, axis=1)
# current PR: 3.38 ms ± 2.75 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# always subtract route: 2 ms ± 2.92 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, I think this can be simpler:
if uniform_spacing:
- out[tuple(slice1)] = (f[tuple(slice4)] - f[tuple(slice2)]) / (2. * ax_dx)
+ out_slice = out[tuple(slice1)]
+ np.subtract(f[tuple(slice4)], f[tuple(slice2)], out=out_slice)
+ out_slice /= (2. * ax_dx)
There was a problem hiding this comment.
Yeah, seems strange to me as well, can you please follow up on this? There could in theory be differences if a contiguous fast-path isn't taken or casts are involved, but that isn't clear to me that it should be the case.
Plus since you avoid a copy it still seems a bit surprising.
The PR otherwise seems nice since it is a simple re-organization, but branching is a bit annoying.
There was a problem hiding this comment.
Let me do another test after work today. Maybe I mixed something up.
PR summary
I investigated np.gradient and found out the central difference quotient was calculated in one line in _function_base_impl.py
So first it calculates the difference f[slice4_t] - f[slice2_t], creates a temporary array for that result, then does the division by 2 * dx and saves the final result in out. This costs unnecessary memory and memory bandwidth, especially for large arrays. For some cases it is faster to calculate the difference, directly save the result in out and then scale out further.
Explanation
For normal numpy arrays (C-order arrays), the last axis is contiguous in memory. For Fortran-order arrays (F-order arrays), the first axis is contiguous in memory. My tests showed that the existing method is good or sometimes even better for contiguous axes. But for non-contiguous, strided axes, the new proposed method is faster.
Therefore, I kept the existing method for contiguous axes and uses the new method as a case only for strided axes.
Results
float32, C-order, (1000, 1000), axis=0:
196 µs -> 137 µs
float64, C-order, (1000, 1000), axis=0:
377 µs -> 273 µs
float32, F-order, (1000, 1000), axis=1:
196 µs -> 137 µs
float64, F-order, (1000, 1000), axis=1:
379 µs -> 270 µs
The added ASV benchmarks showed an increase of roughly 25-40% for the new method on strided-axis cases (tested on MacOS/arm64 , Apple M4 Max).
ASV
"SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED."
Tests
python -m pytest numpy/lib/tests/test_function_base.py -k gradient
"28 passed, 1468 deselected"
First time committer introduction
I used numpy for over a decade in physics. My main focus in corporate research is in NLP and Riemannian Optimization. I wanted to contribute to the numpy community for a long time as I benefited from it long enough.
AI Disclosure
ChatGPT reviewed the benchmark script and help me rewriting the PR title.