Skip to content

PERF: avoid temporary allocation in np.gradient for strided axes#31505

Open
karbalaie wants to merge 2 commits into
numpy:mainfrom
karbalaie:perf-gradient-strided-out
Open

PERF: avoid temporary allocation in np.gradient for strided axes#31505
karbalaie wants to merge 2 commits into
numpy:mainfrom
karbalaie:perf-gradient-strided-out

Conversation

@karbalaie
Copy link
Copy Markdown

PR summary

I investigated np.gradient and found out the central difference quotient was calculated in one line in _function_base_impl.py

out[slice1_t] = (f[slice4_t] - f[slice2_t]) / (2. * ax_dx) 

So first it calculates the difference f[slice4_t] - f[slice2_t], creates a temporary array for that result, then does the division by 2 * dx and saves the final result in out. This costs unnecessary memory and memory bandwidth, especially for large arrays. For some cases it is faster to calculate the difference, directly save the result in out and then scale out further.

Explanation

For normal numpy arrays (C-order arrays), the last axis is contiguous in memory. For Fortran-order arrays (F-order arrays), the first axis is contiguous in memory. My tests showed that the existing method is good or sometimes even better for contiguous axes. But for non-contiguous, strided axes, the new proposed method is faster.
Therefore, I kept the existing method for contiguous axes and uses the new method as a case only for strided axes.

Results

float32, C-order, (1000, 1000), axis=0:
196 µs -> 137 µs

float64, C-order, (1000, 1000), axis=0:
377 µs -> 273 µs

float32, F-order, (1000, 1000), axis=1:
196 µs -> 137 µs

float64, F-order, (1000, 1000), axis=1:
379 µs -> 270 µs

The added ASV benchmarks showed an increase of roughly 25-40% for the new method on strided-axis cases (tested on MacOS/arm64 , Apple M4 Max).

ASV
"SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED."

Tests
python -m pytest numpy/lib/tests/test_function_base.py -k gradient
"28 passed, 1468 deselected"

First time committer introduction

I used numpy for over a decade in physics. My main focus in corporate research is in NLP and Riemannian Optimization. I wanted to contribute to the numpy community for a long time as I benefited from it long enough.

AI Disclosure

ChatGPT reviewed the benchmark script and help me rewriting the PR title.

slice4_t = tuple(slice4)

if f.strides[axis] != f.itemsize:
np.subtract(f[slice4_t], f[slice2_t], out=out[slice1_t])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to simply always take this path?

Copy link
Copy Markdown
Author

@karbalaie karbalaie May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tested it and the local benchmarks showed that for contiguous differentiation axes the np.subtract(out=...) path was sometimes slower e.g. for the last axis of C-contiguous arrays and the first of F-contiguous the current method as fast as or faster. That is why I suggested to distinguish between those two cases.

Here are some additional benchmarks results for always using np.subtract(out=...) path:

C-order float64, shape (1000, 1000), axis=1:
current:     0.03968
always-out:  0.05526
ratio:       0.72x

F-order float64, shape (1000, 1000), axis=0:
current:     0.03669
always-out:  0.05070
ratio:       0.72x

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk does that answer your question?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to not have had time to look earlier, but I now did a simple test. Are you sure you didn't flip the results? I ask since I seem to find the reverse, that using np.subtract inplace is always faster:

shape = (1000,1000)
dtype=float
rng = np.random.default_rng(42)
arr = rng.random(shape).astype(dtype)
%timeit np.gradient(arr, 0.01, axis=1)
# current PR: 3.38 ms ± 2.75 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# always subtract route: 2 ms ± 2.92 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, I think this can be simpler:

         if uniform_spacing:
-            out[tuple(slice1)] = (f[tuple(slice4)] - f[tuple(slice2)]) / (2. * ax_dx)
+            out_slice = out[tuple(slice1)]
+            np.subtract(f[tuple(slice4)], f[tuple(slice2)], out=out_slice)
+            out_slice /= (2. * ax_dx)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems strange to me as well, can you please follow up on this? There could in theory be differences if a contiguous fast-path isn't taken or casts are involved, but that isn't clear to me that it should be the case.
Plus since you avoid a copy it still seems a bit surprising.

The PR otherwise seems nice since it is a simple re-organization, but branching is a bit annoying.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do another test after work today. Maybe I mixed something up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants