PERF: avoid temporary allocation in np.gradient for strided axes by karbalaie · Pull Request #31505 · numpy/numpy

karbalaie · 2026-05-25T15:29:44Z

PR summary

I investigated np.gradient and found out the central difference quotient was calculated in one line in _function_base_impl.py

out[slice1_t] = (f[slice4_t] - f[slice2_t]) / (2. * ax_dx)

So first it calculates the difference f[slice4_t] - f[slice2_t], creates a temporary array for that result, then does the division by 2 * dx and saves the final result in out. This costs unnecessary memory and memory bandwidth, especially for large arrays. For some cases it is faster to calculate the difference, directly save the result in out and then scale out further.

Explanation

For normal numpy arrays (C-order arrays), the last axis is contiguous in memory. For Fortran-order arrays (F-order arrays), the first axis is contiguous in memory. My tests showed that the existing method is good or sometimes even better for contiguous axes. But for non-contiguous, strided axes, the new proposed method is faster.
Therefore, I kept the existing method for contiguous axes and uses the new method as a case only for strided axes.

Results

float32, C-order, (1000, 1000), axis=0:
196 µs -> 137 µs

float64, C-order, (1000, 1000), axis=0:
377 µs -> 273 µs

float32, F-order, (1000, 1000), axis=1:
196 µs -> 137 µs

float64, F-order, (1000, 1000), axis=1:
379 µs -> 270 µs

The added ASV benchmarks showed an increase of roughly 25-40% for the new method on strided-axis cases (tested on MacOS/arm64 , Apple M4 Max).

ASV
"SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED."

Tests
python -m pytest numpy/lib/tests/test_function_base.py -k gradient
"28 passed, 1468 deselected"

First time committer introduction

I used numpy for over a decade in physics. My main focus in corporate research is in NLP and Riemannian Optimization. I wanted to contribute to the numpy community for a long time as I benefited from it long enough.

AI Disclosure

ChatGPT reviewed the benchmark script and help me rewriting the PR title.

mhvk · 2026-05-25T15:48:13Z

+            slice4_t = tuple(slice4)
+
+            if f.strides[axis] != f.itemsize:
+                np.subtract(f[slice4_t], f[slice2_t], out=out[slice1_t])


Is there a reason not to simply always take this path?

Yes, I tested it and the local benchmarks showed that for contiguous differentiation axes the np.subtract(out=...) path was sometimes slower e.g. for the last axis of C-contiguous arrays and the first of F-contiguous the current method as fast as or faster. That is why I suggested to distinguish between those two cases.

Here are some additional benchmarks results for always using np.subtract(out=...) path:

C-order float64, shape (1000, 1000), axis=1: current: 0.03968 always-out: 0.05526 ratio: 0.72x F-order float64, shape (1000, 1000), axis=0: current: 0.03669 always-out: 0.05070 ratio: 0.72x

@mhvk does that answer your question?

Sorry to not have had time to look earlier, but I now did a simple test. Are you sure you didn't flip the results? I ask since I seem to find the reverse, that using np.subtract inplace is always faster:

shape = (1000,1000) dtype=float rng = np.random.default_rng(42) arr = rng.random(shape).astype(dtype) %timeit np.gradient(arr, 0.01, axis=1) # current PR: 3.38 ms ± 2.75 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) # always subtract route: 2 ms ± 2.92 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, I think this can be simpler:

if uniform_spacing: - out[tuple(slice1)] = (f[tuple(slice4)] - f[tuple(slice2)]) / (2. * ax_dx) + out_slice = out[tuple(slice1)] + np.subtract(f[tuple(slice4)], f[tuple(slice2)], out=out_slice) + out_slice /= (2. * ax_dx)

Yeah, seems strange to me as well, can you please follow up on this? There could in theory be differences if a contiguous fast-path isn't taken or casts are involved, but that isn't clear to me that it should be the case.
Plus since you avoid a copy it still seems a bit surprising.

The PR otherwise seems nice since it is a simple re-organization, but branching is a bit annoying.

Let me do another test after work today. Maybe I mixed something up.

karbalaie added 2 commits May 25, 2026 15:37

PERF: avoid temporary allocation in np.gradient for strided axes

9164eda

BENCH: add np.gradient strided-axis benchmark

e507e72

mhvk reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: avoid temporary allocation in np.gradient for strided axes#31505

PERF: avoid temporary allocation in np.gradient for strided axes#31505
karbalaie wants to merge 2 commits into
numpy:mainfrom
karbalaie:perf-gradient-strided-out

karbalaie commented May 25, 2026

Uh oh!

mhvk May 25, 2026

Uh oh!

karbalaie May 25, 2026 •

edited

Loading

Uh oh!

karbalaie May 29, 2026

Uh oh!

mhvk Jun 2, 2026

Uh oh!

seberg Jun 3, 2026

Uh oh!

karbalaie Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

karbalaie commented May 25, 2026

PR summary

Explanation

Results

First time committer introduction

AI Disclosure

Uh oh!

mhvk May 25, 2026

Choose a reason for hiding this comment

Uh oh!

karbalaie May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karbalaie May 29, 2026

Choose a reason for hiding this comment

Uh oh!

mhvk Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

seberg Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

karbalaie Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karbalaie May 25, 2026 •

edited

Loading