Fix violinplot crash on empty datasets (#31700)#31707
Conversation
|
Hi @story645! The GitHub UI was throwing an error when I tried to accept the commit suggestion directly, so I pulled the branch and applied both of your changes manually locally!(since i don't why that apply suggestion button not worked for me) _axes.py now uses the stricter > 0 constraint, and cbook.py has been updated to use the exact same 'append up here and mutate below' pattern as boxplot_stats. Let me know if everything looks good to go now! |
Co-authored-by: Tim Hoffmann <2836374+timhoffm@users.noreply.github.com>
|
Hi @timhoffm, great one with the [np.nan, np.nan] edge case! I've updated cbook.py so the NaN and inf stripping logic happens before the len(x) == 0 bailout check. The empty dataset dictionary now safely populates and avoids the crash even if the array initially contained only NaNs. Also, just a heads-up: it looks like the AppVeyor Windows check failed right at the end of its run due to a Windows temp file PermissionError during teardown, but the actual pytest suite passed perfectly. Let me know if everything else looks good to go! and what to do next! |
| # note tricksiness, append up here and then mutate below | ||
| vpstats.append(stats) | ||
|
|
There was a problem hiding this comment.
This is unconventional and makes the code harder to reason about. Instead, put the calculation into an else block:
if len(x) == 0:
#empty stats
else:
# calculate stats
vpstats.append(stats)There was a problem hiding this comment.
The way boxplot does it is puts a continue at the end of the empty case. Might make sense here too?
There was a problem hiding this comment.
Hi @story645! I actually just refactored this loop into the strict if/else block that @timhoffm suggested above, since it lets us avoid the early append and the continue statement entirely.
The code is pushed and the linters are perfectly green! Let me know if you are both happy with this if/else structure, or if you'd prefer I switch it to the continue pattern!
There was a problem hiding this comment.
I think the continue pattern is better b/c then you don't have a giant indent block for the else that you need to keep track of. That's presumably why it's used in boxplots
There was a problem hiding this comment.
I see this differently: The "early return" block is almost as long as the regular block, because the majority of work is identical: configuring stats values and appending to vpstats. IMHO it's beneficial for readablility to reflected this parallelism in an if / else block with equal indentation for both cases.
On a more general note, the code is a bit fragmented and cluttered with extra variables. Directly appending a dict literal would be much cleaner:
for (x, quantile) in zip(X, quantiles):
x = np.asarray(x)
x = x[~(np.isnan(x) | np.isinf(x))]
if len(x) == 0:
vpstats.append({
'vals': np.array([]),
'coords': np.array([]),
'mean': np.nan,
'median': np.nan,
'min': np.nan,
'max': np.nan,
'quantiles': np.array([]),
})
else:
min_val = np.min(x)
max_val = np.max(x)
coords = np.linspace(min_val, max_val, points)
vpstats.append({
'vals': method(x, coords),
'coords': coords,
'mean': np.mean(x),
'median': np.median(x),
'min': min_val,
'max': max_val,
'quantiles': np.atleast_1d(np.percentile(x, 100 * quantile))
})But I'm not going to fight over this.
| max_val = np.max(x) | ||
| quantile_val = np.percentile(x, 100 * q) | ||
| x = np.asarray(x) | ||
| x = x[~(np.isnan(x) | np.isinf(x))] |
There was a problem hiding this comment.
@rahulrathnavel sorry for not being precise. I meant documenting in the docstring (Parameter X) not a code comment.
There was a problem hiding this comment.
Hi @timhoffm, that makes total sense! No worries at all, I completely misunderstood what you meant earlier my side weak interpretation.
I have removed the inline code comment and moved the explanation up into the public docstring for parameter X in violin_stats so users know that NaN and infinite values are automatically stripped.
The code is pushed up and the CI checks are running now. Let me know if the wording looks good to you!Eagarly waiting to hear from you!
| # note tricksiness, append up here and then mutate below | ||
| vpstats.append(stats) | ||
|
|
There was a problem hiding this comment.
I see this differently: The "early return" block is almost as long as the regular block, because the majority of work is identical: configuring stats values and appending to vpstats. IMHO it's beneficial for readablility to reflected this parallelism in an if / else block with equal indentation for both cases.
On a more general note, the code is a bit fragmented and cluttered with extra variables. Directly appending a dict literal would be much cleaner:
for (x, quantile) in zip(X, quantiles):
x = np.asarray(x)
x = x[~(np.isnan(x) | np.isinf(x))]
if len(x) == 0:
vpstats.append({
'vals': np.array([]),
'coords': np.array([]),
'mean': np.nan,
'median': np.nan,
'min': np.nan,
'max': np.nan,
'quantiles': np.array([]),
})
else:
min_val = np.min(x)
max_val = np.max(x)
coords = np.linspace(min_val, max_val, points)
vpstats.append({
'vals': method(x, coords),
'coords': coords,
'mean': np.mean(x),
'median': np.median(x),
'min': min_val,
'max': max_val,
'quantiles': np.atleast_1d(np.percentile(x, 100 * quantile))
})But I'm not going to fight over this.
|
Hi @timhoffm and @story645, thank you both so much for talking through the design and for the phenomenal mentorship! I have implemented the dict literal snippet exactly as requested, added the documentation comment for the NaN-stripping logic, and the CI checks are now 100% green! 😁 Since this is one of my very first open-source contributions, I really appreciate your patience and guidance in helping me get the code structure and standards just right. I am super excited and looking forward to seeing this merged! Let me know if you need absolutely anything else from me. |
Co-authored-by: Tim Hoffmann <2836374+timhoffm@users.noreply.github.com>
|
Thanks so much for fully guided review and approval, @timhoffm! for my first PRs here. |
|
@scottshambaugh thanks . I have pushed a commit that updates |
Co-authored-by: Tim Hoffmann <2836374+timhoffm@users.noreply.github.com>
Co-authored-by: Tim Hoffmann <2836374+timhoffm@users.noreply.github.com>
Co-authored-by: Tim Hoffmann <2836374+timhoffm@users.noreply.github.com>
Co-authored-by: Scott Shambaugh <14363975+scottshambaugh@users.noreply.github.com>
😄 to be honest, that's the good thing I have heard so far today. |
|
Congrats on your first contribution to matplotlib @rahulrathnavel! We hope to see you again. |
|
Is this both a bug fix and new feature? The linked issue seems to suggest the former, so wondering if this can go into 3.11? |
I think more bugfix than new feature since boxplot already works on empty datasets. |
|
@meeseeksdev backport to v3.11.x |
PR summary
closes #31700
This PR fixes a bug where passing an empty dataset to
violinplotcauses it to crash (ValueError: zero-size array to reduction operation minimum), whereasboxplothandles the exact same scenario gracefully by simply drawing nothing.Reasoning for this implementation:
I updated
cbook.violin_statsto check if the input dataset is empty. If it is, it bypasses the min/max/KDE math operations and returns an empty stats dictionary for that specific dataset. I also added a safeguard inaxes.violinto prevent width-scaling calculations on empty density arrays.This allows
violinplotto safely skip rendering violins for empty datasets, perfectly mirroring the resilient behavior ofboxplot. I have also included a regression test to ensure this remains fixed.AI Disclosure
I used an AI assistant strictly to help navigate the codebase, locate the specific statistics functions in
cbook.pyand_axes.py, and draft the boilerplate for the pytest. The core logic was manually reviewed, applied, and tested locally to ensure complete compliance with Matplotlib's standards.PR checklist