Skip to content

Conversation

@tdakic
Copy link

@tdakic tdakic commented Feb 25, 2021

Closes #18435.

arr = np.fromstring("\n", sep=" ") now retuns []
@charris charris changed the title Bug fix for 18435 BUG: Make arr = np.fromstring("\n", sep=" ") return []. Feb 25, 2021
* fromstr always returns 0 for basic dtypes; s points to the end of the
* parsed string. If s is not changed an error occurred or the end was
* parsed string. If s is not changed an error occu5trred or the end was
* reached.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use a comma after "changed".

Copy link
Author

@tdakic tdakic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a typo :-(

Copy link
Author

@tdakic tdakic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@tdakic
Copy link
Author

tdakic commented Mar 3, 2021

With this fix, print(np.fromstring(", 1,2,3", sep= ",",dtype =np.int64)) will print [1 2 3]. Should it?

Base automatically changed from master to main March 4, 2021 02:05
@charris
Copy link
Member

charris commented Mar 26, 2021

close/reopen

@charris charris closed this Mar 26, 2021
@charris charris reopened this Mar 26, 2021
@charris charris self-assigned this Mar 26, 2021
@charris
Copy link
Member

charris commented Mar 26, 2021

Looking at this, I'm not convinced the result should be []. Apparently white space (\n) is interpreted as -1. For instance

In [17]: fromstring(' , ',sep=',')                                              
Out[17]: array([-1., -1.])

That may be a problem, but it is a different problem. I'm not yet clear where that value comes from. @seberg Thoughts?

This raises a FutureWarning with other separators.

n [7]: fromstring(',',sep=',')                                                 
<ipython-input-7-5aea430bb5d9>:1: DeprecationWarning: string or file could not be read to its end due to unmatched data; this will raise a ValueError in the future.
  fromstring(',',sep=',')
Out[7]: array([], dtype=float64)

@seberg
Copy link
Member

seberg commented Mar 27, 2021

Uh, that is strange... It seems you are right and whitespace is parsed as a number and that number is -1.... That explains how the -1. can be so persistent.

It basically drops out of this:

/*
* NumPyOS_ascii_strtod_plain:
*
* PyOS_ascii_strtod work-alike, with no enhanced features,
* for forward compatibility with Python >= 2.7
*/
static double
NumPyOS_ascii_strtod_plain(const char *s, char** endptr)
{
double result;
NPY_ALLOW_C_API_DEF;
NPY_ALLOW_C_API;
result = PyOS_string_to_double(s, endptr, NULL);
if (PyErr_Occurred()) {
if (endptr) {
*endptr = (char*)s;
}
PyErr_Clear();
}
NPY_DISABLE_C_API;
return result;
}

Now, that should indicate that an error occurred, by resetting the end-pointer to the start position on error, which then should trigger the error return path here:

char *e = *s;
int r = dtype->f->fromstr(*s, dptr, &e, dtype);
/*
* fromstr always returns 0 for basic dtypes; s points to the end of the
* parsed string. If s is not changed an error occurred or the end was
* reached.
*/
if (*s == e || r < 0) {

Which then has a negative return value indicating an error in the read.

All of that seems a bit strange and every single level of that code looks like it could be smoothened a bit more. Maybe we should try and see if we can do a proper deprecation of that -1 if the -1 works for reliably for everything that cannot be read.

The bug seems to be that call to string_is_fully_read which probably does the right thing for the skip function, but not for the parse function. Since it checks start >= end and equality holds here.

The easier fix should probably to properly propagate a -1 return value when not successfully reading from dtype->f->fromstr (or even lower). The only question is whether we want to try a deprecation there, I am not sure how easy it is.

Long story short, the -1. is definitely an accident (I wonder if it has always been there, or an artifact of fixing another bug). But if we want to gently "deprecate" it, I guess it might be a bit tricky. If we are happy to not be gentle about it, I expect the string_is_fully_read might just be unnecessary after an unsuccessfull value parsing attempt. (worst case, we have to check for a leading \0 maybe, in case the string is fully empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Needs decision

Development

Successfully merging this pull request may close these issues.

np.fromstring("\n", sep=" ") returns non-empty array

3 participants