Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-31690: Make "a", "L" and "u" inline flags in regular expressions local. #3885

Merged
merged 10 commits into from Oct 24, 2017

Conversation

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Oct 4, 2017

Copy link
Member

@warsaw warsaw left a comment

Questions and suggestions for rewording some of the documentation.

and :const:`re.X` (verbose), for the part of the expression.
(The flags are described in :ref:`contents-of-module-re`.)
The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow
``'-'``. Instead using the one of them temporary removes other flags.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?

Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?

I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 5, 2017
Author Member

The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?

LGTM.

Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?

I meant the flags ASCII, LOCAL and UNICODE specified by letters 'a', 'L' and 'u'. Every of these flags affects the meaning of \w and case-insensitive matching. They are mutually exclusive. The UNICODE and LOCALE flags not affect the part inside (?a:...). And you can nest different flags: (?a: ascii matching (?u: unicode matching )).

I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.

Because the one, and only one of these flags should be set for unambiguity. If you remove the flag that is set, you should add other flag. Thus you would need to write (?a-u:...) for switching from Unicode matching to ASCII matching in a string pattern. This looks cumbersome.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

Thanks for the details! I agree with the functionality. What about this for documentation:

"The letters 'a', 'L' and 'u' are mutually exclusive when used as inline flags, so they can't be combined or follow '-'. Instead, when one of them appears in an inline group, it overrides any of the other two letters in the enclosing group. This override is only in effect for the narrow inline group, and the original flags are restored outside of the group."

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 6, 2017
Author Member

There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).

In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).

We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.

In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.

Does all this affect your thought about the wording?


.. versionadded:: 3.6

.. versionchanged:: 3.7
The letters ``'a'``, ``'L'`` and ``'u'`` can be used in a scope.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

I don't think "in a scope" is quite the right choice of words. Maybe "...can be used as inline flags." ...?

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 5, 2017
Author Member

These flags already can be used as inline flags: (?a), (?L). But they are not scoped and affect the whole expression.

Inline flags are flags inlined in a pattern string instead of specified as a separate compile() argument.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

Maybe "group" instead of "scope"?

--

The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
can be set for the part of a regular expression.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

How about "... can be used as inline regular expression flags." ...?

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 5, 2017
Author Member

They can be used as inline regular expression flags before, but affect the whole expression.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

How about "...can be used within the scope of a group"?

@@ -13,7 +13,7 @@

# update when constants are added or removed

MAGIC = 20170530
MAGIC = 20171005

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

This occurred to me during the previous discussion regarding optimization efforts for re.compile(). I meant to bring it up on python-dev at the time but got distracted. I'll mention it here and maybe we need to discuss further. I think we might need to be as careful about regex magic numbers and bytecodes as we are for Python bytecodes.

The reason is that compiled regular expressions can be pickled.

Python 3.7.0a1+ (heads/debughook-dirty:28bd8e477d, Oct  4 2017, 23:21:36) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> x = re.compile('foo')
>>> from pickle import dumps
>>> dumps(x)
b'\x80\x03cre\n_compile\nq\x00X\x03\x00\x00\x00fooq\x01K \x86q\x02Rq\x03.'

If I pickle a regex using Python 3.7.0 and the format changes for 3.7.1, then my compiled regexps will not be unpicklable, right? I haven't actually analyzed the regex pickle format, so I could be wrong about that, but if I'm right, then I think that's a problem.

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 5, 2017
Author Member

The regex pickle format just contains a text pattern and integer flags. When unpickled re.compile() just is called. This format is compatible while the syntax of regular expressions is compatible.

@@ -1470,11 +1470,11 @@ def test_ascii_and_unicode_flag(self):
self.assertIsNone(pat.match(b'\xe0'))
# Incompatibilities
self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE)
self.assertRaises(ValueError, re.compile, br'(?u)\w')
self.assertRaises(re.error, re.compile, br'(?u)\w')

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

Why the change in exception raised? Isn't that an API break?

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 5, 2017
Author Member

I consider this as minor implementation detail. re.error looks more correct to me here. ValueError is related to flags specified as a separate argument, re.error is related to inline flags.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

I'm concerned that this may break some code. For example, in Mailman we sometimes have regular expression strings provided by a site admin or list owner. We have to check to see if they are valid by trying to compile them and catching any exceptions. So it seems like if this changes, those checks would fail.

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 6, 2017
Author Member

re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.

@@ -0,0 +1,2 @@
Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a
regular expression.

This comment has been minimized.

@warsaw

warsaw Oct 5, 2017
Member

How about: "Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as inline flags for regular expressions."

This comment has been minimized.

@warsaw

warsaw Oct 22, 2017
Member

I didn't see a response. What do you think about this suggestion?

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 23, 2017
Author Member

Sorry, I answered similar comment in other place. "inline flags" are not correct words here. All these flags already can be used as inline flags ("(?a)", "("L)", "(?u)"). But these inline flags affect the entire regular expression. What is new in this PR it is allowing them to be set only for the part of the RE: "(?a:...)" etc.

This comment has been minimized.

@warsaw

warsaw Oct 23, 2017
Member

Maybe we can call these "group flags"? Thus:

Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as group flags for regular expressions.

?

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 23, 2017
Author Member

I have never encountered such an expression, but on the other hand, there seems to be no established terminology. I'm not very like "group flags" because it contains "group", but this group is non-capturing and it is not counted in the groups returned by "match.group()" and other API. But if it looks unambiguous to you, I'll use your wording.

This comment has been minimized.

@warsaw

warsaw Oct 24, 2017
Member

The original entry felt awkwardly worded to me. "Group flags" seems like a natural description. Ultimately, this is just a blurb entry so it's not that critical.

Copy link
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

First than tweak the wording more, I want to apply other documentation changes (#3907).

and :const:`re.X` (verbose), for the part of the expression.
(The flags are described in :ref:`contents-of-module-re`.)
The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow
``'-'``. Instead using the one of them temporary removes other flags.

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 6, 2017
Author Member

There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).

In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).

We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.

In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.

Does all this affect your thought about the wording?

@@ -1470,11 +1470,11 @@ def test_ascii_and_unicode_flag(self):
self.assertIsNone(pat.match(b'\xe0'))
# Incompatibilities
self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE)
self.assertRaises(ValueError, re.compile, br'(?u)\w')
self.assertRaises(re.error, re.compile, br'(?u)\w')

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 6, 2017
Author Member

re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.

@warsaw
Copy link
Member

@warsaw warsaw commented Oct 6, 2017

Does all this affect your thought about the wording?

Yes, but I'm not sure how ;) Or maybe, I'm not sure how to turn your excellent explanation into something more concise. Let's see how your other documentation changes work out first.

@serhiy-storchaka
Copy link
Member Author

@serhiy-storchaka serhiy-storchaka commented Oct 22, 2017

Could you please take a look on the updated PR @warsaw?

@warsaw
warsaw approved these changes Oct 22, 2017
Copy link
Member

@warsaw warsaw left a comment

I have a couple of questions, and noticed one spelling error. Other than that, it looks great, so I'll go ahead and approve it, conditional on at least fixing the spelling mistake.

(default). In byte pattern ``(?L:...)`` switches to locale depending
matching, and ``(?a:...)`` switches to ASCII-only matching (default).
This override is only in effect for the narrow inline group, and the
original matchin mode is restored outside of the group.

This comment has been minimized.

@warsaw

warsaw Oct 22, 2017
Member

s/matchin/matching/

:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
and :const:`re.X` (verbose), for the part of the expression.
(The flags are described in :ref:`contents-of-module-re`.)

This comment has been minimized.

@warsaw

warsaw Oct 22, 2017
Member

Maybe use a bullet list here? I'll leave that to you.

This comment has been minimized.

@serhiy-storchaka

serhiy-storchaka Oct 23, 2017
Author Member

This will take too much space. Maybe later I'll add a table somewhere.

@@ -0,0 +1,2 @@
Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a
regular expression.

This comment has been minimized.

@warsaw

warsaw Oct 22, 2017
Member

I didn't see a response. What do you think about this suggestion?

@warsaw
warsaw approved these changes Oct 24, 2017
Copy link
Member

@warsaw warsaw left a comment

Thanks for taking the time to work with me on this change. It's an important and useful improvement! I'm going to approve this PR, and leave it up to you whether to reword the NEWS file blurb or not, based on my last comment.

@serhiy-storchaka serhiy-storchaka merged commit 3557b05 into python:master Oct 24, 2017
3 checks passed
3 checks passed
@bedevere-bot
bedevere/issue-number Issue number 31690 found
Details
@bedevere-bot
bedevere/news News entry found in Misc/NEWS.d
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@serhiy-storchaka serhiy-storchaka deleted the serhiy-storchaka:re-local-aLu branch Oct 24, 2017
@serhiy-storchaka
Copy link
Member Author

@serhiy-storchaka serhiy-storchaka commented Oct 24, 2017

Thank you @warsaw for your review and help with the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants