bpo-31690: Make "a", "L" and "u" inline flags in regular expressions local. #3885
Conversation
|
Questions and suggestions for rewording some of the documentation. |
| and :const:`re.X` (verbose), for the part of the expression. | ||
| (The flags are described in :ref:`contents-of-module-re`.) | ||
| The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow | ||
| ``'-'``. Instead using the one of them temporary removes other flags. |
warsaw
Oct 5, 2017
Member
The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?
Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?
I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.
The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?
Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?
I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.
serhiy-storchaka
Oct 5, 2017
Author
Member
The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?
LGTM.
Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?
I meant the flags ASCII, LOCAL and UNICODE specified by letters 'a', 'L' and 'u'. Every of these flags affects the meaning of \w and case-insensitive matching. They are mutually exclusive. The UNICODE and LOCALE flags not affect the part inside (?a:...). And you can nest different flags: (?a: ascii matching (?u: unicode matching )).
I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.
Because the one, and only one of these flags should be set for unambiguity. If you remove the flag that is set, you should add other flag. Thus you would need to write (?a-u:...) for switching from Unicode matching to ASCII matching in a string pattern. This looks cumbersome.
The last sentence doesn't parse quite right. How about "Using one of these flags temporarily removes other flags." ...?
LGTM.
Also, what flags are "temporarily removed" and what does it mean to temporarily remove a flag?
I meant the flags ASCII, LOCAL and UNICODE specified by letters 'a', 'L' and 'u'. Every of these flags affects the meaning of \w and case-insensitive matching. They are mutually exclusive. The UNICODE and LOCALE flags not affect the part inside (?a:...). And you can nest different flags: (?a: ascii matching (?u: unicode matching )).
I guess I'm looking for more rationale as to why some flags can appear after the dash but others can't.
Because the one, and only one of these flags should be set for unambiguity. If you remove the flag that is set, you should add other flag. Thus you would need to write (?a-u:...) for switching from Unicode matching to ASCII matching in a string pattern. This looks cumbersome.
warsaw
Oct 5, 2017
Member
Thanks for the details! I agree with the functionality. What about this for documentation:
"The letters 'a', 'L' and 'u' are mutually exclusive when used as inline flags, so they can't be combined or follow '-'. Instead, when one of them appears in an inline group, it overrides any of the other two letters in the enclosing group. This override is only in effect for the narrow inline group, and the original flags are restored outside of the group."
Thanks for the details! I agree with the functionality. What about this for documentation:
"The letters 'a', 'L' and 'u' are mutually exclusive when used as inline flags, so they can't be combined or follow '-'. Instead, when one of them appears in an inline group, it overrides any of the other two letters in the enclosing group. This override is only in effect for the narrow inline group, and the original flags are restored outside of the group."
serhiy-storchaka
Oct 6, 2017
Author
Member
There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).
In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).
We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.
In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.
Does all this affect your thought about the wording?
There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).
In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).
We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.
In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.
Does all this affect your thought about the wording?
|
|
||
| .. versionadded:: 3.6 | ||
|
|
||
| .. versionchanged:: 3.7 | ||
| The letters ``'a'``, ``'L'`` and ``'u'`` can be used in a scope. | ||
|
|
warsaw
Oct 5, 2017
Member
I don't think "in a scope" is quite the right choice of words. Maybe "...can be used as inline flags." ...?
I don't think "in a scope" is quite the right choice of words. Maybe "...can be used as inline flags." ...?
serhiy-storchaka
Oct 5, 2017
Author
Member
These flags already can be used as inline flags: (?a), (?L). But they are not scoped and affect the whole expression.
Inline flags are flags inlined in a pattern string instead of specified as a separate compile() argument.
These flags already can be used as inline flags: (?a), (?L). But they are not scoped and affect the whole expression.
Inline flags are flags inlined in a pattern string instead of specified as a separate compile() argument.
warsaw
Oct 5, 2017
Member
Maybe "group" instead of "scope"?
Maybe "group" instead of "scope"?
| -- | ||
|
|
||
| The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE` | ||
| can be set for the part of a regular expression. |
warsaw
Oct 5, 2017
Member
How about "... can be used as inline regular expression flags." ...?
How about "... can be used as inline regular expression flags." ...?
serhiy-storchaka
Oct 5, 2017
Author
Member
They can be used as inline regular expression flags before, but affect the whole expression.
They can be used as inline regular expression flags before, but affect the whole expression.
warsaw
Oct 5, 2017
Member
How about "...can be used within the scope of a group"?
How about "...can be used within the scope of a group"?
| @@ -13,7 +13,7 @@ | |||
|
|
|||
| # update when constants are added or removed | |||
|
|
|||
| MAGIC = 20170530 | |||
| MAGIC = 20171005 | |||
warsaw
Oct 5, 2017
Member
This occurred to me during the previous discussion regarding optimization efforts for re.compile(). I meant to bring it up on python-dev at the time but got distracted. I'll mention it here and maybe we need to discuss further. I think we might need to be as careful about regex magic numbers and bytecodes as we are for Python bytecodes.
The reason is that compiled regular expressions can be pickled.
Python 3.7.0a1+ (heads/debughook-dirty:28bd8e477d, Oct 4 2017, 23:21:36)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> x = re.compile('foo')
>>> from pickle import dumps
>>> dumps(x)
b'\x80\x03cre\n_compile\nq\x00X\x03\x00\x00\x00fooq\x01K \x86q\x02Rq\x03.'
If I pickle a regex using Python 3.7.0 and the format changes for 3.7.1, then my compiled regexps will not be unpicklable, right? I haven't actually analyzed the regex pickle format, so I could be wrong about that, but if I'm right, then I think that's a problem.
This occurred to me during the previous discussion regarding optimization efforts for re.compile(). I meant to bring it up on python-dev at the time but got distracted. I'll mention it here and maybe we need to discuss further. I think we might need to be as careful about regex magic numbers and bytecodes as we are for Python bytecodes.
The reason is that compiled regular expressions can be pickled.
Python 3.7.0a1+ (heads/debughook-dirty:28bd8e477d, Oct 4 2017, 23:21:36)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> x = re.compile('foo')
>>> from pickle import dumps
>>> dumps(x)
b'\x80\x03cre\n_compile\nq\x00X\x03\x00\x00\x00fooq\x01K \x86q\x02Rq\x03.'
If I pickle a regex using Python 3.7.0 and the format changes for 3.7.1, then my compiled regexps will not be unpicklable, right? I haven't actually analyzed the regex pickle format, so I could be wrong about that, but if I'm right, then I think that's a problem.
serhiy-storchaka
Oct 5, 2017
Author
Member
The regex pickle format just contains a text pattern and integer flags. When unpickled re.compile() just is called. This format is compatible while the syntax of regular expressions is compatible.
The regex pickle format just contains a text pattern and integer flags. When unpickled re.compile() just is called. This format is compatible while the syntax of regular expressions is compatible.
| @@ -1470,11 +1470,11 @@ def test_ascii_and_unicode_flag(self): | |||
| self.assertIsNone(pat.match(b'\xe0')) | |||
| # Incompatibilities | |||
| self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE) | |||
| self.assertRaises(ValueError, re.compile, br'(?u)\w') | |||
| self.assertRaises(re.error, re.compile, br'(?u)\w') | |||
warsaw
Oct 5, 2017
Member
Why the change in exception raised? Isn't that an API break?
Why the change in exception raised? Isn't that an API break?
serhiy-storchaka
Oct 5, 2017
Author
Member
I consider this as minor implementation detail. re.error looks more correct to me here. ValueError is related to flags specified as a separate argument, re.error is related to inline flags.
I consider this as minor implementation detail. re.error looks more correct to me here. ValueError is related to flags specified as a separate argument, re.error is related to inline flags.
warsaw
Oct 5, 2017
Member
I'm concerned that this may break some code. For example, in Mailman we sometimes have regular expression strings provided by a site admin or list owner. We have to check to see if they are valid by trying to compile them and catching any exceptions. So it seems like if this changes, those checks would fail.
I'm concerned that this may break some code. For example, in Mailman we sometimes have regular expression strings provided by a site admin or list owner. We have to check to see if they are valid by trying to compile them and catching any exceptions. So it seems like if this changes, those checks would fail.
serhiy-storchaka
Oct 6, 2017
Author
Member
re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.
re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.
| @@ -0,0 +1,2 @@ | |||
| Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a | |||
| regular expression. | |||
warsaw
Oct 5, 2017
Member
How about: "Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as inline flags for regular expressions."
How about: "Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as inline flags for regular expressions."
warsaw
Oct 22, 2017
Member
I didn't see a response. What do you think about this suggestion?
I didn't see a response. What do you think about this suggestion?
serhiy-storchaka
Oct 23, 2017
Author
Member
Sorry, I answered similar comment in other place. "inline flags" are not correct words here. All these flags already can be used as inline flags ("(?a)", "("L)", "(?u)"). But these inline flags affect the entire regular expression. What is new in this PR it is allowing them to be set only for the part of the RE: "(?a:...)" etc.
Sorry, I answered similar comment in other place. "inline flags" are not correct words here. All these flags already can be used as inline flags ("(?a)", "("L)", "(?u)"). But these inline flags affect the entire regular expression. What is new in this PR it is allowing them to be set only for the part of the RE: "(?a:...)" etc.
warsaw
Oct 23, 2017
Member
Maybe we can call these "group flags"? Thus:
Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as group flags for regular expressions.
?
Maybe we can call these "group flags"? Thus:
Allow the flags re.ASCII, re.LOCALE, and re.UNICODE to be used as group flags for regular expressions.
?
serhiy-storchaka
Oct 23, 2017
Author
Member
I have never encountered such an expression, but on the other hand, there seems to be no established terminology. I'm not very like "group flags" because it contains "group", but this group is non-capturing and it is not counted in the groups returned by "match.group()" and other API. But if it looks unambiguous to you, I'll use your wording.
I have never encountered such an expression, but on the other hand, there seems to be no established terminology. I'm not very like "group flags" because it contains "group", but this group is non-capturing and it is not counted in the groups returned by "match.group()" and other API. But if it looks unambiguous to you, I'll use your wording.
warsaw
Oct 24, 2017
Member
The original entry felt awkwardly worded to me. "Group flags" seems like a natural description. Ultimately, this is just a blurb entry so it's not that critical.
The original entry felt awkwardly worded to me. "Group flags" seems like a natural description. Ultimately, this is just a blurb entry so it's not that critical.
|
First than tweak the wording more, I want to apply other documentation changes (#3907). |
| and :const:`re.X` (verbose), for the part of the expression. | ||
| (The flags are described in :ref:`contents-of-module-re`.) | ||
| The letters ``'a'``, ``'L'`` and ``'u'`` can't be combined or follow | ||
| ``'-'``. Instead using the one of them temporary removes other flags. |
serhiy-storchaka
Oct 6, 2017
Author
Member
There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).
In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).
We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.
In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.
Does all this affect your thought about the wording?
There are yet more details. 'u' can be used only in Unicode patterns, and 'L' can be used only in byte pattern. Therefore for any of these letters there is only one other alternate letter in every type of patterns ('u'<->'a' in Unicode patterns and 'a'<->'L' in byte patterns).
In the current PR (?a:...) is used for disabling Unicode matching in Unicode patterns, and (?u:...) is used for restoring Unicode matching in Unicode patterns. (?L:...) is used for enabling locale-depending matching in byte patterns, and (?a:...) is used for disabling locale-depending matching in byte patterns (default).
We could use (?-a:...) for restoring Unicode matching in Unicode patterns and (?-L:...) for disabling locale-depending matching in byte patterns, but using different flags for switching to ASCII matching looks weird to me.
In future we either add support of the locale-aware matching in Unicode patterns (there is a patch in bpo-22407), or remove the locale-aware matching at all.
Does all this affect your thought about the wording?
| @@ -1470,11 +1470,11 @@ def test_ascii_and_unicode_flag(self): | |||
| self.assertIsNone(pat.match(b'\xe0')) | |||
| # Incompatibilities | |||
| self.assertRaises(ValueError, re.compile, br'\w', re.UNICODE) | |||
| self.assertRaises(ValueError, re.compile, br'(?u)\w') | |||
| self.assertRaises(re.error, re.compile, br'(?u)\w') | |||
serhiy-storchaka
Oct 6, 2017
Author
Member
re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.
re.error is an exception raised when an error in a regular expression is found. If you try to compile arbitrary string regular expression you should catch re.error. You should also catch ValueError (only raised when specify incompatible flags as a separate argument) and OverflowError (only raised if specify too large integers in {m,n}). And I think this is all. Now new exception added. If you have different handlers for re.error and ValueError, this change can be visible to you.
Yes, but I'm not sure how ;) Or maybe, I'm not sure how to turn your excellent explanation into something more concise. Let's see how your other documentation changes work out first. |
|
Could you please take a look on the updated PR @warsaw? |
|
I have a couple of questions, and noticed one spelling error. Other than that, it looks great, so I'll go ahead and approve it, conditional on at least fixing the spelling mistake. |
| (default). In byte pattern ``(?L:...)`` switches to locale depending | ||
| matching, and ``(?a:...)`` switches to ASCII-only matching (default). | ||
| This override is only in effect for the narrow inline group, and the | ||
| original matchin mode is restored outside of the group. |
warsaw
Oct 22, 2017
Member
s/matchin/matching/
s/matchin/matching/
| :const:`re.L` (locale dependent), :const:`re.M` (multi-line), | ||
| :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), | ||
| and :const:`re.X` (verbose), for the part of the expression. | ||
| (The flags are described in :ref:`contents-of-module-re`.) |
warsaw
Oct 22, 2017
Member
Maybe use a bullet list here? I'll leave that to you.
Maybe use a bullet list here? I'll leave that to you.
serhiy-storchaka
Oct 23, 2017
Author
Member
This will take too much space. Maybe later I'll add a table somewhere.
This will take too much space. Maybe later I'll add a table somewhere.
| @@ -0,0 +1,2 @@ | |||
| Allow to set the flags re.ASCII, re.LOCALE and re.UNICODE for the part of a | |||
| regular expression. | |||
warsaw
Oct 22, 2017
Member
I didn't see a response. What do you think about this suggestion?
I didn't see a response. What do you think about this suggestion?
|
Thanks for taking the time to work with me on this change. It's an important and useful improvement! I'm going to approve this PR, and leave it up to you whether to reword the NEWS file blurb or not, based on my last comment. |
|
Thank you @warsaw for your review and help with the documentation. |
https://bugs.python.org/issue31690