third_party_pyyaml

mirror of https://gitee.com/openharmony/third_party_pyyaml synced 2024-11-27 04:10:36 +00:00

History

Anish Athalye 0716ae21a1 Fix reader for Unicode code points over 0xFFFF (#351 ) This patch fixes the handling of inputs with Unicode code points over 0xFFFF when running on a Python 2 that does not have UCS-4 support (which certain distributions still ship, e.g. macOS). When Python is compiled without UCS-4 support, it uses UCS-2. In this situation, non-BMP Unicode characters, which have code points over 0xFFFF, are represented as surrogate pairs. For example, if we take u'\U0001f3d4', it will be represented as the surrogate pair u'\ud83c\udfd4'. This can be seen by running, for example: [i for i in u'\U0001f3d4'] In PyYAML, the reader uses a function `check_printable` to validate inputs, making sure that they only contain printable characters. Prior to this patch, on UCS-2 builds, it incorrectly identified surrogate pairs as non-printable. It would be fairly natural to write a regular expression that captures strings that contain only printable characters, as opposed to non-printable characters (as identified by the old code, so not excluding surrogate pairs): PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]$') Adding support for surrogate pairs to this would be straightforward, adding the option of having a surrogate high followed by a surrogate low (`[\uD800-\uDBFF][\uDC00-\uDFFF]`): PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]\|[\uD800-\uDBFF][\uDC00-\uDFFF])$') Then, this regex could be used as follows: def check_printable(self, data): if not self.PRINTABLE.match(data): raise ReaderError(...) However, matching printable strings, rather than searching for non-printable characters as the code currently does, would have the disadvantage of not identifying the culprit character (we wouldn't get the position and the actual non-printable character from a lack of a regex match). Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate pairs. We do this by removing surrogate pairs from the existing character set and adding the following options for illegal uses of surrogate code points: - Surrogate low that doesn't follow a surrogate high (either a surrogate low at the start of a string, or a surrogate low that follows a character that's not a surrogate high): (?:^\|[^\uD800-\uDBFF])[\uDC00-\uDFFF] - Surrogate high that isn't followed by a surrogate low (either a surrogate high at the end of a string, or a surrogate high that is followed by a character that's not a surrogate low): [\uD800-\uDBFF](?:[^\uDC00-\uDFFF]\|$) The behavior of this modified regex should match the one that is used when Python is built with UCS-4 support.		2019-12-20 20:38:46 +01:00
..
__init__.py	Use `is` instead of equality for comparing with None	2019-12-04 00:04:05 +01:00
composer.py	Fix typos	2017-08-08 06:05:28 -05:00
constructor.py	Allow add_multi_constructor with None (#358 )	2019-12-07 22:40:48 +01:00
cyaml.py	Make default_flow_style=False	2019-03-08 09:09:48 -08:00
dumper.py	Make default_flow_style=False	2019-03-08 09:09:48 -08:00
emitter.py	Fix logic for quoting special characters (#276 )	2019-11-18 11:59:54 +01:00
error.py	scanner: use infinitive verb after auxiliary word could	2015-04-04 13:25:24 -03:00
events.py	scanner: use infinitive verb after auxiliary word could	2015-04-04 13:25:24 -03:00
loader.py	fix typos and stylistic nit	2019-12-03 23:58:55 +01:00
nodes.py	scanner: use infinitive verb after auxiliary word could	2015-04-04 13:25:24 -03:00
parser.py	scanner: use infinitive verb after auxiliary word could	2015-04-04 13:25:24 -03:00
reader.py	Fix reader for Unicode code points over 0xFFFF (#351 )	2019-12-20 20:38:46 +01:00
representer.py	Make default_flow_style=False	2019-03-08 09:09:48 -08:00
resolver.py	Adding an implicit resolver to a derived loader should not affect the base loader (fixes issue #57 ).	2016-08-25 17:42:41 -05:00
scanner.py	Fix up small typo	2019-12-04 00:31:05 +01:00
serializer.py	scanner: use infinitive verb after auxiliary word could	2015-04-04 13:25:24 -03:00
tokens.py	scanner: use infinitive verb after auxiliary word could	2015-04-04 13:25:24 -03:00