This patch fixes the handling of inputs with Unicode code points over
0xFFFF when running on a Python 2 that does not have UCS-4 support
(which certain distributions still ship, e.g. macOS).
When Python is compiled without UCS-4 support, it uses UCS-2. In this
situation, non-BMP Unicode characters, which have code points over
0xFFFF, are represented as surrogate pairs. For example, if we take
u'\U0001f3d4', it will be represented as the surrogate pair
u'\ud83c\udfd4'. This can be seen by running, for example:
[i for i in u'\U0001f3d4']
In PyYAML, the reader uses a function `check_printable` to validate
inputs, making sure that they only contain printable characters. Prior
to this patch, on UCS-2 builds, it incorrectly identified surrogate
pairs as non-printable.
It would be fairly natural to write a regular expression that captures
strings that contain only *printable* characters, as opposed to
*non-printable* characters (as identified by the old code, so not
excluding surrogate pairs):
PRINTABLE = re.compile(u'^[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]*$')
Adding support for surrogate pairs to this would be straightforward,
adding the option of having a surrogate high followed by a surrogate low
(`[\uD800-\uDBFF][\uDC00-\uDFFF]`):
PRINTABLE = re.compile(u'^(?:[\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF][\uDC00-\uDFFF])*$')
Then, this regex could be used as follows:
def check_printable(self, data):
if not self.PRINTABLE.match(data):
raise ReaderError(...)
However, matching printable strings, rather than searching for
non-printable characters as the code currently does, would have the
disadvantage of not identifying the culprit character (we wouldn't get
the position and the actual non-printable character from a lack of a
regex match).
Instead, we can modify the NON_PRINTABLE regex to allow legal surrogate
pairs. We do this by removing surrogate pairs from the existing
character set and adding the following options for illegal uses of
surrogate code points:
- Surrogate low that doesn't follow a surrogate high (either a surrogate
low at the start of a string, or a surrogate low that follows a
character that's not a surrogate high):
(?:^|[^\uD800-\uDBFF])[\uDC00-\uDFFF]
- Surrogate high that isn't followed by a surrogate low (either a
surrogate high at the end of a string, or a surrogate high that is
followed by a character that's not a surrogate low):
[\uD800-\uDBFF](?:[^\uDC00-\uDFFF]|$)
The behavior of this modified regex should match the one that is used
when Python is built with UCS-4 support.
When someone writes a subclass of the YAMLObject class, the constructors
will now be added to all 3 (non-safe) loaders.
Furthermore, we support the class variable `yaml_loader` being a list,
offering more control of which loaders are affected.
To support safe_load in your custom class you could add this:
yaml_loader = yaml.SafeLoader
yaml_loader = yaml.YAMLObject.yaml_loader
yaml_loader.append(yaml.SafeLoader)
* Fix logic for quoting special characters
* Remove has_ucs4 from condition
on systems with `sys.maxunicode <= 0xffff` the comparison
(u'\U00010000' <= ch < u'\U0010ffff') can't be true anyway I think
The `load` and `load_all` methods will issue a warning when they are
called without the 'Loader=' parameter. The warning will point to a URL
that is always up to date with the latest information on the usage of
`load`.
There are several ways to stop the warning:
* Use `full_load(input)` - sugar for `yaml.load(input, FullLoader)`
* FullLoader is the new safe but complete loader class
* Use `safe_load(input)` - sugar for `yaml.load(input, SafeLoader)`
* Make sure your input YAML consists of the 'safe' subset
* Use `unsafe_load(input)` - sugar for `yaml.load(input, UnsafeLoader)`
* Make sure your input YAML consists of the 'safe' subset
* Use `yaml.load(input, Loader=yaml.<loader>)`
* Or shorter `yaml.load(input, yaml.<loader>)`
* Where '<loader>' can be:
* FullLoader - safe, complete Python YAML loading
* SafeLoader - safe, partial Python YAML loading
* UnsafeLoader - more explicit name for the old, unsafe 'Loader' class
* yaml.warnings({'YAMLLoadWarning': False})
* Use this when you use third party modules that use `yaml.load(input)`
* Only do this if input is trusted
The above `load()` expressions all have `load_all()` counterparts.
You can get the original unsafe behavior with:
* `yaml.unsafe_load(input)`
* `yaml.load(input, Loader=yaml.UnsafeLoader)`
In a future release, `yaml.load(input)` will raise an exception.
The new loader called FullLoader is almost entirely complete as
Loader/UnsafeLoader but it does it avoids all known code execution
paths. It is the preferred YAML loader, and the current default for
`yaml.load(input)` when you get the warning.
Here are some of the exploits that can be triggered with UnsafeLoader
but not with FullLoader:
```
python -c 'import os, yaml; yaml.full_load("!!python/object/new:os.system [echo EXPLOIT!]")'`
python -c 'import yaml; print yaml.full_load("!!python/object/new:abs [-5]")'
python -c 'import yaml; yaml.full_load("!!python/object/new:eval [exit(5)]")' ; echo $?
python -c 'import yaml; yaml.full_load("!!python/object/new:exit [5]")' ; echo $?
This is the first release under new maintainership. A bunch of things
involving resource URLs and copyright details needed updating; in
addition to the normal version and changelog updates.
From the Psyco website:
> 12 March 2012
>
> Psyco is unmaintained and dead. Please look at PyPy for the
> state-of-the-art in JIT compilers for Python.
http://psyco.sourceforge.net/
Change yaml.load/yaml.dump to be yaml.safe_load/yaml.safe_dump, introduced yaml.danger_dump/yaml.danger_load, and the same for various other classes.
(python2 only at this moment)
Refs #5