mirror of
https://gitee.com/openharmony/third_party_rust_regex
synced 2025-04-07 20:51:33 +00:00

This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII word boundary") in the NFA engines could produce match positions at invalid UTF-8 sequence boundaries. The specific problem is that determining whether (?-u:\B) matches or not relies on knowing whether we must report matches only at UTF-8 boundaries, and this wasn't actually being taken into account. (Instead, we prefer to enforce this invariant in the compiler, so that the matching engines mostly don't have to care about it.) But of course, the zero-width assertions are kind of a special case all around, so we need to handle ASCII word boundaries differently depending on whether we require valid UTF-8. This bug was noticed because the DFA actually handles this correctly (by encoding ASCII word boundaries into the state machine itself, which in turn guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an inconsistency. Fix #241.