Pavel Labath f85485df31 Resubmit r325107 (case folding DJB hash)
The issue was that the has function was generating different results depending
on the signedness of char on the host platform. This commit fixes the issue by
explicitly using an unsigned char type to prevent sign extension and
adds some extra tests.

The original commit message was:

This patch implements a variant of the DJB hash function which folds the
input according to the algorithm in the Dwarf 5 specification (Section
6.1.1.4.5), which in turn references the Unicode Standard (Section 5.18,
"Case Mappings").

To achieve this, I have added a llvm::sys::unicode::foldCharSimple
function, which performs this mapping. The implementation of this
function was generated from the CaseMatching.txt file from the Unicode
spec using a python script (which is also included in this patch). The
script tries to optimize the function by coalescing adjecant mappings
with the same shift and stride (terms I made up). Theoretically, it
could be made a bit smarter and merge adjecant blocks that were
interrupted by only one or two characters with exceptional mapping, but
this would save only a couple of branches, while it would greatly
complicate the implementation, so I deemed it was not worth it.

Since we assume that the vast majority of the input characters will be
US-ASCII, the folding hash function has a fast-path for handling these,
and only whips out the full decode+fold+encode logic if we encounter a
character outside of this range. It might be possible to implement the
folding directly on utf8 sequences, but this would also bring a lot of
complexity for the few cases where we will actually need to process
non-ascii characters.

Reviewers: JDevlieghere, aprantl, probinson, dblaikie

Subscribers: mgorny, hintonda, echristo, clayborg, vleschuk, llvm-commits

Differential Revision: https://reviews.llvm.org/D42740

llvm-svn: 325732
2018-02-21 22:36:31 +00:00

72 lines
2.5 KiB
C++

//===- llvm/Support/Unicode.h - Unicode character properties -*- C++ -*-=====//
//
// The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
// This file defines functions that allow querying certain properties of Unicode
// characters.
//
//===----------------------------------------------------------------------===//
#ifndef LLVM_SUPPORT_UNICODE_H
#define LLVM_SUPPORT_UNICODE_H
namespace llvm {
class StringRef;
namespace sys {
namespace unicode {
enum ColumnWidthErrors {
ErrorInvalidUTF8 = -2,
ErrorNonPrintableCharacter = -1
};
/// Determines if a character is likely to be displayed correctly on the
/// terminal. Exact implementation would have to depend on the specific
/// terminal, so we define the semantic that should be suitable for generic case
/// of a terminal capable to output Unicode characters.
///
/// All characters from the Unicode code point range are considered printable
/// except for:
/// * C0 and C1 control character ranges;
/// * default ignorable code points as per 5.21 of
/// http://www.unicode.org/versions/Unicode6.2.0/UnicodeStandard-6.2.pdf
/// except for U+00AD SOFT HYPHEN, as it's actually displayed on most
/// terminals;
/// * format characters (category = Cf);
/// * surrogates (category = Cs);
/// * unassigned characters (category = Cn).
/// \return true if the character is considered printable.
bool isPrintable(int UCS);
/// Gets the number of positions the UTF8-encoded \p Text is likely to occupy
/// when output on a terminal ("character width"). This depends on the
/// implementation of the terminal, and there's no standard definition of
/// character width.
///
/// The implementation defines it in a way that is expected to be compatible
/// with a generic Unicode-capable terminal.
///
/// \return Character width:
/// * ErrorNonPrintableCharacter (-1) if \p Text contains non-printable
/// characters (as identified by isPrintable);
/// * 0 for each non-spacing and enclosing combining mark;
/// * 2 for each CJK character excluding halfwidth forms;
/// * 1 for each of the remaining characters.
int columnWidthUTF8(StringRef Text);
/// Fold input unicode character according the the Simple unicode case folding
/// rules.
int foldCharSimple(int C);
} // namespace unicode
} // namespace sys
} // namespace llvm
#endif