As obvious from the paper's title this is an LWG issue and thus retroactively
applied to C++20. This change may the output for certain code points:
1 Considers 8477 extra codepoints as having a width 2 (as of Unicode 15)
(mostly Tangut Ideographs)
2 Change the width of 85 unassigned code points from 2 to 1
3 Change the width of 8 codepoints (in the range U+3248 CIRCLED NUMBER
TEN ON BLACK SQUARE ... U+324F CIRCLED NUMBER EIGHTY ON BLACK
SQUARE) from 2 to 1, because it seems questionable to make an exception
for those without input from Unicode
Note that libc++ already uses Unicode 15, while the Standard requires Unicode 12.
(The last time I checked MSVC STL used Unicode 14.)
So in practice the only notable change is item 3.
Implements
P2675 LWG3780: The Paper
format's width estimation is too approximate and not forward compatible
Benchmark before these changes
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_ascii_text<char> 3928 ns 3928 ns 178131
BM_unicode_text<char> 75231 ns 75230 ns 9158
BM_cyrillic_text<char> 59837 ns 59834 ns 11529
BM_japanese_text<char> 39842 ns 39832 ns 17501
BM_emoji_text<char> 3931 ns 3930 ns 177750
BM_ascii_text<wchar_t> 4024 ns 4024 ns 174190
BM_unicode_text<wchar_t> 63756 ns 63751 ns 11136
BM_cyrillic_text<wchar_t> 44639 ns 44638 ns 15597
BM_japanese_text<wchar_t> 34425 ns 34424 ns 20283
BM_emoji_text<wchar_t> 3937 ns 3937 ns 177684
Benchmark after these changes
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_ascii_text<char> 3914 ns 3913 ns 178814
BM_unicode_text<char> 70380 ns 70378 ns 9694
BM_cyrillic_text<char> 51889 ns 51877 ns 13488
BM_japanese_text<char> 41707 ns 41705 ns 16723
BM_emoji_text<char> 3908 ns 3907 ns 177912
BM_ascii_text<wchar_t> 3949 ns 3948 ns 177525
BM_unicode_text<wchar_t> 64591 ns 64587 ns 10649
BM_cyrillic_text<wchar_t> 44089 ns 44078 ns 15721
BM_japanese_text<wchar_t> 39369 ns 39367 ns 17779
BM_emoji_text<wchar_t> 3936 ns 3934 ns 177821
Benchmarks without "if(__code_point < (__entries[0] >> 14))"
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_ascii_text<char> 3922 ns 3922 ns 178587
BM_unicode_text<char> 94474 ns 94474 ns 7351
BM_cyrillic_text<char> 69202 ns 69200 ns 10157
BM_japanese_text<char> 42735 ns 42692 ns 16382
BM_emoji_text<char> 3920 ns 3919 ns 178704
BM_ascii_text<wchar_t> 3951 ns 3950 ns 177224
BM_unicode_text<wchar_t> 81003 ns 80988 ns 8668
BM_cyrillic_text<wchar_t> 57020 ns 57018 ns 12048
BM_japanese_text<wchar_t> 39695 ns 39687 ns 17582
BM_emoji_text<wchar_t> 3977 ns 3976 ns 176479
This optimization does carry its weight for the Unicode and Cyrillic
test. For the Japanese tests the gains are minor and for emoji it seems
to have no effect.
Reviewed By: ldionne, tahonermann, #libc
Differential Revision: https://reviews.llvm.org/D144499
This adds support for the new code points in the Extended Grapheme
Cluster algorithm. The algorithm itself has remained unchanged.
The width estimation still follows the rules of the Standard.
@cor3ntin filed
LWG3780 format's width estimation is too approximate and not forward compatible
to improve the estimate.
Reviewed By: ldionne, #libc
Differential Revision: https://reviews.llvm.org/D134106
This changes makes it easier to update the Unicode data files used for
the Extended Graphme Clustering as added in D126971.
Reviewed By: ldionne, #libc
Differential Revision: https://reviews.llvm.org/D129668