Commit Graph

160 Commits

Author SHA1 Message Date
itrofimow
51e91b64d0
[libc++abi] Implement __cxa_init_primary_exception and use it to optimize std::make_exception_ptr (#65534)
This patch implements __cxa_init_primary_exception, an extension to the 
Itanium C++ ABI. This extension is already present in both libsupc++ and 
libcxxrt. This patch also starts making use of this function in 
std::make_exception_ptr: instead of going through a full throw/catch 
cycle, we are now able to initialize an exception directly, thus making 
std::make_exception_ptr around 30x faster.
2024-01-22 10:12:41 -05:00
ZijunZhaoCCK
fdd089b500
[libc++] Implement ranges::contains (#65148)
Differential Revision: https://reviews.llvm.org/D159232
```
Running ./ranges_contains.libcxx.out
Run on (10 X 24.121 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x10)
  L1 Instruction 128 KiB (x10)
  L2 Unified 4096 KiB (x5)
Load Average: 3.37, 6.77, 5.27
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
bm_contains_char/16             1.88 ns         1.87 ns    371607095
bm_contains_char/256            7.48 ns         7.47 ns     93292285
bm_contains_char/4096           99.7 ns         99.6 ns      7013185
bm_contains_char/65536          1296 ns         1294 ns       540436
bm_contains_char/1048576       23887 ns        23860 ns        29302
bm_contains_char/16777216     389420 ns       389095 ns         1796
bm_contains_int/16              7.14 ns         7.14 ns     97776288
bm_contains_int/256             90.4 ns         90.3 ns      7558089
bm_contains_int/4096            1294 ns         1290 ns       543052
bm_contains_int/65536          20482 ns        20443 ns        34334
bm_contains_int/1048576       328817 ns       327965 ns         2147
bm_contains_int/16777216     5246279 ns      5239361 ns          133
bm_contains_bool/16             2.19 ns         2.19 ns    322565780
bm_contains_bool/256            3.42 ns         3.41 ns    205025467
bm_contains_bool/4096           22.1 ns         22.1 ns     31780479
bm_contains_bool/65536           333 ns          332 ns      2106606
bm_contains_bool/1048576        5126 ns         5119 ns       135901
bm_contains_bool/16777216      81656 ns        81574 ns         8569
```

---------

Co-authored-by: Nathan Gauër <brioche@google.com>
2023-12-19 16:34:19 -08:00
Nikolas Klauser
f7407411a1
[libc++] Optimize std::find for segmented iterators (#67224)
```
--------------------------------------------------------------------------
Benchmark                                              old             new
--------------------------------------------------------------------------
bm_find<std::deque<char>>/1                        6.06 ns         10.6 ns
bm_find<std::deque<char>>/2                        15.5 ns         10.6 ns
bm_find<std::deque<char>>/3                        19.0 ns         10.6 ns
bm_find<std::deque<char>>/4                        20.8 ns         10.6 ns
bm_find<std::deque<char>>/5                        22.0 ns         10.6 ns
bm_find<std::deque<char>>/6                        23.0 ns         10.5 ns
bm_find<std::deque<char>>/7                        24.8 ns         10.7 ns
bm_find<std::deque<char>>/8                        25.7 ns         10.6 ns
bm_find<std::deque<char>>/16                       28.3 ns         10.6 ns
bm_find<std::deque<char>>/64                       44.2 ns         27.0 ns
bm_find<std::deque<char>>/512                       133 ns         37.6 ns
bm_find<std::deque<char>>/4096                      867 ns         53.1 ns
bm_find<std::deque<char>>/32768                    6838 ns          160 ns
bm_find<std::deque<char>>/262144                  52897 ns         1495 ns
bm_find<std::deque<char>>/1048576                215621 ns         6077 ns
bm_find<std::deque<short>>/1                       6.03 ns         6.28 ns
bm_find<std::deque<short>>/2                       15.8 ns         15.8 ns
bm_find<std::deque<short>>/3                       20.5 ns         20.3 ns
bm_find<std::deque<short>>/4                       21.0 ns         21.0 ns
bm_find<std::deque<short>>/5                       23.0 ns         22.1 ns
bm_find<std::deque<short>>/6                       22.6 ns         23.0 ns
bm_find<std::deque<short>>/7                       23.4 ns         23.7 ns
bm_find<std::deque<short>>/8                       24.4 ns         24.9 ns
bm_find<std::deque<short>>/16                      26.6 ns         27.2 ns
bm_find<std::deque<short>>/64                      43.2 ns         40.9 ns
bm_find<std::deque<short>>/512                      124 ns         90.7 ns
bm_find<std::deque<short>>/4096                     845 ns          525 ns
bm_find<std::deque<short>>/32768                   7273 ns         3194 ns
bm_find<std::deque<short>>/262144                 53710 ns        24385 ns
bm_find<std::deque<short>>/1048576               216086 ns        96195 ns
bm_find<std::deque<int>>/1                         6.03 ns         10.3 ns
bm_find<std::deque<int>>/2                         15.6 ns         10.3 ns
bm_find<std::deque<int>>/3                         19.1 ns         10.3 ns
bm_find<std::deque<int>>/4                         22.3 ns         10.3 ns
bm_find<std::deque<int>>/5                         23.5 ns         10.4 ns
bm_find<std::deque<int>>/6                         23.1 ns         10.3 ns
bm_find<std::deque<int>>/7                         23.7 ns         10.2 ns
bm_find<std::deque<int>>/8                         24.5 ns         10.2 ns
bm_find<std::deque<int>>/16                        27.9 ns         26.6 ns
bm_find<std::deque<int>>/64                        42.6 ns         32.2 ns
bm_find<std::deque<int>>/512                        123 ns         43.0 ns
bm_find<std::deque<int>>/4096                       874 ns         93.5 ns
bm_find<std::deque<int>>/32768                     7031 ns          751 ns
bm_find<std::deque<int>>/262144                   57723 ns         6169 ns
bm_find<std::deque<int>>/1048576                 230867 ns        35851 ns
bm_ranges_find<std::deque<char>>/1                 5.97 ns         10.6 ns
bm_ranges_find<std::deque<char>>/2                 16.0 ns         10.5 ns
bm_ranges_find<std::deque<char>>/3                 19.5 ns         10.5 ns
bm_ranges_find<std::deque<char>>/4                 21.1 ns         10.6 ns
bm_ranges_find<std::deque<char>>/5                 22.8 ns         10.5 ns
bm_ranges_find<std::deque<char>>/6                 22.8 ns         10.6 ns
bm_ranges_find<std::deque<char>>/7                 23.4 ns         10.8 ns
bm_ranges_find<std::deque<char>>/8                 24.1 ns         10.5 ns
bm_ranges_find<std::deque<char>>/16                26.9 ns         10.6 ns
bm_ranges_find<std::deque<char>>/64                50.2 ns         27.2 ns
bm_ranges_find<std::deque<char>>/512                126 ns         38.3 ns
bm_ranges_find<std::deque<char>>/4096               868 ns         53.8 ns
bm_ranges_find<std::deque<char>>/32768             6695 ns          161 ns
bm_ranges_find<std::deque<char>>/262144           54411 ns         1497 ns
bm_ranges_find<std::deque<char>>/1048576         241699 ns         6042 ns
bm_ranges_find<std::deque<short>>/1                6.39 ns         6.31 ns
bm_ranges_find<std::deque<short>>/2                15.8 ns         15.9 ns
bm_ranges_find<std::deque<short>>/3                19.0 ns         19.8 ns
bm_ranges_find<std::deque<short>>/4                20.8 ns         20.9 ns
bm_ranges_find<std::deque<short>>/5                21.8 ns         22.1 ns
bm_ranges_find<std::deque<short>>/6                23.0 ns         23.0 ns
bm_ranges_find<std::deque<short>>/7                23.2 ns         23.9 ns
bm_ranges_find<std::deque<short>>/8                23.7 ns         24.4 ns
bm_ranges_find<std::deque<short>>/16               26.6 ns         26.8 ns
bm_ranges_find<std::deque<short>>/64               43.4 ns         39.7 ns
bm_ranges_find<std::deque<short>>/512               131 ns         90.5 ns
bm_ranges_find<std::deque<short>>/4096              851 ns          523 ns
bm_ranges_find<std::deque<short>>/32768            7370 ns         3166 ns
bm_ranges_find<std::deque<short>>/262144          60778 ns        24814 ns
bm_ranges_find<std::deque<short>>/1048576        229288 ns        99273 ns
bm_ranges_find<std::deque<int>>/1                  6.43 ns         10.2 ns
bm_ranges_find<std::deque<int>>/2                  16.6 ns         10.2 ns
bm_ranges_find<std::deque<int>>/3                  19.6 ns         10.2 ns
bm_ranges_find<std::deque<int>>/4                  21.0 ns         10.2 ns
bm_ranges_find<std::deque<int>>/5                  21.9 ns         10.4 ns
bm_ranges_find<std::deque<int>>/6                  22.7 ns         10.2 ns
bm_ranges_find<std::deque<int>>/7                  23.9 ns         10.2 ns
bm_ranges_find<std::deque<int>>/8                  23.8 ns         10.2 ns
bm_ranges_find<std::deque<int>>/16                 27.2 ns         27.1 ns
bm_ranges_find<std::deque<int>>/64                 42.4 ns         32.4 ns
bm_ranges_find<std::deque<int>>/512                 122 ns         43.0 ns
bm_ranges_find<std::deque<int>>/4096                895 ns         93.7 ns
bm_ranges_find<std::deque<int>>/32768              6890 ns          756 ns
bm_ranges_find<std::deque<int>>/262144            54025 ns         6102 ns
bm_ranges_find<std::deque<int>>/1048576          221558 ns        32783 ns
```
2023-12-15 17:10:16 +01:00
Hui
36dd743212
[libc++][test] add more benchmarks for stop_token (#69590) 2023-12-02 16:14:38 +00:00
Louis Dionne
1a5af34e6f
[libc++] Speed up classic locale (take 2) (#73533)
Locale objects use atomic reference counting, which may be very
expensive in parallel applications. The classic locale is used by
default by all streams and can be very contended. But it's never
destroyed, so the reference counting is also completely pointless on the
classic locale. Currently ~70% of time in the parallel stringstream
benchmarks is spent in locale ctor/dtor. And the execution radically
slows down with more threads.

Avoid reference counting on the classic locale. With this change
parallel benchmarks start to scale with threads.

This is a re-application of f8afc53d64 (aka PR #72112) which was
reverted in 4e0c48b907 because it broke the sanitizer builds due
to an initialization order fiasco. This issue has now been fixed by
ensuring that the locale is constinit'ed.

Co-authored-by: Dmitry Vyukov <dvyukov@google.com>
2023-11-29 09:31:05 -05:00
Louis Dionne
ac8c9f1e39
[libc++] Properly guard std::filesystem with >= C++17 (#72701)
<filesystem> is a C++17 addition. In C++11 and C++14 modes, we actually
have all the code for <filesystem> but it is hidden behind a non-inline
namespace __fs so it is not accessible. Instead of doing this unusual
dance, just guard the code for filesystem behind a classic C++17 check
like we normally do.
2023-11-28 18:41:59 -05:00
Kirill Stoimenov
4e0c48b907 Revert "[libc++] Speed up classic locale (#72112)"
Looks like it broke the ASAN build: https://lab.llvm.org/buildbot/#/builders/168/builds/17053/steps/9/logs/stdio

This reverts commit f8afc53d64.
2023-11-27 20:21:58 +00:00
Dmitry Vyukov
f8afc53d64
[libc++] Speed up classic locale (#72112)
Locale objects use atomic reference counting, which may be very
expensive in parallel applications. The classic locale is used by
default by all streams and can be very contended. But it's never
destroyed, so the reference counting is also completely pointless on the
classic locale. Currently ~70% of time in the parallel stringstream
benchmarks is spent in locale ctor/dtor. And the execution radically
slows down with more threads.

Avoid reference counting on the classic locale. With this change
parallel benchmarks start to scale with threads.

Co-authored-by: Louis Dionne <ldionne.2@gmail.com>

```
                              │   baseline   │    optimized                            │
                              │    sec/op    │    sec/op      vs base                  │
Istream_numbers/0/threads:1      4.672µ ± 0%   4.419µ ± 0%     -5.42% (p=0.000 n=30+39)
Istream_numbers/0/threads:72   539.817µ ± 0%   9.842µ ± 1%    -98.18% (p=0.000 n=30+40)
Istream_numbers/1/threads:1      4.890µ ± 0%   4.750µ ± 0%     -2.85% (p=0.000 n=30+40)
Istream_numbers/1/threads:72     66.44µ ± 1%   10.14µ ± 1%    -84.74% (p=0.000 n=30+40)
Istream_numbers/2/threads:1      4.888µ ± 0%   4.746µ ± 0%     -2.92% (p=0.000 n=30+40)
Istream_numbers/2/threads:72     494.8µ ± 0%   410.2µ ± 1%    -17.11% (p=0.000 n=30+40)
Istream_numbers/3/threads:1      4.697µ ± 0%   4.695µ ± 5%          ~ (p=0.391 n=30+37)
Istream_numbers/3/threads:72     421.5µ ± 7%   421.9µ ± 9%          ~ (p=0.665 n=30)
Ostream_number/0/threads:1       183.0n ± 0%   141.0n ± 2%    -22.95% (p=0.000 n=30)
Ostream_number/0/threads:72    24196.5n ± 1%   343.5n ± 3%    -98.58% (p=0.000 n=30)
Ostream_number/1/threads:1       250.0n ± 0%   196.0n ± 2%    -21.60% (p=0.000 n=30)
Ostream_number/1/threads:72    16260.5n ± 0%   407.0n ± 2%    -97.50% (p=0.000 n=30)
Ostream_number/2/threads:1       254.0n ± 0%   196.0n ± 1%    -22.83% (p=0.000 n=30)
Ostream_number/2/threads:72      28.49µ ± 1%   18.89µ ± 5%    -33.72% (p=0.000 n=30)
Ostream_number/3/threads:1       185.0n ± 0%   185.0n ± 0%      0.00% (p=0.017 n=30)
Ostream_number/3/threads:72      19.38µ ± 4%   19.33µ ± 5%          ~ (p=0.425 n=30)
```
2023-11-27 07:00:21 +01:00
Nikolas Klauser
c81bfc61da [libc++] Optimize for_each for segmented iterators
```
---------------------------------------------------
Benchmark                       old             new
---------------------------------------------------
bm_for_each/1               3.00 ns         2.98 ns
bm_for_each/2               4.53 ns         4.57 ns
bm_for_each/3               5.82 ns         5.82 ns
bm_for_each/4               6.94 ns         6.91 ns
bm_for_each/5               7.55 ns         7.75 ns
bm_for_each/6               7.06 ns         7.45 ns
bm_for_each/7               6.69 ns         7.14 ns
bm_for_each/8               6.86 ns         4.06 ns
bm_for_each/16              11.5 ns         5.73 ns
bm_for_each/64              43.7 ns         4.06 ns
bm_for_each/512              356 ns         7.98 ns
bm_for_each/4096            2787 ns         53.6 ns
bm_for_each/32768          20836 ns          438 ns
bm_for_each/262144        195362 ns         4945 ns
bm_for_each/1048576       685482 ns        19822 ns
```

Reviewed By: ldionne, Mordante, #libc

Spies: bgraur, sberg, arichardson, libcxx-commits

Differential Revision: https://reviews.llvm.org/D151274
2023-11-14 23:55:24 +01:00
Hui
511236e074
[libc++][test] Add stop_token benchmark (#69117)
This is transforming the `stop_token` benchmark that Lewis Baker had
created into Google
Bench
https://reviews.llvm.org/D154702
2023-10-16 21:49:37 +01:00
Louis Dionne
36bb134ac7
[libc++] Use -nostdlib++ on GCC unconditionally (#68832)
We support GCC 13, which supports the flag. This allows simplifying the
CMake logic around the use of -nostdlib++. Note that there are other
places where we don't assume -nostdlib++ yet in the build, but this
patch is intentionally trying to be small because this part of our CMake
is pretty tricky.
2023-10-13 11:53:57 -07:00
Nikolas Klauser
a9138cdb36 [libc++] Optimize ranges::count for __bit_iterators
```
---------------------------------------------------------------
Benchmark                                    old            new
---------------------------------------------------------------
bm_vector_bool_count/1                   1.92 ns        1.92 ns
bm_vector_bool_count/2                   1.92 ns        1.92 ns
bm_vector_bool_count/3                   1.92 ns        1.92 ns
bm_vector_bool_count/4                   1.92 ns        1.92 ns
bm_vector_bool_count/5                   1.92 ns        1.92 ns
bm_vector_bool_count/6                   1.92 ns        1.92 ns
bm_vector_bool_count/7                   1.92 ns        1.92 ns
bm_vector_bool_count/8                   1.92 ns        1.92 ns
bm_vector_bool_count/16                  1.92 ns        1.92 ns
bm_vector_bool_count/64                  2.24 ns        2.25 ns
bm_vector_bool_count/512                 3.19 ns        3.20 ns
bm_vector_bool_count/4096                14.1 ns        12.3 ns
bm_vector_bool_count/32768               84.0 ns        83.6 ns
bm_vector_bool_count/262144               664 ns         661 ns
bm_vector_bool_count/1048576             2623 ns        2628 ns
bm_vector_bool_ranges_count/1            1.07 ns        1.92 ns
bm_vector_bool_ranges_count/2            1.65 ns        1.92 ns
bm_vector_bool_ranges_count/3            2.27 ns        1.92 ns
bm_vector_bool_ranges_count/4            2.68 ns        1.92 ns
bm_vector_bool_ranges_count/5            3.33 ns        1.92 ns
bm_vector_bool_ranges_count/6            3.99 ns        1.92 ns
bm_vector_bool_ranges_count/7            4.67 ns        1.92 ns
bm_vector_bool_ranges_count/8            5.19 ns        1.92 ns
bm_vector_bool_ranges_count/16           11.1 ns        1.92 ns
bm_vector_bool_ranges_count/64           52.2 ns        2.24 ns
bm_vector_bool_ranges_count/512           452 ns        3.20 ns
bm_vector_bool_ranges_count/4096         3577 ns        12.1 ns
bm_vector_bool_ranges_count/32768       28725 ns        83.7 ns
bm_vector_bool_ranges_count/262144     229676 ns         662 ns
bm_vector_bool_ranges_count/1048576    905574 ns        2625 ns
```

Reviewed By: #libc, ldionne

Spies: arichardson, ldionne, libcxx-commits

Differential Revision: https://reviews.llvm.org/D156956
2023-10-06 22:58:41 +02:00
Martijn Vels
6fe4e033f0 [libc++] Optimize vector push_back to avoid continuous load and store of end pointer
Credits: this change is based on analysis and a proof of concept by
gerbens@google.com.

Before, the compiler loses track of end as 'this' and other references
possibly escape beyond the compiler's scope. This can be see in the
generated assembly:

     16.28 │200c80:   mov     %r15d,(%rax)
     60.87 │200c83:   add     $0x4,%rax
           │200c87:   mov     %rax,-0x38(%rbp)
      0.03 │200c8b: → jmpq    200d4e
      ...
      ...
      1.69 │200d4e:   cmp     %r15d,%r12d
           │200d51: → je      200c40
     16.34 │200d57:   inc     %r15d
      0.05 │200d5a:   mov     -0x38(%rbp),%rax
      3.27 │200d5e:   mov     -0x30(%rbp),%r13
      1.47 │200d62:   cmp     %r13,%rax
           │200d65: → jne     200c80

We fix this by always explicitly storing the loaded local and pointer
back at the end of push back. This generates some slight source 'noise',
but creates nice and compact fast path code, i.e.:

     32.64 │200760:   mov    %r14d,(%r12)
      9.97 │200764:   add    $0x4,%r12
      6.97 │200768:   mov    %r12,-0x38(%rbp)
     32.17 │20076c:   add    $0x1,%r14d
      2.36 │200770:   cmp    %r14d,%ebx
           │200773: → je     200730
      8.98 │200775:   mov    -0x30(%rbp),%r13
      6.75 │200779:   cmp    %r13,%r12
           │20077c: → jne    200760

Now there is a single store for the push_back value (as before), and a
single store for the end without a reload (dependency).

For fully local vectors, (i.e., not referenced elsewhere), the capacity
load and store inside the loop could also be removed, but this requires
more substantial refactoring inside vector.

Differential Revision: https://reviews.llvm.org/D80588
2023-10-02 09:12:37 -04:00
Konstantin Varlamov
dd788af74a
Reapply "[libc++][ranges] Add benchmarks for the from_range constructors of vector and deque." (#67753)
This reverts commit 10edd5d943 and guards
against older versions of GCC to work around the problem.
2023-09-29 10:27:20 -04:00
Aaron Ballman
10edd5d943 Revert "[libc++][ranges] Add benchmarks for the from_range constructors of vector and deque."
This reverts commit 390ac82317.

It broke the sphinx publish bots for our documentation:
https://lab.llvm.org/buildbot/#/builders/242/builds/1130

because that machine has GCC 9.4.0 which does not know about C++23
2023-09-25 10:40:51 -04:00
Jake Egan
1abff08e9e
[AIX] cxx_std_23 is currently not a known feature to ibm-clang (#66952)
Temporary workaround for the following CMake error:
```
CMake Error in /llvm/libcxx/benchmarks/CMakeLists.txt:
The compiler feature "cxx_std_23" is not known to CXX compiler
"IBMClang"
```
2023-09-21 10:34:07 -04:00
Louis Dionne
8bbcc6d548
[libc++] Add test coverage for unordered containers comparison (#66692)
This patch is a melting pot of changes picked up from
https://llvm.org/D61878. It adds a few tests checking corner cases of
unordered containers comparison and adds benchmarks for a few
unordered_set operations.
2023-09-21 05:11:49 -04:00
Zijun Zhao
0218ea4aaa [libc++] Implement ranges::ends_with
Reviewed By: #libc, var-const

Differential Revision: https://reviews.llvm.org/D150831
2023-09-18 11:56:10 -07:00
Dmitry Ilvokhin
3df9c2984b [libc++] Fix minor warnings in libcxx benchmarks
This fixes some unused variable and "comparison of integers of different
signs" warnings in the benchmarks.

Differential Revision: https://reviews.llvm.org/D156613
2023-09-13 17:05:56 -04:00
Louis Dionne
d671126ad0 [libc++][NFC] Remove stray #if 1 that was probably a debugging leftover 2023-09-12 17:19:28 -04:00
Sirui Mu
679c0b48d7 [libc++abi] Refactor around __dynamic_cast
This commit contains refactorings around __dynamic_cast without changing
its behavior. Some important changes include:

- Refactor __dynamic_cast into various small helper functions;
- Move dynamic_cast_stress.pass.cpp to libcxx/benchmarks and refactor
  it into a benchmark. The benchmark performance numbers are updated
  as well.

Differential Revision: https://reviews.llvm.org/D138006
2023-09-08 11:47:24 -04:00
varconst
390ac82317 [libc++][ranges] Add benchmarks for the from_range constructors of vector and deque.
Differential Revision: https://reviews.llvm.org/D150747
2023-09-05 15:57:45 -07:00
Nikolas Klauser
68b1035965 [libc++][PSTL] Add a __parallel_sort implementation to libdispatch
Reviewed By: #libc, ldionne

Spies: ldionne, libcxx-commits

Differential Revision: https://reviews.llvm.org/D155136
2023-08-15 12:20:40 -07:00
Edoardo Sanguineti
66bb6b4fda [libc++] Optimize internal function in <system_error>
In the event the internal function __init is called with an empty string the code will take unnecessary extra steps, in addition, the code generated might be overall greater because, to my understanding, when initializing a string with an empty `const char*` "" (like in this case), the compiler might be unable to deduce the string is indeed empty at compile time and more code is generated.

The goal of this patch is to make a new internal function that will accept just an error code skipping the empty string argument. It should skip the unnecessary steps and in the event `if (ec)` is `false`, it will return an empty string using the correct ctor, avoiding any extra code generation issues.

After the conversation about this patch matured in the libcxx channel on the LLVM Discord server, the patch was analyzed quickly with "Compiler Explorer" and other tools and it was discovered that it does indeed reduce the amount of code generated when using the latest stable clang version (16) which in turn produces faster code.

This patch targets LLVM 18 as it will break the ABI by addressing https://github.com/llvm/llvm-project/issues/63985

Benchmark tests run on other machines as well show in the best case, that the new version without the extra string as an argument performs 10 times faster.
On the buildkite CI run it shows the new code takes less CPU time as well.
In conclusion, the new code should also just appear cleaner because there are fewer checks to do when there is no message.

Reviewed By: #libc, philnik

Spies: emaste, nemanjai, philnik, libcxx-commits

Differential Revision: https://reviews.llvm.org/D155820
2023-08-11 13:08:45 -07:00
Nikolas Klauser
8670b53e11 [libc++] Optimize ranges::find for vector<bool>
Benchmark results:
```
----------------------------------------------------------------
Benchmark                                    old             new
----------------------------------------------------------------
bm_vector_bool_ranges_find/1             5.64 ns         6.08 ns
bm_vector_bool_ranges_find/2             16.5 ns         6.03 ns
bm_vector_bool_ranges_find/3             20.3 ns         6.07 ns
bm_vector_bool_ranges_find/4             22.2 ns         6.08 ns
bm_vector_bool_ranges_find/5             23.5 ns         6.05 ns
bm_vector_bool_ranges_find/6             24.4 ns         6.10 ns
bm_vector_bool_ranges_find/7             26.7 ns         6.10 ns
bm_vector_bool_ranges_find/8             25.0 ns         6.08 ns
bm_vector_bool_ranges_find/16            27.9 ns         6.07 ns
bm_vector_bool_ranges_find/64            44.5 ns         5.35 ns
bm_vector_bool_ranges_find/512            243 ns         25.7 ns
bm_vector_bool_ranges_find/4096          1858 ns         35.6 ns
bm_vector_bool_ranges_find/32768        15461 ns         93.5 ns
bm_vector_bool_ranges_find/262144      126462 ns          571 ns
bm_vector_bool_ranges_find/1048576     497736 ns         2272 ns
```

Reviewed By: #libc, Mordante

Spies: var-const, Mordante, libcxx-commits

Differential Revision: https://reviews.llvm.org/D156039
2023-08-01 10:28:25 -07:00
Dmitry Ilvokhin
78addb2c32 Use hash value checks optimizations consistently
There are couple of optimizations of `__hash_table::find` which are applicable
to other places like `__hash_table::__node_insert_unique_prepare` and
`__hash_table::__emplace_unique_key_args`.

```
for (__nd = __nd->__next_; __nd != nullptr &&
    (__nd->__hash() == __hash
    // ^^^^^^^^^^^^^^^^^^^^^^
    //         (1)
      || std::__constrain_hash(__nd->__hash(), __bc) == __chash);
                                               __nd = __nd->__next_)
{
    if ((__nd->__hash() == __hash)
    // ^^^^^^^^^^^^^^^^^^^^^^^^^^
    //           (2)
        && key_eq()(__nd->__upcast()->__value_, __k))
        return iterator(__nd, this);
}
```

(1): We can avoid expensive modulo operations from `std::__constrain_hash` if
hashes matched. This one is from commit 6a411472e3.
(2): We can avoid `key_eq` calls if hashes didn't match. Commit:
318d35a7bc.

Both of them are applicable for insert and emplace methods.

Results of unordered_set_operations benchmark:

```
Comparing /tmp/main to /tmp/hashtable-hash-value-optimization
Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Hash/uint32_random_std_hash/1024                                    -0.0127         -0.0127          1511          1492          1508          1489
BM_Hash/uint32_random_custom_hash/1024                                 +0.0012         +0.0013          1370          1371          1367          1369
BM_Hash/uint32_top_std_hash/1024                                       -0.0027         -0.0028          1502          1497          1498          1494
BM_Hash/uint32_top_custom_hash/1024                                    +0.0033         +0.0032          1368          1373          1365          1370
BM_InsertValue/unordered_set_uint32/1024                               +0.0267         +0.0266         36421         37392         36350         37318
BM_InsertValue/unordered_set_uint32_sorted/1024                        +0.0230         +0.0229         28247         28897         28193         28837
BM_InsertValue/unordered_set_top_bits_uint32/1024                      +0.0492         +0.0491         31012         32539         30952         32472
BM_InsertValueRehash/unordered_set_top_bits_uint32/1024                +0.0523         +0.0520         62905         66197         62780         66043
BM_InsertValue/unordered_set_string/1024                               -0.0252         -0.0253        300762        293189        299805        292221
BM_InsertValueRehash/unordered_set_string/1024                         -0.0932         -0.0920        332924        301882        331276        300810
BM_InsertValue/unordered_set_prefixed_string/1024                      -0.0578         -0.0577        314239        296072        313222        295137
BM_InsertValueRehash/unordered_set_prefixed_string/1024                -0.0986         -0.0985        336091        302950        334982        301995
BM_Find/unordered_set_random_uint64/1024                               -0.1416         -0.1417         16075         13798         16041         13769
BM_FindRehash/unordered_set_random_uint64/1024                         -0.0105         -0.0105          5900          5838          5889          5827
BM_Find/unordered_set_sorted_uint64/1024                               +0.0014         +0.0014          2813          2817          2807          2811
BM_FindRehash/unordered_set_sorted_uint64/1024                         -0.0247         -0.0249          5863          5718          5851          5706
BM_Find/unordered_set_sorted_uint128/1024                              +0.0113         +0.0112         15570         15746         15539         15713
BM_FindRehash/unordered_set_sorted_uint128/1024                        +0.0438         +0.0441          6917          7220          6902          7206
BM_Find/unordered_set_sorted_uint32/1024                               -0.0020         -0.0020          3098          3091          3092          3085
BM_FindRehash/unordered_set_sorted_uint32/1024                         +0.0570         +0.0569          5377          5684          5368          5673
BM_Find/unordered_set_sorted_large_uint64/1024                         +0.0081         +0.0081          3594          3623          3587          3616
BM_FindRehash/unordered_set_sorted_large_uint64/1024                   -0.0542         -0.0540          6154          5820          6140          5808
BM_Find/unordered_set_top_bits_uint64/1024                             -0.0061         -0.0061         10440         10377         10417         10353
BM_FindRehash/unordered_set_top_bits_uint64/1024                       +0.0131         +0.0128          5852          5928          5840          5914
BM_Find/unordered_set_string/1024                                      -0.0352         -0.0349        189037        182384        188389        181809
BM_FindRehash/unordered_set_string/1024                                -0.0309         -0.0311        180718        175142        180141        174532
BM_Find/unordered_set_prefixed_string/1024                             -0.0559         -0.0557        190853        180177        190251        179659
BM_FindRehash/unordered_set_prefixed_string/1024                       -0.0563         -0.0561        182396        172136        181797        171602
BM_Rehash/unordered_set_string_arg/1024                                -0.0244         -0.0241         27052         26393         26989         26339
BM_Rehash/unordered_set_int_arg/1024                                   -0.0410         -0.0410         19582         18779         19539         18738
BM_InsertDuplicate/unordered_set_int/1024                              +0.0023         +0.0025         12168         12196         12142         12173
BM_InsertDuplicate/unordered_set_string/1024                           -0.0505         -0.0504        189238        179683        188648        179133
BM_InsertDuplicate/unordered_set_prefixed_string/1024                  -0.0989         -0.0987        198893        179222        198263        178702
BM_EmplaceDuplicate/unordered_set_int/1024                             -0.0175         -0.0173         12674         12452         12646         12427
BM_EmplaceDuplicate/unordered_set_string/1024                          -0.0559         -0.0557        190104        179481        189492        178934
BM_EmplaceDuplicate/unordered_set_prefixed_string/1024                 -0.1111         -0.1110        201233        178870        200608        178341
BM_InsertDuplicate/unordered_set_int_insert_arg/1024                   -0.0747         -0.0745         12993         12022         12964         11997
BM_InsertDuplicate/unordered_set_string_insert_arg/1024                -0.0584         -0.0583        191489        180311        190864        179731
BM_EmplaceDuplicate/unordered_set_int_insert_arg/1024                  -0.0807         -0.0804         35946         33047         35866         32982
BM_EmplaceDuplicate/unordered_set_string_arg/1024                      -0.0312         -0.0310        321623        311601        320559        310637
OVERALL_GEOMEAN                                                        -0.0276         -0.0275             0             0             0             0
```

Time differences looks more like noise to me. But if we want to have this
optimizations in `find`, we probably want them in `insert` and `emplace` as
well.

Reviewed By: #libc, Mordante

Differential Revision: https://reviews.llvm.org/D140779
2023-07-04 21:01:08 +02:00
Louis Dionne
5aa03b648b [libc++][NFC] Apply clang-format on large parts of the code base
This commit does a pass of clang-format over files in libc++ that
don't require major changes to conform to our style guide, or for
which we're not overly concerned about conflicting with in-flight
patches or hindering the git blame.

This roughly covers:
- benchmarks
- range algorithms
- concepts
- type traits

I did a manual verification of all the changes, and in particular I
applied clang-format on/off annotations in a few places where the
result was less readable after than before. This was not necessary
in a lot of places, however I did find that clang-format had pretty
bad taste when it comes to formatting concepts.

Differential Revision: https://reviews.llvm.org/D153140
2023-06-19 11:19:51 -04:00
Mark de Wever
f26b43fa5c [libc++] Removes CMake work-arounds.
CMake older than 3.20.0 is no longer supported.
This removes work-arounds for no longer supported versions.

Reviewed By: #libc, jloser, philnik

Differential Revision: https://reviews.llvm.org/D152099
2023-06-06 08:10:31 +02:00
Nikolas Klauser
d965960fcf Revert "[libc++] Optimize for_each for segmented iterators"
This reverts commit b1dc43aa3a.
2023-06-05 10:00:02 -07:00
Nikolas Klauser
b1dc43aa3a [libc++] Optimize for_each for segmented iterators
```
---------------------------------------------------
Benchmark                       old             new
---------------------------------------------------
bm_for_each/1               3.00 ns         2.98 ns
bm_for_each/2               4.53 ns         4.57 ns
bm_for_each/3               5.82 ns         5.82 ns
bm_for_each/4               6.94 ns         6.91 ns
bm_for_each/5               7.55 ns         7.75 ns
bm_for_each/6               7.06 ns         7.45 ns
bm_for_each/7               6.69 ns         7.14 ns
bm_for_each/8               6.86 ns         4.06 ns
bm_for_each/16              11.5 ns         5.73 ns
bm_for_each/64              43.7 ns         4.06 ns
bm_for_each/512              356 ns         7.98 ns
bm_for_each/4096            2787 ns         53.6 ns
bm_for_each/32768          20836 ns          438 ns
bm_for_each/262144        195362 ns         4945 ns
bm_for_each/1048576       685482 ns        19822 ns
```

Reviewed By: ldionne, Mordante, #libc

Spies: arichardson, libcxx-commits

Differential Revision: https://reviews.llvm.org/D151274
2023-05-31 18:15:25 -07:00
Nikolas Klauser
1fd08edd58 [libc++] Forward to std::{,w}memchr in std::find
Reviewed By: #libc, ldionne

Spies: Mordante, libcxx-commits, ldionne, mikhail.ramalho

Differential Revision: https://reviews.llvm.org/D144394
2023-05-25 07:59:50 -07:00
Tobias Hieta
7bfaa0f09d
[NFC][Py Reformat] Reformat python files in libcxx/libcxxabi
This is an ongoing series of commits that are reformatting our
Python code.

Reformatting is done with `black`.

If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.

If you run into any problems, post to discourse about it and
we will try to help.

RFC Thread below:

https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style

Reviewed By: #libc, kwk, Mordante

Differential Revision: https://reviews.llvm.org/D150763
2023-05-25 11:15:34 +02:00
AdityaK
63a2b206fa [libc++, std::vector] call the optimized version of __uninitialized_allocator_copy for trivial types
See: https://github.com/llvm/llvm-project/issues/61987

Fix suggested by: @philnik and @var-const

Reviewers: philnik, ldionne, EricWF, var-const

Differential Revision: https://reviews.llvm.org/D147741

Testing:
ninja check-cxx check-clang check-llvm

Benchmark Testcases (BM_CopyConstruct, and BM_Assignment) added.

performance improvement:

Run on (8 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 1.66, 3.02, 2.43

Comparing build-runtimes-base/libcxx/benchmarks/vector_operations.libcxx.out to build-runtimes/libcxx/benchmarks/vector_operations.libcxx.out
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_ConstructSize/vector_byte/5140480                     +0.0362         +0.0362        116906        121132        116902        121131
BM_CopyConstruct/vector_int/5140480                      -0.4563         -0.4577       1755224        954241       1755330        951987
BM_Assignment/vector_int/5140480                         -0.0222         -0.0220        990045        968095        989917        968125
BM_ConstructSizeValue/vector_byte/5140480                +0.0308         +0.0307        116970        120567        116977        120573
BM_ConstructIterIter/vector_char/1024                    -0.0831         -0.0831            19            17            19            17
BM_ConstructIterIter/vector_size_t/1024                  +0.0129         +0.0131            88            89            88            89
BM_ConstructIterIter/vector_string/1024                  -0.0064         -0.0018         54455         54109         54208         54112
OVERALL_GEOMEAN                                          -0.0845         -0.0842             0             0             0             0

FYI, the perf improvements for BM_CopyConstruct due to this patch is mostly subsumed by the https://reviews.llvm.org/D149826. However this patch still adds value by converting copy to memmove (the second testcase).

Before the patch:

```
define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 {
  %5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1
  %6 = load ptr, ptr %5, align 8, !tbaa !12
  %7 = icmp eq ptr %1, %2
  br i1 %7, label %16, label %8

8:                                                ; preds = %4, %8
  %9 = phi ptr [ %13, %8 ], [ %1, %4 ]
  %10 = phi ptr [ %14, %8 ], [ %6, %4 ]
  %11 = icmp ne ptr %10, null
  tail call void @llvm.assume(i1 %11)
  %12 = load i32, ptr %9, align 4, !tbaa !14
  store i32 %12, ptr %10, align 4, !tbaa !14
  %13 = getelementptr inbounds i32, ptr %9, i64 1
  %14 = getelementptr inbounds i32, ptr %10, i64 1
  %15 = icmp eq ptr %13, %2
  br i1 %15, label %16, label %8, !llvm.loop !16

16:                                               ; preds = %8, %4
  %17 = phi ptr [ %6, %4 ], [ %14, %8 ]
  store ptr %17, ptr %5, align 8, !tbaa !12
  ret void
}
```

After the patch:
```
define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 {
  %5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1
  %6 = load ptr, ptr %5, align 8, !tbaa !12
  %7 = ptrtoint ptr %2 to i64
  %8 = ptrtoint ptr %1 to i64
  %9 = sub i64 %7, %8
  %10 = ashr exact i64 %9, 2
  tail call void @llvm.memmove.p0.p0.i64(ptr align 4 %6, ptr align 4 %1, i64 %9, i1 false)
  %11 = getelementptr inbounds i32, ptr %6, i64 %10
  store ptr %11, ptr %5, align 8, !tbaa !12
  ret void
}
```

This is due to the optimized version of uninitialized_allocator_copy function.
2023-05-24 13:40:53 -07:00
Sirui Mu
c9d475c937 [libc++abi] Improve performance of __dynamic_cast
The original `__dynamic_cast` implementation does not use the ABI-provided `src2dst_offset` parameter which helps improve performance on the hot paths. This patch improves the performance on the hot paths in `__dynamic_cast` by leveraging hints provided by the `src2dst_offset` parameter. This patch also includes a performance benchmark suite for the `__dynamic_cast` implementation.

Reviewed By: philnik, ldionne, #libc, #libc_abi, avogelsgesang

Spies: mikhail.ramalho, avogelsgesang, xingxue, libcxx-commits

Differential Revision: https://reviews.llvm.org/D138005
2023-03-19 10:04:34 +01:00
Nikolas Klauser
aff3cdc604 [libc++] Optimize std::ranges::{min, max} for types that are cheap to copy
Don't forward to `min_element` for small types that are trivially copyable, and instead use a naive loop that keeps track of the smallest element (as opposed to an iterator to the smallest element). This allows the compiler to vectorize the loop in some cases.

Reviewed By: #libc, ldionne

Spies: ldionne, libcxx-commits

Differential Revision: https://reviews.llvm.org/D143596
2023-03-11 16:28:24 +01:00
Nikolas Klauser
b4ecfd3c46 [libc++] Forward to std::memcmp for trivially comparable types in equal
Reviewed By: #libc, ldionne

Spies: ldionne, Mordante, libcxx-commits

Differential Revision: https://reviews.llvm.org/D139554
2023-02-21 17:11:21 +01:00
Adrian Vogelsgesang
2a06757a20 [libc++][spaceship] Implement lexicographical_compare_three_way
The implementation makes use of the freedom added by LWG 3410. We have
two variants of this algorithm:
* a fast path for random access iterators: This fast path computes the
  maximum number of loop iterations up-front and does not compare the
  iterators against their limits on every loop iteration.
* A basic implementation for all other iterators: This implementation
  compares the iterators against their limits in every loop iteration.
  However, it still takes advantage of the freedom added by LWG 3410 to
  avoid unnecessary additional iterator comparisons, as originally
  specified by P1614R2.

https://godbolt.org/z/7xbMEen5e shows the benefit of the fast path:
The hot loop generated of `lexicographical_compare_three_way3` is
more tight than for `lexicographical_compare_three_way1`. The added
benchmark illustrates how this leads to a 30% - 50% performance
improvement on integer vectors.

Implements part of P1614R2 "The Mothership has Landed"

Fixes LWG 3410 and LWG 3350

Differential Revision: https://reviews.llvm.org/D131395
2023-02-12 14:51:08 -08:00
Nikolas Klauser
21f4232dd9 [libc++] Enable segmented iterator optimizations for join_view::iterator
Reviewed By: ldionne, #libc

Spies: libcxx-commits

Differential Revision: https://reviews.llvm.org/D138413
2023-01-20 07:55:58 +01:00
Nikolas Klauser
c90801457f [libc++] Refactor deque::iterator algorithm optimizations
This has multiple benefits:
- The optimizations are also performed for the `ranges::` versions of the algorithms
- Code duplication is reduced
- it is simpler to add this optimization for other segmented iterators,
  like `ranges::join_view::iterator`
- Algorithm code is removed from `<deque>`

Reviewed By: ldionne, huixie90, #libc

Spies: mstorsjo, sstefan1, EricWF, libcxx-commits, mgorny

Differential Revision: https://reviews.llvm.org/D132505
2023-01-19 20:11:43 +01:00
Nikolas Klauser
549a5fd0b7 [libc++] Make pmr::monotonic_buffer_resource bump down
Bumping down is significantly faster than bumping up. This is ABI breaking, but the ABI of `pmr::monotonic_buffer_resource` was only stabilized in this release cycle, so we can still change it.
For a more detailed explanation why bumping down is better, see https://fitzgeraldnick.com/2019/11/01/always-bump-downwards.html.

Reviewed By: ldionne, #libc

Spies: libcxx-commits

Differential Revision: https://reviews.llvm.org/D141435
2023-01-12 18:36:45 +01:00
Louis Dionne
355e0ce3c5 [libc++] Extend check for non-ASCII characters to src/, test/ and benchmarks/
Differential Revision: https://reviews.llvm.org/D132180
2022-08-23 18:36:38 -04:00
Mark de Wever
857a78c04d [libc++] Implements Unicode grapheme clustering
This implements the Grapheme clustering as required by
P1868R2 width: clarifying units of width and precision in std::format

This was omitted in the initial patch, but the paper was marked as completed. This really completes the paper.

Reviewed By: ldionne, #libc

Differential Revision: https://reviews.llvm.org/D126971
2022-07-20 18:38:32 +02:00
Louis Dionne
8711fcae27 [libc++] Treat incomplete features just like other experimental features
In particular remove the ability to expel incomplete features from the
library at configure-time, since this can now be done through the
_LIBCPP_ENABLE_EXPERIMENTAL macro.

Also, never provide symbols related to incomplete features inside the
dylib, instead provide them in c++experimental.a (this changes the
symbols list, but not for any configuration that should have shipped).

Differential Revision: https://reviews.llvm.org/D128928
2022-07-19 10:50:20 -04:00
Mark de Wever
f338f416ba [libc++][format] Adds integral formatter benchmarks.
This is a preparation to look at possible performance improvements.

Reviewed By: #libc, ldionne

Differential Revision: https://reviews.llvm.org/D129421
2022-07-12 19:17:57 +02:00
Ivan Trofimov
3085e42f80 [libc++] Don't call key_eq in unordered_map/set rehashing routine
As of now containers key_eq might get called when rehashing happens, which is redundant for unique keys containers.

Reviewed By: #libc, philnik, Mordante

Differential Revision: https://reviews.llvm.org/D128021
2022-07-10 11:44:12 +02:00
Konstantin Varlamov
c945bd0da6 [libc++][ranges] Implement modifying heap algorithms:
- `ranges::make_heap`;
- `ranges::push_heap`;
- `ranges::pop_heap`;
- `ranges::sort_heap`.

Differential Revision: https://reviews.llvm.org/D128115
2022-07-08 13:48:41 -07:00
Konstantin Varlamov
94c7b89fe5 [libc++][ranges] Implement ranges::stable_sort.
Differential Revision: https://reviews.llvm.org/D127834
2022-07-01 16:34:26 -07:00
Louis Dionne
303c4c37ea [libc++] Don't force -O2 when building the benchmarks
The optimization level used when building the benchmarks should
match the optimization level of the current build. Otherwise, we
can end up mixing an -O3 or -O0 optimized dylib with benchmarks
built with -O2, which is really misleading.

Differential Revision: https://reviews.llvm.org/D127987
2022-06-17 17:34:02 -04:00
Konstantin Varlamov
ff3989e6ae [libc++][ranges] Implement ranges::sort.
Differential Revision: https://reviews.llvm.org/D127557
2022-06-16 15:21:06 -07:00
Mark de Wever
aed5ddf8d0 [libc++][format] Implement format-string.
Implements the compile-time checking of the formatting arguments.

Completes:
- P2216 std::format improvements

Reviewed By: #libc, ldionne

Differential Revision: https://reviews.llvm.org/D121530
2022-06-11 15:25:56 +02:00