llvm-capstone

mirror of https://github.com/capstone-engine/llvm-capstone.git synced 2024-10-07 10:54:01 +00:00

Author	SHA1	Message	Date
lntue	c80d68a676	[libc] Add float.h header. (#78737 )	2024-01-19 12:04:34 -05:00
Nishant Mittal	0504e93288	[libc][math] Implement nan(f\|l) functions (#76690 ) Specification: https://en.cppreference.com/w/c/numeric/math/nan	2024-01-05 08:23:23 -05:00
Nishant Mittal	0c49fc4c68	[libc][math] Implement nexttoward functions (#72763 ) Implements the `nexttoward`, `nexttowardf` and `nexttowardl` functions. Also, raise excepts required by the standard in `nextafter` functions. cc: @lntue	2023-11-21 09:02:51 -05:00
lntue	bc7a3bd864	[libc][math] Implement powf function correctly rounded to all rounding modes. (#71188 ) We compute `pow(x, y)` using the formula ``` pow(x, y) = x^y = 2^(y * log2(x)) ``` We follow similar steps as in `log2f(x)` and `exp2f(x)`, by breaking down into `hi + mid + lo` parts, in which `hi` parts are computed using the exponent field directly, `mid` parts will use look-up tables, and `lo` parts are approximated by polynomials. We add some speedup for common use-cases: ``` pow(2, y) = exp2(y) pow(10, y) = exp10(y) pow(x, 2) = x * x pow(x, 1/2) = sqrt(x) pow(x, -1/2) = rsqrt(x) - to be added ```	2023-11-06 16:54:25 -05:00
lntue	da28593d71	[libc][math] Implement double precision expm1 function correctly rounded for all rounding modes. (#67048 ) Implementing expm1 function for double precision based on exp function algorithm: - Reduced x = log2(e) * (hi + mid1 + mid2) + lo, where: * hi is an integer * mid1 * 2^-6 is an integer * mid2 * 2^-12 is an integer * \|lo\| < 2^-13 + 2^-30 - Then exp(x) - 1 = 2^hi * 2^mid1 * 2^mid2 * exp(lo) - 1 ~ 2^hi * (2^mid1 * 2^mid2 * (1 + lo * P(lo)) - 2^(-hi) ) - We evaluate fast pass with P(lo) is a degree-3 Taylor polynomial of (e^lo - 1) / lo in double precision - If the Ziv accuracy test fails, we use degree-6 Taylor polynomial of (e^lo - 1) / lo in double double precision - If the Ziv accuracy test still fails, we re-evaluate everything in 128-bit precision.	2023-09-28 16:43:15 -04:00
Tue Ly	76bb278ebb	[libc][math] Implement double precision exp10 function correctly rounded for all rounding modes. Implement double precision exp10 function correctly rounded for all rounding modes. Using the same algorithm as double precision exp (https://reviews.llvm.org/D158551) and exp2 (https://reviews.llvm.org/D158812) functions. Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D159143	2023-08-30 08:43:50 -04:00
Tue Ly	8ca614aa22	[libc][math] Implement double precision exp2 function correctly rounded for all rounding modes. Implement double precision exp2 function correctly rounded for all rounding modes. Using the same algorithm as double precision exp function in https://reviews.llvm.org/D158551. Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D158812	2023-08-25 10:15:08 -04:00
Tue Ly	434bf16084	[libc][math] Implement double precision exp function correctly rounded for all rounding modes. Implement double precision exp function correctly rounded for all rounding modes. Using 4 stages: - Range reduction: reduce to `exp(x) = 2^hi * 2^mid1 * 2^mid2 * exp(lo)`. - Use 64 + 64 LUT for 2^mid1 and 2^mid2, and use cubic Taylor polynomial to approximate `(exp(lo) - 1) / lo` in double precision. Relative error in this step is bounded by 1.5 * 2^-63. - If the rounding test fails, use degree-6 Taylor polynomial to approximate `exp(lo)` in double-double precision. Relative error in this step is bounded by 2^-99. - If the rounding test still fails, use degree-7 Taylor polynomial to compute `exp(lo)` in ~128-bit precision. Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D158551	2023-08-24 10:17:17 -04:00
Tue Ly	f320fefc4a	[libc][math] Implement erff function correctly rounded to all rounding modes. Implement correctly rounded `erff` functions. For `x >= 4`, `erff(x) = 1` for `FE_TONEAREST` or `FE_UPWARD`, `0x1.ffffep-1` for `FE_DOWNWARD` or `FE_TOWARDZERO`. For `0 <= x < 4`, we divide into 32 sub-intervals of length `1/8`, and use a degree-15 odd polynomial to approximate `erff(x)` in each sub-interval: ``` erff(x) ~ x * (c0 + c1 * x^2 + c2 * x^4 + ... + c7 * x^14). ``` For `x < 0`, we can use the same formula as above, since the odd part is factored out. Performance tested with `perf.sh` tool from the CORE-MATH project on AMD Ryzen 9 5900X: Reciprocal throughput (clock cycles / op) ``` $ ./perf.sh erff --path2 GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH reciprocal throughput -- with -march=native (with FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 11.790 + 0.182 clc/call; Median-Min = 0.154 clc/call; Max = 12.255 clc/call; -- CORE-MATH reciprocal throughput -- with -march=x86-64-v2 (without FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 14.205 + 0.151 clc/call; Median-Min = 0.159 clc/call; Max = 15.893 clc/call; -- System LIBC reciprocal throughput -- [####################] 100 % Ntrial = 20 ; Min = 45.519 + 0.445 clc/call; Median-Min = 0.552 clc/call; Max = 46.345 clc/call; -- LIBC reciprocal throughput -- with -mavx2 -mfma (with FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 9.595 + 0.214 clc/call; Median-Min = 0.220 clc/call; Max = 9.887 clc/call; -- LIBC reciprocal throughput -- with -msse4.2 (without FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 10.223 + 0.190 clc/call; Median-Min = 0.222 clc/call; Max = 10.474 clc/call; ``` and latency (clock cycles / op): ``` $ ./perf.sh erff --path2 GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH latency -- with -march=native (with FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 38.566 + 0.391 clc/call; Median-Min = 0.503 clc/call; Max = 39.170 clc/call; -- CORE-MATH latency -- with -march=x86-64-v2 (without FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 43.223 + 0.667 clc/call; Median-Min = 0.680 clc/call; Max = 43.913 clc/call; -- System LIBC latency -- [####################] 100 % Ntrial = 20 ; Min = 111.613 + 1.267 clc/call; Median-Min = 1.696 clc/call; Max = 113.444 clc/call; -- LIBC latency -- with -mavx2 -mfma (with FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 40.138 + 0.410 clc/call; Median-Min = 0.536 clc/call; Max = 40.729 clc/call; -- LIBC latency -- with -msse4.2 (without FMA instructions) [####################] 100 % Ntrial = 20 ; Min = 44.858 + 0.872 clc/call; Median-Min = 0.814 clc/call; Max = 46.019 clc/call; ``` Reviewed By: michaelrj Differential Revision: https://reviews.llvm.org/D153683	2023-06-28 13:58:37 -04:00
Tue Ly	b91e78da37	[libc][math] Implement double precision log1p correctly rounded to all rounding modes. Implement double precision log1p function correctly rounded to all rounding modes. Performance - For `0.5 <= x <= 2`, the fast pass hitting rate is about 99.93%. - Benchmarks with `./perf.sh` tool from the CORE-MATH project, unit is (CPU clocks / call). - Reciprocal throughput from CORE-MATH's perf tool on Ryzen 5900X: ``` $ ./perf.sh log1p GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH reciprocal throughput -- with FMA [####################] 100 % Ntrial = 20 ; Min = 39.792 + 1.011 clc/call; Median-Min = 0.940 clc/call; Max = 41.373 clc/call; -- CORE-MATH reciprocal throughput -- without FMA (-march=x86-64-v2) [####################] 100 % Ntrial = 20 ; Min = 87.285 + 1.135 clc/call; Median-Min = 1.299 clc/call; Max = 89.715 clc/call; -- System LIBC reciprocal throughput -- [####################] 100 % Ntrial = 20 ; Min = 20.666 + 0.123 clc/call; Median-Min = 0.125 clc/call; Max = 20.828 clc/call; -- LIBC reciprocal throughput -- with FMA [####################] 100 % Ntrial = 20 ; Min = 20.928 + 0.771 clc/call; Median-Min = 0.725 clc/call; Max = 22.767 clc/call; -- LIBC reciprocal throughput -- without FMA [####################] 100 % Ntrial = 20 ; Min = 31.461 + 0.528 clc/call; Median-Min = 0.602 clc/call; Max = 36.809 clc/call; ``` - Latency from CORE-MATH's perf tool on Ryzen 5900X: ``` $ ./perf.sh log1p --latency GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH latency -- with FMA [####################] 100 % Ntrial = 20 ; Min = 77.875 + 0.062 clc/call; Median-Min = 0.051 clc/call; Max = 78.003 clc/call; -- CORE-MATH latency -- without FMA (-march=x86-64-v2) [####################] 100 % Ntrial = 20 ; Min = 101.958 + 1.202 clc/call; Median-Min = 1.325 clc/call; Max = 104.452 clc/call; -- System LIBC latency -- [####################] 100 % Ntrial = 20 ; Min = 60.581 + 1.443 clc/call; Median-Min = 1.611 clc/call; Max = 62.285 clc/call; -- LIBC latency -- with FMA [####################] 100 % Ntrial = 20 ; Min = 48.817 + 1.108 clc/call; Median-Min = 1.300 clc/call; Max = 50.282 clc/call; -- LIBC latency -- without FMA [####################] 100 % Ntrial = 20 ; Min = 61.121 + 0.599 clc/call; Median-Min = 0.761 clc/call; Max = 62.020 clc/call; ``` - Accurate pass latency: ``` $ ./perf.sh log1p --latency --simple_stat GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH latency -- with FMA 760.444 -- CORE-MATH latency -- without FMA (-march=x86-64-v2) 827.880 -- LIBC latency -- with FMA 711.837 -- LIBC latency -- without FMA 764.317 ``` Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D151049	2023-05-23 11:04:04 -04:00
Tue Ly	111d274841	[libc][math] Implement double precision log2 function correctly rounded to all rounding modes. Implement double precision log2 function correctly rounded to all rounding modes. See https://reviews.llvm.org/D150014 for a more detail description of the algorithm. Performance - For `0.5 <= x <= 2`, the fast pass hitting rate is about 99.91%. - Reciprocal throughput from CORE-MATH's perf tool on Ryzen 5900X: ``` $ ./perf.sh log2 GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH reciprocal throughput -- with FMA [####################] 100 % Ntrial = 20 ; Min = 15.458 + 0.204 clc/call; Median-Min = 0.224 clc/call; Max = 15.867 clc/call; -- CORE-MATH reciprocal throughput -- without FMA (-march=x86-64-v2) [####################] 100 % Ntrial = 20 ; Min = 23.711 + 0.524 clc/call; Median-Min = 0.443 clc/call; Max = 25.307 clc/call; -- System LIBC reciprocal throughput -- [####################] 100 % Ntrial = 20 ; Min = 14.807 + 0.199 clc/call; Median-Min = 0.211 clc/call; Max = 15.137 clc/call; -- LIBC reciprocal throughput -- with FMA [####################] 100 % Ntrial = 20 ; Min = 17.666 + 0.274 clc/call; Median-Min = 0.298 clc/call; Max = 18.531 clc/call; -- LIBC reciprocal throughput -- without FMA [####################] 100 % Ntrial = 20 ; Min = 26.534 + 0.418 clc/call; Median-Min = 0.462 clc/call; Max = 27.327 clc/call; ``` - Latency from CORE-MATH's perf tool on Ryzen 5900X: ``` $ ./perf.sh log2 --latency GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH latency -- with FMA [####################] 100 % Ntrial = 20 ; Min = 46.048 + 1.643 clc/call; Median-Min = 1.694 clc/call; Max = 48.018 clc/call; -- CORE-MATH latency -- without FMA (-march=x86-64-v2) [####################] 100 % Ntrial = 20 ; Min = 62.333 + 0.138 clc/call; Median-Min = 0.119 clc/call; Max = 62.583 clc/call; -- System LIBC latency -- [####################] 100 % Ntrial = 20 ; Min = 45.206 + 1.503 clc/call; Median-Min = 1.467 clc/call; Max = 47.229 clc/call; -- LIBC latency -- with FMA [####################] 100 % Ntrial = 20 ; Min = 43.042 + 0.454 clc/call; Median-Min = 0.484 clc/call; Max = 43.912 clc/call; -- LIBC latency -- without FMA [####################] 100 % Ntrial = 20 ; Min = 57.016 + 1.636 clc/call; Median-Min = 1.655 clc/call; Max = 58.816 clc/call; ``` - Accurate pass latency: ``` $ ./perf.sh log2 --latency --simple_stat GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH latency -- with FMA 177.632 -- CORE-MATH latency -- without FMA (-march=x86-64-v2) 231.332 -- LIBC latency -- with FMA 459.751 -- LIBC latency -- without FMA 463.850 ``` Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D150374	2023-05-23 10:49:30 -04:00
Tue Ly	a68bbf42fa	[libc][math] Implement double precision log function correctly rounded to all rounding modes. Implement double precision log function correctly rounded to all rounding modes. See https://reviews.llvm.org/D150014 for a more detail description of the algorithm. Performance - For `0.5 <= x <= 2`, the fast pass hitting rate is about 99.93%. - Reciprocal throughput from CORE-MATH's perf tool on Ryzen 5900X: ``` $ ./perf.sh log GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH reciprocal throughput -- with FMA [####################] 100 % Ntrial = 20 ; Min = 17.465 + 0.596 clc/call; Median-Min = 0.602 clc/call; Max = 18.389 clc/call; -- CORE-MATH reciprocal throughput -- without FMA (-march=x86-64-v2) [####################] 100 % Ntrial = 20 ; Min = 54.961 + 2.606 clc/call; Median-Min = 2.180 clc/call; Max = 59.583 clc/call; -- System LIBC reciprocal throughput -- [####################] 100 % Ntrial = 20 ; Min = 12.608 + 0.276 clc/call; Median-Min = 0.359 clc/call; Max = 13.147 clc/call; -- LIBC reciprocal throughput -- with FMA [####################] 100 % Ntrial = 20 ; Min = 20.952 + 0.468 clc/call; Median-Min = 0.602 clc/call; Max = 21.881 clc/call; -- LIBC reciprocal throughput -- without FMA [####################] 100 % Ntrial = 20 ; Min = 18.569 + 0.552 clc/call; Median-Min = 0.601 clc/call; Max = 19.259 clc/call; ``` - Latency from CORE-MATH's perf tool on Ryzen 5900X: ``` $ ./perf.sh log --latency GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH latency -- with FMA [####################] 100 % Ntrial = 20 ; Min = 48.431 + 0.699 clc/call; Median-Min = 0.073 clc/call; Max = 51.269 clc/call; -- CORE-MATH latency -- without FMA (-march=x86-64-v2) [####################] 100 % Ntrial = 20 ; Min = 64.865 + 3.235 clc/call; Median-Min = 3.475 clc/call; Max = 71.788 clc/call; -- System LIBC latency -- [####################] 100 % Ntrial = 20 ; Min = 42.151 + 2.090 clc/call; Median-Min = 2.270 clc/call; Max = 44.773 clc/call; -- LIBC latency -- with FMA [####################] 100 % Ntrial = 20 ; Min = 35.266 + 0.479 clc/call; Median-Min = 0.373 clc/call; Max = 36.798 clc/call; -- LIBC latency -- without FMA [####################] 100 % Ntrial = 20 ; Min = 48.518 + 0.484 clc/call; Median-Min = 0.500 clc/call; Max = 49.896 clc/call; ``` - Accurate pass latency: ``` $ ./perf.sh log --latency --simple_stat GNU libc version: 2.35 GNU libc release: stable -- CORE-MATH latency -- with FMA 598.306 -- CORE-MATH latency -- without FMA (-march=x86-64-v2) 632.925 -- LIBC latency -- with FMA 455.632 -- LIBC latency -- without FMA 488.564 ``` Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D150131	2023-05-23 10:35:15 -04:00
Caslyn Tonelli	718729e997	[libc] Add memmem implementation Introduce the `memmem` libc string function. `memmem_implementation` performs shared logic for `strstr`, `strcasestr`, and `memmem`; essentially reconfiguring what was the `strstr_implementation` to support length parameters. Differential Revision: https://reviews.llvm.org/D147822	2023-04-11 20:49:25 +00:00
Caslyn Tonelli	bc2b161408	[libc] Add strchrnul implementation Introduce strchrnul implementation and unit tests. Submitting on behalf of Caslyn@ Differential Revision: https://reviews.llvm.org/D147346	2023-04-03 11:08:28 -07:00
Siva Chandra Reddy	71825a889a	[libc][NFC] Add string.h header to various platform headers.txt.	2023-03-13 15:34:58 +00:00
Tue Ly	3735d209a2	[libc][Obvious] Add errno entrypoint for macOS ARM64.	2023-03-06 00:06:03 -05:00
Michael Jones	effd56b0a0	[libc] add basic Intel MacOS configuration The config is based on the ARM MacOS config, but with fenv and math functions disabled. This should unblock this bug: https://github.com/llvm/llvm-project/issues/60910 Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D145099	2023-03-01 15:33:16 -08:00
Renyi Chen	6cb14adbfa	[libc][math] Implement scalbn, scalbnf, scalbnl. Implement scalbn via `fptuil::ldexp` for `FLT_RADIX==2` case. "unimplemented" otherwise. Reviewed By: lntue, sivachandra Differential Revision: https://reviews.llvm.org/D143116	2023-02-09 05:55:34 +00:00
Tue Ly	9b30f6b6d7	[libc][math] Implement acoshf function correctly rounded to all rounding modes. Implement acoshf function correctly rounded to all rounding modes. Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D142781	2023-02-01 11:35:15 -05:00
Tue Ly	46b15fd19e	[libc][math] Implement asinhf function correctly rounded for all rounding modes. Implement asinhf function correctly rounded for all rounding modes. Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D142681	2023-01-27 11:12:27 -05:00
Alex Brachet	741021de32	[libc] Implement strcasestr Differential Revision: https://reviews.llvm.org/D142518	2023-01-25 17:58:13 +00:00
Alex Brachet	e9d571d3b6	[libc] Implement str{,n}casecmp Differential Revision: https://reviews.llvm.org/D141236	2023-01-11 05:38:33 +00:00
Tue Ly	5814b7b279	[libc][math] Implement log10 function correctly rounded for all rounding modes Implement double precision log10 function correctly rounded for all rounding modes. This implementation currently needs FMA instructions for correctness. Use 2 passes: Fast pass: - 1 step range reduction with a lookup table of `2^7 = 128` elements to reduce the ranges to `[-2^-7, 2^-7]`. - Use a degree-7 minimax polynomial generated by Sollya, evaluated using a mixed of double-double and double precisions. - Apply Ziv's test for accuracy. Accurate pass: - Apply 5 more range reduction steps to reduce the ranges further to [-2^-27, 2^-27]. - Use a degree-4 minimax polynomial generated by Sollya, evaluated using 192-bit precisions. - By the result of Lefevre (add quote), this is more than enough for correct rounding to all rounding modes. In progress: Adding detail documentations about the algorithm. Depend on: https://reviews.llvm.org/D136799 Reviewed By: zimmermann6 Differential Revision: https://reviews.llvm.org/D139846	2023-01-08 17:41:54 -05:00
Guillaume Chatelet	436c8f4420	[reland][libc] Add bcopy Differential Revision: https://reviews.llvm.org/D138994	2022-12-01 10:07:04 +00:00
Guillaume Chatelet	c5fe7eb216	Revert D138994 "[libc] Add bcopy" Broke build bot This reverts commit `186a15f7a9`.	2022-12-01 09:55:36 +00:00
Guillaume Chatelet	186a15f7a9	[libc] Add bcopy Differential Revision: https://reviews.llvm.org/D138994	2022-12-01 09:52:10 +00:00
Tue Ly	a752460d73	[libc][math] Implement exp10f function correctly rounded to all rounding modes. Implement exp10f function correctly rounded to all rounding modes. Algorithm: perform range reduction to reduce ``` 10^x = 2^(hi + mid) * 10^lo ``` where: ``` hi is an integer, 0 <= mid * 2^5 < 2^5 -log10(2) / 2^6 <= lo <= log10(2) / 2^6 ``` Then `2^mid` is stored in a table of 32 entries and the product `2^hi * 2^mid` is performed by adding `hi` into the exponent field of `2^mid`. `10^lo` is then approximated by a degree-5 minimax polynomials generated by Sollya with: ``` > P = fpminimax((10^x - 1)/x, 4, [\|D...\|], [-log10(2)/64. log10(2)/64]); ``` Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700: ``` $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp10f GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 10.215 System LIBC reciprocal throughput : 7.944 LIBC reciprocal throughput : 38.538 LIBC reciprocal throughput : 12.175 (with `-msse4.2` flag) LIBC reciprocal throughput : 9.862 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp10f --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 40.744 System LIBC latency : 37.546 BEFORE LIBC latency : 48.989 LIBC latency : 44.486 (with `-msse4.2` flag) LIBC latency : 40.221 (with `-mfma` flag) ``` This patch relies on https://reviews.llvm.org/D134002 Reviewed By: orex, zimmermann6 Differential Revision: https://reviews.llvm.org/D134104	2022-09-19 10:01:40 -04:00
Tue Ly	463dcc8749	[libc][math] Implement acosf function correctly rounded for all rounding modes. Implement acosf function correctly rounded for all rounding modes. We perform range reduction as follows: - When `\|x\| < 2^(-10)`, we use cubic Taylor polynomial: ``` acos(x) = pi/2 - asin(x) ~ pi/2 - x - x^3 / 6. ``` - When `2^(-10) <= \|x\| <= 0.5`, we use the same approximation that is used for `asinf(x)` when `\|x\| <= 0.5`: ``` acos(x) = pi/2 - asin(x) ~ pi/2 - x - x^3 * P(x^2). ``` - When `0.5 < x <= 1`, we use the double angle formula: `cos(2y) = 1 - 2 * sin^2 (y)` to reduce to: ``` acos(x) = 2 * asin( sqrt( (1 - x)/2 ) ) ``` - When `-1 <= x < -0.5`, we reduce to the positive case above using the formula: ``` acos(x) = pi - acos(-x) ``` Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700: ``` $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh acosf GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 28.613 System LIBC reciprocal throughput : 29.204 LIBC reciprocal throughput : 24.271 $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh asinf --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 55.554 System LIBC latency : 76.879 LIBC latency : 62.118 ``` Reviewed By: orex, zimmermann6 Differential Revision: https://reviews.llvm.org/D133550	2022-09-09 09:55:30 -04:00
Tue Ly	e2f065c2a3	[libc][math] Implement asinf function correctly rounded for all rounding modes. Implement asinf function correctly rounded for all rounding modes. For `\|x\| <= 0.5`, we approximate `asin(x)` by ``` asin(x) = x * P(x^2) ``` where `P(X^2) = Q(X)` is a degree-20 minimax even polynomial approximating `asin(x)/x` on `[0, 0.5]` generated by Sollya with: ``` > Q = fpminimax(asin(x)/x, [\|0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20\|], [\|1, D...\|], [0, 0.5]); ``` When `\|x\| > 0.5`, we perform range reduction as follow: Assume further that `0.5 < x <= 1`, and let: ``` y = asin(x) ``` We will use the double angle formula: ``` cos(2X) = 1 - 2 sin^2(X) ``` and the complement angle identity: ``` x = sin(y) = cos(pi/2 - y) = 1 - 2 sin^2 (pi/4 - y/2) ``` So: ``` sin(pi/4 - y/2) = sqrt( (1 - x)/2 ) ``` And hence: ``` pi/4 - y/2 = asin( sqrt( (1 - x)/2 ) ) ``` Equivalently: ``` asin(x) = y = pi/2 - 2 * asin( sqrt( (1 - x)/2 ) ) ``` Let `u = (1 - x)/2`, then ``` asin(x) = pi/2 - 2 * asin(u) ``` Moreover, since `0.5 < x <= 1`, ``` 0 <= u < 1/4, and 0 <= sqrt(u) < 0.5. ``` And hence we can reuse the same polynomial approximation of `asin(x)` when `\|x\| <= 0.5`: ``` asin(x) = pi/2 - 2 * u * P(u^2). ``` Performance benchmark using `perf` tool from the CORE-MATH project on Ryzen 1700: ``` $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh asinf CORE-MATH reciprocal throughput : 23.418 System LIBC reciprocal throughput : 27.310 LIBC reciprocal throughput : 22.741 $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh asinf --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 58.884 System LIBC latency : 62.055 LIBC latency : 62.037 ``` Reviewed By: orex, zimmermann6 Differential Revision: https://reviews.llvm.org/D133400	2022-09-07 19:27:47 -04:00
Kirill Okhotnikov	77e1d9beed	[libc][math] Added atanf function. Performance by core-math (core-math/glibc 2.31/current llvm-14): 28.879/20.843/20.15 Differential Revision: https://reviews.llvm.org/D132842	2022-08-30 22:39:54 +02:00
Kirill Okhotnikov	6c1fc7e430	[libc][math] Added atanhf function. Performance by core-math (core-math/glibc 2.31/current llvm-14): 10.845/43.174/13.467 The review is done on top of D132809. Differential Revision: https://reviews.llvm.org/D132811	2022-08-30 22:39:54 +02:00
Tue Ly	82d6e77048	[libc] Implement tanf function correctly rounded for all rounding modes. Implement tanf function correctly rounded for all rounding modes. We use the range reduction that is shared with `sinf`, `cosf`, and `sincosf`: ``` k = round(x * 32/pi) and y = x * (32/pi) - k. ``` Then we use the tangent of sum formula: ``` tan(x) = tan((k + y)* pi/32) = tan((k mod 32) * pi / 32 + y * pi/32) = (tan((k mod 32) * pi/32) + tan(y * pi/32)) / (1 - tan((k mod 32) * pi/32) * tan(y * pi/32)) ``` We need to make a further reduction when `k mod 32 >= 16` due to the pole at `pi/2` of `tan(x)` function: ``` if (k mod 32 >= 16): k = k - 31, y = y - 1.0 ``` And to compute the final result, we store `tan(k * pi/32)` for `k = -15..15` in a table of 32 double values, and evaluate `tan(y * pi/32)` with a degree-11 minimax odd polynomial generated by Sollya with: ``` > P = fpminimax(tan(y * pi/32)/y, [\|0, 2, 4, 6, 8, 10\|], [\|D...\|], [0, 1.5]); ``` Performance benchmark using `perf` tool from the CORE-MATH project on Ryzen 1700: ``` $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanf CORE-MATH reciprocal throughput : 18.586 System LIBC reciprocal throughput : 50.068 LIBC reciprocal throughput : 33.823 LIBC reciprocal throughput : 25.161 (with `-msse4.2` flag) LIBC reciprocal throughput : 19.157 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanf --latency GNU libc version: 2.31 GNU libc release: stable CORE-MATH latency : 55.630 System LIBC latency : 106.264 LIBC latency : 96.060 LIBC latency : 90.727 (with `-msse4.2` flag) LIBC latency : 82.361 (with `-mfma` flag) ``` Reviewed By: orex Differential Revision: https://reviews.llvm.org/D131715	2022-08-12 09:21:05 -04:00
Kirill Okhotnikov	5ef987c985	[libc][math] Added tanhf function. Correct rounding function. Performance ~2x faster than glibc analog. Performance (llvm 12 intel): ``` CORE_MATH_PERF_MODE=rdtsc PERF_ARGS='' ./perf.sh tanhf GNU libc version: 2.31 GNU libc release: stable 13.279 37.492 18.145 CORE_MATH_PERF_MODE=rdtsc PERF_ARGS='--latency' ./perf.sh tanhf GNU libc version: 2.31 GNU libc release: stable 40.658 109.582 66.568 ``` Differential Revision: https://reviews.llvm.org/D130780	2022-08-01 22:43:00 +02:00
Kirill Okhotnikov	a7f55f0805	[libc][math] Added sinhf function. Differential Revision: https://reviews.llvm.org/D129278	2022-07-29 17:20:53 +02:00
Kirill Okhotnikov	fcb9d7e2cf	[libc][math] Added coshf function. Differential Revision: https://reviews.llvm.org/D129275	2022-07-29 16:57:28 +02:00
Alex Brachet	c179bcc151	[libc] Add imaxabs Differential Revision: https://reviews.llvm.org/D129517	2022-07-11 21:28:21 +00:00
Kirill Okhotnikov	b8e8012aa2	[libc][math] fmod/fmodf implementation. This is a implementation of find remainder fmod function from standard libm. The underline algorithm is developed by myself, but probably it was first invented before. Some features of the implementation: 1. The code is written on more-or-less modern C++. 2. One general implementation for both float and double precision numbers. 3. Spitted platform/architecture dependent and independent code and tests. 4. Tests covers 100% of the code for both float and double numbers. Tests cases with NaN/Inf etc is copied from glibc. 5. The new implementation in general 2-4 times faster for “regular” x,y values. It can be 20 times faster for x/y huge value, but can also be 2 times slower for double denormalized range (according to perf tests provided). 6. Two different implementation of division loop are provided. In some platforms division can be very time consuming operation. Depend on platform it can be 3-10 times slower than multiplication. Performance tests: The test is based on core-math project (https://gitlab.inria.fr/core-math/core-math). By Tue Ly suggestion I took hypot function and use it as template for fmod. Preserving all test cases. `./check.sh <--special\|--worst> fmodf` passed. `CORE_MATH_PERF_MODE=rdtsc ./perf.sh fmodf` results are ``` GNU libc version: 2.35 GNU libc release: stable 21.166 <-- FPU 51.031 <-- current glibc 37.659 <-- this fmod version. ```	2022-06-24 23:09:14 +02:00
Alex Brachet	b1183305f8	[libc] Add strlcat Differential Revision: https://reviews.llvm.org/D125978	2022-05-19 21:48:39 +00:00
Alex Brachet	fc2c8b2371	[libc] Add strlcpy Differential Revision: https://reviews.llvm.org/D125806	2022-05-18 17:45:05 +00:00
Tue Ly	0f031daea8	[libc] Initial support for darwin-aarch64. Add initial support for darwin-aarch64 (macOS M1). Some differences compared to linux-aarch64: - `math.h` defined `math_errhandling` by the compiler builtin `__math_errhandling()` but Apple Clang 13.0.0 on M1 does not support `__math_errhandling()` builtin as a macro function or a constexpr function. - `math.h` defines `UNDERFLOW` and `OVERFLOW` macros. - Besides 5 usual floating point exceptions: `FE_INEXACT`, `FE_UNDERFLOW`, `FE_OVERFLOW`, `FE_DIVBYZERO`, and `FE_INVALID`, `fenv.h` also has another floating point exception: `FE_FLUSHTOZERO`. The corresponding trap for `FE_FLUSHTOZERO` in the control register is at the different location compared to the status register. - `FE_FLUSHTOZERO` exception flag cannot be raised with the default CPU floating point operation mode. Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D120914	2022-03-10 09:26:09 -05:00

40 Commits