summaryrefslogtreecommitdiffstats
path: root/lib/msun/src/s_cbrt.c
Commit message (Collapse)AuthorAgeFilesLines
* Implement the long double version for the cube root function, cbrtl.kargl2011-03-121-0/+4
| | | | | | | | The algorithm uses Newton's iterations with a crude estimate of the cube root to converge to a result. Reviewed by: bde Approved by: das
* Fix typos - remove duplicate "the".brucec2011-02-211-1/+1
| | | | | | PR: bin/154928 Submitted by: Eitan Adler <lists at eitanadler.com> MFC after: 3 days
* s/rcsid/__FBSDID/das2008-02-221-3/+2
|
* Extract the high and low words together. With gcc-3.4 on uniformlybde2005-12-201-8/+6
| | | | | | | | | | | | | | | | | | | | distributed non-large args, this saves about 14 of 134 cycles for Athlon64s and about 5 of 199 cycles for AthlonXPs. Moved the check for x == 0 inside the check for subnormals. With gcc-3.4 on uniformly distributed non-large args, this saves another 5 cycles on Athlon64s and loses 1 cycle on AthlonXPs. Use INSERT_WORDS() and not SET_HIGH_WORD() when converting the first approximation from bits to double. With gcc-3.4 on uniformly distributed non-large args, this saves another 4 cycles on both Athlon64s and and AthlonXPs. Accessing doubles as 2 words may be an optimization on old CPUs, but on current CPUs it tends to cause extra operations and pipeline stalls, especially for writes, even when only 1 of the words needs to be accessed. Removed an unused variable.
* Use a minimax polynomial approximation instead of a Pade rationalbde2005-12-191-26/+21
| | | | | | | | | | | | | | | | | function approximation for the second step. The polynomial has degree 2 for cbrtf() and 4 for cbrt(). These degrees are minimal for the final accuracy to be essentially the same as before (slightly smaller). Adjust the rounding between steps 2 and 3 to match. Unfortunately, for cbrt(), this breaks the claimed accuracy slightly although incorrect rounding doesn't. Claim less accuracy since its not worth pessimizing the polynomial or relying on exhaustive testing to get insignificantly more accuracy. This saves about 30 cycles on Athlons (mainly by avoiding 2 divisions) so it gives an overall optimization in the 10-25% range (a larger percentage for float precision, especially in 32-bit mode, since other overheads are more dominant for double precision, surprisingly more in 32-bit mode).
* Fixed code to match comments and the algorithm:bde2005-12-181-11/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - in preparing for the third approximation, actually make t larger in magnitude than cbrt(x). After chopping, t must be incremented by 2 ulps to make it larger, not 1 ulp since chopping can reduce it by almost 1 ulp and it might already be up to half a different-sized-ulp smaller than cbrt(x). I have not found any cases where this is essential, but the think-time error bound depends on it. The relative smallness of the different-sized-ulp limited the bug. If there are cases where this is essential, then the final error bound would be 5/6+epsilon instead of of 4/6+epsilon ulps (still < 1). - in preparing for the third approximation, round more carefully (but still sloppily to avoid branches) so that the claimed error bound of 0.667 ulps is satisfied in all cases tested for cbrt() and remains satisfied in all cases for cbrtf(). There isn't enough spare precision for very sloppy rounding to work: - in cbrt(), even with the inadequate increment, the actual error was 0.6685 in some cases, and correcting the increment increased this a little. The fix uses sloppy rounding to 25 bits instead of very sloppy rounding to 21 bits, and starts using uint64_t instead of 2 words for bit manipulation so that rounding more bits is not much costly. - in cbrtf(), the 0.667 bound was already satisfied even with the inadequate increment, but change the code to almost match cbrt() anyway. There is not enough spare precision in the Newton approximation to double the inadequate increment without exceeding the 0.667 bound, and no spare precision to avoid this problem as in cbrt(). The fix is to round using an increment of 2 smaller-ulps before chopping so that an increment of 1 ulp is enough. In cbrt(), we essentially do the same, but move the chop point so that the increment of 1 is not needed. Fixed comments to match code: - in cbrt(), the second approximation is good to 25 bits, not quite 26 bits. - in cbrt(), don't claim that the second approximation may be implemented in single precision. Single precision cannot handle the full exponent range without minor but pessimal changes to renormalize, and although single precision is enough, 25 bit precision is now claimed and used. Added comments about some of the magic for the error bound 4/6+epsilon. I still don't understand why it is 4/6+ and not 6/6+ ulps. Indent comments at the right of code more consistently.
* Added comments about the apparently-magic rational function used inbde2005-12-151-1/+15
| | | | | | | | | | | | | | | | | | | | the second step of approximating cbrt(x). It turns out to be neither very magic not nor very good. It is just the (2,2) Pade approximation to 1/cbrt(r) at r = 1, arranged in a strange way to use fewer operations at a cost of replacing 4 multiplications by 1 division, which is an especially bad tradeoff on machines where some of the multiplications can be done in parallel. A Remez rational approximation would give at least 2 more bits of accuracy, but the (2,2) Pade approximation already gives 6 more bits than needed. (Changed the comment which essentially says that it gives 3 more bits.) Lower order Pade approximations are not quite accurate enough for double precision but are plenty for float precision. A lower order Remez rational approximation might be enough for double precision too. However, rational approximations inherently require an extra division, and polynomial approximations work well for 1/cbrt(r) at r = 1, so I plan to switch to using the latter. There are some technical complications that tend to cost a division in another way.
* Optimize by not doing excessive conversions for handling the sign bit.bde2005-12-131-7/+5
| | | | | | | | | | | | | | | | | | | This gives an optimization of between 9 and 22% on Athlons (largest for cbrt() on amd64 -- from 205 to 159 cycles). We extracted the sign bit and worked with |x|, and restored the sign bit as the last step. We avoided branches to a fault by using accesses to FP values as bits to clear and restore the sign bit. Avoiding branches is usually good, but the bit access macros are not so good (especially for setting FP values), and here they always caused pipeline stalls on Athlons. Even using branches would be faster except on args that give perfect branch misprediction, since only mispredicted branches cause stalls, but it possible to avoid touching the sign bit in FP values at all (except to preserve it in conversions from bits to FP not related to the sign bit). Do this. The results are identical except in 2 of the 3 unsupported rounding modes, since all the approximations use odd rational functions so they work right on strictly negative values, and the special case of -0 doesn't use an approximation.
* Fixed some especially horrible style bugs (indentation that is neitherbde2005-12-131-6/+7
| | | | KNF nor fdlibmNF combined with multiple statements per line).
* Added comments about the magic behindbde2005-12-111-10/+22
| | | | | | | <cbrt(x) in bits> ~= <x in bits>/3 + BIAS. Keep the large comments only in the double version as usual. Fixed some style bugs (mainly grammar and spelling errors in comments).
* Fix formatting, this is hard to explain, so I'll show one example.alfred2002-05-281-1/+2
| | | | | | | | | | - float ynf(int n, float x) /* wrapper ynf */ +float +ynf(int n, float x) /* wrapper ynf */ This is because the __STDC__ stuff was indented. Reviewed by: md5
* Assume __STDC__, remove non-__STDC__ code.alfred2002-05-281-13/+0
| | | | Reviewed by: md5
* $Id$ -> $FreeBSD$peter1999-08-281-1/+1
|
* Revert $FreeBSD$ to $Id$peter1997-02-221-1/+1
|
* Make the long-awaited change from $Id$ to $FreeBSD$jkh1997-01-141-1/+1
| | | | | | | | This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
* Remove trailing whitespace.rgrimes1995-05-301-7/+7
|
* J.T. Conklin's latest version of the Sun math library.jkh1994-08-191-0/+93
-- Begin comments from J.T. Conklin: The most significant improvement is the addition of "float" versions of the math functions that take float arguments, return floats, and do all operations in floating point. This doesn't help (performance) much on the i386, but they are still nice to have. The float versions were orginally done by Cygnus' Ian Taylor when fdlibm was integrated into the libm we support for embedded systems. I gave Ian a copy of my libm as a starting point since I had already fixed a lot of bugs & problems in Sun's original code. After he was done, I cleaned it up a bit and integrated the changes back into my libm. -- End comments Reviewed by: jkh Submitted by: jtc
OpenPOWER on IntegriCloud