summaryrefslogtreecommitdiffstats
path: root/lib
Commit message (Collapse)AuthorAgeFilesLines
* Minor cleanups and optimizations:bde2005-11-241-11/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Remove dead code that I forgot to remove in the previous commit. - Calculate the sum of the lower terms of the polynomial (divided by x**5) in a single expression (sum of odd terms) + (sum of even terms) with parentheses to control grouping. This is clearer and happens to give better instruction scheduling for a tiny optimization (an average of about ~0.5 cycles/call on Athlons). - Calculate the final sum in a single expression with parentheses to control grouping too. Change the grouping from first_term + (second_term + sum_of_lower_terms) to (first_term + second_term) + sum_of_lower_terms. Normally the first grouping must be used for accuracy, but extra precision makes any grouping give a correct result so we can group for efficiency. This is a larger optimization (average 3-4 cycles/call or 5%). - Use parentheses to indicate that the C order of left to right evaluation is what is wanted (for efficiency) in a multiplication too. The old fdlibm code has several optimizations related to these. 2 involve doing an extra operation that can be done almost in parallel on some superscalar machines but are pessimizations on sequential machines. Others involve statement ordering or expression grouping. All of these except the ordering for the combining the sums of the odd and even terms seem to be ideal for Athlons, but parallelism is still limited so all of these optimizations combined together with the ones in this commit save only ~6-8 cycles (~10%). On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now takes 39-59 cycles. I don't know of any more optimizations for tanf() short of writing it all in asm with very MD instruction scheduling. Hardware fsin takes 122-138 cycles. Most of the optimizations for tanf() don't work very well for tan[l](). fdlibm tan() now takes 145-365 cycles.
* Fix prototype.ru2005-11-241-1/+1
|
* Fix prototypes.ru2005-11-241-6/+6
|
* Fix prototypes.ru2005-11-241-6/+6
|
* Fix prototypes.ru2005-11-241-5/+5
|
* Fix prototype.ru2005-11-241-1/+1
|
* Fix prototype.ru2005-11-241-1/+1
|
* Fix prototypes.ru2005-11-241-8/+8
|
* Fix prototypes.ru2005-11-244-14/+14
|
* s/5.5/6.0/ in HISTORY section.joel2005-11-243-3/+3
| | | | Discussed with: ru
* Make SYNOPSIS compile.ru2005-11-241-1/+1
| | | | Attn peter@: this manpage wasn't synced with your code changes.
* Fix prototypes.ru2005-11-241-3/+3
| | | | | Attn davidxu@: most likely, the description should also be tweaked after your undocumented changes that changed these prototypes.
* Fix prototypes.ru2005-11-241-2/+2
|
* Keep up with const poisoning in uuid.h,v 1.3.ru2005-11-241-6/+6
|
* Fix prototype.ru2005-11-241-2/+2
|
* Optimized by eliminating the special case for 0.67434 <= |x| < pi/4.bde2005-11-241-16/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A single polynomial approximation for tan(x) works in infinite precision up to |x| < pi/2, but in finite precision, to restrict the accumulated roundoff error to < 1 ulp, |x| must be restricted to less than about sqrt(0.5/((1.5+1.5)/3)) ~= 0.707. We restricted it a bit more to give a safety margin including some slop for optimizations. Now that we use double precision for the calculations, the accumulated roundoff error is in double-precision ulps so it can easily be made almost 2**29 times smaller than a single-precision ulp. Near x = pi/4 its maximum is about 0.5+(1.5+1.5)*x**2/3 ~= 1.117 double-precision ulps. The minimax polynomial needs to be different to work for the larger interval. I didn't increase its degree the old degree is just large enough to keep the final error less than 1 ulp and increasing the degree would be a pessimization. The maximum error is now ~0.80 ulps instead of ~0.53 ulps. The speedup from this optimization for uniformly distributed args in [-2pi, 2pi] is 28-43% on athlons, depending on how badly gcc selected and scheduled the instructions in the old version. The old version has some int-to-float conversions that are apparently difficult to schedule well, but gcc-3.3 somehow did everything ~10 cycles or ~10% faster than gcc-3.4, with the difference especially large on AXPs. On A64s, the problem seems to be related to documented penalties for moving single precision data to undead xmm registers. With this version, the speed is cycles is almost independent of the athlon and gcc version despite the large differences in instruction selection to use the FPU on AXPs and SSE on A64s.
* Fix prototype.ru2005-11-231-1/+3
|
* Fix prototype.ru2005-11-232-2/+2
|
* Fix prototypes.ru2005-11-233-4/+4
|
* There's no longer^Wyet <sys/capability.h>.ru2005-11-231-1/+1
|
* Fix inet6_opt_get_val() prototype.ru2005-11-231-1/+1
|
* Make SYNOPSIS compile.ru2005-11-231-3/+3
|
* Make SYNOPSIS compile after imp@'s changes.ru2005-11-232-11/+11
|
* Make SYNOPSIS compile.ru2005-11-231-1/+1
|
* Use only double precision for "kernel" tanf (except for returning float).bde2005-11-233-29/+20
| | | | | | | | | | | | | | This is a minor interface change. The function is renamed from __kernel_tanf() to __kernel_tandf() so that misues of it will cause link errors and not crashes. This version is a routine translation with no special optimizations for accuracy or efficiency. It gives an unimportant increase in accuracy, from ~0.9 ulps to 0.5285 ulps. Almost all of the error is from the minimax polynomial (~0.03 ulps and the final rounding step (< 0.5 ulps). It gives strange differences in efficiency in the -5 to +10% range, with -O1 fairly consistently becoming faster and -O2 slower on AXP and A64 with gcc-3.3 and gcc-3.4.
* Add missing includes.ru2005-11-231-1/+3
|
* Simplified setiing up args for __kernel_rem_pio2(). We already have xbde2005-11-231-17/+9
| | | | | with a 24-bit fraction, so we don't need a loop to split it into up to 3 terms with 24-bit fractions.
* Quick fix for stack buffer overrun in rev.1.13. Oops. The prec == 1bde2005-11-231-4/+4
| | | | | | | | | | | | | | | | | | | | | arg to __kernel_rem_pio2() gives 53-bit (double) precision, not single precision and/or the array dimension like I thought. prec == 2 is used in e_rem_pio2.c for double precision although it is documented to be for 64-bit (extended) precision, and I just reduced it by 1 thinking that this would give the value suitable for 24-bit (float) precision. Reducing it 1 more to the documented value for float precision doesn't actually work (it gives errors of ~0.75 ulps in the reduced arg, but errors of much less than 0.5 ulps are needed; the bug seems to be in kernel_rem_pio2.c). Keep using a value 1 larger than the documented value but supply an array large enough hold the extra unused result from this. The bug can also be fixed quickly by increasing init_jk[0] in k_rem_pio2.c from 2 to 3. This gives behaviour identical to using prec == 1 except it doesn't create the extra result. It isn't clear how the precision bug affects higher precisions. 113-bit (quad) is the largest precision, so there is no way to use a large precision to fix it.
* Tidy up markup and fix two bugs.ru2005-11-211-77/+93
|
* Mess up the "kernel" float trig function .c files with ifdefs so thatbde2005-11-216-0/+25
| | | | | | | | | | | | they can be #included in other .c files to give inline functions, and use them to inline the functions in most callers (not in e_lgammaf_r.c). __kernel_tanf() is too large and complicated for gcc to inline very well. An athlons, this gives a speed increase under favourable pipeline conditions of about 10% overall (larger for AXP, smaller for A64). E.g., on AXP, sinf() on uniformly distributed args in [-2Pi, 2Pi] now takes 30-56 cycles; it used to take 45-61 cycles; hardware fsin takes 65-129.
* Use double precision to simplify and optimize a long division.bde2005-11-211-15/+1
| | | | | | | | | | | | | | | | On athlons, this gives a speedup of 10-20% for tanf() on uniformly distributed args in [-2Pi, 2Pi]. (It only directly applies for 43% of the args and gives a 16-20% speedup for these (more for AXP than A64) and this gives an overall speedup of 10-12% which is all that it should; however, it gives an overall speedup of 17-20% with gcc-3.3 on AXP-A64 by mysteriously effected cases where it isn't executed.) I originally intended to use double precision for all internals of float trig functions and will probably still do this, but benchmarking showed that converting to double precision and back is a pessimization in cases where a simple float precision calculation works, so it may be optimal to switch precisions only when using extra precision is much simpler.
* Restored a cleanup in rev.1.9 tthat was lost in rev.1.10.bde2005-11-201-2/+2
|
* Do not explicitly state how many bytes an argument list can be in thesimon2005-11-191-1/+0
| | | | | | description of E2BIG, since it's now larger on some platforms. MFC after: 3 days
* o Include <sys/time.h>marcel2005-11-192-26/+28
| | | | | | o Make this ILP32/LP64 clean: cast pointers to long o Code conditional upon DEBUG must also be conditional upon _LIBC_R_
* o Include <string.h>marcel2005-11-192-6/+10
| | | | o Make this ILP32/LP64 clean: cast pointers to long.
* Fix typo: s/_LIBC_R/_LIBC_R_/marcel2005-11-192-2/+2
|
* Moved all the optimizations for |x| <= 9pi/2 frombde2005-11-194-67/+105
| | | | | | | | | | | | | | __ieee754_rem_pio2f() to its 3 callers and manually inline them. On Athlons, with favourable compiler flags and optimizations and favourable pipeline conditions, this gives a speedup of 30-40 cycles for cosf(), sinf() and tanf() on the range pi/4 < |x| <= 9pi/4, so thes functions are now signifcantly faster than the hardware trig functions in many cases. E.g., in a benchmark with uniformly distributed x in [-2pi, 2pi], A64 hardware fcos took 72-129 cycles and cosf() took 37-55 cycles. Out-of-order execution is needed to get both of these times. The optimizations in this commit apparently work more by removing 1 serialization point than by reducing latency.
* Document CLOCK_UPTIME which returns the current uptime in SI seconds.andre2005-11-181-1/+3
| | | | | | | At the moment it is just an alias for CLOCK_MONOTONIC which reports the same number. Sponsored by: TCP/IP Optimization Fundraise 2005
* Fix markup, grammar and spelling.ru2005-11-181-32/+44
|
* Fix up markup.ru2005-11-181-7/+9
|
* Fix up markup etc. in recently born manpage.ru2005-11-185-152/+316
|
* Removed an unused declaration which was so old that it wasn't a prototypebde2005-11-181-4/+6
| | | | | | and thus just broke building at any nonzero WARNS level. Fixed nearby style bugs.
* -mdoc sweep.ru2005-11-1731-59/+79
|
* Minor cleanups:bde2005-11-173-24/+21
| | | | | | | | | | | | | | | s_cosf.c and s_sinf.c: Use a non-bogus magic constant for the threshold of pi/4. It was 2 ulps smaller than pi/4 rounded down, but its value is not critical so it should be the result of natural rounding. s_cosf.c and s_tanf.c: Use a literal 0.0 instead of an unnecessary variable initialized to [(float)]0.0. Let the function prototype convert to 0.0F. Improved wording in some comments. Attempted to improve indentation of comments.
* Rearranged the the optimizations for special cases to reduce the averagebde2005-11-171-42/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | number of branches. Use a non-bogus magic constant for the threshold of pi/4. It was 2 ulps smaller than pi/4 rounded down, but its value is not critical so it should be the result of natural rounding. Use "<=" comparisons with rounded- down thresholds for all small multiples of pi/4. Cleaned up previous commit: - use static const variables instead of expressions for multiples of pi/2 to ensure that they are evaluated at compile time. gcc currently evaluates them at compile time but C99 compilers are not required to do so. We want compile time evaluation for optimization and don't care about side effects. - use M_PI_2 instead of a magic constant for pi/2. We need magic constants related to pi/2 elsewhere but not here since we just want pi/2 rounded to double and even prefer it to be rounded in the default rounding mode. We can depend on the cmpiler being C99ish enough to round M_PI_2 correctly just as much as we depended on it handling hex constants correctly. This also fixes a harmless rounding error in the hex constant. - keep using expressions n*<value for pi/2> in the initializers for the static const variables. 2*M_PI_2 and 4*M_PI_2 are obviously rounded in the same way as the corresponding infinite precision expressions for multiples of pi/2, and 3*M_PI_2 happens to be rounded like this, so we don't need magic constants for the multiples. - fixed and/or updated some comments.
* The KAME's getipnodebyaddr() code honor the MULTI_PTRS_ARE_ALIASESume2005-11-151-0/+1
| | | | | | | | | define also, but res_config.h was not included into libc/net/name6.c. So getipnodebyaddr() ignored the multiple PTRs. PR: kern/88241 Submitted by: Dan Lukes <dan__at__obluda.cz> MFC after: 3 days
* Add symlinks for kvm access methods for memstat(3).rwatson2005-11-131-0/+3
| | | | MFC after: 3 days
* Fixed some magic numbers.bde2005-11-131-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | The threshold for not being tiny was too small. Use the usual 2**-12 threshold. This change is not just an optimization, since the general code that we fell into has accuracy problems even for tiny x. Avoiding it fixes 2*1366 args with errors of more than 1 ulp, with a maximum error of 1.167 ulps. The magic number 22 is log(DBL_EPSILON)/2 plus slop. This is bogus for float precision. Use 9 (~log(FLT_EPSILON)/2 plus less slop than for double precision). The code for handling the interval [2**-28, 9_was_22] has accuracy problems even for [9, 22], so this change happens to fix errors of more than 1 ulp in about 2*17000 cases. It leaves such errors in about 2*1074000 cases, with a max error of 1.242 ulps. The threshold for switching from returning exp(x)/2 to returning exp(x/2)^2/2 was a little smaller than necessary. As for coshf(), This was not quite harmless since the exp(x/2)^2/2 case is inaccurate, and fixing it avoids accuracy problems in 2*6 cases, leaving problems in 2*19997 cases. Fixed naming errors in pseudo-code in comments.
* Fixed some magic numbers.bde2005-11-131-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | The threshold for not being tiny was confusing and too small. Use the usual 2**-12 threshold and simplify the algorithm slightly so that this threshold works (now use the threshold for sinhf() instead of one for 1+expm1()). This is just a small optimization. The magic number 22 is log(DBL_EPSILON)/2 plus slop. This is bogus for float precision. Use 9 (~log(FLT_EPSILON)/2 plus less slop than for double precision). The threshold for switching from returning exp(x)/2 to returning exp(x/2)^2/2 was a little smaller than necessary. This was not quite harmless since the exp(x/2)^2/2 case is inaccurate. Fixing it happens to avoid accuracy problems for 2*6 of the 2*151 args that were handled by the exp(x)/2 case. This leaves accuracy problems for about 2*19997 args near the overflow threshold (~89); the maximum error there is 2.5029 ulps. There are also accuracy probles for args in +-[0.5*ln2, 9] -- 2*188885 args with errors of more than 1 ulp, with a maximum error of 1.384 ulps. Fixed a syntax error and naming errors in pseudo-code in comments.
* Imoproved comments for the minimax polynomial.bde2005-11-121-10/+11
| | | | | | Removed an unused variable. Fixed some wrong comments and some nearby misformatting.
OpenPOWER on IntegriCloud