summaryrefslogtreecommitdiffstats
path: root/lib/msun
diff options
context:
space:
mode:
authorbde <bde@FreeBSD.org>2005-11-24 13:48:40 +0000
committerbde <bde@FreeBSD.org>2005-11-24 13:48:40 +0000
commit441700048348a276281af93e88d0e3cd79077b25 (patch)
tree6346f26984626e621544309f9a3d3257a65294fb /lib/msun
parentf6e0fe26535937569b564a74c6e113aa03a5a502 (diff)
downloadFreeBSD-src-441700048348a276281af93e88d0e3cd79077b25.zip
FreeBSD-src-441700048348a276281af93e88d0e3cd79077b25.tar.gz
Minor cleanups and optimizations:
- Remove dead code that I forgot to remove in the previous commit. - Calculate the sum of the lower terms of the polynomial (divided by x**5) in a single expression (sum of odd terms) + (sum of even terms) with parentheses to control grouping. This is clearer and happens to give better instruction scheduling for a tiny optimization (an average of about ~0.5 cycles/call on Athlons). - Calculate the final sum in a single expression with parentheses to control grouping too. Change the grouping from first_term + (second_term + sum_of_lower_terms) to (first_term + second_term) + sum_of_lower_terms. Normally the first grouping must be used for accuracy, but extra precision makes any grouping give a correct result so we can group for efficiency. This is a larger optimization (average 3-4 cycles/call or 5%). - Use parentheses to indicate that the C order of left to right evaluation is what is wanted (for efficiency) in a multiplication too. The old fdlibm code has several optimizations related to these. 2 involve doing an extra operation that can be done almost in parallel on some superscalar machines but are pessimizations on sequential machines. Others involve statement ordering or expression grouping. All of these except the ordering for the combining the sums of the odd and even terms seem to be ideal for Athlons, but parallelism is still limited so all of these optimizations combined together with the ones in this commit save only ~6-8 cycles (~10%). On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now takes 39-59 cycles. I don't know of any more optimizations for tanf() short of writing it all in asm with very MD instruction scheduling. Hardware fsin takes 122-138 cycles. Most of the optimizations for tanf() don't work very well for tan[l](). fdlibm tan() now takes 145-365 cycles.
Diffstat (limited to 'lib/msun')
-rw-r--r--lib/msun/src/k_tanf.c16
1 files changed, 5 insertions, 11 deletions
diff --git a/lib/msun/src/k_tanf.c b/lib/msun/src/k_tanf.c
index 263ec96..850248e 100644
--- a/lib/msun/src/k_tanf.c
+++ b/lib/msun/src/k_tanf.c
@@ -38,23 +38,17 @@ extern inline
float
__kernel_tandf(double x, int iy)
{
- double z,r,v,w,s;
- int32_t ix,hx;
+ double z,r,w,s;
- GET_FLOAT_WORD(hx,x);
- ix = hx&0x7fffffff;
z = x*x;
w = z*z;
/* Break x^5*(T[1]+x^2*T[2]+...) into
* x^5*(T[1]+x^4*T[3]+x^8*T[5]) +
* x^5*(x^2*(T[2]+x^4*T[4]))
*/
- r = T[1]+w*(T[3]+w*T[5]);
- v = z*(T[2]+w*T[4]);
+ r = T[1]+w*(T[3]+w*T[5]) + z*(T[2]+w*T[4]);
s = z*x;
- r = z*s*(r+v);
- r += T[0]*s;
- w = x+r;
- if(iy==1) return w;
- else return -1.0/w;
+ r = (x+s*T[0])+(s*z)*r;
+ if(iy==1) return r;
+ else return -1.0/r;
}
OpenPOWER on IntegriCloud