| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
are workarounds for various symptoms of the problem described in clang
bugs 3929, 8100, 8241, 10409, and 12958.
The regression tests did their job: they failed, someone brought it
up on the mailing lists, and then the issue got ignored for 6 months.
Oops. There may still be some regressions for functions we don't have
test coverage for yet.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This has the side effect of confusing gcc-4.2.1's optimizer into more
often doing the right thing. When it does the wrong thing here, it
seems to be mainly making too many copies of x with dependency chains.
This effect is tiny on amd64, but in some cases on i386 it is enormous.
E.g., on i386 (A64) with -O1, the current version of exp2() should
take about 50 cycles, but took 83 cycles before this change and 66
cycles after this change. exp2f() with -O1 only speeded up from 51
to 47 cycles. (exp2f() should take about 40 cycles, on an Athlon in
either i386 or amd64 mode, and now takes 42 on amd64). exp2l() with
-O1 slowed down from 155 cycles to 123 for some args; this is unimportant
since the i386 exp2l() is a fake; the wrong thing for it seems to
involve branch misprediction.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
faster on all machines tested (old Celeron (P2), A64 (amd64 and i386)
and ia64) except on ia64 when compiled with -O1. It takes 2 more
multiplications, so it will be slower on old machines. The speedup
is about 8 cycles = 17% on A64 (amd64 and i386) with best CFLAGS
and some parallelism in the caller.
Move the evaluation of 2**k up a bit so that it doesn't compete too
much with the new polynomial evaluation. Unlike the previous
optimization, this rearrangement cannot change the result, so compilers
and CPU schedulers can do it, but they don't do it quite right yet.
This saves a whole 1 or 2 cycles on A64.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
exp2(i/TBLSIZE) * p(z) instead of only for the final multiplication
and addition. This fixes the code to match the comment that the maximum
error is 0.5010 ulps (except on machines that evaluate float expressions
in extra precision, e.g., i386's, where the evaluation was already
in extra precision).
Fix and expand the comment about use of double precision.
The relative roundoff error from evaluating p(z) in non-extra precision
was about 16 times larger than in exp2() because the interval length
is 16 times smaller. Its maximum was at least P1 * (1.0 ulps) *
max(|z|) ~= log(2) * 1.0 * 1/32 ~= 0.0217 ulps (1.0 ulps from the
addition in (1 + P1*z) with a cancelation error when z ~= -1/32). The
actual final maximum was 0.5313 ulps, of which 0.0303 ulps must have
come from the additional roundoff error in p(z). I can't explain why
the additional roundoff error was almost 3/2 times larger than the rough
estimate.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it. This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.
In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%. I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up). In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles. The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%. The manual
scheduling for plain exp[f] is harder and not as tuned.
This change ld128/s_exp2l.c has not been tested.
|
|
|
|
|
| |
float precision. This fixes some double rounding problems for
subnormals and simplifies things a bit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
variable hack for exp2f() only.
The volatile variable had a surprisingly large cost for exp2f() -- 19
cycles or 15% on i386 in the worst case observed. This is only partly
explained by there being several references to the variable, only one
of which benefited from it being volatile. Arches that have working
assignment are likely to benefit even more from not having any volatile
variable.
exp2() now has a chance of working with extra precision on i386.
exp2() has even more references to the variable, so it would have been
pessimized more by simply declaring the variable as volatile. Even
the temporary volatile variable for STRICT_ASSIGN costs 5-10% on i386,
(A64) so I will change STRICT_ASSIGN() to do an ordinary assignment
until i386 defaults to extra precision.
|
|
|
|
|
| |
exception when they're supposed to. Previously, gcc -O2 was optimizing
away the statement that generated it.
|
|
|