diff options
author | markm <markm@FreeBSD.org> | 1996-10-20 08:49:26 +0000 |
---|---|---|
committer | markm <markm@FreeBSD.org> | 1996-10-20 08:49:26 +0000 |
commit | 6ec01646dc55b6fa688ed5906e0d52556d174da2 (patch) | |
tree | ecc4c214d76efa8e1b2fb33ac2f0aab18ea03ebf /contrib/libgmp/PROJECTS | |
download | FreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.zip FreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.tar.gz |
Clean import of libgmp 2.0.2, with only the non-x86 bits removed.
BMakefiles and other bits will follow.
Requested by: Andrey Chernov
Made world by: Chuck Robey
Diffstat (limited to 'contrib/libgmp/PROJECTS')
-rw-r--r-- | contrib/libgmp/PROJECTS | 270 |
1 files changed, 270 insertions, 0 deletions
diff --git a/contrib/libgmp/PROJECTS b/contrib/libgmp/PROJECTS new file mode 100644 index 0000000..75016bd --- /dev/null +++ b/contrib/libgmp/PROJECTS @@ -0,0 +1,270 @@ +IDEAS ABOUT THINGS TO WORK ON + +* mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say, + 4 most significant limbs of each operand and compare them. If that is not + sufficient, do the same for 8 limbs, etc. + +* Write mpi, the Multiple Precision Interval Arithmetic layer. + +* Write `mpX_eval' that take lambda-like expressions and a list of operands. + +* As a general rule, recognize special operand values in mpz and mpf, and + use shortcuts for speed. Examples: Recognize (small or all) 2^n in + multiplication and division. Recognize small bases in mpz_pow_ui. + +* Implement lazy allocation? mpz->d == 0 would mean no allocation made yet. + +* Maybe store one-limb numbers according to Per Bothner's idea: + struct { + mp_ptr d; + union { + mp_limb val; /* if (d == NULL). */ + mp_size size; /* Length of data array, if (d != NULL). */ + } u; + }; + Problem: We can't normalize to that format unless we free the space + pointed to by d, and therefore small values will not be stored in a + canonical way. + +* Document complexity of all functions. + +* Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p, + mpz_fits_signedint_p, etc. + + mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq). + +* Better random number generators. There should be fast (like mpz_random), + very good (mpz_veryrandom), and special purpose (like mpz_random2). Sizes + in *bits*, not in limbs. + +* It'd be possible to have an interface "s = add(a,b)" with automatic GC. + If the mpz_xinit routine remembers the address of the variable we could + walk-and-mark the list of remembered variables, and free the space + occupied by the remembered variables that didn't get marked. Fairly + standard. + +* Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd, + etc, to call __umul_ppmm, __udiv_qrnnd. A typical definition for + umul_ppmm would be + #define umul_ppmm(ph,pl,m0,m1) \ + {unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;} + In order to maintain just one version of longlong.h (gmp and gcc), this + has to be done outside of longlong.h. + +Bennet Yee at CMU proposes: +* mpz_{put,get}_raw for memory oriented I/O like other *_raw functions. +* A function mpfatal that is called for exceptions. Let the user override + a default definition. + +* Make all computation mpz_* functions return a signed int indicating if the + result was zero, positive, or negative? + +* Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb, + mpz_ldb, various bit string operations. Also mpz_@_si for most @?? + +* Add macros for looping efficiently over a number's limbs: + MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb) + { user code manipulating limb} + MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb) + { user code manipulating limb} + +Brian Beuning proposes: + 1. An array of small primes + 3. A function to factor a mpz_t. [How do we return the factors? Maybe + we just return one arbitrary factor? In the latter case, we have to + use a data structure that records the state of the factoring routine.] + 4. A routine to look for "small" divisors of an mpz_t + 5. A 'multiply mod n' routine based on Montgomery's algorithm. + +Dough Lea proposes: + 1. A way to find out if an integer fits into a signed int, and if so, a + way to convert it out. + 2. Similarly for double precision float conversion. + 3. A function to convert the ratio of two integers to a double. This + can be useful for mixed mode operations with integers, rationals, and + doubles. + +Elliptic curve method description in the Chapter `Algorithms in Number +Theory' in the Handbook of Theoretical Computer Science, Elsevier, +Amsterdam, 1990. Also in Carl Pomerance's lecture notes on Cryptology and +Computational Number Theory, 1990. + +* Harald Kirsh suggests: + mpq_set_str (MP_RAT *r, char *numerator, char *denominator). + +* New function: mpq_get_ifstr (int_str, frac_str, base, + precision_in_som_way, rational_number). Convert RATIONAL_NUMBER to a + string in BASE and put the integer part in INT_STR and the fraction part + in FRAC_STR. (This function would do a division of the numerator and the + denominator.) + +* Should mpz_powm* handle negative exponents? + +* udiv_qrnnd: If the denominator is normalized, the n0 argument has very + little effect on the quotient. Maybe we can assume it is 0, and + compensate at a later stage? + +* Better sqrt: First calculate the reciprocal square root, then multiply by + the operand to get the square root. The reciprocal square root can be + obtained through Newton-Raphson without division. To compute sqrt(A), the + iteration is, + + 2 + x = x (3 - A x )/2. + i+1 i i + + The final result can be computed without division using, + + sqrt(A) = A x . + n + +* Newton-Raphson using multiplication: We get twice as many correct digits + in each iteration. So if we square x(k) as part of the iteration, the + result will have the leading digits in common with the entire result from + iteration k-1. A _mpn_mul_lowpart could help us take advantage of this. + +* Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product + a*b modulo p and the long long type is unavailable, then I can write + + typedef signed long slong; + typedef unsigned long ulong; + slong a, b, p, quot, rem; + + quot = (slong) (0.5 + (double)a * (double)b / (double)p); + rem = (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot); + if (rem < 0} {rem += p; quot--;} + +* Speed modulo arithmetic, using Montgomery's method or my pre-inversion + method. In either case, special arithmetic calls would be needed, + mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization + functions. Better yet: Write a new mpr layer. + +* mpz_powm* should not use division to reduce the result in the loop, but + instead pre-compute the reciprocal of the MOD argument and do reduced_val + = val-val*reciprocal(MOD)*MOD, or use Montgomery's method. + +* mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s + +* It would be a quite important feature never to allocate more memory than + really necessary for a result. Sometimes we can achieve this cheaply, by + deferring reallocation until the result size is known. + +* New macro in longlong.h: shift_rhl that extracts a word by shifting two + words as a unit. (Supported by i386, i860, HP-PA, POWER, 29k.) Useful + for shifting multiple precision numbers. + +* The installation procedure should make a test run of multiplication to + decide the threshold values for algorithm switching between the available + methods. + +* Fast output conversion of x to base B: + 1. Find n, such that (B^n > x). + 2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n + 3. Multiply the low half of y by B^(n/2), and recursively convert the + result. Truncate the low half of y and convert that recursively. + Complexity: O(M(n)log(n))+O(D(n))! + +* Improve division using Newton-Raphson. Check out "Newton Iteration and + Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms", + Morgan Kaufmann, 1993 ("beware of some errors"...) + +* Improve implementation of Karatsuba's algorithm. For most operand sizes, + we can reduce the number of operations by splitting differently. + +* Faster multiplication: The best approach is to first implement Toom-Cook. + People report that it beats Karatsuba's algorithm already at about 100 + limbs. FFT would probably never beat a well-written Toom-Cook (not even for + millions of bits). + +FFT: +{ + * Multiplication could be done with Montgomery's method combined with + the "three primes" method described in Lipson. Maybe this would be + faster than to Nussbaumer's method with 3 (simple) moduli? + + * Maybe the modular tricks below are not needed: We are using very + special numbers, Fermat numbers with a small base and a large exponent, + and maybe it's possible to just subtract and add? + + * Modify Nussbaumer's convolution algorithm, to use 3 words for each + coefficient, calculating in 3 relatively prime moduli (e.g. + 0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer). Both all + operations and CRR would be very fast with such numbers. + + * Optimize the Schoenhage-Stassen multiplication algorithm. Take advantage + of the real valued input to save half of the operations and half of the + memory. Use recursive FFT with large base cases, since recursive FFT has + better memory locality. A normal FFT get 100% cache misses for large + enough operands. + + * In the 3-prime convolution method, it might sometimes be a win to use 2, + 3, or 5 primes. Imagine that using 3 primes would require a transform + length of 2^n. But 2 primes might still sometimes give us correct + results with that same transform length, or 5 primes might allow us to + decrease the transform size to 2^(n-1). + + To optimize floating-point based complex FFT we have to think of: + + 1. The normal implementation accesses all input exactly once for each of + the log(n) passes. This means that we will get 0% cache hit when n > + our cache. Remedy: Reorganize computation to compute partial passes, + maybe similar to a standard recursive FFT implementation. Use a large + `base case' to make any extra overhead of this organization negligible. + + 2. Use base-4, base-8 and base-16 FFT instead of just radix-2. This can + reduce the number of operations by 2x. + + 3. Inputs are real-valued. According to Knuth's "Seminumerical + Algorithms", exercise 4.6.4-14, we can save half the memory and half + the operations if we take advantage of that. + + 4. Maybe make it possible to write the innermost loop in assembly, since + that could win us another 2x speedup. (If we write our FFT to avoid + cache-miss (see #1 above) it might be logical to write the `base case' + in assembly.) + + 5. Avoid multiplication by 1, i, -1, -i. Similarly, optimize + multiplication by (+-\/2 +- i\/2). + + 6. Put as many bits as possible in each double (but don't waste time if + that doesn't make the transform size become smaller). + + 7. For n > some large number, we will get accuracy problems because of the + limited precision of our floating point arithmetic. This can easily be + solved by using the Karatsuba trick a few times until our operands + become small enough. + + 8. Precompute the roots-of-unity and store them in a vector. +} + +* When a division result is going to be just one limb, (i.e. nsize-dsize is + small) normalization could be done in the division loop. + +* Never allocate temporary space for a source param that overlaps with a + destination param needing reallocation. Instead malloc a new block for + the destination (and free the source before returning to the caller). + +* Parallel addition. Since each processors have to tell it is ready to the + next processor, we can use simplified synchronization, and actually write + it in C: For each processor (apart from the least significant): + + while (*svar != my_number) + ; + *svar = my_number + 1; + + The least significant processor does this: + + *svar = my_number + 1; /* i.e., *svar = 1 */ + + Before starting the addition, one processor has to store 0 in *svar. + + Other things to think about for parallel addition: To avoid false + (cache-line) sharing, allocate blocks on cache-line boundaries. + + +Local Variables: +mode: text +fill-column: 77 +fill-prefix: " " +version-control: never +End: |