summaryrefslogtreecommitdiffstats
path: root/contrib/libgmp/PROJECTS
diff options
context:
space:
mode:
authormarkm <markm@FreeBSD.org>1996-10-20 08:49:26 +0000
committermarkm <markm@FreeBSD.org>1996-10-20 08:49:26 +0000
commit6ec01646dc55b6fa688ed5906e0d52556d174da2 (patch)
treeecc4c214d76efa8e1b2fb33ac2f0aab18ea03ebf /contrib/libgmp/PROJECTS
downloadFreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.zip
FreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.tar.gz
Clean import of libgmp 2.0.2, with only the non-x86 bits removed.
BMakefiles and other bits will follow. Requested by: Andrey Chernov Made world by: Chuck Robey
Diffstat (limited to 'contrib/libgmp/PROJECTS')
-rw-r--r--contrib/libgmp/PROJECTS270
1 files changed, 270 insertions, 0 deletions
diff --git a/contrib/libgmp/PROJECTS b/contrib/libgmp/PROJECTS
new file mode 100644
index 0000000..75016bd
--- /dev/null
+++ b/contrib/libgmp/PROJECTS
@@ -0,0 +1,270 @@
+IDEAS ABOUT THINGS TO WORK ON
+
+* mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say,
+ 4 most significant limbs of each operand and compare them. If that is not
+ sufficient, do the same for 8 limbs, etc.
+
+* Write mpi, the Multiple Precision Interval Arithmetic layer.
+
+* Write `mpX_eval' that take lambda-like expressions and a list of operands.
+
+* As a general rule, recognize special operand values in mpz and mpf, and
+ use shortcuts for speed. Examples: Recognize (small or all) 2^n in
+ multiplication and division. Recognize small bases in mpz_pow_ui.
+
+* Implement lazy allocation? mpz->d == 0 would mean no allocation made yet.
+
+* Maybe store one-limb numbers according to Per Bothner's idea:
+ struct {
+ mp_ptr d;
+ union {
+ mp_limb val; /* if (d == NULL). */
+ mp_size size; /* Length of data array, if (d != NULL). */
+ } u;
+ };
+ Problem: We can't normalize to that format unless we free the space
+ pointed to by d, and therefore small values will not be stored in a
+ canonical way.
+
+* Document complexity of all functions.
+
+* Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
+ mpz_fits_signedint_p, etc.
+
+ mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq).
+
+* Better random number generators. There should be fast (like mpz_random),
+ very good (mpz_veryrandom), and special purpose (like mpz_random2). Sizes
+ in *bits*, not in limbs.
+
+* It'd be possible to have an interface "s = add(a,b)" with automatic GC.
+ If the mpz_xinit routine remembers the address of the variable we could
+ walk-and-mark the list of remembered variables, and free the space
+ occupied by the remembered variables that didn't get marked. Fairly
+ standard.
+
+* Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
+ etc, to call __umul_ppmm, __udiv_qrnnd. A typical definition for
+ umul_ppmm would be
+ #define umul_ppmm(ph,pl,m0,m1) \
+ {unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
+ In order to maintain just one version of longlong.h (gmp and gcc), this
+ has to be done outside of longlong.h.
+
+Bennet Yee at CMU proposes:
+* mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
+* A function mpfatal that is called for exceptions. Let the user override
+ a default definition.
+
+* Make all computation mpz_* functions return a signed int indicating if the
+ result was zero, positive, or negative?
+
+* Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
+ mpz_ldb, various bit string operations. Also mpz_@_si for most @??
+
+* Add macros for looping efficiently over a number's limbs:
+ MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb)
+ { user code manipulating limb}
+ MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb)
+ { user code manipulating limb}
+
+Brian Beuning proposes:
+ 1. An array of small primes
+ 3. A function to factor a mpz_t. [How do we return the factors? Maybe
+ we just return one arbitrary factor? In the latter case, we have to
+ use a data structure that records the state of the factoring routine.]
+ 4. A routine to look for "small" divisors of an mpz_t
+ 5. A 'multiply mod n' routine based on Montgomery's algorithm.
+
+Dough Lea proposes:
+ 1. A way to find out if an integer fits into a signed int, and if so, a
+ way to convert it out.
+ 2. Similarly for double precision float conversion.
+ 3. A function to convert the ratio of two integers to a double. This
+ can be useful for mixed mode operations with integers, rationals, and
+ doubles.
+
+Elliptic curve method description in the Chapter `Algorithms in Number
+Theory' in the Handbook of Theoretical Computer Science, Elsevier,
+Amsterdam, 1990. Also in Carl Pomerance's lecture notes on Cryptology and
+Computational Number Theory, 1990.
+
+* Harald Kirsh suggests:
+ mpq_set_str (MP_RAT *r, char *numerator, char *denominator).
+
+* New function: mpq_get_ifstr (int_str, frac_str, base,
+ precision_in_som_way, rational_number). Convert RATIONAL_NUMBER to a
+ string in BASE and put the integer part in INT_STR and the fraction part
+ in FRAC_STR. (This function would do a division of the numerator and the
+ denominator.)
+
+* Should mpz_powm* handle negative exponents?
+
+* udiv_qrnnd: If the denominator is normalized, the n0 argument has very
+ little effect on the quotient. Maybe we can assume it is 0, and
+ compensate at a later stage?
+
+* Better sqrt: First calculate the reciprocal square root, then multiply by
+ the operand to get the square root. The reciprocal square root can be
+ obtained through Newton-Raphson without division. To compute sqrt(A), the
+ iteration is,
+
+ 2
+ x = x (3 - A x )/2.
+ i+1 i i
+
+ The final result can be computed without division using,
+
+ sqrt(A) = A x .
+ n
+
+* Newton-Raphson using multiplication: We get twice as many correct digits
+ in each iteration. So if we square x(k) as part of the iteration, the
+ result will have the leading digits in common with the entire result from
+ iteration k-1. A _mpn_mul_lowpart could help us take advantage of this.
+
+* Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
+ a*b modulo p and the long long type is unavailable, then I can write
+
+ typedef signed long slong;
+ typedef unsigned long ulong;
+ slong a, b, p, quot, rem;
+
+ quot = (slong) (0.5 + (double)a * (double)b / (double)p);
+ rem = (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
+ if (rem < 0} {rem += p; quot--;}
+
+* Speed modulo arithmetic, using Montgomery's method or my pre-inversion
+ method. In either case, special arithmetic calls would be needed,
+ mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
+ functions. Better yet: Write a new mpr layer.
+
+* mpz_powm* should not use division to reduce the result in the loop, but
+ instead pre-compute the reciprocal of the MOD argument and do reduced_val
+ = val-val*reciprocal(MOD)*MOD, or use Montgomery's method.
+
+* mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s
+
+* It would be a quite important feature never to allocate more memory than
+ really necessary for a result. Sometimes we can achieve this cheaply, by
+ deferring reallocation until the result size is known.
+
+* New macro in longlong.h: shift_rhl that extracts a word by shifting two
+ words as a unit. (Supported by i386, i860, HP-PA, POWER, 29k.) Useful
+ for shifting multiple precision numbers.
+
+* The installation procedure should make a test run of multiplication to
+ decide the threshold values for algorithm switching between the available
+ methods.
+
+* Fast output conversion of x to base B:
+ 1. Find n, such that (B^n > x).
+ 2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
+ 3. Multiply the low half of y by B^(n/2), and recursively convert the
+ result. Truncate the low half of y and convert that recursively.
+ Complexity: O(M(n)log(n))+O(D(n))!
+
+* Improve division using Newton-Raphson. Check out "Newton Iteration and
+ Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms",
+ Morgan Kaufmann, 1993 ("beware of some errors"...)
+
+* Improve implementation of Karatsuba's algorithm. For most operand sizes,
+ we can reduce the number of operations by splitting differently.
+
+* Faster multiplication: The best approach is to first implement Toom-Cook.
+ People report that it beats Karatsuba's algorithm already at about 100
+ limbs. FFT would probably never beat a well-written Toom-Cook (not even for
+ millions of bits).
+
+FFT:
+{
+ * Multiplication could be done with Montgomery's method combined with
+ the "three primes" method described in Lipson. Maybe this would be
+ faster than to Nussbaumer's method with 3 (simple) moduli?
+
+ * Maybe the modular tricks below are not needed: We are using very
+ special numbers, Fermat numbers with a small base and a large exponent,
+ and maybe it's possible to just subtract and add?
+
+ * Modify Nussbaumer's convolution algorithm, to use 3 words for each
+ coefficient, calculating in 3 relatively prime moduli (e.g.
+ 0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer). Both all
+ operations and CRR would be very fast with such numbers.
+
+ * Optimize the Schoenhage-Stassen multiplication algorithm. Take advantage
+ of the real valued input to save half of the operations and half of the
+ memory. Use recursive FFT with large base cases, since recursive FFT has
+ better memory locality. A normal FFT get 100% cache misses for large
+ enough operands.
+
+ * In the 3-prime convolution method, it might sometimes be a win to use 2,
+ 3, or 5 primes. Imagine that using 3 primes would require a transform
+ length of 2^n. But 2 primes might still sometimes give us correct
+ results with that same transform length, or 5 primes might allow us to
+ decrease the transform size to 2^(n-1).
+
+ To optimize floating-point based complex FFT we have to think of:
+
+ 1. The normal implementation accesses all input exactly once for each of
+ the log(n) passes. This means that we will get 0% cache hit when n >
+ our cache. Remedy: Reorganize computation to compute partial passes,
+ maybe similar to a standard recursive FFT implementation. Use a large
+ `base case' to make any extra overhead of this organization negligible.
+
+ 2. Use base-4, base-8 and base-16 FFT instead of just radix-2. This can
+ reduce the number of operations by 2x.
+
+ 3. Inputs are real-valued. According to Knuth's "Seminumerical
+ Algorithms", exercise 4.6.4-14, we can save half the memory and half
+ the operations if we take advantage of that.
+
+ 4. Maybe make it possible to write the innermost loop in assembly, since
+ that could win us another 2x speedup. (If we write our FFT to avoid
+ cache-miss (see #1 above) it might be logical to write the `base case'
+ in assembly.)
+
+ 5. Avoid multiplication by 1, i, -1, -i. Similarly, optimize
+ multiplication by (+-\/2 +- i\/2).
+
+ 6. Put as many bits as possible in each double (but don't waste time if
+ that doesn't make the transform size become smaller).
+
+ 7. For n > some large number, we will get accuracy problems because of the
+ limited precision of our floating point arithmetic. This can easily be
+ solved by using the Karatsuba trick a few times until our operands
+ become small enough.
+
+ 8. Precompute the roots-of-unity and store them in a vector.
+}
+
+* When a division result is going to be just one limb, (i.e. nsize-dsize is
+ small) normalization could be done in the division loop.
+
+* Never allocate temporary space for a source param that overlaps with a
+ destination param needing reallocation. Instead malloc a new block for
+ the destination (and free the source before returning to the caller).
+
+* Parallel addition. Since each processors have to tell it is ready to the
+ next processor, we can use simplified synchronization, and actually write
+ it in C: For each processor (apart from the least significant):
+
+ while (*svar != my_number)
+ ;
+ *svar = my_number + 1;
+
+ The least significant processor does this:
+
+ *svar = my_number + 1; /* i.e., *svar = 1 */
+
+ Before starting the addition, one processor has to store 0 in *svar.
+
+ Other things to think about for parallel addition: To avoid false
+ (cache-line) sharing, allocate blocks on cache-line boundaries.
+
+
+Local Variables:
+mode: text
+fill-column: 77
+fill-prefix: " "
+version-control: never
+End:
OpenPOWER on IntegriCloud