Clean import of libgmp 2.0.2, with only the non-x86 bits removed.

BMakefiles and other bits will follow. Requested by: Andrey Chernov Made world by: Chuck Robey
author: markm <markm@FreeBSD.org> 1996-10-20 08:49:26 +0000
committer: markm <markm@FreeBSD.org> 1996-10-20 08:49:26 +0000
commit: 6ec01646dc55b6fa688ed5906e0d52556d174da2 (patch)
tree: ecc4c214d76efa8e1b2fb33ac2f0aab18ea03ebf /contrib/libgmp/PROJECTS
download: FreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.zip
FreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.tar.gz
1 files changed, 270 insertions, 0 deletions
diff --git a/contrib/libgmp/PROJECTS b/contrib/libgmp/PROJECTS
new file mode 100644
index 0000000..75016bd
--- /dev/null
+++ b/contrib/libgmp/PROJECTS
@@ -0,0 +1,270 @@
+IDEAS ABOUT THINGS TO WORK ON
+
+* mpq_cmp: Maybe the most sensible thing to do would be to multiply the, say,
+  4 most significant limbs of each operand and compare them.  If that is not
+  sufficient, do the same for 8 limbs, etc.
+
+* Write mpi, the Multiple Precision Interval Arithmetic layer.
+
+* Write `mpX_eval' that take lambda-like expressions and a list of operands.
+
+* As a general rule, recognize special operand values in mpz and mpf, and
+  use shortcuts for speed.  Examples: Recognize (small or all) 2^n in
+  multiplication and division.  Recognize small bases in mpz_pow_ui.
+
+* Implement lazy allocation?  mpz->d == 0 would mean no allocation made yet.
+
+* Maybe store one-limb numbers according to Per Bothner's idea:
+    struct {
+      mp_ptr d;
+      union {
+	mp_limb val;    /* if (d == NULL).  */
+        mp_size size;   /* Length of data array, if (d != NULL).  */
+      } u;
+    };
+  Problem: We can't normalize to that format unless we free the space
+  pointed to by d, and therefore small values will not be stored in a
+  canonical way.
+
+* Document complexity of all functions.
+
+* Add predicate functions mpz_fits_signedlong_p, mpz_fits_unsignedlong_p,
+  mpz_fits_signedint_p, etc.
+
+  mpz_floor (mpz, mpq), mpz_trunc (mpz, mpq), mpz_round (mpz, mpq).
+
+* Better random number generators.  There should be fast (like mpz_random),
+  very good (mpz_veryrandom), and special purpose (like mpz_random2).  Sizes
+  in *bits*, not in limbs.
+
+* It'd be possible to have an interface "s = add(a,b)" with automatic GC.
+  If the mpz_xinit routine remembers the address of the variable we could
+  walk-and-mark the list of remembered variables, and free the space
+  occupied by the remembered variables that didn't get marked.  Fairly
+  standard.
+
+* Improve speed for non-gcc compilers by defining umul_ppmm, udiv_qrnnd,
+  etc, to call __umul_ppmm, __udiv_qrnnd.  A typical definition for
+  umul_ppmm would be
+  #define umul_ppmm(ph,pl,m0,m1) \
+    {unsigned long __ph; (pl) = __umul_ppmm (&__ph, (m0), (m1)); (ph) = __ph;}
+  In order to maintain just one version of longlong.h (gmp and gcc), this
+  has to be done outside of longlong.h.
+
+Bennet Yee at CMU proposes:
+* mpz_{put,get}_raw for memory oriented I/O like other *_raw functions.
+* A function mpfatal that is called for exceptions.  Let the user override
+  a default definition.
+
+* Make all computation mpz_* functions return a signed int indicating if the
+  result was zero, positive, or negative?
+
+* Implement mpz_cmpabs, mpz_xor, mpz_to_double, mpz_to_si, mpz_lcm, mpz_dpb,
+  mpz_ldb, various bit string operations.  Also mpz_@_si for most @??
+
+* Add macros for looping efficiently over a number's limbs:
+	MPZ_LOOP_OVER_LIMBS_INCREASING(num,limb)
+	  { user code manipulating limb}
+	MPZ_LOOP_OVER_LIMBS_DECREASING(num,limb)
+	  { user code manipulating limb}
+
+Brian Beuning proposes:
+   1. An array of small primes
+   3. A function to factor a mpz_t.  [How do we return the factors?  Maybe
+      we just return one arbitrary factor?  In the latter case, we have to
+      use a data structure that records the state of the factoring routine.]
+   4. A routine to look for "small" divisors of an mpz_t
+   5. A 'multiply mod n' routine based on Montgomery's algorithm.
+
+Dough Lea proposes:
+   1. A way to find out if an integer fits into a signed int, and if so, a
+      way to convert it out.
+   2. Similarly for double precision float conversion.
+   3. A function to convert the ratio of two integers to a double.  This
+      can be useful for mixed mode operations with integers, rationals, and
+      doubles.
+
+Elliptic curve method description in the Chapter `Algorithms in Number
+Theory' in the Handbook of Theoretical Computer Science, Elsevier,
+Amsterdam, 1990.  Also in Carl Pomerance's lecture notes on Cryptology and
+Computational Number Theory, 1990.
+
+* Harald Kirsh suggests:
+  mpq_set_str (MP_RAT *r, char *numerator, char *denominator).
+
+* New function: mpq_get_ifstr (int_str, frac_str, base,
+  precision_in_som_way, rational_number).  Convert RATIONAL_NUMBER to a
+  string in BASE and put the integer part in INT_STR and the fraction part
+  in FRAC_STR.  (This function would do a division of the numerator and the
+  denominator.)
+
+* Should mpz_powm* handle negative exponents?
+
+* udiv_qrnnd: If the denominator is normalized, the n0 argument has very
+  little effect on the quotient.  Maybe we can assume it is 0, and
+  compensate at a later stage?
+
+* Better sqrt: First calculate the reciprocal square root, then multiply by
+  the operand to get the square root.  The reciprocal square root can be
+  obtained through Newton-Raphson without division.  To compute sqrt(A), the
+  iteration is,
+
+				    2
+		   x   = x  (3 - A x )/2.
+		    i+1   i         i
+
+  The final result can be computed without division using,
+
+		     sqrt(A) = A x .
+			          n
+
+* Newton-Raphson using multiplication: We get twice as many correct digits
+  in each iteration.  So if we square x(k) as part of the iteration, the
+  result will have the leading digits in common with the entire result from
+  iteration k-1.  A _mpn_mul_lowpart could help us take advantage of this.
+
+* Peter Montgomery: If 0 <= a, b < p < 2^31 and I want a modular product
+  a*b modulo p and the long long type is unavailable, then I can write
+
+	  typedef   signed long slong;
+	  typedef unsigned long ulong;
+	  slong a, b, p, quot, rem;
+
+	  quot = (slong) (0.5 + (double)a * (double)b / (double)p);
+	  rem =  (slong)((ulong)a * (ulong)b - (ulong)p * (ulong)quot);
+	  if (rem < 0} {rem += p; quot--;}
+
+* Speed modulo arithmetic, using Montgomery's method or my pre-inversion
+  method.  In either case, special arithmetic calls would be needed,
+  mpz_mmmul, mpz_mmadd, mpz_mmsub, plus some kind of initialization
+  functions.  Better yet: Write a new mpr layer.
+
+* mpz_powm* should not use division to reduce the result in the loop, but
+  instead pre-compute the reciprocal of the MOD argument and do reduced_val
+  = val-val*reciprocal(MOD)*MOD, or use Montgomery's method.
+
+* mpz_mod_2expplussi -- to reduce a bignum modulo (2**n)+s
+
+* It would be a quite important feature never to allocate more memory than
+  really necessary for a result.  Sometimes we can achieve this cheaply, by
+  deferring reallocation until the result size is known.
+
+* New macro in longlong.h: shift_rhl that extracts a word by shifting two
+  words as a unit.  (Supported by i386, i860, HP-PA, POWER, 29k.)  Useful
+  for shifting multiple precision numbers.
+
+* The installation procedure should make a test run of multiplication to
+  decide the threshold values for algorithm switching between the available
+  methods.
+
+* Fast output conversion of x to base B:
+    1. Find n, such that (B^n > x).
+    2. Set y to (x*2^m)/(B^n), where m large enough to make 2^n ~~ B^n
+    3. Multiply the low half of y by B^(n/2), and recursively convert the
+       result.  Truncate the low half of y and convert that recursively.
+  Complexity: O(M(n)log(n))+O(D(n))!
+
+* Improve division using Newton-Raphson.  Check out "Newton Iteration and
+  Integer Division" by Stephen Tate in "Synthesis of Parallel Algorithms",
+  Morgan Kaufmann, 1993 ("beware of some errors"...)
+
+* Improve implementation of Karatsuba's algorithm.  For most operand sizes,
+  we can reduce the number of operations by splitting differently.
+
+* Faster multiplication: The best approach is to first implement Toom-Cook.
+  People report that it beats Karatsuba's algorithm already at about 100
+  limbs.  FFT would probably never beat a well-written Toom-Cook (not even for
+  millions of bits).
+
+FFT:
+{
+  * Multiplication could be done with Montgomery's method combined with
+    the "three primes" method described in Lipson.  Maybe this would be
+    faster than to Nussbaumer's method with 3 (simple) moduli?
+
+  * Maybe the modular tricks below are not needed: We are using very
+    special numbers, Fermat numbers with a small base and a large exponent,
+    and maybe it's possible to just subtract and add?
+
+  * Modify Nussbaumer's convolution algorithm, to use 3 words for each
+    coefficient, calculating in 3 relatively prime moduli (e.g.
+    0xffffffff, 0x100000000, and 0x7fff on a 32-bit computer).  Both all
+    operations and CRR would be very fast with such numbers.
+
+  * Optimize the Schoenhage-Stassen multiplication algorithm.  Take advantage
+    of the real valued input to save half of the operations and half of the
+    memory.  Use recursive FFT with large base cases, since recursive FFT has
+    better memory locality.  A normal FFT get 100% cache misses for large
+    enough operands.
+
+  * In the 3-prime convolution method, it might sometimes be a win to use 2,
+    3, or 5 primes.  Imagine that using 3 primes would require a transform
+    length of 2^n.  But 2 primes might still sometimes give us correct
+    results with that same transform length, or 5 primes might allow us to
+    decrease the transform size to 2^(n-1).
+
+  To optimize floating-point based complex FFT we have to think of:
+
+  1. The normal implementation accesses all input exactly once for each of
+     the log(n) passes.  This means that we will get 0% cache hit when n >
+     our cache.  Remedy: Reorganize computation to compute partial passes,
+     maybe similar to a standard recursive FFT implementation.  Use a large
+     `base case' to make any extra overhead of this organization negligible.
+
+  2. Use base-4, base-8 and base-16 FFT instead of just radix-2.  This can
+     reduce the number of operations by 2x.
+
+  3. Inputs are real-valued.  According to Knuth's "Seminumerical
+     Algorithms", exercise 4.6.4-14, we can save half the memory and half
+     the operations if we take advantage of that.
+
+  4. Maybe make it possible to write the innermost loop in assembly, since
+     that could win us another 2x speedup.  (If we write our FFT to avoid
+     cache-miss (see #1 above) it might be logical to write the `base case'
+     in assembly.)
+
+  5. Avoid multiplication by 1, i, -1, -i.  Similarly, optimize
+     multiplication by (+-\/2 +- i\/2).
+
+  6. Put as many bits as possible in each double (but don't waste time if
+     that doesn't make the transform size become smaller).
+
+  7. For n > some large number, we will get accuracy problems because of the
+     limited precision of our floating point arithmetic.  This can easily be
+     solved by using the Karatsuba trick a few times until our operands
+     become small enough.
+
+  8. Precompute the roots-of-unity and store them in a vector.
+}
+
+* When a division result is going to be just one limb, (i.e. nsize-dsize is
+  small) normalization could be done in the division loop.
+
+* Never allocate temporary space for a source param that overlaps with a
+  destination param needing reallocation.  Instead malloc a new block for
+  the destination (and free the source before returning to the caller).
+
+* Parallel addition.  Since each processors have to tell it is ready to the
+  next processor, we can use simplified synchronization, and actually write
+  it in C:  For each processor (apart from the least significant):
+
+	while (*svar != my_number)
+	  ;
+	*svar = my_number + 1;
+
+  The least significant processor does this:
+
+	*svar = my_number + 1;  /* i.e., *svar = 1 */
+
+  Before starting the addition, one processor has to store 0 in *svar.
+
+  Other things to think about for parallel addition: To avoid false
+  (cache-line) sharing, allocate blocks on cache-line boundaries.
+
+
+Local Variables:
+mode: text
+fill-column: 77
+fill-prefix: "  "
+version-control: never
+End:
author	markm <markm@FreeBSD.org>	1996-10-20 08:49:26 +0000
committer	markm <markm@FreeBSD.org>	1996-10-20 08:49:26 +0000
commit	6ec01646dc55b6fa688ed5906e0d52556d174da2 (patch)
tree	ecc4c214d76efa8e1b2fb33ac2f0aab18ea03ebf /contrib/libgmp/PROJECTS
download	FreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.zip FreeBSD-src-6ec01646dc55b6fa688ed5906e0d52556d174da2.tar.gz