diff options
Diffstat (limited to 'lib/libc/softfloat/softfloat.txt')
-rw-r--r-- | lib/libc/softfloat/softfloat.txt | 373 |
1 files changed, 373 insertions, 0 deletions
diff --git a/lib/libc/softfloat/softfloat.txt b/lib/libc/softfloat/softfloat.txt new file mode 100644 index 0000000..bd63324 --- /dev/null +++ b/lib/libc/softfloat/softfloat.txt @@ -0,0 +1,373 @@ +$NetBSD: softfloat.txt,v 1.1 2000/06/06 08:15:10 bjh21 Exp $ +$FreeBSD$ + +SoftFloat Release 2a General Documentation + +John R. Hauser +1998 December 13 + + +------------------------------------------------------------------------------- +Introduction + +SoftFloat is a software implementation of floating-point that conforms to +the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four +formats are supported: single precision, double precision, extended double +precision, and quadruple precision. All operations required by the standard +are implemented, except for conversions to and from decimal. + +This document gives information about the types defined and the routines +implemented by SoftFloat. It does not attempt to define or explain the +IEC/IEEE Floating-Point Standard. Details about the standard are available +elsewhere. + + +------------------------------------------------------------------------------- +Limitations + +SoftFloat is written in C and is designed to work with other C code. The +SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt +has been made to accomodate compilers that are not ISO-conformant. In +particular, the distributed header files will not be acceptable to any +compiler that does not recognize function prototypes. + +Support for the extended double-precision and quadruple-precision formats +depends on a C compiler that implements 64-bit integer arithmetic. If the +largest integer format supported by the C compiler is 32 bits, SoftFloat is +limited to only single and double precisions. When that is the case, all +references in this document to the extended double precision, quadruple +precision, and 64-bit integers should be ignored. + + +------------------------------------------------------------------------------- +Contents + + Introduction + Limitations + Contents + Legal Notice + Types and Functions + Rounding Modes + Extended Double-Precision Rounding Precision + Exceptions and Exception Flags + Function Details + Conversion Functions + Standard Arithmetic Functions + Remainder Functions + Round-to-Integer Functions + Comparison Functions + Signaling NaN Test Functions + Raise-Exception Function + Contact Information + + + +------------------------------------------------------------------------------- +Legal Notice + +SoftFloat was written by John R. Hauser. This work was made possible in +part by the International Computer Science Institute, located at Suite 600, +1947 Center Street, Berkeley, California 94704. Funding was partially +provided by the National Science Foundation under grant MIP-9311980. The +original version of this code was written as part of a project to build +a fixed-point vector processor in collaboration with the University of +California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. + +THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort +has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT +TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO +PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY +AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. + + +------------------------------------------------------------------------------- +Types and Functions + +When 64-bit integers are supported by the compiler, the `softfloat.h' header +file defines four types: `float32' (single precision), `float64' (double +precision), `floatx80' (extended double precision), and `float128' +(quadruple precision). The `float32' and `float64' types are defined in +terms of 32-bit and 64-bit integer types, respectively, while the `float128' +type is defined as a structure of two 64-bit integers, taking into account +the byte order of the particular machine being used. The `floatx80' type +is defined as a structure containing one 16-bit and one 64-bit integer, with +the machine's byte order again determining the order of the `high' and `low' +fields. + +When 64-bit integers are _not_ supported by the compiler, the `softfloat.h' +header file defines only two types: `float32' and `float64'. Because +ISO/ANSI C guarantees at least one built-in integer type of 32 bits, +the `float32' type is identified with an appropriate integer type. The +`float64' type is defined as a structure of two 32-bit integers, with the +machine's byte order determining the order of the fields. + +In either case, the types in `softfloat.h' are defined such that if a system +implements the usual C `float' and `double' types according to the IEC/IEEE +Standard, then the `float32' and `float64' types should be indistinguishable +in memory from the native `float' and `double' types. (On the other hand, +when `float32' or `float64' values are placed in processor registers by +the compiler, the type of registers used may differ from those used for the +native `float' and `double' types.) + +SoftFloat implements the following arithmetic operations: + +-- Conversions among all the floating-point formats, and also between + integers (32-bit and 64-bit) and any of the floating-point formats. + +-- The usual add, subtract, multiply, divide, and square root operations + for all floating-point formats. + +-- For each format, the floating-point remainder operation defined by the + IEC/IEEE Standard. + +-- For each floating-point format, a ``round to integer'' operation that + rounds to the nearest integer value in the same format. (The floating- + point formats can hold integer values, of course.) + +-- Comparisons between two values in the same floating-point format. + +The only functions required by the IEC/IEEE Standard that are not provided +are conversions to and from decimal. + + +------------------------------------------------------------------------------- +Rounding Modes + +All four rounding modes prescribed by the IEC/IEEE Standard are implemented +for all operations that require rounding. The rounding mode is selected +by the global variable `float_rounding_mode'. This variable may be set +to one of the values `float_round_nearest_even', `float_round_to_zero', +`float_round_down', or `float_round_up'. The rounding mode is initialized +to nearest/even. + + +------------------------------------------------------------------------------- +Extended Double-Precision Rounding Precision + +For extended double precision (`floatx80') only, the rounding precision +of the standard arithmetic operations is controlled by the global variable +`floatx80_rounding_precision'. The operations affected are: + + floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt + +When `floatx80_rounding_precision' is set to its default value of 80, these +operations are rounded (as usual) to the full precision of the extended +double-precision format. Setting `floatx80_rounding_precision' to 32 +or to 64 causes the operations listed to be rounded to reduced precision +equivalent to single precision (`float32') or to double precision +(`float64'), respectively. When rounding to reduced precision, additional +bits in the result significand beyond the rounding point are set to zero. +The consequences of setting `floatx80_rounding_precision' to a value other +than 32, 64, or 80 is not specified. Operations other than the ones listed +above are not affected by `floatx80_rounding_precision'. + + +------------------------------------------------------------------------------- +Exceptions and Exception Flags + +All five exception flags required by the IEC/IEEE Standard are +implemented. Each flag is stored as a unique bit in the global variable +`float_exception_flags'. The positions of the exception flag bits within +this variable are determined by the bit masks `float_flag_inexact', +`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and +`float_flag_invalid'. The exception flags variable is initialized to all 0, +meaning no exceptions. + +An individual exception flag can be cleared with the statement + + float_exception_flags &= ~ float_flag_<exception>; + +where `<exception>' is the appropriate name. To raise a floating-point +exception, the SoftFloat function `float_raise' should be used (see below). + +In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess +for underflow either before or after rounding. The choice is made by +the global variable `float_detect_tininess', which can be set to either +`float_tininess_before_rounding' or `float_tininess_after_rounding'. +Detecting tininess after rounding is better because it results in fewer +spurious underflow signals. The other option is provided for compatibility +with some systems. Like most systems, SoftFloat always detects loss of +accuracy for underflow as an inexact result. + + +------------------------------------------------------------------------------- +Function Details + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +Conversion Functions + +All conversions among the floating-point formats are supported, as are all +conversions between a floating-point format and 32-bit and 64-bit signed +integers. The complete set of conversion functions is: + + int32_to_float32 int64_to_float32 + int32_to_float64 int64_to_float32 + int32_to_floatx80 int64_to_floatx80 + int32_to_float128 int64_to_float128 + + float32_to_int32 float32_to_int64 + float32_to_int32 float64_to_int64 + floatx80_to_int32 floatx80_to_int64 + float128_to_int32 float128_to_int64 + + float32_to_float64 float32_to_floatx80 float32_to_float128 + float64_to_float32 float64_to_floatx80 float64_to_float128 + floatx80_to_float32 floatx80_to_float64 floatx80_to_float128 + float128_to_float32 float128_to_float64 float128_to_floatx80 + +Each conversion function takes one operand of the appropriate type and +returns one result. Conversions from a smaller to a larger floating-point +format are always exact and so require no rounding. Conversions from 32-bit +integers to double precision and larger formats are also exact, and likewise +for conversions from 64-bit integers to extended double and quadruple +precisions. + +Conversions from floating-point to integer raise the invalid exception if +the source value cannot be rounded to a representable integer of the desired +size (32 or 64 bits). If the floating-point operand is a NaN, the largest +positive integer is returned. Otherwise, if the conversion overflows, the +largest integer with the same sign as the operand is returned. + +On conversions to integer, if the floating-point operand is not already an +integer value, the operand is rounded according to the current rounding +mode as specified by `float_rounding_mode'. Because C (and perhaps other +languages) require that conversions to integers be rounded toward zero, the +following functions are provided for improved speed and convenience: + + float32_to_int32_round_to_zero float32_to_int64_round_to_zero + float64_to_int32_round_to_zero float64_to_int64_round_to_zero + floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero + float128_to_int32_round_to_zero float128_to_int64_round_to_zero + +These variant functions ignore `float_rounding_mode' and always round toward +zero. + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +Standard Arithmetic Functions + +The following standard arithmetic functions are provided: + + float32_add float32_sub float32_mul float32_div float32_sqrt + float64_add float64_sub float64_mul float64_div float64_sqrt + floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt + float128_add float128_sub float128_mul float128_div float128_sqrt + +Each function takes two operands, except for `sqrt' which takes only one. +The operands and result are all of the same type. + +Rounding of the extended double-precision (`floatx80') functions is affected +by the `floatx80_rounding_precision' variable, as explained above in the +section _Extended_Double-Precision_Rounding_Precision_. + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +Remainder Functions + +For each format, SoftFloat implements the remainder function according to +the IEC/IEEE Standard. The remainder functions are: + + float32_rem + float64_rem + floatx80_rem + float128_rem + +Each remainder function takes two operands. The operands and result are all +of the same type. Given operands x and y, the remainder functions return +the value x - n*y, where n is the integer closest to x/y. If x/y is exactly +halfway between two integers, n is the even integer closest to x/y. The +remainder functions are always exact and so require no rounding. + +Depending on the relative magnitudes of the operands, the remainder +functions can take considerably longer to execute than the other SoftFloat +functions. This is inherent in the remainder operation itself and is not a +flaw in the SoftFloat implementation. + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +Round-to-Integer Functions + +For each format, SoftFloat implements the round-to-integer function +specified by the IEC/IEEE Standard. The functions are: + + float32_round_to_int + float64_round_to_int + floatx80_round_to_int + float128_round_to_int + +Each function takes a single floating-point operand and returns a result of +the same type. (Note that the result is not an integer type.) The operand +is rounded to an exact integer according to the current rounding mode, and +the resulting integer value is returned in the same floating-point format. + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +Comparison Functions + +The following floating-point comparison functions are provided: + + float32_eq float32_le float32_lt + float64_eq float64_le float64_lt + floatx80_eq floatx80_le floatx80_lt + float128_eq float128_le float128_lt + +Each function takes two operands of the same type and returns a 1 or 0 +representing either _true_ or _false_. The abbreviation `eq' stands for +``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands +for ``less than'' (<). + +The standard greater-than (>), greater-than-or-equal (>=), and not-equal +(!=) functions are easily obtained using the functions provided. The +not-equal function is just the logical complement of the equal function. +The greater-than-or-equal function is identical to the less-than-or-equal +function with the operands reversed; and the greater-than function can be +obtained from the less-than function in the same way. + +The IEC/IEEE Standard specifies that the less-than-or-equal and less-than +functions raise the invalid exception if either input is any kind of NaN. +The equal functions, on the other hand, are defined not to raise the invalid +exception on quiet NaNs. For completeness, SoftFloat provides the following +additional functions: + + float32_eq_signaling float32_le_quiet float32_lt_quiet + float64_eq_signaling float64_le_quiet float64_lt_quiet + floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet + float128_eq_signaling float128_le_quiet float128_lt_quiet + +The `signaling' equal functions are identical to the standard functions +except that the invalid exception is raised for any NaN input. Likewise, +the `quiet' comparison functions are identical to their counterparts except +that the invalid exception is not raised for quiet NaNs. + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +Signaling NaN Test Functions + +The following functions test whether a floating-point value is a signaling +NaN: + + float32_is_signaling_nan + float64_is_signaling_nan + floatx80_is_signaling_nan + float128_is_signaling_nan + +The functions take one operand and return 1 if the operand is a signaling +NaN and 0 otherwise. + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +Raise-Exception Function + +SoftFloat provides a function for raising floating-point exceptions: + + float_raise + +The function takes a mask indicating the set of exceptions to raise. No +result is returned. In addition to setting the specified exception flags, +this function may cause a trap or abort appropriate for the current system. + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + +------------------------------------------------------------------------------- +Contact Information + +At the time of this writing, the most up-to-date information about +SoftFloat and the latest release can be found at the Web page `http:// +HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'. + + |