GetWiki
Floating-point arithmetic
ARTICLE SUBJECTS
being →
database →
ethics →
fiction →
history →
internet →
language →
linux →
logic →
method →
news →
policy →
purpose →
religion →
science →
software →
truth →
unix →
wiki →
ARTICLE TYPES
essay →
feed →
help →
system →
wiki →
ARTICLE ORIGINS
critical →
forked →
imported →
original →
Floating-point arithmetic
please note:
- the content below is remote from Wikipedia
- it has been imported raw for GetWiki
{{redirect|Floating point}}{{Short description|Computer approximation for real numbers}}{{Use dmy dates|date=May 2019|cs1-dates=y}}File:Z3 Deutsches Museum.JPG|thumb|200px|An early electromechanical programmable computer, the Z3, included floating-point arithmetic (replica on display at Deutsches Museum in MunichMunichIn computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base.Numbers of this form are called floating-point numbers.{{rp|3}}BOOK, Sterbenz, Pat H., Floating-Point Computation, 1974, Prentice-Hall, Englewood Cliffs, NJ, United States, 0-13-322495-3- the content below is remote from Wikipedia
- it has been imported raw for GetWiki
10}}For example, 12.345 is a floating-point number in base ten with five digits of precision:12.345 = ! underbrace{12345}_text{significand} ! times ! underbrace{10}_text{base}!!!!!!!overbrace{{}^{-3}}^{text{exponent}}However, unlike 12.345, 12.3456 is not a floating-point number in base ten with five digits of precision—it needs six digits of precision; the nearest floating-point number with only five digits is 12.346.In practice, most floating-point systems use base two, though base ten (decimal floating point) is also common.Floating-point arithmetic operations, such as addition and division, approximate the corresponding real number arithmetic operations by rounding any result that is not a floating-point number itself to a nearby floating-point number.{{rp|22}}{{rp|10}}For example, in a floating-point arithmetic with five base-ten digits of precision, the sum 12.345 + 1.0001 = 13.3451 might be rounded to 13.345.The term floating point refers to the fact that the number’s radix point can “float” anywhere to the left, right, or between the significant digits of the number. This position is indicated by the exponent, so floating point can be considered a form of scientific notation.A floating-point system can be used to represent, with a fixed number of digits, numbers of very different orders of magnitude â such as the number of meters between galaxies or between protons in an atom. For this reason, floating-point arithmetic is often used to allow very small and very large real numbers that require fast processing times. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with their exponent.File:A number line representing single-precision floating point’s numbers and numbers that it cannot display.png|500x500px|thumb|right|Single-precision floating-point numbers on a (number line]]: the green lines mark representable values.)File:FloatingPointPrecisionAugmented.png|500x500px|thumb|right|Augmented version above showing both signs of representable values]]Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers.OverviewFloating-point numbersA number representation specifies some way of encoding a number, usually as a string of digits.There are several mechanisms by which strings of digits can represent numbers. In standard mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit “point” character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might use a string of 8 decimal digits with the decimal point in the middle, whereby “00012345” would represent 0001.2345.In scientific notation, the given number is scaled by a power of 10, so that it lies within a specific rangeâtypically between 1 and 10, with the radix point appearing immediately after the first digit. As a power of ten, the scaling factor is then indicated separately at the end of the number. For example, the orbital period of Jupiter’s moon Io is {{val|152853.5047|fmt=commas}} seconds, a value that would be represented in standard-form scientific notation as {{val|1.528535047|e=5|fmt=commas}} seconds.Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:
&left(sum_{n=0}^{p-1} text{bit}_n times 2^{-n}right) times 2^e
end{align}where {{mvar|p}} is the precision ({{val|24}} in this example), {{mvar|n}} is the position of the bit of the significand from the left (starting at {{val|0}} and finishing at {{val|23}} here) and {{mvar|e}} is the exponent ({{val|1=1}} in this example).{{anchor|Hidden bit}}It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called normalization. For binary formats (which uses only the digits {{val|0}} and {{val|1=1}}), this non-zero digit is necessarily {{val|1}}. Therefore, it does not need to be represented in memory, allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention, or the assumed bit convention.={} &left(1 times 2^{-0} + 1 times 2^{-1} + 0 times 2^{-2} + 0 times 2^{-3} + 1 times2^{-4} + cdots + 1 times 2^{-23}right) times 2^1 approx{} &1.5707964 times 2 approx{} &3.1415928 Alternatives to floating-point numbersThe floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:
History{{see also|IEEE 754#History}}File:Quevedo 1917.jpg|thumb|upright=0.7|Leonardo Torres QuevedoLeonardo Torres QuevedoIn 1914, the Spanish engineer Leonardo Torres Quevedo published Essays on Automatics,Torres Quevedo, Leonardo. Automática: Complemento de la TeorÃa de las Máquinas, (pdf), pp. 575â583, Revista de Obras Públicas, 19 November 1914. where he designed a special-purpose electromechanical calculator based on Charles Babbage’s Analytical Engine and described a way to store floating-point numbers in a consistent manner. He stated that numbers will be stored in exponential format as n x 10^m, and offered three rules by which consistent manipulation of floating-point numbers by machines could be implemented. For Torres, ”n will always be the same number of digits (e.g. six), the first digit of n will be of order of tenths, the second of hundredths, etc, and one will write each quantity in the form: n; m.” The format he proposed shows the need for a fixed-sized significand as is presently used for floating-point data, fixing the location of the decimal point in the significand so that each representation was unique, and how to format such numbers by specifying a syntax to be used that could be entered through a typewriter, as was the case of his Electromechanical Arithmometer in 1920.Ronald T. Kneusel. Numbers and Computers, Springer, pp. 84â85, 2017. {{ISBN|978-3319505084}}{{Sfn|Randell|1982|pp=6, 11â13}}Randell, Brian. Digital Computers, History of Origins, (pdf), p. 545, Digital Computers: Origins, Encyclopedia of Computer Science, January 2003. File:Konrad Zuse (1992).jpg|thumb|upright=0.7|right|Konrad Zuse, architect of the Z3 computer, which uses a 22-bit binary floating-point representation]]In 1938, Konrad Zuse of Berlin completed the Z1, the first binary, programmable mechanical computer; it uses a 24-bit binary floating-point number representation with a 7-bit signed exponent, a 17-bit significand (including one implicit bit), and a sign bit. The more reliable relay-based Z3, completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as ^1/_infty = 0, and it stops on undefined operations, such as 0 times infty.Zuse also proposed, but did not complete, carefully rounded floating-point arithmetic that includes pminfty and NaN representations, anticipating features of the IEEE Standard by four decades. In contrast, von Neumann recommended against floating-point numbers for the 1951 IAS machine, arguing that fixed-point arithmetic is preferable.The first commercial computer with floating-point hardware was Zuse’s Z4 computer, designed in 1942â1945. In 1946, Bell Laboratories introduced the Model V, which implemented decimal floating-point numbers.The Pilot ACE has binary floating-point arithmetic, and it became operational in 1950 at National Physical Laboratory, UK. Thirty-three were later sold commercially as the English Electric DEUCE. The arithmetic is actually implemented in software, but with a one megahertz clock rate, the speed of floating-point and fixed-point operations in this machine were initially faster than those of many competing computers.The mass-produced IBM 704 followed in 1954; it introduced the use of a biased exponent. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be “scientific computers”, or to have “scientific computation” (SC) capability (see also Extensions for Scientific Computation (XSC)). It was not until the launch of the Intel i486 in 1989 that general-purpose personal computers had floating-point capability in hardware as a standard feature.The UNIVAC 1100/2200 series, introduced in 1962, supported two floating-point representations:
Range of floating-point numbersA floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floating-point range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.On a typical computer system, a double-precision (64-bit) binary floating-point number has a coefficient of 53 bits (including 1 implied bit), an exponent of 11 bits, and 1 sign bit. Since 210 = 1024, the complete range of the positive normal floating-point numbers in this format is from 2â1022 â 2 Ã 10â308 to approximately 21024 â 2 Ã 10308.The number of normal floating-point numbers in a system (B, P, L, U) where
Underflow level = UFL = B^L,
which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.There is a largest floating-point number,
Overflow level = OFL = left(1 - B^{-P}right)left(B^{U + 1}right),
which has B â 1 as the value for each digit of the significand and the largest possible value for the exponent.In addition, there are representable values strictly between âUFL and UFL. Namely, positive and negative zeros, as well as subnormal numbers.IEEE 754: floating point in modern computers {{anchor|IEEE 754}}{{Floating-point}}The IEEE standardized the computer representation for binary floating-point numbers in IEEE 754 (a.k.a. IEC 60559) in 1985. This first standard is followed by almost all modern machines. It was revised in 2008. IBM mainframes support IBM’s own hexadecimal floating point format and IEEE 754-2008 decimal floating point in addition to the IEEE 754 binary format. The Cray T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.{{Citation needed|date=July 2020}}The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats, and others are termed extended precision formats and extendable precision format. Three formats are especially widely used in computer hardware and languages:{{Citation needed|reason=Possibly wrong for double extended: OK for hardware, but for languages? Note that in C, long double may not correspond to double extended (see 32-bit ARM and PowerPC).|date=July 2020}}
Internal representationFloating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand or mantissa, from left to right. For the IEEE 754 binary formats (basic and extended) which have extant hardware implementations, they are apportioned as follows:{| class=“wikitable” style="text-align:right; border:0” |
!rowspan=“2” |Exponentbias!rowspan=“2” |Bitsprecision!rowspan=“2” |Number ofdecimal digits |
Half precision>Half (IEEE 754-2008)|1|5|10|16|15|11|~3.3 |
Single precision>Single|1|8|23|32|127|24|~7.2 |
Double precision>Double|1|11|52|64|1023|53|~15.9 |
Extended precision#x86 extended precision format>x86 extended precision|1|15|64|80|16383|64|~19.2 |
Quad precision>Quad|1|15|112|128|16383|113|~34.0 |
- sign = 0 ; e = 1 ; s = 110010010000111111011011 (including the hidden bit)
- 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB as a hexadecimal number.
Other notable floating-point formats
In addition to the widely used IEEE 754 standard formats, other floating-point formats are used, or have been used, in certain domain-specific areas.- The Microsoft Binary Format (MBF) was developed for the Microsoft BASIC language products, including Microsoft’s first ever product the Altair BASIC (1975), TRS-80 LEVEL II, CP/M’s MBASIC, IBM PC 5150’s BASICA, MS-DOS’s GW-BASIC and QuickBASIC prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated Intel 8080 by Monte Davidoff, a dormmate of Bill Gates, during spring of 1975 for the MITS Altair 8800. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the MITS Altair 8800 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format. A single-precision (40 bits) variant format was adopted for other CPU’s, notably the MOS 6502 (Apple //, Commodore PET, Atari), Motorola 6800 (MITS Altair 680) and Motorola 6809 (TRS-80 Color Computer). All Microsoft language products from 1975 through 1987 used the Microsoft Binary Format until Microsoft adopted the IEEE-754 standard format in all its products starting in 1988 to their current releases. MBF consists of the MBF single-precision format (32 bits, “6-digit BASIC“), the MBF extended-precision format (40 bits, “9-digit BASIC“), and the MBF double-precision format (64 bits); each of them is represented with an 8-bit exponent, followed by a sign bit, followed by a significand of respectively 23, 31, and 55 bits.
- The Bfloat16 format requires the same amount of memory (16 bits) as the IEEE 754 half-precision format, but allocates 8 bits to the exponent instead of 5, thus providing the same range as a IEEE 754 single-precision number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of machine learning models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
- The TensorFloat-32 format combines the 8 bits of exponent of the Bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by Nvidia, which provides hardware support for it in the Tensor Cores of its GPUs based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.
- The Hopper architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).
Representable numbers, conversion and rounding {{anchor|Representable numbers}}
By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as Ï or â2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the decimal number 123456789 cannot be exactly represented if only eight decimal digits of precision are available (it would be rounded to one of the two straddling representable values, 12345678 Ã 101 or 12345679 Ã 101), the same applies to non-terminating digits (.{{overline|5}} to be rounded to either .55555555 or .55555556).When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the rounded value.Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers that appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a “1100” sequence continuing endlessly:
e = â4; s = 1100110011001100110011001100110011...,
where, as previously, s is the significand and e is the exponent.When rounded to 24 bits this becomes
e = â4; s = 110011001100110011001101,
which is actually 0.100000001490116119384765625 in decimal.As a further example, the real number Ï, represented in binary as an infinite sequence of bits is
11.0010010000111111011010101000100010000101101000110000100011010011...
but is
11.0010010000111111011011
when approximated by rounding to a precision of 24 bits.In binary single-precision floating-point, this is represented as s = 1.10010010000111111011011 with e = 1.This has a decimal value of
3.1415927410125732421875,
whereas a more accurate approximation of the true value of Ï is
3.14159265358979323846264338327950...
The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of Ï in the first 7 digits. The difference is the discretization error and is limited by the machine epsilon.The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a unit in the last place (ULP). For example, if there is no representable number lying between the representable numbers 1.45a70c22hex and 1.45a70c24hex, the ULP is 2Ã16â8, or 2â31. For numbers with a base-2 exponent part of 0, i.e. numbers with an absolute value higher than or equal to 1 but lower than 2, an ULP is exactly 2â23 or about 10â7 in single precision, and exactly 2â53 or about 10â16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP.Rounding modes
Rounding is used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more digits than there are digits in the significand. IEEE 754 requires correct rounding: that is, the rounded result is as if infinitely precise arithmetic was used to compute the value and then rounded (although in implementation only three extra bits are needed to ensure this). There are several different rounding schemes (or rounding modes). Historically, truncation was the typical approach. Since the introduction of IEEE 754, the default method (round to nearest, ties to even, sometimes called Banker’s Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result. In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (non-NaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. (“Library” functions such as cosine and log are not mandated.)Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:- round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
- round to nearest, where ties round away from zero (optional for binary floating-point and commonly used in decimal)
- round up (toward +â; negative results thus round toward zero)
- round down (toward ââ; negative results thus round away from zero)
- round toward zero (truncation; it is similar to the common behavior of float-to-integer conversions, which convert â3.9 to â3 and 3.9 to 3)
Binary-to-decimal conversion with minimal number of digits
Converting a double-precision binary floating-point number to a decimal string is a common operation, but an algorithm producing results that are both accurate and minimal did not appear in print until 1990, with Steele and White’s Dragon4. Some of the improvements since then include:- David M. Gay’s dtoa.c, a practical open-source implementation of many ideas in Dragon4.
- Grisu3, with a 4× speedup as it removes the use of bignums. Must be used with a fallback, as it fails for ~0.5% of cases.
- Errol3, an always-succeeding algorithm similar to, but slower than, Grisu3. Apparently not as good as an early-terminating Grisu with fallback.
- Ryū, an always-succeeding algorithm that is faster and simpler than Grisu3.
- Schubfach, an always-succeeding algorithm that is based on a similar idea to Ryū, developed almost simultaneously and independently. Performs better than Ryū and Grisu3 in certain benchmarks.
Decimal-to-binary conversion
The problem of parsing a decimal string into a binary FP representation is complex, with an accurate parser not appearing until Clinger’s 1990 work (implemented in dtoa.c). Further work has likewise progressed in the direction of faster parsing.Floating-point operations
For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples, as in the IEEE 754 decimal32 format. The fundamental principles are the same in any radix or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, s denotes the significand and e denotes the exponent.Addition and subtraction
A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits, and one then proceeds with the usual addition method:
123456.7 = 1.234567 Ã 10^5
101.7654 = 1.017654 Ã 10^2 = 0.001017654 Ã 10^5
101.7654 = 1.017654 Ã 10^2 = 0.001017654 Ã 10^5
Hence:
123456.7 + 101.7654 = (1.234567 Ã 10^5) + (1.017654 Ã 10^2)
= (1.234567 Ã 10^5) + (0.001017654 Ã 10^5)
= (1.234567 + 0.001017654) Ã 10^5
= 1.235584654 Ã 10^5
In detail:
123456.7 + 101.7654 = (1.234567 Ã 10^5) + (1.017654 Ã 10^2)
= (1.234567 Ã 10^5) + (0.001017654 Ã 10^5)
= (1.234567 + 0.001017654) Ã 10^5
= 1.235584654 Ã 10^5
e=5; s=1.234567 (123456.7)
+ e=2; s=1.017654 (101.7654)
+ e=2; s=1.017654 (101.7654)
e=5; s=1.234567
+ e=5; s=0.001017654 (after shifting)
----------------
e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
+ e=5; s=0.001017654 (after shifting)
----------------
e=5; s=1.235584654 (true sum: 123558.4654)
e=5; s=1.235585 (final sum: 123558.5)
The lowest three digits of the second operand (654) are essentially lost. This is round-off error. In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1.234567
+ e=â3; s=9.876543
+ e=â3; s=9.876543
e=5; s=1.234567
+ e=5; s=0.00000009876543 (after shifting)
------------------
e=5; s=1.23456709876543 (true sum)
e=5; s=1.234567 (after rounding and normalization)
In the above conceptual examples it would appear that a large number of extra digits would need to be provided by the adder to ensure correct rounding; however, for binary addition or subtraction using careful implementation techniques only a guard bit, a rounding bit and one extra sticky bit need to be carried beyond the precision of the operands.{{rp|218â220}}Another problem of loss of significance occurs when approximations to two nearly equal numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are approximations to the rationals 123457.1467 and 123456.659.
+ e=5; s=0.00000009876543 (after shifting)
------------------
e=5; s=1.23456709876543 (true sum)
e=5; s=1.234567 (after rounding and normalization)
e=5; s=1.234571
â e=5; s=1.234567
------------
e=5; s=0.000004
e=â1; s=4.000000 (after rounding and normalization)
The floating-point difference is computed exactly because the numbers are closeâthe Sterbenz lemma guarantees this, even in case of underflow when gradual underflow is supported. Despite this, the difference of the original numbers is e = â1; s = 4.877000, which differs more than 20% from the difference e = â1; s = 4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost. This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis; see also Accuracy problems.â e=5; s=1.234567
------------
e=5; s=0.000004
e=â1; s=4.000000 (after rounding and normalization)
Multiplication and division
To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.
e=3; s=4.734612
à e=5; s=5.417242
-------------------
e=8; s=25.648538980104 (true product)
e=8; s=25.64854 (after rounding)
e=9; s=2.564854 (after normalization)
Similarly, division is accomplished by subtracting the divisor’s exponent from the dividend’s exponent, and dividing the dividend’s significand by the divisor’s significand.There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed in succession. In practice, the way these operations are carried out in digital logic can be quite complex (see Booth’s multiplication algorithm and Division algorithm).For a fast, simple method, see the Horner method.Ã e=5; s=5.417242
-------------------
e=8; s=25.648538980104 (true product)
e=8; s=25.64854 (after rounding)
e=9; s=2.564854 (after normalization)
Literal syntax
Literals for floating-point numbers depend on languages. They typically use e or E to denote scientific notation. The C programming language and the IEEE 754 standard also define a hexadecimal literal syntax with a base-2 exponent instead of 10. In languages like C, when the decimal exponent is omitted, a decimal point is needed to differentiate them from integers. Other languages do not have an integer type (such as JavaScript), or allow overloading of numeric types (such as Haskell). In these cases, digit strings such as 123 may also be floating-point literals.Examples of floating-point literals are:- 99.9
- -5000.12
- 6.02e23
- -3e-45
- 0x1.fffffep+127 in C and IEEE 754
Dealing with exceptional cases {{anchor|Floating point exception|Exception handling}}
{{further|IEEE 754#Exception handling}}Floating-point computation in a computer can run into three kinds of problems:- An operation can be mathematically undefined, such as â/â, or division by zero.
- An operation can be legal in principle, but not supported by the specific format, for example, calculating the square root of â1 or the inverse sine of 2 (both of which result in complex numbers).
- An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large), underflow (exponent too small) or denormalization (precision loss).
- inexact, set if the rounded (and returned) value is different from the mathematically exact result of the operation.
- underflow, set if the rounded value is tiny (as specified in IEEE 754) and inexact (or maybe limited to if it has denormalization loss, as per the 1985 version of IEEE 754), returning a subnormal value including the zeros.
- overflow, set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
- divide-by-zero, set if the result is infinite given finite operands, returning an infinity, either +â or ââ.
- invalid, set if a real-valued result cannot be returned e.g. sqrt(â1) or 0/0, returning a quiet NaN.
Accuracy problems
The fact that floating-point numbers cannot accurately represent all real numbers, and that floating-point operations cannot accurately represent true arithmetic operations, leads to many surprising situations. This is related to the finite precision with which computers generally represent numbers.For example, the decimal numbers 0.1 and 0.01 cannot be represented exactly as binary floating-point numbers. In the IEEE 754 binary32 format with its 24-bit significand, the result of attempting to square the approximation to 0.1 is neither 0.01 nor the representable number closest to it. The decimal number 0.1 is represented in binary as {{math|1={{var|e}} = â4}}; {{math|1={{var|s}} = 110011001100110011001101}}, which is{{block indent|1=0.100000001490116119384765625 exactly.}}Squaring this number gives{{block indent|1=0.010000000298023226097399174250313080847263336181640625 exactly.}}Squaring it with rounding to the 24-bit precision gives{{block indent|0.010000000707805156707763671875 exactly.}}But the representable number closest to 0.01 is{{block indent|0.009999999776482582092285156250 exactly.}}Also, the non-representability of Ï (and Ï/2) means that an attempted computation of tan(Ï/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(Ï/2), because Ï/2 cannot be represented exactly. This computation in C:/* Enough digits to be sure we get the correct approximation. */double pi = 3.1415926535897932384626433832795;double z = tan(pi/2.0);will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be â22877332.0.By the same token, an attempted computation of sin(Ï) will not yield zero. The result will be (approximately) 0.1225{{e|â15}} in double precision, or â0.8742{{e|â7}} in single precision.While floating-point addition and multiplication are both commutative ({{math|{{var|a}} + {{var|b}} {{=}} {{var|b}} + {{var|a}}}} and {{math|{{var|a}} Ã {{var|b}} {{=}} {{var|b}} Ã {{var|a}}}}), they are not necessarily associative. That is, {{math|({{var|a}} + {{var|b}}) + {{var|c}}}} is not necessarily equal to {{math|{{var|a}} + ({{var|b}} + {{var|c}})}}. Using 7-digit significand decimal arithmetic:
a = 1234.567, b = 45.67834, c = 0.0004
(a + b) + c:
1234.567 (a)
+ 45.67834 (b)
____________
1280.24534 rounds to 1280.245
1234.567 (a)
+ 45.67834 (b)
____________
1280.24534 rounds to 1280.245
1280.245 (a + b)
+ 0.0004 (c)
____________
1280.2454 rounds to 1280.245 â (a + b) + c
+ 0.0004 (c)
____________
1280.2454 rounds to 1280.245 â (a + b) + c
a + (b + c):
45.67834 (b)
+ 0.0004 (c)
____________
45.67874
45.67834 (b)
+ 0.0004 (c)
____________
45.67874
1234.567 (a)
+ 45.67874 (b + c)
____________
1280.24574 rounds to 1280.246 â a + (b + c)
They are also not necessarily distributive. That is, {{math|({{var|a}} + {{var|b}}) Ã {{var|c}}}} may not be the same as {{math|{{var|a}} Ã {{var|c}} + {{var|b}} Ã {{var|c}}}}:
+ 45.67874 (b + c)
____________
1280.24574 rounds to 1280.246 â a + (b + c)
1234.567 Ã 3.333333 = 4115.223
1.234567 Ã 3.333333 = 4.115223
4115.223 + 4.115223 = 4119.338
but
1234.567 + 1.234567 = 1235.802
1235.802 Ã 3.333333 = 4119.340
In addition to loss of significance, inability to represent numbers such as Ï and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:{{ulist1.234567 Ã 3.333333 = 4.115223
4115.223 + 4.115223 = 4119.338
but
1234.567 + 1.234567 = 1235.802
1235.802 Ã 3.333333 = 4119.340
Incidents
- On 25 February 1991, a loss of significance in a MIM-104 Patriot missile battery prevented it from intercepting an incoming Scud missile in Dhahran, Saudi Arabia, contributing to the death of 28 soldiers from the U.S. Army’s 14th Quartermaster Detachment.
Machine precision and backward error analysis
Machine precision is a quantity that characterizes the accuracy of a floating-point system, and is used in backward error analysis of floating-point algorithms. It is also known as unit roundoff or machine epsilon. Usually denoted {{math|{{var|Î}}mach}}, its value depends on the particular rounding being used.With rounding to zero,Epsilon_text{mach} = B^{1-P},,whereas rounding to nearest,Epsilon_text{mach} = tfrac{1}{2} B^{1-P},where B is the base of the system and P is the precision of the significand (in base B).This is important since it bounds the relative error in representing any non-zero real number {{math|{{var|x}}}} within the normalized range of a floating-point system:left| frac{operatorname{fl}(x) - x}{x} right| le Epsilon_text{mach}.Backward error analysis, the theory of which was developed and popularized by James H. Wilkinson, can be used to establish that an algorithm implementing a numerical function is numerically stable. The basic approach is to show that although the calculated result, due to roundoff errors, will not be exactly correct, it is the exact solution to a nearby problem with slightly perturbed input data. If the perturbation required is small, on the order of the uncertainty in the input data, then the results are in some sense as accurate as the data “deserves”. The algorithm is then defined as backward stable. Stability is a measure of the sensitivity to rounding errors of a given numerical procedure; by contrast, the condition number of a function for a given problem indicates the inherent sensitivity of the function to small perturbations in its input and is independent of the implementation used to solve the problem.As a trivial example, consider a simple expression giving the inner product of (length two) vectors x and y, thenbegin{align}
operatorname{fl}(x cdot y)
&= operatorname{fl}big(operatorname{fl}(x_1 cdot y_1) + operatorname{fl}(x_2 cdot y_2)big), && text{ where } operatorname{fl}() text{ indicates correctly rounded floating-point arithmetic}
&= operatorname{fl}big((x_1 cdot y_1)(1 + delta_1) + (x_2 cdot y_2)(1 + delta_2)big), && text{ where } delta_n leq Epsilon_text{mach}, text{ from above}
&= big((x_1 cdot y_1)(1 + delta_1) + (x_2 cdot y_2)(1 + delta_2)big)(1 + delta_3)
&= (x_1 cdot y_1)(1 + delta_1)(1 + delta_3) + (x_2 cdot y_2)(1 + delta_2)(1 + delta_3),
end{align}and sooperatorname{fl}(x cdot y) = hat{x} cdot hat{y},wherebegin{align}hat{x}_1 &= x_1(1 + delta_1); & hat{x}_2 &= x_2(1 + delta_2);hat{y}_1 &= y_1(1 + delta_3); & hat{y}_2 &= y_2(1 + delta_3),end{align}wheredelta_n leq Epsilon_text{mach}by definition, which is the sum of two slightly perturbed (on the order of Îmach) input data, and so is backward stable. For more realistic examples in numerical linear algebra, see Higham 2002 and other references below.&= operatorname{fl}big(operatorname{fl}(x_1 cdot y_1) + operatorname{fl}(x_2 cdot y_2)big), && text{ where } operatorname{fl}() text{ indicates correctly rounded floating-point arithmetic}
&= operatorname{fl}big((x_1 cdot y_1)(1 + delta_1) + (x_2 cdot y_2)(1 + delta_2)big), && text{ where } delta_n leq Epsilon_text{mach}, text{ from above}
&= big((x_1 cdot y_1)(1 + delta_1) + (x_2 cdot y_2)(1 + delta_2)big)(1 + delta_3)
&= (x_1 cdot y_1)(1 + delta_1)(1 + delta_3) + (x_2 cdot y_2)(1 + delta_2)(1 + delta_3),
Minimizing the effect of accuracy problems
Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are ill-conditioned, meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm numerically unstable for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as numerical analysis. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires, which can remove, or reduce by orders of magnitude, such risk: IEEE 754 quadruple precision and extended precision are designed for this purpose when computing at double precision.For example, the following algorithm is a direct implementation to compute the function {{math|{{var|A}}({{var|x}}) {{=}} ({{var|x}}â1) / (exp({{var|x}}â1) â 1)}} which is well-conditioned at 1.0, however it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.double A(double X){
double Y, Z; // [1]
Y = X - 1.0;
Z = exp(Y);
if (Z != 1.0)
Z = Y / (Z - 1.0); // [2]
return Z;
}If, however, intermediate computations are all performed in extended precision (e.g. by setting line [1] to C99 {{code|long double}}), then up to full precision in the final double result can be maintained. Alternatively, a numerical analysis of the algorithm reveals that if the following non-obvious change to line [2] is made:Z = log(Z) / (Z - 1.0);then the algorithm becomes numerically stable and can compute to full double precision.To maintain the properties of such carefully constructed numerically stable programs, careful handling by the compiler is required. Certain “optimizations” that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to, and the other references at the bottom of this article. Kahan suggests several rules of thumb that can substantially decrease by orders of magnitude the risk of numerical anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include: as noted above, computing all expressions and intermediate results in the highest precision supported in hardware (a common rule of thumb is to carry twice the precision of the desired result, i.e. compute in double precision for a final single-precision result, or in double extended or quad precision for up to double-precision results); and rounding input data and results to only the precision required and supported by the input data (carrying excess precision in the final result beyond that required and supported by the input data can be misleading, increases storage cost and decreases speed, and the excess bits can affect convergence of numerical procedures: notably, the first form of the iterative example given below converges correctly when using this rule of thumb). Brief descriptions of several additional issues and techniques follow.As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. The “decimal” data type of the C# and Python programming languages, and the decimal formats of the IEEE 754-2008 standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that (x+y)(x-y) = x^2-y^2,, and that sin^2{theta}+cos^2{theta} = 1,, however these facts cannot be relied on when the quantities involved are the result of floating-point computation.The use of the equality test (if (x==y) ...) requires care when dealing with floating-point numbers. Even simple expressions like 0.6/0.2-3==0 will, on most computers, fail to be true (in IEEE 754 double precision, for example, 0.6/0.2 - 3 is approximately equal to -4.44089209850063e-16). Consequently, such tests are sometimes replaced with “fuzzy” comparisons (if (abs(x-y) Y = X - 1.0;
Z = exp(Y);
if (Z != 1.0)
Z = Y / (Z - 1.0); // [2]
return Z;
3253.671
+ 3.141276
-------
3256.812
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The Kahan summation algorithm may be used to reduce the errors.Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, Archimedes approximated Ï by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error (numerical analysis). Two forms of the recurrence formula for the circumscribed polygon are:{{citation needed|reason=Not obvious formulas|date=June 2016}} + 3.141276
-------
3256.812
- t_0 = frac{1}{sqrt{3}}
- First form: t_{i+1} = frac{sqrt{t_i^2+1}-1}{t_i}
- second form: t_{i+1} = frac{t_i}{sqrt{t_i^2+1}+1}
- pi sim 6 times 2^i times t_i, converging as i rightarrow infty
i 6 à 2i à ti, first form 6 à 2i à ti, second form
-----------------------------------------------------
0 {{Fontcolor|purple|3}}.4641016151377543863 {{Fontcolor|purple|3}}.4641016151377543863
1 {{Fontcolor|purple|3}}.2153903091734710173 {{Fontcolor|purple|3}}.2153903091734723496
2 {{Fontcolor|purple|3.1}}596599420974940120 {{Fontcolor|purple|3.1}}596599420975006733
3 {{Fontcolor|purple|3.14}}60862151314012979 {{Fontcolor|purple|3.14}}60862151314352708
4 {{Fontcolor|purple|3.14}}27145996453136334 {{Fontcolor|purple|3.14}}27145996453689225
5 {{Fontcolor|purple|3.141}}8730499801259536 {{Fontcolor|purple|3.141}}8730499798241950
6 {{Fontcolor|purple|3.141}}6627470548084133 {{Fontcolor|purple|3.141}}6627470568494473
7 {{Fontcolor|purple|3.141}}6101765997805905 {{Fontcolor|purple|3.141}}6101766046906629
8 {{Fontcolor|purple|3.14159}}70343230776862 {{Fontcolor|purple|3.14159}}70343215275928
9 {{Fontcolor|purple|3.14159}}37488171150615 {{Fontcolor|purple|3.14159}}37487713536668
10 {{Fontcolor|purple|3.141592}}9278733740748 {{Fontcolor|purple|3.141592}}9273850979885
11 {{Fontcolor|purple|3.141592}}7256228504127 {{Fontcolor|purple|3.141592}}7220386148377
12 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.1415926}}707019992125
13 {{Fontcolor|purple|3.1415926}}189011456060 {{Fontcolor|purple|3.14159265}}78678454728
14 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.14159265}}46593073709
15 {{Fontcolor|purple|3.14159}}19358822321783 {{Fontcolor|purple|3.141592653}}8571730119
16 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.141592653}}6566394222
17 {{Fontcolor|purple|3.1415}}810075796233302 {{Fontcolor|purple|3.141592653}}6065061913
18 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.1415926535}}939728836
19 {{Fontcolor|purple|3.141}}4061547378810956 {{Fontcolor|purple|3.1415926535}}908393901
20 {{Fontcolor|purple|3.14}}05434924008406305 {{Fontcolor|purple|3.1415926535}}900560168
21 {{Fontcolor|purple|3.14}}00068646912273617 {{Fontcolor|purple|3.141592653589}}8608396
22 {{Fontcolor|purple|3.1}}349453756585929919 {{Fontcolor|purple|3.141592653589}}8122118
23 {{Fontcolor|purple|3.14}}00068646912273617 {{Fontcolor|purple|3.14159265358979}}95552
24 {{Fontcolor|purple|3}}.2245152435345525443 {{Fontcolor|purple|3.14159265358979}}68907
25 {{Fontcolor|purple|3.14159265358979}}62246
26 {{Fontcolor|purple|3.14159265358979}}62246
27 {{Fontcolor|purple|3.14159265358979}}62246
28 {{Fontcolor|purple|3.14159265358979}}62246
The true value is {{Fontcolor|purple|3.14159265358979323846264338327...}}
While the two forms of the recurrence formula are clearly mathematically equivalent, the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of significant digits. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.-----------------------------------------------------
0 {{Fontcolor|purple|3}}.4641016151377543863 {{Fontcolor|purple|3}}.4641016151377543863
1 {{Fontcolor|purple|3}}.2153903091734710173 {{Fontcolor|purple|3}}.2153903091734723496
2 {{Fontcolor|purple|3.1}}596599420974940120 {{Fontcolor|purple|3.1}}596599420975006733
3 {{Fontcolor|purple|3.14}}60862151314012979 {{Fontcolor|purple|3.14}}60862151314352708
4 {{Fontcolor|purple|3.14}}27145996453136334 {{Fontcolor|purple|3.14}}27145996453689225
5 {{Fontcolor|purple|3.141}}8730499801259536 {{Fontcolor|purple|3.141}}8730499798241950
6 {{Fontcolor|purple|3.141}}6627470548084133 {{Fontcolor|purple|3.141}}6627470568494473
7 {{Fontcolor|purple|3.141}}6101765997805905 {{Fontcolor|purple|3.141}}6101766046906629
8 {{Fontcolor|purple|3.14159}}70343230776862 {{Fontcolor|purple|3.14159}}70343215275928
9 {{Fontcolor|purple|3.14159}}37488171150615 {{Fontcolor|purple|3.14159}}37487713536668
10 {{Fontcolor|purple|3.141592}}9278733740748 {{Fontcolor|purple|3.141592}}9273850979885
11 {{Fontcolor|purple|3.141592}}7256228504127 {{Fontcolor|purple|3.141592}}7220386148377
12 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.1415926}}707019992125
13 {{Fontcolor|purple|3.1415926}}189011456060 {{Fontcolor|purple|3.14159265}}78678454728
14 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.14159265}}46593073709
15 {{Fontcolor|purple|3.14159}}19358822321783 {{Fontcolor|purple|3.141592653}}8571730119
16 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.141592653}}6566394222
17 {{Fontcolor|purple|3.1415}}810075796233302 {{Fontcolor|purple|3.141592653}}6065061913
18 {{Fontcolor|purple|3.1415926}}717412858693 {{Fontcolor|purple|3.1415926535}}939728836
19 {{Fontcolor|purple|3.141}}4061547378810956 {{Fontcolor|purple|3.1415926535}}908393901
20 {{Fontcolor|purple|3.14}}05434924008406305 {{Fontcolor|purple|3.1415926535}}900560168
21 {{Fontcolor|purple|3.14}}00068646912273617 {{Fontcolor|purple|3.141592653589}}8608396
22 {{Fontcolor|purple|3.1}}349453756585929919 {{Fontcolor|purple|3.141592653589}}8122118
23 {{Fontcolor|purple|3.14}}00068646912273617 {{Fontcolor|purple|3.14159265358979}}95552
24 {{Fontcolor|purple|3}}.2245152435345525443 {{Fontcolor|purple|3.14159265358979}}68907
25 {{Fontcolor|purple|3.14159265358979}}62246
26 {{Fontcolor|purple|3.14159265358979}}62246
27 {{Fontcolor|purple|3.14159265358979}}62246
28 {{Fontcolor|purple|3.14159265358979}}62246
The true value is {{Fontcolor|purple|3.14159265358979323846264338327...}}
“Fast math” optimization
The aforementioned lack of associativity of floating-point operations in general means that compilers cannot as effectively reorder arithmetic expressions as they could with integer and fixed-point arithmetic, presenting a roadblock in optimizations such as common subexpression elimination and auto-vectorization. The “fast math” option on many compilers (ICC, GCC, Clang, MSVC...) turns on reassociation along with unsafe assumptions such as a lack of NaN and infinite numbers in IEEE 754. Some compilers also offer more granular options to only turn on reassociation. In either case, the programmer is exposed to many of the precision pitfalls mentioned above for the portion of the program using “fast” math.In some compilers (GCC and Clang), turning on “fast” math may cause the program to disable subnormal floats at startup, affecting the floating-point behavior of not only the generated code, but also any program using such code as a library.In most Fortran compilers, as allowed by the ISO/IEC 1539-1:2004 Fortran standard, reassociation is the default, with breakage largely prevented by the “protect parens” setting (also on by default). This setting stops the compiler from reassociating beyond the boundaries of parentheses. Intel Fortran Compiler is a notable outlier.A common problem in “fast” math is that subexpressions may not be optimized identically from place to place, leading to unexpected differences. One interpretation of the issue is that “fast” math as implemented currently has a poorly defined semantics. One attempt at formalizing “fast” math optimizations is seen in Icing, a verified compiler.See also
{{div col|colwidth=20em}}- Arbitrary-precision arithmetic
- C99 for code examples demonstrating access and use of IEEE 754 features.
- Computable number
- Coprocessor
- Decimal floating point
- Double precision
- Experimental mathematics â utilizes high precision floating-point computations
- Fixed-point arithmetic
- Floating-point error mitigation
- FLOPS
- Gal’s accurate tables
- GNU MPFR
- Half-precision floating-point format
- IEEE 754 â Standard for Binary Floating-Point Arithmetic
- IBM Floating Point Architecture
- Kahan summation algorithm
- Microsoft Binary Format (MBF)
- Minifloat
- Q (number format) for constant resolution
- Quadruple-precision floating-point format (including double-double)
- Significant figures
- Single-precision floating-point format
Notes
References
Further reading
- BOOK, James Hardy, Wilkinson, James Hardy Wilkinson, Rounding Errors in Algebraic Processes, 1963, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA, 1st, 161456,books.google.com/books?id=yFogU9Ot-qsC, 9780486679990, (NB. Classic influential treatises on floating-point arithmetic.)
- BOOK, James Hardy, Wilkinson, James Hardy Wilkinson, The Algebraic Eigenvalue Problem, Monographs on Numerical Analysis, 1965, 1st, Oxford University Press / Clarendon Press,books.google.com/books?id=N98IAQAAIAAJ&q=editions:ISBN0198534183, 2016-02-11, 9780198534037,
- BOOK, Pat H., Sterbenz, Floating-Point Computation, 1974, 1st, Prentice-Hall Series in Automatic Computation, Prentice Hall, Englewood Cliffs, New Jersey, USA, 978-0-13-322495-5,
- BOOK, Gene F., Golub, Charles F., van Loan, Matrix Computations, 3rd, Johns Hopkins University Press, 1986, 978-0-8018-5413-2,
- BOOK, Numerical Recipes - The Art of Scientific Computing, Numerical Recipes, William Henry, Press, William Henry Press, Saul A., Teukolsky, Saul A. Teukolsky, William T., Vetterling, William T. Vetterling, Brian P., Flannery, Brian P. Flannery, 3rd, 2007, 1986, 978-0-521-88407-5, Cambridge University Press, (NB. Edition with source code CD-ROM.)
- BOOK, Donald Ervin, Knuth, Donald Ervin Knuth, The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 3rd, Addison-Wesley, 1997, 978-0-201-89684-8, Section 4.2: Floating-Point Arithmetic, 214â264,
- BOOK, Gerrit Anne, Blaauw, Gerrit Anne Blaauw, Frederick Phillips, Brooks, Jr., Frederick Phillips Brooks, Jr., Computer Architecture: Concepts and Evolution, Addison-Wesley, 1997, 1st, 0-201-10557-8, (1213 pages) (NB. This is a single-volume edition. This work was also available in a two-volume version.)
- {{citation |title=Floating-Point Formats |first=John J. G. |last=Savard |date=2018 |orig-date=2005 |work=quadibloc |url=http://www.quadibloc.com/comp/cp0201.htm |access-date=2018-07-16 |url-status=live |archive-url=https://web.archive.org/web/20180703001709www.quadibloc.com/comp/cp0201.htm |archive-date=2018-07-03}}
- BOOK, Muller, Jean-Michel, Brunie, Nicolas, de Dinechin, Florent, Jeannerod, Claude-Pierre, Mioara, Joldes, Lefèvre, Vincent, Melquiond, Guillaume, Revol, Nathalie, Nathalie Revol, Torres, Serge, Handbook of Floating-Point Arithmetic, 2018, 2010, Birkhäuser, 2nd, 978-3-319-76525-9, 2018935254, 10.1007/978-3-319-76526-6,books.google.com/books?id=h3ZZDwAAQBAJ,
External links
- WEB,www.mrob.com/pub/math/floatformats.html, Survey of Floating-Point Formats, (NB. This page gives a very brief summary of floating-point formats that have been used over the years.)
- JOURNAL,hal.science/hal-00128124/en/, The pitfalls of verifying floating-point computations, David, Monniaux, ACM Transactions on Programming Languages and Systems, Association for Computing Machinery (ACM) Transactions on programming languages and systems (TOPLAS), May 2008, 30, 3, 1â41, 10.1145/1353445.1353446, cs/0701192, 218578808, (NB. A compendium of non-intuitive behaviors of floating point on popular architectures, with implications for program verification and testing.)
- OpenCores. (NB. This website contains open source floating-point IP cores for the implementation of floating-point operators in FPGA or ASIC devices. The project double_fpu contains verilog source code of a double-precision floating-point unit. The project fpuvhdl contains vhdl source code of a single-precision floating-point unit.)
- WEB,msdn.microsoft.com/en-us/library/aa289157(v=vs.71).aspx, Microsoft Visual C++ Floating-Point Optimization, Eric, Fleegal, MSDN, 2004,
- content above as imported from Wikipedia
- "Floating-point arithmetic" does not exist on GetWiki (yet)
- time: 8:59am EDT - Wed, May 22 2024
- "Floating-point arithmetic" does not exist on GetWiki (yet)
- time: 8:59am EDT - Wed, May 22 2024
[ this remote article is provided by Wikipedia ]
LATEST EDITS [ see all ]
GETWIKI 21 MAY 2024
The Illusion of Choice
Culture
Culture
GETWIKI 09 JUL 2019
Eastern Philosophy
History of Philosophy
History of Philosophy
GETWIKI 09 MAY 2016
GetMeta:About
GetWiki
GetWiki
GETWIKI 18 OCT 2015
M.R.M. Parrott
Biographies
Biographies
GETWIKI 20 AUG 2014
GetMeta:News
GetWiki
GetWiki
© 2024 M.R.M. PARROTT | ALL RIGHTS RESERVED