Is floating point math broken

Question

Consider the following code    0 1   0 2    0 3  - gt   false   0 1   0 2         - gt   0 30000000000000004   Why do these inaccuracies happen

User · Answer

It s actually pretty simple  When you have a base 10 system  like ours   it can only express fractions that use a prime factor of the base  The prime factors of 10 are 2 and 5  So 1 2  1 4  1 5  1 8  and 1 10 can all be expressed cleanly because the denominators all use prime factors of 10  In contrast  1 3  1 6  and 1 7 are all repeating decimals because their denominators use a prime factor of 3 or 7  In binary  or base 2   the only prime factor is 2  So you can only express fractions cleanly which only contain 2 as a prime factor  In binary  1 2  1 4  1 8 would all be expressed cleanly as decimals  While  1 5 or 1 10 would be repeating decimals  So 0 1 and 0 2  1 10 and 1 5  while clean decimals in a base 10 system  are repeating decimals in the base 2 system the computer is operating in  When you do math on these repeating decimals  you end up with leftovers which carry over when you convert the computer s base 2  binary  number into a more human readable base 10 number    From https   0 30000000000000004 com

User · Answer

Given that nobody has mentioned this     Some high level languages such as Python and Java come with tools to overcome binary floating point limitations  For example    Python s decimal module and Java s BigDecimal class  that represent numbers internally with decimal notation  as opposed to binary notation   Both have limited precision  so they are still error prone  however they solve most common problems with binary floating point arithmetic   Decimals are very nice when dealing with money  ten cents plus twenty cents are always exactly thirty cents    gt  gt  gt  0 1   0 2    0 3 False  gt  gt  gt  Decimal  0 1     Decimal  0 2      Decimal  0 3   True   Python s decimal module is based on IEEE standard 854-1987  Python s fractions module and Apache Common s BigFraction class  Both represent rational numbers as  numerator  denominator  pairs and they may give more accurate results than decimal floating point arithmetic    Neither of these solutions is perfect  especially if we look at performances  or if we require a very high precision   but still they solve a great number of problems with binary floating point arithmetic

User · Answer

Imagine working in base ten with  say  8 digits of accuracy   You check whether   1 3   2   3    1   and learn that this returns false   Why   Well  as real numbers we have  1 3   0 333     and 2 3   0 666      Truncating at eight decimal places  we get  0 33333333   0 66666666   0 99999999   which is  of course  different from 1 00000000 by exactly 0 00000001     The situation for binary numbers with a fixed number of bits is exactly analogous  As real numbers  we have  1 10   0 0001100110011001100     base 2   and  1 5   0 0011001100110011001     base 2   If we truncated these to  say  seven bits  then we d get  0 0001100   0 0011001   0 0100101   while on the other hand   3 10   0 01001100110011     base 2   which  truncated to seven bits  is 0 0100110  and these differ by exactly 0 0000001     The exact situation is slightly more subtle because these numbers are typically stored in scientific notation   So  for instance  instead of storing 1 10 as 0 0001100 we may store it as something like 1 10011   2 -4  depending on how many bits we ve allocated for the exponent and the mantissa   This affects how many digits of precision you get for your calculations   The upshot is that because of these rounding errors you essentially never want to use    on floating-point numbers   Instead  you can check if the absolute value of their difference is smaller than some fixed small number

User · Answer

In addition to the other correct answers  you may want to consider scaling your values to avoid problems with floating-point arithmetic    For example    var result   1 0   2 0         result     3 0 returns true       instead of   var result   0 1   0 2         result     0 3 returns false   The expression 0 1   0 2     0 3 returns false in JavaScript  but fortunately integer arithmetic in floating-point is exact  so decimal representation errors can be avoided by scaling   As a practical example  to avoid floating-point problems where accuracy is paramount  it is recommended1 to handle money as an integer representing the number of cents  2550 cents instead of 25 50 dollars      1 Douglas Crockford  JavaScript  The Good Parts  Appendix A - Awful Parts  page 105

User · Answer

No  not broken  but most decimal fractions must be approximated     Summary   Floating point arithmetic is exact  unfortunately  it doesn t match up well with our usual base-10 number representation  so it turns out we are often giving it input that is slightly off from what we wrote   Even simple numbers like 0 01  0 02  0 03  0 04     0 24 are not representable exactly as binary fractions  If you count up 0 01   02   03      not until you get to 0 25 will you get the first fraction representable in base2   If you tried that using FP  your 0 01 would have been slightly off  so the only way to add 25 of them up to a nice exact 0 25 would have required a long chain of causality involving guard bits and rounding  It s hard to predict so we throw up our hands and say  FP is inexact   but that s not really true    We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2      How did this happen    When we write in decimal  every fraction  specifically  every terminating decimal  is a rational number of the form    nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  a    2n x 5m   In binary  we only get the 2n term  that is    nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  nbsp  a   2n  So in decimal  we can t represent 1 3  Because base 10 includes 2 as a prime factor  every number we can write as a binary fraction also can be written as a base 10 fraction  However  hardly anything we write as a base10 fraction is representable in binary  In the range from 0 01  0 02  0 03     0 99  only three numbers can be represented in our FP format  0 25  0 50  and 0 75  because they are 1 4  1 2  and 3 4  all numbers with a prime factor using only the 2n term   In base10 we can t represent 1 3  But in binary  we can t do 1 10 or 1 3   So while every binary fraction can be written in decimal  the reverse is not true  And in fact most decimal fractions repeat in binary      Dealing with it   Developers are usually instructed to do  lt  epsilon comparisons  better advice might be to round to integral values  in the C library  round   and roundf    i e   stay in the FP format  and then compare  Rounding to a specific decimal fraction length solves most problems with output   Also  on real number-crunching problems  the problems that FP was invented for on early  frightfully expensive computers  the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures  so the entire problem space was  inexact  anyway  FP  accuracy  isn t a problem in this kind of application   The whole issue really arises when people try to use FP for bean counting  It does work for that  but only if you stick to integral values  which kind of defeats the point of using it  This is why we have all those decimal fraction software libraries   I love the Pizza answer by Chris  because it describes the actual problem  not just the usual handwaving about  inaccuracy   If FP were simply  inaccurate   we could fix that and would have done it decades ago  The reason we haven t is because the FP format is compact and fast and it s the best way to crunch a lot of numbers  Also  it s a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems   Sometimes  individual magnetic cores for 1-bit storage  but that s another story       Conclusion   If you are just counting beans at a bank  software solutions that use decimal string representations in the first place work perfectly well  But you can t do quantum chromodynamics or aerodynamics that way

User · Answer

Most answers here address this question in very dry  technical terms  I d like to address this in terms that normal human beings can understand   Imagine that you are trying to slice up pizzas  You have a robotic pizza cutter that can cut pizza slices exactly in half  It can halve a whole pizza  or it can halve an existing slice  but in any case  the halving is always exact   That pizza cutter has very fine movements  and if you start with a whole pizza  then halve that  and continue halving the smallest slice each time  you can do the halving 53 times before the slice is too small for even its high-precision abilities  At that point  you can no longer halve that very thin slice  but must either include or exclude it as is   Now  how would you piece all the slices in such a way that would add up to one-tenth  0 1  or one-fifth  0 2  of a pizza  Really think about it  and try working it out  You can even try to use a real pizza  if you have a mythical precision pizza cutter at hand   -     Most experienced programmers  of course  know the real answer  which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices  no matter how finely you slice them  You can do a pretty good approximation  and if you add up the approximation of 0 1 with the approximation of 0 2  you get a pretty good approximation of 0 3  but it s still just that  an approximation   For double-precision numbers  which is the precision that allows you to halve your pizza 53 times   the numbers immediately less and greater than 0 1 are 0 09999999999999999167332731531132594682276248931884765625 and 0 1000000000000000055511151231257827021181583404541015625  The latter is quite a bit closer to 0 1 than the former  so a numeric parser will  given an input of 0 1  favour the latter    The difference between those two numbers is the  smallest slice  that we must decide to either include  which introduces an upward bias  or exclude  which introduces a downward bias  The technical term for that smallest slice is an ulp    In the case of 0 2  the numbers are all the same  just scaled up by a factor of 2  Again  we favour the value that s slightly higher than 0 2   Notice that in both cases  the approximations for 0 1 and 0 2 have a slight upward bias  If we add enough of these biases in  they will push the number further and further away from what we want  and in fact  in the case of 0 1   0 2  the bias is high enough that the resulting number is no longer the closest number to 0 3   In particular  0 1   0 2 is really 0 1000000000000000055511151231257827021181583404541015625   0 200000000000000011102230246251565404236316680908203125   0 3000000000000000444089209850062616169452667236328125  whereas the number closest to 0 3 is actually 0 299999999999999988897769753748434595763683319091796875     P S  Some programming languages also provide pizza cutters that can split slices into exact tenths  Although such pizza cutters are uncommon  if you do have access to one  you should use it when it s important to be able to get exactly one-tenth or one-fifth of a slice    Originally posted on Quora

User · Answer

My workaround   function add a  b  precision        var x   Math pow 10  precision    2       return  Math round a   x    Math round b   x     x      precision refers to the number of digits you want to preserve after the decimal point during addition

User · Answer

A lot of good answers have been posted  but I d like to append one more   Not all numbers can be represented via floats doubles For example  the number  0 2  will be represented as  0 200000003  in single precision in IEEE754 float point standard   Model for store real numbers under the hood represent float numbers as    Even though you can type 0 2 easily  FLT RADIX and DBL RADIX is 2  not 10 for a computer with FPU which uses  IEEE Standard for Binary Floating-Point Arithmetic  ISO IEEE Std 754-1985     So it is a bit hard to represent such numbers exactly  Even if you specify this variable explicitly without any intermediate calculation

User · Answer

Floating point numbers are represented  at the hardware level  as fractions of binary numbers  base 2   For example  the decimal fraction  0 125  has the value 1 10   2 100   5 1000 and  in the same way  the binary fraction  0 001  has the value 0 2   0 4   1 8  These two fractions have the same value  the only difference is that the first is a decimal fraction  the second is a binary fraction  Unfortunately  most decimal fractions cannot have exact representation in binary fractions  Therefore  in general  the floating point numbers you give are only approximated to binary fractions to be stored in the machine  The problem is easier to approach in base 10  Take for example  the fraction 1 3  You can approximate it to a decimal fraction  0 3  or better  0 33  or better  0 333  etc  No matter how many decimal places you write  the result is never exactly 1 3  but it is an estimate that always comes closer  Likewise  no matter how many base 2 decimal places you use  the decimal value 0 1 cannot be represented exactly as a binary fraction  In base 2  1 10 is the following periodic number  0 0001100110011001100110011001100110011001100110011      Stop at any finite amount of bits  and you ll get an approximation  For Python  on a typical machine  53 bits are used for the precision of a float  so the value stored when you enter the decimal 0 1 is the binary fraction  0 00011001100110011001100110011001100110011001100110011010  which is close  but not exactly equal  to 1 10  It s easy to forget that the stored value is an approximation of the original decimal fraction  due to the way floats are displayed in the interpreter  Python only displays a decimal approximation of the value stored in binary  If Python were to output the true decimal value of the binary approximation stored for 0 1  it would output   gt  gt  gt  0 1 0 1000000000000000055511151231257827021181583404541015625  This is a lot more decimal places than most people would expect  so Python displays a rounded value to improve readability   gt  gt  gt  0 1 0 1  It is important to understand that in reality this is an illusion  the stored value is not exactly 1 10  it is simply on the display that the stored value is rounded  This becomes evident as soon as you perform arithmetic operations with these values   gt  gt  gt  0 1   0 2 0 30000000000000004  This behavior is inherent to the very nature of the machine s floating-point representation  it is not a bug in Python  nor is it a bug in your code  You can observe the same type of behavior in all other languages   that use hardware support for calculating floating point numbers  although some languages   do not make the difference visible by default  or not in all display modes   Another surprise is inherent in this one  For example  if you try to round the value 2 675 to two decimal places  you will get  gt  gt  gt  round  2 675  2  2 67  The documentation for the round   primitive indicates that it rounds to the nearest value away from zero  Since the decimal fraction is exactly halfway between 2 67 and 2 68  you should expect to get  a binary approximation of  2 68  This is not the case  however  because when the decimal fraction 2 675 is converted to a float  it is stored by an approximation whose exact value is   2 67499999999999982236431605997495353221893310546875  Since the approximation is slightly closer to 2 67 than 2 68  the rounding is down  If you are in a situation where rounding decimal numbers halfway down matters  you should use the decimal module  By the way  the decimal module also provides a convenient way to  quot see quot  the exact value stored for any float   gt  gt  gt  from decimal import Decimal  gt  gt  gt  Decimal  2 675   gt  gt  gt  Decimal   2 67499999999999982236431605997495353221893310546875    Another consequence of the fact that 0 1 is not exactly stored in 1 10 is that the sum of ten values   of 0 1 does not give 1 0 either   gt  gt  gt  sum   0 0  gt  gt  gt  for i in range  10       sum     0 1     gt  gt  gt  sum 0 9999999999999999  The arithmetic of binary floating point numbers holds many such surprises  The problem with  quot 0 1 quot  is explained in detail below  in the section  quot Representation errors quot   See The Perils of Floating Point for a more complete list of such surprises  It is true that there is no simple answer  however do not be overly suspicious of floating virtula numbers  Errors  in Python  in floating-point number operations are due to the underlying hardware  and on most machines are no more than 1 in 2    53 per operation  This is more than necessary for most tasks  but you should keep in mind that these are not decimal operations  and every operation on floating point numbers may suffer from a new error  Although pathological cases exist  for most common use cases you will get the expected result at the end by simply rounding up to the number of decimal places you want on the display  For fine control over how floats are displayed  see String Formatting Syntax for the formatting specifications of the str format    method  This part of the answer explains in detail the example of  quot 0 1 quot  and shows how you can perform an exact analysis of this type of case on your own  We assume that you are familiar with the binary representation of floating point numbers The term Representation error means that most decimal fractions cannot be represented exactly in binary  This is the main reason why Python  or Perl  C  C     Java  Fortran  and many others  usually doesn t display the exact result in decimal   gt  gt  gt  0 1   0 2 0 30000000000000004  Why   1 10 and 2 10 are not representable exactly in binary fractions  However  all machines today  July 2010  follow the IEEE-754 standard for the arithmetic of floating point numbers  and most platforms use an  quot IEEE-754 double precision quot  to represent Python floats  Double precision IEEE-754 uses 53 bits of precision  so on reading the computer tries to convert 0 1 to the nearest fraction of the form J   2    N with J an integer of exactly 53 bits  Rewrite   1 10     J    2    N   in   J     2    N   10  remembering that J is exactly 53 bits  so gt    2    52 but  lt 2    53   the best possible value for N is 56   gt  gt  gt  2    52 4503599627370496  gt  gt  gt  2    53 9007199254740992  gt  gt  gt  2    56 10 7205759403792793  So 56 is the only possible value for N which leaves exactly 53 bits for J  The best possible value for J is therefore this quotient  rounded   gt  gt  gt  q  r   divmod  2    56  10   gt  gt  gt  r 6  Since the carry is greater than half of 10  the best approximation is obtained by rounding up   gt  gt  gt  q   1 7205759403792794  Therefore the best possible approximation for 1 10 in  quot IEEE-754 double precision quot  is this above 2    56  that is  7205759403792794 72057594037927936  Note that since the rounding was done upward  the result is actually slightly greater than 1 10  if we hadn t rounded up  the quotient would have been slightly less than 1 10  But in no case is it exactly 1 10  So the computer never  quot sees quot  1 10  what it sees is the exact fraction given above  the best approximation using the double precision floating point numbers from the  quot  quot  IEEE-754  quot    gt  gt  gt   1   2    56 7205759403792794 0  If we multiply this fraction by 10    30  we can observe the values   of its 30 decimal places of strong weight   gt  gt  gt  7205759403792794   10    30    2    56 100000000000000005551115123125L  meaning that the exact value stored in the computer is approximately equal to the decimal value 0 100000000000000005551115123125  In versions prior to Python 2 7 and Python 3 1  Python rounded these values   to 17 significant decimal places  displaying    0 10000000000000001     In current versions of Python  the displayed value is the value whose fraction is as short as possible while giving exactly the same representation when converted back to binary  simply displaying    0 1

User · Answer

Decimal numbers such as 0 1  0 2  and 0 3 are not represented exactly in binary encoded floating point types  The sum of the approximations for 0 1 and 0 2 differs from the approximation used for 0 3  hence the falsehood of 0 1   0 2    0 3 as can be seen more clearly here    include  lt stdio h gt   int main         printf  0 1   0 2    0 3 is  s n   0 1   0 2    0 3    true     false        printf  0 1 is   23f n   0 1       printf  0 2 is   23f n   0 2       printf  0 1   0 2 is   23f n   0 1   0 2       printf  0 3 is   23f n   0 3       printf  0 3 -  0 1   0 2  is  g n   0 3 -  0 1   0 2        return 0      Output   0 1   0 2    0 3 is false 0 1 is 0 10000000000000000555112 0 2 is 0 20000000000000001110223 0 1   0 2 is 0 30000000000000004440892 0 3 is 0 29999999999999998889777 0 3 -  0 1   0 2  is -5 55112e-17   For these computations to be evaluated more reliably  you would need to use a decimal-based representation for floating point values  The C Standard does not specify such types by default but as an extension described in a technical Report   The  Decimal32   Decimal64 and  Decimal128 types might be available on your system  for example  GCC supports them on selected targets  but Clang does not support them on OS nbsp X

User · Answer

Normal arithmetic is base-10  so decimals represent tenths  hundredths  etc   When you try to represent a floating-point number in binary base-2 arithmetic  you are dealing with halves  fourths  eighths  etc  In the hardware  floating points are stored as integer mantissas and exponents   Mantissa represents the significant digits   Exponent is like scientific notation but it uses a base of 2 instead of 10   For example 64 0 would be represented with a mantissa of 1 and exponent of 6   0 125 would be represented with a mantissa of 1 and an exponent of -3  Floating point decimals have to add up negative powers of 2 0 1b   0 5d 0 01b   0 25d 0 001b   0 125d 0 0001b   0 0625d 0 00001b   0 03125d  and so on  It is common to use a error delta instead of using equality operators when dealing with floating point arithmetic   Instead of if a  b       you would use delta   0 0001     or some arbitrarily small amount if a - b  gt  -delta  amp  amp  a - b  lt  delta

User · Answer

Just for fun  I played with the representation of floats  following the definitions from the Standard C99 and I wrote the code below   The code prints the binary representation of floats in 3 separated groups  SIGN EXPONENT FRACTION   and after that it prints a sum  that  when summed with enough precision  it will show the value that really exists in hardware   So when you write float x   999     the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number   In reality  this sum is only an approximation   For the number 999 999 999  the compiler will insert in bit representation of the float the number 1 000 000 000  After the code I attach a console session  in which I compute the sum of terms for both constants  minus PI and 999999999  that really exists in hardware  inserted there by the compiler    include  lt stdio h gt   include  lt limits h gt   void xx float  x        unsigned char i   sizeof  x  CHAR BIT-1      do           switch  i            case 31               printf  sign                  break          case 30               printf  exponent                  break          case 23               printf  fraction                  break                     char b    unsigned long long  x amp   unsigned long long 1 lt  lt i    0          printf   d    b         while  i--       printf   n       void yy float a        int sign     unsigned long long   amp a amp   unsigned long long 1 lt  lt 31        int fraction     1 lt  lt 23 -1  amp    int   amp a       int exponent    255 amp     int   amp a  gt  gt 23  -127       printf sign  positive      1    negative      1         unsigned int i   1 lt  lt 22      unsigned int j   1      do           char b  fraction amp i   0          b amp  amp  printf  1   d   c   1 lt  lt j   fraction amp  i-1              0         while  j    i gt  gt  1        printf   2  d   exponent       printf   n       void main         float x -3 14      float y 999999999      printf   lu n   sizeof x        xx  amp x       xx  amp y       yy x       yy y         Here is a console session in which I compute the real value of the float that exists in hardware   I used bc to print the sum of terms outputted by the main program   One can insert that sum in python repl or something similar also   --     terra1 stub   qemacs f c --     terra1 stub   gcc f c --     terra1 stub     a out sign 1 exponent 1 0 0 0 0 0 0 fraction 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1 sign 0 exponent 1 0 0 1 1 1 0 fraction 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 negative   1 1  2   1  16   1  256   1  512   1  1024   1  2048   1  8192   1  32768   1  65536   1  131072   1  4194304   1  8388608    2 1 positive   1 1  2   1  4   1  16   1  32   1  64   1  512   1  1024   1  4096   1  16384   1  32768   1  262144   1  1048576    2 29 --     terra1 stub   bc scale 15   1 1  2   1  4   1  16   1  32   1  64   1  512   1  1024   1  4096   1  16384   1  32768   1  262144   1  1048576    2 29 999999999 999999446351872   That s it   The value of 999999999 is in fact  999999999 999999446351872   You can also check with bc that -3 14 is also perturbed   Do not forget to set a scale factor in bc   The displayed sum is what inside the hardware   The value you obtain by computing it depends on the scale you set   I did set the scale factor to 15   Mathematically  with infinite precision  it seems it is 1 000 000 000

User · Answer

My answer is quite long  so I ve split it into three sections  Since the question is about floating point mathematics  I ve put the emphasis on what the machine actually does  I ve also made it specific to double  64 bit  precision  but the argument applies equally to any floating point arithmetic   Preamble  An IEEE 754 double-precision binary floating-point format  binary64  number represents a number of the form     value    -1  s    1 m51m50   m2m1m0 2   2e-1023   in 64 bits    The first bit is the sign bit  1 if the number is negative  0 otherwise1  The next 11 bits are the exponent  which is offset by 1023  In other words  after reading the exponent bits from a double-precision number  1023 must be subtracted to obtain the power of two  The remaining 52 bits are the significand  or mantissa   In the mantissa  an  implied  1  is always2 omitted since the most significant bit of any binary value is 1    1 - IEEE 754 allows for the concept of a signed zero -  0 and -0 are treated differently  1     0  is positive infinity  1    -0  is negative infinity  For zero values  the mantissa and exponent bits are all zero  Note  zero values   0 and -0  are explicitly not classed as denormal2   2 - This is not the case for denormal numbers  which have an offset exponent of zero  and an implied 0    The range of denormal double precision numbers is dmin    x    dmax  where dmin  the smallest representable nonzero number  is 2-1023 - 51     4 94   10-324  and dmax  the largest denormal number  for which the mantissa consists entirely of 1s  is 2-1023   1 - 2-1023 - 51     2 225   10-308      Turning a double precision number to binary  Many online converters exist to convert a double precision floating point number to binary  e g  at binaryconvert com   but here is some sample C  code to obtain the IEEE 754 representation for a double precision number  I separate the three parts with colons       public static string BinaryRepresentation double value        long valueInLongType   BitConverter DoubleToInt64Bits value       string bits   Convert ToString valueInLongType  2       string leadingZeros   new string  0   64 - bits Length       string binaryRepresentation   leadingZeros   bits       string sign   binaryRepresentation 0  ToString        string exponent   binaryRepresentation Substring 1  11       string mantissa   binaryRepresentation Substring 12        return string Format   0   1   2    sign  exponent  mantissa         Getting to the point  the original question   Skip to the bottom for the TL DR version   Cato Johnston  the question asker  asked why 0 1   0 2    0 3   Written in binary  with colons separating the three parts   the IEEE 754 representations of the values are   0 1   gt  0 01111111011 1001100110011001100110011001100110011001100110011010 0 2   gt  0 01111111100 1001100110011001100110011001100110011001100110011010   Note that the mantissa is composed of recurring digits of 0011  This is key to why there is any error to the calculations - 0 1  0 2 and 0 3 cannot be represented in binary precisely in a finite number of binary bits any more than 1 9  1 3 or 1 7 can be represented precisely in decimal digits   Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places  much like 10-3   1 23    10-5   123   This then enables us to represent the binary representation as the exact value that it represents in the form a   2p  where  a  is an integer   Converting the exponents to decimal  removing the offset  and re-adding the implied 1  in square brackets   0 1 and 0 2 are   0 1   gt  2 -4    1  1001100110011001100110011001100110011001100110011010 0 2   gt  2 -3    1  1001100110011001100110011001100110011001100110011010 or 0 1   gt  2 -56   7205759403792794   0 1000000000000000055511151231257827021181583404541015625 0 2   gt  2 -55   7205759403792794   0 200000000000000011102230246251565404236316680908203125   To add two numbers  the exponent needs to be the same  i e    0 1   gt  2 -3    0 1100110011001100110011001100110011001100110011001101 0  0 2   gt  2 -3    1 1001100110011001100110011001100110011001100110011010 sum    2 -3   10 0110011001100110011001100110011001100110011001100111 or 0 1   gt  2 -55   3602879701896397    0 1000000000000000055511151231257827021181583404541015625 0 2   gt  2 -55   7205759403792794    0 200000000000000011102230246251565404236316680908203125 sum    2 -55   10808639105689191   0 3000000000000000166533453693773481063544750213623046875   Since the sum is not of the form 2n   1  bbb  we increase the exponent by one and shift the decimal  binary  point to get   sum   2 -2    1 0011001100110011001100110011001100110011001100110011 1        2 -54   5404319552844595 5   0 3000000000000000166533453693773481063544750213623046875   There are now 53 bits in the mantissa  the 53rd is in square brackets in the line above   The default rounding mode for IEEE 754 is  Round to Nearest  - i e  if a number x falls between two values a and b  the value where the least significant bit is zero is chosen   a   2 -54   5404319552844595   0 299999999999999988897769753748434595763683319091796875     2 -2    1 0011001100110011001100110011001100110011001100110011  x   2 -2    1 0011001100110011001100110011001100110011001100110011 1   b   2 -2    1 0011001100110011001100110011001100110011001100110100     2 -54   5404319552844596   0 3000000000000000444089209850062616169452667236328125   Note that a and b differ only in the last bit     0011   1      0100  In this case  the value with the least significant bit of zero is b  so the sum is   sum   2 -2    1 0011001100110011001100110011001100110011001100110100       2 -54   5404319552844596   0 3000000000000000444089209850062616169452667236328125   whereas the binary representation of 0 3 is   0 3   gt  2 -2    1 0011001100110011001100110011001100110011001100110011        2 -54   5404319552844595   0 299999999999999988897769753748434595763683319091796875   which only differs from the binary representation of the sum of 0 1 and 0 2 by 2-54   The binary representation of 0 1 and 0 2 are the most accurate representations of the numbers allowable by IEEE 754  The addition of these representation  due to the default rounding mode  results in a value which differs only in the least-significant-bit   TL DR  Writing 0 1   0 2 in a IEEE 754 binary representation  with colons separating the three parts  and comparing it to 0 3  this is  I ve put the distinct bits in square brackets    0 1   0 2   gt  0 01111111101 0011001100110011001100110011001100110011001100110 100  0 3         gt  0 01111111101 0011001100110011001100110011001100110011001100110 011    Converted back to decimal  these values are   0 1   0 2   gt  0 300000000000000044408920985006    0 3         gt  0 299999999999999988897769753748      The difference is exactly 2-54  which is  5 5511151231258    10-17 - insignificant  for many applications  when compared to the original values   Comparing the last few bits of a floating point number is inherently dangerous  as anyone who reads the famous  What Every Computer Scientist Should Know About Floating-Point Arithmetic   which covers all the major parts of this answer  will know   Most calculators use additional guard digits to get around this problem  which is how 0 1   0 2 would give 0 3  the final few bits are rounded

User · Answer

Binary floating point math is like this  In most programming languages  it is based on the IEEE 754 standard  The crux of the problem is that numbers are represented in this format as a whole number times a power of two  rational numbers  such as 0 1  which is 1 10  whose denominator is not a power of two cannot be exactly represented  For 0 1 in the standard binary64 format  the representation can be written exactly as  0 1000000000000000055511151231257827021181583404541015625 in decimal  or 0x1 999999999999ap-4 in C99 hexfloat notation   In contrast  the rational number 0 1  which is 1 10  can be written exactly as  0 1 in decimal  or 0x1 99999999999999   p-4 in an analogue of C99 hexfloat notation  where the     represents an unending sequence of 9 s   The constants 0 2 and 0 3 in your program will also be approximations to their true values   It happens that the closest double to 0 2 is larger than the rational number 0 2 but that the closest double to 0 3 is smaller than the rational number 0 3   The sum of 0 1 and 0 2 winds up being larger than the rational number 0 3 and hence disagreeing with the constant in your code  A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic  For an easier-to-digest explanation  see floating-point-gui de  Side Note  All positional  base-N  number systems share this problem with precision Plain old decimal  base 10  numbers have the same issues  which is why numbers like 1 3 end up as 0 333333333    You ve just stumbled on a number  3 10  that happens to be easy to represent with the decimal system  but doesn t fit the binary system  It goes both ways  to some small degree  as well  1 16 is an ugly number in decimal  0 0625   but in binary it looks as neat as a 10 000th does in decimal  0 0001    - if we were in the habit of using a base-2 number system in our daily lives  you d even look at that number and instinctively understand you could arrive there by halving something  halving it again  and again and again     Of course  that s not exactly how floating-point numbers are stored in memory  they use a form of scientific notation   However  it does illustrate the point that binary floating-point precision errors tend to crop up because the  quot real world quot  numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day  This is also why we ll say things like 71  instead of  quot 5 out of every 7 quot   71  is an approximation  since 5 7 can t be represented exactly with any decimal number   So no  binary floating point numbers are not broken  they just happen to be as imperfect as every other base-N number system    Side Side Note  Working with Floats in Programming In practice  this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you re interested in before you display them  You also need to replace equality tests with comparisons that allow some amount of tolerance  which means  Do not do if  x    y          Instead do if  abs x - y   lt  myToleranceValue           where abs is the absolute value  myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much  quot wiggle room quot  you are prepared to allow  and what the largest number you are going to be comparing may be  due to loss of precision issues   Beware of  quot epsilon quot  style constants in your language of choice  These are not to be used as tolerance values

User · Answer

Those weird numbers appear because computers use binary base 2  number system for calculation purposes  while we use decimal base 10    There are a majority of fractional numbers that cannot be represented precisely either in binary or in decimal or both  Result - A rounded up  but precise  number results

User · Answer

Since this thread branched off a bit into a general discussion over current floating point implementations I d add that there are projects on fixing their issues   Take a look at https   posithub org  for example  which showcases a number type called posit  and its predecessor unum  that promises to offer better accuracy with fewer bits  If my understanding is correct  it also fixes the kind of problems in the question  Quite interesting project  the person behind it is a mathematician it Dr  John Gustafson  The whole thing is open source  with many actual implementations in C C    Python  Julia and C   https   hastlayer com arithmetics

User · Answer

Floating point rounding error   From What Every Computer Scientist Should Know About Floating-Point Arithmetic      Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation  Although there are infinitely many integers  in most programs the result of integer computations can be stored in 32 bits  In contrast  given any fixed number of bits  most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits  Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation  This rounding error is the characteristic feature of floating-point computation

User · Answer

Can I just add  people always assume this to be a computer problem  but if you count with your hands  base 10   you can t get  1 3 1 3 2 3  true unless you have infinity to add 0 333    to 0 333    so just as with the  1 10 2 10    3 10 problem in base 2  you truncate it to 0 333   0 333   0 666 and probably round it to 0 667 which would be also be technically inaccurate   Count in ternary  and thirds are not a problem though - maybe some race with 15 fingers on each hand would ask why your decimal math was broken

User · Answer

In short it s because   Floating point numbers cannot represent all decimals precisely in binary  So just like 3 10 which does not exist in base 10 precisely  it will be 3 33    recurring   in the same way 1 10 doesn t exist in binary  So what  How to deal with it  Is there any workaround  In order to offer The best solution I can say I discovered following method  parseFloat  0 1   0 2  toFixed 10     gt  Will return 0 3  Let me explain why it s the best solution  As others mentioned in above answers it s a good idea to use ready to use Javascript toFixed   function to solve the problem  But most likely you ll encounter with some problems  Imagine you are going to add up two float numbers like 0 2 and 0 7 here it is  0 2   0 7   0 8999999999999999  Your expected result was 0 9 it means you need a result with 1 digit precision in this case  So you should have used  0 2   0 7  tofixed 1  but you can t just give a certain parameter to toFixed   since it depends on the given number  for instance 0 22   0 7   0 9199999999999999  In this example you need 2 digits precision so it should be toFixed 2   so what should be the paramter to fit every given float number  You might say let it be 10 in every situation then   0 2   0 7  toFixed 10    gt  Result will be  quot 0 9000000000 quot   Damn  What are you going to do with those unwanted zeros after 9  It s the time to convert it to float to make it as you desire  parseFloat  0 2   0 7  toFixed 10     gt  Result will be 0 9  Now that you found the solution  it s better to offer it as a function like this  function floatify number              return parseFloat  number  toFixed 10                    Let s try it yourself   x000D   x000D  function floatify number          return parseFloat  number  toFixed 10            function addUp      var number1        number1   val      var number2        number2   val      var unexpectedResult   number1   number2    var expectedResult   floatify number1   number2         unexpectedResult   text unexpectedResult         expectedResult   text expectedResult     addUp    x000D  input    width  50px     expectedResult  color  green     unexpectedResult  color  red    x000D   lt script src  https   ajax googleapis com ajax libs jquery 2 1 1 jquery min js  gt  lt  script gt   lt input id  number1  value  0 2  onclick  addUp    onkeyup  addUp     gt     lt input id  number2  value  0 7  onclick  addUp    onkeyup  addUp     gt     lt p gt Expected Result   lt span id  expectedResult  gt  lt  span gt  lt  p gt   lt p gt Unexpected Result   lt span id  unexpectedResult  gt  lt  span gt  lt  p gt  x000D   x000D   x000D   You can use it this way  var x   0 2   0 7  floatify x      gt  Result  0 9  As W3SCHOOLS suggests there is another solution too  you can multiply and divide to solve the problem above  var x    0 2   10   0 1   10    10           x will be 0 3  Keep in mind that  0 2   0 1    10   10 won t work at all although it seems the same  I prefer the first solution since I can apply it as a function which converts the input float to accurate output float

User · Answer

Since Python 3 5 you can use math isclose   function for testing approximate equality    gt  gt  gt  import math  gt  gt  gt  math isclose 0 1   0 2  0 3  True  gt  gt  gt  0 1   0 2    0 3 False

User · Answer

It s broken in the exact same way decimal  base-10  notation is broken  just for base-2  To understand  think about representing 1 3 as a decimal value  It s impossible to do exactly  In the same way  1 10  decimal 0 1  cannot be represented exactly in base 2  binary  as a  quot decimal quot  value  a repeating pattern after the decimal point goes on forever  The value is not exact  and therefore you can t do exact math with it using normal floating point methods

User · Answer

I just saw this interesting issue around floating points   Consider the following results   error    2  53 1  - int float 2  53 1      gt  gt  gt   2  53 1  - int float 2  53 1   1   We can clearly see a breakpoint when 2  53 1 - all works fine until 2  53    gt  gt  gt   2  53  - int float 2  53   0     This happens because of the double-precision binary  IEEE 754 double-precision binary floating-point format  binary64  From the Wikipedia page for Double-precision floating-point format      Double-precision binary floating-point is a commonly used format on PCs  due to its wider range over single-precision floating point  in spite of its performance and bandwidth cost  As with single-precision floating-point format  it lacks precision on integer numbers when compared with an integer format of the same size  It is commonly known simply as double  The IEEE 754 standard specifies a binary64 as having          Sign bit  1 bit   Exponent  11 bits   Significant precision  53 bits  52 explicitly stored                The real value assumed by a given 64-bit double-precision datum with a given biased exponent and a 52-bit fraction is            or         Thanks to  a guest for pointing that out to me

User · Answer

2020 TypeScript Answer  Why  https   0 30000000000000004 com Library  I built an NPM library for these reasons  but here s the code  You can either import the library or copy the code below     TypeScript Type  Math Type type MathType    Add     Subtract     Multiply     Divide       Math Exact export const mathExact    mathType  MathType  numberOne number   numberTwo  number    gt         Decimal Places   let numberOneDecimalPlaces  number   0    let numberTwoDecimalPlaces  number   0        Float  Number One   if  numberOne toString   indexOf          -1           Assign Decimal Places     numberOneDecimalPlaces   numberOne toString   length - 1 - numberOne toString   indexOf                 Float  Number Two   if  numberTwo toString   indexOf          -1           Assign Decimal Places     numberTwoDecimalPlaces   numberTwo toString   length - 1 - numberTwo toString   indexOf                 Decimal Places  Equal   if  numberOneDecimalPlaces     numberTwoDecimalPlaces           Integers  Off By Decimal Places      const numberOneInteger  number   numberOne   Math pow 10  numberOneDecimalPlaces       const numberTwoInteger  number   numberTwo   Math pow 10  numberTwoDecimalPlaces           Math Type  Add     if  mathType      Add              Integer Total       const integerTotal  number   numberOneInteger   numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberOneDecimalPlaces                Math Type  Subtract     else if  mathType      Subtract              Integer Total       const integerTotal  number   numberOneInteger - numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberOneDecimalPlaces                Math Type  Multiply     else if  mathType      Multiply              Integer Total       const integerTotal  number   numberOneInteger   numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberOneDecimalPlaces   numberTwoDecimalPlaces                Math Type  Divide     else if  mathType      Divide           return numberOneInteger   numberTwoInteger                 Decimal Places  Number One Has More   else if  numberOneDecimalPlaces  gt  numberTwoDecimalPlaces           Integers  Off By Decimal Places      const numberOneInteger  number   numberOne   Math pow 10  numberOneDecimalPlaces       const numberTwoInteger  number   numberTwo   Math pow 10  numberOneDecimalPlaces           Math Type  Add     if  mathType      Add              Integer Total       const integerTotal  number   numberOneInteger   numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberOneDecimalPlaces                Math Type  Subtract     else if  mathType      Subtract              Integer Total       const integerTotal  number   numberOneInteger - numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberOneDecimalPlaces                Math Type  Multiply     else if  mathType      Multiply              Integer Total       const integerTotal  number   numberOneInteger   numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberOneDecimalPlaces   numberTwoDecimalPlaces    numberOneDecimalPlaces - numberTwoDecimalPlaces                 Math Type  Divide     else if  mathType      Divide           return numberOneInteger   numberTwoInteger                 Decimal Places  Number Two Has More   else if  numberOneDecimalPlaces  lt  numberTwoDecimalPlaces           Integers  Off By Decimal Places      const numberOneInteger  number   numberOne   Math pow 10  numberTwoDecimalPlaces       const numberTwoInteger  number   numberTwo   Math pow 10  numberTwoDecimalPlaces           Math Type  Add     if  mathType      Add              Integer Total       const integerTotal  number   numberOneInteger   numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberTwoDecimalPlaces                Math Type  Subtract     else if  mathType      Subtract              Integer Total       const integerTotal  number   numberOneInteger - numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberTwoDecimalPlaces                Math Type  Multiply     else if  mathType      Multiply              Integer Total       const integerTotal  number   numberOneInteger   numberTwoInteger            Decimal Total  Convert Back X Amount Of Decimal Places        return integerTotal   Math pow 10  numberOneDecimalPlaces   numberTwoDecimalPlaces    numberTwoDecimalPlaces - numberOneDecimalPlaces                 Math Type  Divide     else if  mathType      Divide           return numberOneInteger   numberTwoInteger                Add  mathExact  Add   1  2      3 mathExact  Add    1   2        3 mathExact  Add   1 123   2       1 323 mathExact  Add    02  1 123      1 143  Subtract  mathExact  Subtract   1  2      -1 mathExact  Subtract    1   2       - 1 mathExact  Subtract   1 123   2        923 mathExact  Subtract    02  1 123      -1 103  Multiply  mathExact  Multiply   1  2      2 mathExact  Multiply    1   2        02 mathExact  Multiply   1 123   2        2246 mathExact  Multiply    02  1 123       002246  Divide  mathExact  Divide   1  2       5 mathExact  Divide    1   2        5 mathExact  Divide   1 123   2       5 615 mathExact  Divide    02  1 123       017809439002671415

User · Answer

Many of this question s numerous duplicates ask about the effects of floating point rounding on specific numbers  In practice  it is easier to get a feeling for how it works by looking at exact results of calculations of interest rather than by just reading about it  Some languages provide ways of doing that - such as converting a float or double to BigDecimal in Java   Since this is a language-agnostic question  it needs language-agnostic tools  such as a Decimal to Floating-Point Converter   Applying it to the numbers in the question  treated as doubles    0 1 converts to 0 1000000000000000055511151231257827021181583404541015625    0 2 converts to 0 200000000000000011102230246251565404236316680908203125    0 3 converts to 0 299999999999999988897769753748434595763683319091796875  and   0 30000000000000004 converts to 0 3000000000000000444089209850062616169452667236328125   Adding the first two numbers manually or in a decimal calculator such as Full Precision Calculator  shows the exact sum of the actual inputs is 0 3000000000000000166533453693773481063544750213623046875    If it were rounded down to the equivalent of 0 3 the rounding error would be 0 0000000000000000277555756156289135105907917022705078125  Rounding up to the equivalent of 0 30000000000000004 also gives rounding error 0 0000000000000000277555756156289135105907917022705078125  The round-to-even tie breaker applies   Returning to the floating point converter  the raw hexadecimal for 0 30000000000000004 is 3fd3333333333334  which ends in an even digit and therefore is the correct result

User · Answer

Floating point numbers stored in the computer consist of two parts  an integer and an exponent that the base is taken to and multiplied by the integer part   If the computer were working in base 10  0 1 would be 1 x 10     0 2 would be 2 x 10     and 0 3 would be 3 x 10     Integer math is easy and exact  so adding 0 1   0 2 will obviously result in 0 3   Computers don t usually work in base 10  they work in base 2  You can still get exact results for some values  for example 0 5 is 1 x 2    and 0 25 is 1 x 2     and adding them results in 3 x 2     or 0 75  Exactly   The problem comes with numbers that can be represented exactly in base 10  but not in base 2  Those numbers need to be rounded to their closest equivalent  Assuming the very common IEEE 64-bit floating point format  the closest number to 0 1 is 3602879701896397 x 2 55  and the closest number to 0 2 is 7205759403792794 x 2 55  adding them together results in 10808639105689191 x 2 55  or an exact decimal value of 0 3000000000000000444089209850062616169452667236328125  Floating point numbers are generally rounded for display

User · Answer

The kind of floating-point math that can be implemented in a digital computer necessarily uses an approximation of the real numbers and operations on them   The standard version runs to over fifty pages of documentation and has a committee to deal with its errata and further refinement    This approximation is a mixture of approximations of different kinds  each of which can either be ignored or carefully accounted for due to its specific manner of deviation from exactitude  It also involves a number of explicit exceptional cases at both the hardware and software levels that most people walk right past while pretending not to notice   If you need infinite precision  using the number p  for example  instead of one of its many shorter stand-ins   you should write or use a symbolic math program instead   But if you re okay with the idea that sometimes floating-point math is fuzzy in value and logic and errors can accumulate quickly  and you can write your requirements and tests to allow for that  then your code can frequently get by with what s in your FPU

User · Answer

Did you try the duct tape solution   Try to determine when errors occur and fix them with short if statements  it s not pretty but for some problems it is the only solution and this is one of them    if   n   0 1   lt  100 0     return n   0 1 - 0 000000000000001                        else   return n   0 1   0 000000000000001          I had the same problem in a scientific simulation project in c   and I can tell you that if you ignore the butterfly effect it s gonna turn to a big fat dragon and bite you in the a

User · Answer

Some statistics related to this famous double precision question   When adding all values  a   b  using a step of 0 1  from 0 1 to 100  we have  15  chance of precision error  Note that the error could result in slightly bigger or smaller values  Here are some examples   0 1   0 2   0 30000000000000004  BIGGER  0 1   0 7   0 7999999999999999  SMALLER      1 7   1 9   3 5999999999999996  SMALLER  1 7   2 2   3 9000000000000004  BIGGER      3 2   3 6   6 800000000000001  BIGGER  3 2   4 4   7 6000000000000005  BIGGER    When subtracting all values  a - b where a   b  using a step of 0 1  from 100 to 0 1  we have  34  chance of precision error  Here are some examples   0 6 - 0 2   0 39999999999999997  SMALLER  0 5 - 0 4   0 09999999999999998  SMALLER      2 1 - 0 2   1 9000000000000001  BIGGER  2 0 - 1 9   0 10000000000000009  BIGGER      100 - 99 9   0 09999999999999432  SMALLER  100 - 99 8   0 20000000000000284  BIGGER     15  and 34  are indeed huge  so always use BigDecimal when precision is of big importance  With 2 decimal digits  step 0 01  the situation worsens a bit more  18  and 36

User · Answer

A Hardware Designer s Perspective  I believe I should add a hardware designer   s perspective to this since I design and build floating point hardware  Knowing the origin of the error may help in understanding what is happening in the software  and ultimately  I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time   1  Overview  From an engineering perspective  most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place  Therefore  much hardware will stop at a precision that s only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division  What constitutes a single operation depends upon how many operands the unit takes  For most  it is two  but some units take 3 or more operands  Because of this  there is no guarantee that repeated operations will result in a desirable error since the errors add up over time   2  Standards  Most processors follow the IEEE-754 standard but some use denormalized  or different standards   For example  there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision  The following  however  will cover the normalized mode of IEEE-754 which is the typical mode of operation   In the IEEE-754 standard  hardware designers are allowed any value of error epsilon as long as it s less than one half of one unit in the last place  and the result only has to be less than one half of one unit in the last place for one operation  This explains why when there are repeated operations  the errors add up  For IEEE-754 double precision  this is the 54th bit  since 53 bits are used to represent the numeric part  normalized   also called the mantissa  of the floating point number  e g  the 5 3 in 5 3e5   The next sections go into more detail on the causes of hardware error on various floating point operations   3  Cause of Rounding Error in Division  The main cause of the error in floating point division is the division algorithms used to calculate the quotient  Most computer systems calculate division using multiplication by an inverse  mainly in Z X Y  Z   X    1 Y    A division is computed iteratively i e  each cycle computes some bits of the quotient until the desired precision is reached  which for IEEE-754 is anything with an error of less than one unit in the last place  The table of reciprocals of Y  1 Y  is known as the quotient selection table  QST  in the slow division  and the size in bits of the quotient selection table is usually the width of the radix  or a number of bits of the quotient computed in each iteration   plus a few guard bits  For the IEEE-754 standard  double precision  64-bit   it would be the size of the radix of the divider  plus a few guard bits k  where k gt  2  So for example  a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time  radix 4  would be 2 2  4 bits  plus a few optional bits     3 1 Division Rounding Error  Approximation of Reciprocal  What reciprocals are in the quotient selection table depend on the division method  slow division such as SRT division  or fast division such as Goldschmidt division  each entry is modified according to the division algorithm in an attempt to yield the lowest possible error  In any case  though  all reciprocals are approximations of the actual reciprocal and introduce some element of error  Both slow division and fast division methods calculate the quotient iteratively  i e  some number of bits of the quotient are calculated each step  then the result is subtracted from the dividend  and the divider repeats the steps until the error is less than one half of one unit in the last place  Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build  and fast division methods calculate a variable number of digits per step and are usually more expensive to build  The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal  so they are prone to error   4  Rounding Errors in Other Operations  Truncation  Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows  There s truncate  round-towards-zero  round-to-nearest  default   round-down  and round-up  All methods introduce an element of error of less than one unit in the last place for a single operation  Over time and repeated operations  truncation also adds cumulatively to the resultant error  This truncation error is especially problematic in exponentiation  which involves some form of repeated multiplication   5  Repeated Operations  Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation  the error will grow over repeated operations if not watched  This is the reason that in computations that require a bounded error  mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754  because  over time  the errors are more likely to cancel each other out  and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors  and correct them  Because of its low relative error compared to other rounding modes  round to nearest even digit  in the last place   is the default rounding mode of IEEE-754   Note that the default rounding mode  round-to-nearest even digit in the last place  guarantees an error of less than one half of one unit in the last place for one operation  Using the truncation  round-up  and round down alone may result in an error that is greater than one half of one unit in the last place  but less than one unit in the last place  so these modes are not recommended unless they are used in Interval Arithmetic    6  Summary  In short  the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware  and the truncation of a reciprocal in the case of division  Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation  the floating point errors over repeated operations will add up unless corrected

User · Answer

Floating point rounding errors  0 1 cannot be represented as accurately in base-2 as in base-10 due to the missing prime factor of 5  Just as 1 3 takes an infinite number of digits to represent in decimal  but is  0 1  in base-3  0 1 takes an infinite number of digits in base-2 where it does not in base-10  And computers don t have an infinite amount of memory

User · Answer

Another way to look at this  Used are 64 bits to represent numbers  As consequence there is no way more than 2  64   18 446 744 073 709 551 616 different numbers can be precisely represented    However  Math says there are already infinitely many decimals between 0 and 1  IEE 754 defines an encoding to use these 64 bits efficiently for a much larger number space plus NaN and   - Infinity  so there are gaps between accurately represented numbers filled with numbers only approximated    Unfortunately 0 3 sits in a gap

[math] Is floating point math broken?

Examples related to math

Examples related to language-agnostic

Examples related to floating-point

Examples related to floating-accuracy