[floating-point] Floating point inaccuracy examples

There are basically two major pitfalls people stumble in with floating-point numbers.

  1. The problem of scale. Each FP number has an exponent which determines the overall “scale” of the number so you can represent either really small values or really larges ones, though the number of digits you can devote for that is limited. Adding two numbers of different scale will sometimes result in the smaller one being “eaten” since there is no way to fit it into the larger scale.

    PS> $a = 1; $b = 0.0000000000000000000000001
    PS> Write-Host a=$a b=$b
    a=1 b=1E-25
    PS> $a + $b
    1
    

    As an analogy for this case you could picture a large swimming pool and a teaspoon of water. Both are of very different sizes, but individually you can easily grasp how much they roughly are. Pouring the teaspoon into the swimming pool, however, will leave you still with roughly a swimming pool full of water.

    (If the people learning this have trouble with exponential notation, one can also use the values 1 and 100000000000000000000 or so.)

  2. Then there is the problem of binary vs. decimal representation. A number like 0.1 can't be represented exactly with a limited amount of binary digits. Some languages mask this, though:

    PS> "{0:N50}" -f 0.1
    0.10000000000000000000000000000000000000000000000000
    

    But you can “amplify” the representation error by repeatedly adding the numbers together:

    PS> $sum = 0; for ($i = 0; $i -lt 100; $i++) { $sum += 0.1 }; $sum
    9,99999999999998
    

    I can't think of a nice analogy to properly explain this, though. It's basically the same problem why you can represent 1/3 only approximately in decimal because to get the exact value you need to repeat the 3 indefinitely at the end of the decimal fraction.

    Similarly, binary fractions are good for representing halves, quarters, eighths, etc. but things like a tenth will yield an infinitely repeating stream of binary digits.

  3. Then there is another problem, though most people don't stumble into that, unless they're doing huge amounts of numerical stuff. But then, those already know about the problem. Since many floating-point numbers are merely approximations of the exact value this means that for a given approximation f of a real number r there can be infinitely many more real numbers r1, r2, ... which map to exactly the same approximation. Those numbers lie in a certain interval. Let's say that rmin is the minimum possible value of r that results in f and rmax the maximum possible value of r for which this holds, then you got an interval [rmin, rmax] where any number in that interval can be your actual number r.

    Now, if you perform calculations on that number—adding, subtracting, multiplying, etc.—you lose precision. Every number is just an approximation, therefore you're actually performing calculations with intervals. The result is an interval too and the approximation error only ever gets larger, thereby widening the interval. You may get back a single number from that calculation. But that's merely one number from the interval of possible results, taking into account precision of your original operands and the precision loss due to the calculation.

    That sort of thing is called Interval arithmetic and at least for me it was part of our math course at the university.