[java] How many significant digits do floats and doubles have in java?

Does a float have 32 binary digits and a double have 64 binary digits? The documentation was too hard to make sense of.

Do all of the bits translate to significant digits? Or does the location of the decimal point take up some of the bits?

This question is related to java floating-point

The answer is


A normal math answer.

Understanding that a floating point number is implemented as some bits representing the exponent and the rest, most for the digits (in the binary system), one has the following situation:

With a high exponent, say 10²³ if the least significant bit is changed, a large difference between two adjacent distinghuishable numbers appear. Furthermore the base 2 decimal point makes that many base 10 numbers can only be approximated; 1/5, 1/10 being endless numbers.

So in general: floating point numbers should not be used if you care about significant digits. For monetary amounts with calculation, e,a, best use BigDecimal.

For physics floating point doubles are adequate, floats almost never. Furthermore the floating point part of processors, the FPU, can even use a bit more precission internally.


From java specification :

The floating-point types are float and double, which are conceptually associated with the single-precision 32-bit and double-precision 64-bit format IEEE 754 values and operations as specified in IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard 754-1985 (IEEE, New York).

As it's hard to do anything with numbers without understanding IEEE754 basics, here's another link.

It's important to understand that the precision isn't uniform and that this isn't an exact storage of the numbers as is done for integers.

An example :

double a = 0.3 - 0.1;
System.out.println(a);          

prints

0.19999999999999998

If you need arbitrary precision (for example for financial purposes) you may need Big Decimal.


Look at Float.intBitsToFloat and Double.longBitsToDouble, which sort of explain how bits correspond to floating-point numbers. In particular, the bits of a normal float look something like

 s * 2^exp * 1.ABCDEFGHIJKLMNOPQRSTUVW

where A...W are 23 bits -- 0s and 1s -- representing a fraction in binary -- s is +/- 1, represented by a 0 or a 1 respectively, and exp is a signed 8-bit integer.


float: 32 bits (4 bytes) where 23 bits are used for the mantissa (about 7 decimal digits). 8 bits are used for the exponent, so a float can “move” the decimal point to the right or to the left using those 8 bits. Doing so avoids storing lots of zeros in the mantissa as in 0.0000003 (3 × 10-7) or 3000000 (3 × 107). There is 1 bit used as the sign bit.

double: 64 bits (8 bytes) where 52 bits are used for the mantissa (about 16 decimal digits). 11 bits are used for the exponent and 1 bit is the sign bit.

Since we are using binary (only 0 and 1), one bit in the mantissa is implicitly 1 (both float and double use this trick) when the number is non-zero.

Also, since everything is in binary (mantissa and exponents) the conversions to decimal numbers are usually not exact. Numbers like 0.5, 0.25, 0.75, 0.125 are stored exactly, but 0.1 is not. As others have said, if you need to store cents precisely, do not use float or double, use int, long, BigInteger or BigDecimal.

Sources:

http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers

http://en.wikipedia.org/wiki/Binary64

http://en.wikipedia.org/wiki/Binary32


Floating point numbers are encoded using an exponential form, that is something like m * b ^ e, i.e. not like integers at all. The question you ask would be meaningful in the context of fixed point numbers. There are numerous fixed point arithmetic libraries available.

Regarding floating point arithmetic: The number of decimal digits depends on the presentation and the number system. For example there are periodic numbers (0.33333) which do not have a finite presentation in decimal but do have one in binary and vice versa.

Also it is worth mentioning that floating point numbers up to a certain point do have a difference larger than one, i.e. value + 1 yields value, since value + 1 can not be encoded using m * b ^ e, where m, b and e are fixed in length. The same happens for values smaller than 1, i.e. all the possible code points do not have the same distance.

Because of this there is no precision of exactly n digits like with fixed point numbers, since not every number with n decimal digits does have a IEEE encoding.

There is a nearly obligatory document which you should read then which explains floating point numbers: What every computer scientist should know about floating point arithmetic.


A 32-bit float has about 7 digits of precision and a 64-bit double has about 16 digits of precision

Long answer:

Floating-point numbers have three components:

  1. A sign bit, to determine if the number is positive or negative.
  2. An exponent, to determine the magnitude of the number.
  3. A fraction, which determines how far between two exponent values the number is. This is sometimes called “the significand, mantissa, or coefficient”

Essentially, this works out to sign * 2^exponent * (1 + fraction). The “size” of the number, it’s exponent, is irrelevant to us, because it only scales the value of the fraction portion. Knowing that log10(n) gives the number of digits of n,† we can determine the precision of a floating point number with log10(largest_possible_fraction). Because each bit in a float stores 2 possibilities, a binary number of n bits can store a number up to 2n - 1 (a total of 2n values where one of the values is zero). This gets a bit hairier, because it turns out that floating point numbers are stored with one less bit of fraction than they can use, because zeroes are represented specially and all non-zero numbers have at least one non-zero binary bit.‡

Combining this, the digits of precision for a floating point number is log10(2n), where n is the number of bits of the floating point number’s fraction. A 32-bit float has 24 bits of fraction for ˜7.22 decimal digits of precision, and a 64-bit double has 53 bits of fraction for ˜15.95 decimal digits of precision.

For more on floating point accuracy, you might want to read about the concept of a machine epsilon.


† For n = 1 at least — for other numbers your formula will look more like ?log10(|n|)? + 1.

‡ “This rule is variously called the leading bit convention, the implicit bit convention, or the hidden bit convention.” (Wikipedia)