What s the difference between a single precision and double precision floating point operation

Question

What is the difference between a single precision floating point operation and double precision floating operation   I m especially interested in practical terms in relation to video game consoles  For example  does the Nintendo 64 have a 64 bit processor and if it does then would that mean it was capable of double precision floating point operations  Can the PS3 and Xbox 360 pull off double precision floating point operations or only single precision and in general use is the double precision capabilities made use of  if they exist

User · Answer

Okay  the basic difference at the machine is that double precision uses twice as many bits as single   In the usual implementation that s 32 bits for single  64 bits for double   But what does that mean   If we assume the IEEE standard  then a single precision number has about 23 bits of the mantissa  and a maximum exponent of about 38  a double precision has 52 bits for the mantissa  and a maximum exponent of about 308   The details are at Wikipedia  as usual

User · Answer

Basically single precision floating point arithmetic deals with 32 bit floating point numbers whereas double precision deals with 64 bit   The number of bits in double precision increases the maximum value that can be stored as well as increasing the precision  ie the number of significant digits

User · Answer

First of all float and double are both used for representation of numbers fractional numbers  So  the difference between the two stems from the fact with how much precision they can store the numbers   For example  I have to store 123 456789 One may be able to store only 123 4567 while other may be able to store the exact 123 456789   So  basically we want to know how much accurately can the number be stored and is what we call precision   Quoting  Alessandro here  The precision indicates the number of decimal digits that are correct  i e  without any kind of representation error or approximation  In other words  it indicates how many decimal digits one can safely use   Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part  So  double can store double the amount of fractional part as of float  That is why Double is called double the float

User · Answer

Note  the Nintendo 64 does have a 64-bit processor  however      Many games took advantage of the chip s 32-bit processing mode as the greater data precision available with 64-bit data types is not typically required by 3D games  as well as the fact that processing 64-bit data uses twice as much RAM  cache  and bandwidth  thereby reducing the overall system performance    From Webopedia      The term double precision is something of a misnomer because the precision is not really double    The word double derives from the fact that a double-precision number uses twice as many bits as a regular floating-point number    For example  if a single-precision number requires 32 bits  its double-precision counterpart will be 64 bits long       The extra bits increase not only the precision but also the range of magnitudes that can be represented    The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values    Most computers use a standard format known as the IEEE floating-point format     The IEEE double-precision format actually has more than twice as many bits of precision as the single-precision format  as well as a much greater range   From the IEEE standard for floating point arithmetic  Single Precision  The IEEE single precision floating point standard representation requires a 32 bit word  which may be represented as numbered from 0 to 31  left to right      The first bit is the sign bit  S   the next eight bits are the exponent bits   E   and  the final 23 bits are the fraction  F    S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF 0 1      8 9                    31    The value V represented by the word may be determined as follows    If E 255 and F is nonzero  then V NaN   Not a number   If E 255 and F is zero and S is 1  then V -Infinity If E 255 and F is zero and S is 0  then V Infinity If 0 lt E lt 255 then V  -1   S   2     E-127     1 F  where  1 F  is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point  If E 0 and F is nonzero  then V  -1   S   2     -126     0 F    These are  unnormalized  values  If E 0 and F is zero and S is 1  then V -0 If E 0 and F is zero and S is 0  then V 0    In particular   0 00000000 00000000000000000000000   0 1 00000000 00000000000000000000000   -0  0 11111111 00000000000000000000000   Infinity 1 11111111 00000000000000000000000   -Infinity  0 11111111 00000100000000000000000   NaN 1 11111111 00100010001001010101010   NaN  0 10000000 00000000000000000000000    1   2   128-127    1 0   2 0 10000001 10100000000000000000000    1   2   129-127    1 101   6 5 1 10000001 10100000000000000000000   -1   2   129-127    1 101   -6 5  0 00000001 00000000000000000000000    1   2   1-127    1 0   2   -126  0 00000000 10000000000000000000000    1   2   -126    0 1   2   -127   0 00000000 00000000000000000000001    1   2   -126                                          0 00000000000000000000001                                         2   -149    Smallest positive value    Double Precision  The IEEE double precision floating point standard representation requires a 64 bit word  which may be represented as numbered from 0 to 63  left to right     The first bit is the sign bit  S   the next eleven bits are the exponent bits   E   and  the final 52 bits are the fraction  F    S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 0 1        11 12                                                63    The value V represented by the word may be determined as follows    If E 2047 and F is nonzero  then V NaN   Not a number   If E 2047 and F is zero and S is 1  then V -Infinity If E 2047 and F is zero and S is 0  then V Infinity If 0 lt E lt 2047 then V  -1   S   2     E-1023     1 F  where  1 F  is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point  If E 0 and F is nonzero  then V  -1   S   2     -1022     0 F  These are  unnormalized  values  If E 0 and F is zero and S is 1  then V -0 If E 0 and F is zero and S is 0  then V 0   Reference  ANSI IEEE Standard 754-1985  Standard for Binary Floating Point Arithmetic

User · Answer

All have explained in great detail and nothing I could add further  Though I would like to explain it in Layman s Terms or plain ENGLISH  1 9 is less precise than 1 99 1 99 is less precise than 1 999 1 999 is less precise than 1 9999          A variable  able to store or represent  1 9  provides less precision than the one able to hold or represent 1 9999  These Fraction can amount to a huge difference in large calculations

User · Answer

Double precision means the numbers takes twice the word-length to store  On a 32-bit processor  the words are all 32 bits  so doubles are 64 bits  What this means in terms of performance is that operations on double precision numbers take a little longer to execute  So you get a better range  but there is a small hit on performance  This hit is mitigated a little by hardware floating point units  but its still there   The N64 used a MIPS R4300i-based NEC VR4300 which is a 64 bit processor  but the processor communicates with the rest of the system over a 32-bit wide bus  So  most developers used 32 bit numbers because they are faster  and most games at the time did not need the additional precision  so they used floats not doubles    All three systems can do single and double precision floating operations  but they might not because of performance   although pretty much everything after the n64 used a 32 bit bus so

User · Answer

I read a lot of answers but none seems to correctly explain where the word double comes from  I remember a very good explanation given by a University professor I had some years ago   Recalling the style of VonC s answer  a single precision floating point representation uses a word of 32 bit    1 bit for the sign  S 8 bits for the exponent   E  24 bits for the fraction  also called mantissa  or coefficient  even though just 23 are represented   Let s call it  M   for mantissa  I prefer this name as  fraction  can be misunderstood     Representation             S  EEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMM bits     31 30      23 22                     0    Just to point out  the sign bit is the last  not the first    A double precision floating point representation uses a word of 64 bit     1 bit for the sign  S 11 bits for the exponent   E  53 bits for the fraction   mantissa   coefficient  even though only 52 are represented    M    Representation              S  EEEEEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM bits      63 62         52 51                                                  0   As you may notice  I wrote that the mantissa has  in both types  one bit more of information compared to its representation  In fact  the mantissa is a number represented without all its non-significative 0  For example    0 000124 becomes 0 124    10-3 237 141 becomes 0 237141    103   This means that the mantissa will always be in the form  0 a1a2   at      p  where    is the base of representation  But since the fraction is a binary number  a1 will always be equal to 1  thus the fraction can be rewritten as 1 a2a3   at 1    2p and the initial 1 can be implicitly assumed  making room for an extra bit  at 1    Now  it s obviously true that the double of 32 is 64  but that s not where the word comes from   The precision indicates the number of decimal digits that are correct  i e  without any kind of representation error or approximation  In other words  it indicates how many decimal digits one can safely use   With that said  it s easy to estimate the number of decimal digits which can be safely used    single precision  log10 224   which is about  7 8 decimal digits double precision  log10 253   which is about 15 16 decimal digits

User · Answer

According to the IEEE754     Standard for floating point storage      32 and 64 bit standards  single precision and double precision       8 and 11 bit exponent respectively      Extended formats  both mantissa and exponent  for intermediate results

User · Answer

Single precision number uses 32 bits  with the MSB being sign bit  whereas double precision number uses 64bits  MSB being sign bit  Single precision  SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF  SIGN EXPONENT SIGNIFICAND   Double precision    SEEEEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  SIGN EXPONENT SIGNIFICAND

User · Answer

As to the question  Can the ps3 and xbxo 360 pull off double precision floating point operations or only single precision and in generel use is the double precision capabilities made use of  if they exist      I believe that both platforms are incapable of double floating point  The original Cell processor only had 32 bit floats  same with the ATI hardware which the XBox 360 is based on  R600   The Cell got double floating point support later on  but I m pretty sure the PS3 doesn t use that chippery

User · Answer

To add to all the wonderful answers here  First of all float and double are both used for representation of numbers fractional numbers  So  the difference between the two stems from the fact with how much precision they can store the numbers       For example  I have to store 123 456789 One may be able to store only 123 4567 while other may be able to store the exact 123 456789    So  basically we want to know how much accurately can the number be stored and is what we call precision    Quoting  Alessandro here     The precision indicates the number of decimal digits that are correct    i e  without any kind of representation error or approximation  In   other words  it indicates how many decimal digits one can safely use    Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part  So  float can store double the amount of fractional part  That is why Double is called double the float

[floating-point] What's the difference between a single precision and double precision floating point operation?

Examples related to floating-point

Examples related to precision

Examples related to processor

Examples related to operations