Introduction
Ever wondered why 0.1 + 0.2 ≠ 0.3 when you do that in python? How are real number represented in bytes underneath? What does it even mean for a number to have a “floating” point? These are the sort of questions that popped up in my head when I was just writing a simple number parser and it worked for the most part except there was a small error in computing the exact value. I decided to deep dive into IEEE754 floating point specification and figure out what the issue was. Here’s what I learned about floating point numbers to gain a better understanding.
Representation
Try it yourself
Let’s start with basics. How are these numbers represented inside a computer? There are limited number of bytes and infinite number of real numbers. It should be obvious that the not all numbers can be represented with these limited bytes1. Many of these numbers have to have the same binary representation after reduction in precision.
A floating point number in written format can have many forms
or or or
Taking as a simple example, let’s try to convert it into binary ourselves.
can be converted into
can be converted into
So, is the binary representation enough?
We need a way to store the decimal place as well or that information is lost. Let’s add 4 bits at the beginning to store the decimal place. has decimal at fourth place so adding
Looks good except that it can’t store negative numbers, let’s add a sign bit at the beginning 0 for positive and 1 for negative.
Well, this is more or less how IEEE754 standard specifies how to store these numbers with whole set of rules including normalization and rounding.
Let’s see how a 32-bit signed floating point number is actually represented in binary format.
IEEE754 Single Precision Floating Point Number
As above, certain set of bits are allocated to store certain information. Here’s the representation of a 32 bit floating point number.
The first bit is the sign bit and it’s 0 for positive numbers and 1 for negative numbers.
The next set of 8 bits is called the exponent. This is used to represent the power to which the radix (or base) is raised in the number.
The rest of the 23 bits, called significand or mantissa, are used the represent the actual significant digits in the floating point number.
Let’s see how the number will actually be converted.
In the third step, we move the decimal next to the first set bit. This process is called normalization. This step ensures that we utilize all the bits in significand to store the number. You wouldn’t want a number to be reduced to because a lot bits were used to store s. Additionally, you don’t need to store the first bit as it’s always one and you can store rest of the bit in limited space that you have.
In the fourth step, what is that 127 and where did that come from? For a number, the exponent can be a positive or a negative number.
e.g.
The 8 bit exponent can store 0-255 values and 127 is the mid-point. It’s called the bias to which the exponent is added. If the value is between 0-126, the exponent is negative and if it’s between 128-254, it’s positive. For 127, the exponent is 0.
Special Numbers
Infinity
Any number where the exponent is 255 (all bits set), the number is represented as infinite inf
or -inf
based on the signed bit. This is the value for a number that’s too big or small to represent.
NaN
When the exponent is 255, and the first bit of the significand is set, it’s a NaN
. There are two types of NaN
sNaN
and qNaN
.
qNAN
: Left on the implementer how to use it. This does not raise any exception and is the default behavior.
Representation :
sNAN
: Usually used to signal invalid operation and raise exception.
Representation :
where atleast one of the bit is set to distinguish it from infinity.
What’s floating in Floating Point numbers?
Well, The decimal point in the number is floating.
When the integer part of the number is huge, the point floats to the left, dropping the less significant digits. When the fractional part of the number is really small and the integer part if zero, the point floats to the right, preserving the significant digits. The position of the decimal point is not fixed. It moves to represent really small numbers and really big numbers. This is in contrast to fixed-point numbers where the position of the decimal point is fixed and only a set of bits is allocated to both integer and fractional part. This limits the amount of numbers represented in the fixed-point numbers. Maybe we can learn about fixed-point numbers2 another time.
Distribution of Floating Point numbers
The benefit of floating point numbers is that it can represent both very small numbers and very big numbers efficiently. With limited set of bits present, the values it can represent is not uniform across a real number line. It can represent smaller numbers much precisely than big numbers. The precision drops as we move from 0 toward the max value. Here’s the graph representing that.
From the graph, you can see that there are a lot more representable numbers in [0-0.5] bucket than [0.5-1.0].
Why is 0.1 + 0.2 ≠ 0.3?
When you do 0.1 + 0.2 in a python shell, you do not get 0.3
>>> 0.1 + 0.20.30000000000000004
By default, python uses a 64 bits to store the number or double precision floating point.
In this representation, the exponent is stored in 11 bits (bias 1023), the mantissa is 52 bits and the sign is still stored in 1 bit.
0.1 = 00111111 10111001 10011001 10011001 10011001 10011001 10011001 100110100.2 = 00111111 11001001 10011001 10011001 10011001 10011001 10011001 100110100.3 = 00111111 11010011 00110011 00110011 00110011 00110011 00110011 00110011
While converting all the above numbers, you’ll find out that the binary fractional part is repeating.
1001
is repeating in 0.1
and 0.2
except at the end which is 1010
.
0011
is repeating in 0.3
till the end.
Rounding numbers
Round to nearest, ties to even
This rounding rules3 in defined in the standard4 which tells us how to round a number. A number will round to the nearest representable number and round to the even one if there is a tie. The nearness is determined by the bits which are being truncated. Similar to how we’d round to if , to if and again to if because is even.
In the case of 0.1
, the repeating pattern is 1001
. The truncated part is also 1001
which is greater than the mid point 1000
, the last nibble 1001
will be rounded up to 1010
. Similarly, 0.2 will also be rounded.
In case of 0.3
, the truncated part is 0011
. Since, this is less than 1000
, the number will be round to nearest representable number 0011
.
Adding 0.1 and 0.2
To add the numbers, their exponent must be made equal and then their mantissa must be added.
0.1 = 1.100110011001... x 2^-40.2 = 1.100110011001... x 2^-3
To make the exponents same, we need to move decimal to the left for 0.1
0.1 = 0.1100 11001100 11001100 11001100 11001100 11001100 11001101 x 2^-3 (rounded)0.2 = 1.1001 10011001 10011001 10011001 10011001 10011001 10011010 x 2^-3
Adding them up 0.1100 11001100 11001100 11001100 11001100 11001100 11001101 x 2^-3+1.1001 10011001 10011001 10011001 10011001 10011001 10011010 x 2^-3--------------------------------------------------------------------10.0110 01100110 01100110 01100110 01100110 01100110 01100111 x 2^-3
Normalizing1.0011 00110011 00110011 00110011 00110011 00110011 00110011 1(000) x 2^-2
Since, the truncated value is 1000
, the last byte rounds to 00110100
So, the resulting number is 00111111 11010011 00110011 00110011 00110011 00110011 00110011 00110100
and this number is different than just storing 0.3
. This value is 0.30000000000000004
. The rounding error adds up resulting in a different value. This mystery should be resolved now!
Trying it out in python
Using struct
module in python, you can convert a float64 to int64 and then to binary. Here’s how you can do it in python.
$ python3Python 3.13.1 (main, Dec 3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import struct>>> bin(struct.unpack( "!q", struct.pack( "!d", 0.3))[0])[2:].zfill(64)'0011111111010011001100110011001100110011001100110011001100110011'>>> bin(struct.unpack( "!q", struct.pack( "!d", 0.1 + 0.2))[0])[2:].zfill(64)'0011111111010011001100110011001100110011001100110011001100110100'
Did you know?
Did you know that you’ll not see this issue with single precision floating point numbers? You can figure it out following the same process as above.
$ python3Python 3.13.1 (main, Dec 3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import struct>>> bin(struct.unpack( "!i", struct.pack( "!f", 0.3))[0])[2:].zfill(32)'00111110100110011001100110011010'>>> bin(struct.unpack( "!i", struct.pack( "!f", 0.1 + 0.2))[0])[2:].zfill(32)'00111110100110011001100110011010'