Floating Point Representation

It’s exam season again which means my build time is a bit limited. I thought I’d take the opportunity while I’m studying to write some more content content for the basics area of this site. To start out I thought I’d build on my post regarding Binary Representation. With an understanding of binary we can begin to explore one of the more confusing data types, Floating point. I know this topic is straying a little bit from audio synthesis but if you plan on working with microprocessors it’s a good thing to understand.

Binary Fractional Representation

The first piece of this puzzle is understanding how fractional numbers are represented in binary. That is to say non-integer numbers. You may remember that in binary each bits value can be determined by raising 2 to the power of that bit. So the first bit is 2^0 = 1, the second is 2^1 = 2 and so on. For numbers on the other side of the decimal point we continue the same pattern but move into negative exponents. The table below should offer some clarity.

This means if we wanted to express a decimal number like 10.625. We could do so in binary by writing 1010.101.

Conversions of Fractional Numbers

You may remember that we can convert an integer number in decimal to binary by repeatedly halving it and noting the remainders. There is a similar algorithm for converting fractional numbers. This time however we repeatedly double the number and note whether the result is greater than one. Taking the example 10.625 from earlier. For the whole number part (10).

Reading from the bottom up, this gives us the binary representation of the whole number portion (1010). Next for the fractional portion (0.625) we can repeatedly double.

This time we read from the top down to get the fractional part in binary .101. Putting it together we get the answer from above 1010.101.

Scientific Notation in Base 2

If you’ve ever had to do math involving very large or very small numbers you’ve probably encountered scientific notation. It’s a fairly straight forward concept. If you have a number with a bunch of zeros like 30,000,000. You know that dividing by 10 would remove one of the zeroes. If you divided by 10 7 times you would be left with just 3. That means 3 * 10 * 10 * 10 * 10 * 10 * 10 * 10 would be equivalent to 30,000,000. For brevity we just write 3 * 10^7.

This process also works for zeros on the other side of the decimal point. We simply use a negative exponent in these cases. for instance 0.000015 can be represented as 1.5 * 10^-5. The key here is that we multiply or divide by ten in a base ten system. So guess what we use in base 2.

You may not have encountered scientific notation in binary but it works in much the same way. You can slide the decimal point back and forth by multiplying or dividing by 2. for instance our example from earlier (1010.101) could be represented by writing 1.010101 * 2^3. This is at the core of floating point representation.

Floating Point

So that was a lot of preamble… but what actually is floating point. At a high level floating point is a way to represent fractional numbers in a digital way. Through some clever design floating point allows us to represent both incredibly large and incredibly small numbers with the same basic architecture.

A floating point number is made up of three parts. The sign, the exponent and the mantissa. The first, the sign bit, is the easiest to understand. If the sign bit is a 1, the number is negative, if it’s a 0, the number is positive.

The Mantissa

The mantissa holds the actual digits of the number. If we go back to our example of 10.625 (1010.101). We would slide the decimal to the first digit giving us 1.010101. Take note of how many decimal places we move the decimal point, we’ll need that later. Also notice that regardless of the number we do this with we will always have a 1 on the left of the decimal point. Since we can always assume this one will be there we can actually leave it out of our final representation. This leaves us with 010101 which is what you would find in the mantissa segment of a floating point number.

The Exponent

The final piece is the exponent. This represents how many digits the decimal point must be moved from the actual number to reach the mantissa. In our example we’ve moved the decimal point 3 places to the left, but there’s a little more to the story. In order for the exponent to represent both positive and negative numbers a bias is added. The bias is typically half the range of numbers that can be represented in the exponent. That means if we have 8-bits set aside for the exponent (0-255) a bias of 127 would be used. That way even if we have a negative exponent it can still be represented using an unsigned number. Adding 127 to our exponent (3) we end up with a value of 130. We can represent this in binary with 1000 0010.

Putting It Together

Now that we’ve determined the values for our sign bit, exponent and mantissa we can put them together into a floating point number. We can do so in one of two ways. Floating point numbers are divided into either singles or doubles which refers to the level of precision each has available. A single is made of 32 bits (4-bytes) while a double takes 64 bits (8-bytes). I will go over each of them with our current example.

Single Precision

In single precision we use 32 bits. Of these 32-bits, 1 is used for the sign, 8 for the exponent and 23 for the mantissa. Using the values we determined previously that means the single point representation of 10.625 would be:

S | Exponent | Mantissa

0 | 1000 0010 | 0101 0100 0000 0000 0000 000

Notice I have added zeros to the tail end of the mantissa to fill the available bits. This won’t affect the number. It’s also pretty clear looking at this example that even with the lower precision of a single we can represent an incredible spectrum of numbers. We can use exponents from 127 to -127. Just to give an idea of the scale there that would approximately translate to a number with 38 zeros before or after the decimal point in base 10.

Double Precision

In case single doesn’t provide the range or precision necessary another larger option is available. With double precision we use 64 bits to represent a number. This is divided into, 1 sign bit, 11 bits for the exponent and 52 bits for the mantissa.

One important note is that since we now have more than 8 bits for the exponent we need to adjust our bias. 11 bits provides a range from 0-2047. Half of that range gives us a bias of 1023. This makes the exponent for our example 1026 (1023+3). In binary 1026 is 1000 0000 010. By putting this together with the other results calculated earlier we get a double precision floating point representation as follows:

S | Exponent | Mantissa

0 | 1000 0000 010 | 0101 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

With 11 bits available for the exponent the available number range has grown exponentially (pun intended). We can now use exponents anywhere from 1023 to -1023. These numbers actually go beyond what my poor Casio calculator is capable of outputting. In base 10 they equate to approximately 307 zeroes on either side of the decimal place!

In Closing

Floating point representation can seem really intimidating at first. That being said with some practice they start to make sense and you’ll begin to develop an intuition while working with them. As you go further working with micro-controllers you will start encountering situations where you have to send and receive data, read or write from registers or make bare-metal conversions between data-types. In any of these situations a strong grasp on floating point (and other common data types) will serve you well.