Floating Point Representation

It’s exam season again which means my build time is a bit limited. I thought I’d take the opportunity while I’m studying to write some more content content for the basics area of this site. To start out I thought I’d build on my post regarding Binary Representation. With an understanding of binary we can begin to explore one of the more confusing data types, Floating point. I know this topic is straying a little bit from audio synthesis but if you plan on working with microprocessors it’s a good thing to understand.

Binary Fractional Representation

The first piece of this puzzle is understanding how fractional numbers are represented in binary. That is to say non-integer numbers. You may remember that in binary each bits value can be determined by raising 2 to the power of that bit. So the first bit is 2^0 = 1, the second is 2^1 = 2 and so on. For numbers on the other side of the decimal point we continue the same pattern but move into negative exponents. The table below should offer some clarity.

This means if we wanted to express a decimal number like 10.625. We could do so in binary by writing 1010.101.

Conversions of Fractional Numbers

You may remember that we can convert an integer number in decimal to binary by repeatedly halving it and noting the remainders. There is a similar algorithm for converting fractional numbers. This time however we repeatedly double the number and note whether the result is greater than one. Taking the example 10.625 from earlier. For the whole number part (10).

Reading from the bottom up, this gives us the binary representation of the whole number portion (1010). Next for the fractional portion (0.625) we can repeatedly double.

This time we read from the top down to get the fractional part in binary .101. Putting it together we get the answer from above 1010.101.

Scientific Notation in Base 2

If you’ve ever had to do math involving very large or very small numbers you’ve probably encountered scientific notation. It’s a fairly straight forward concept. If you have a number with a bunch of zeros like 30,000,000. You know that dividing by 10 would remove one of the zeroes. If you divided by 10 7 times you would be left with just 3. That means 3 * 10 * 10 * 10 * 10 * 10 * 10 * 10 would be equivalent to 30,000,000. For brevity we just write 3 * 10^7.

This process also works for zeros on the other side of the decimal point. We simply use a negative exponent in these cases. for instance 0.000015 can be represented as 1.5 * 10^-5. The key here is that we multiply or divide by ten in a base ten system. So guess what we use in base 2.

You may not have encountered scientific notation in binary but it works in much the same way. You can slide the decimal point back and forth by multiplying or dividing by 2. for instance our example from earlier (1010.101) could be represented by writing 1.010101 * 2^3. This is at the core of floating point representation.

Floating Point

So that was a lot of preamble… but what actually is floating point. At a high level floating point is a way to represent fractional numbers in a digital way. Through some clever design floating point allows us to represent both incredibly large and incredibly small numbers with the same basic architecture.

A floating point number is made up of three parts. The sign, the exponent and the mantissa. The first, the sign bit, is the easiest to understand. If the sign bit is a 1, the number is negative, if it’s a 0, the number is positive.

The Mantissa

The mantissa holds the actual digits of the number. If we go back to our example of 10.625 (1010.101). We would slide the decimal to the first digit giving us 1.010101. Take note of how many decimal places we move the decimal point, we’ll need that later. Also notice that regardless of the number we do this with we will always have a 1 on the left of the decimal point. Since we can always assume this one will be there we can actually leave it out of our final representation. This leaves us with 010101 which is what you would find in the mantissa segment of a floating point number.

The Exponent

The final piece is the exponent. This represents how many digits the decimal point must be moved from the actual number to reach the mantissa. In our example we’ve moved the decimal point 3 places to the left, but there’s a little more to the story. In order for the exponent to represent both positive and negative numbers a bias is added. The bias is typically half the range of numbers that can be represented in the exponent. That means if we have 8-bits set aside for the exponent (0-255) a bias of 127 would be used. That way even if we have a negative exponent it can still be represented using an unsigned number. Adding 127 to our exponent (3) we end up with a value of 130. We can represent this in binary with 1000 0010.

Putting It Together

Now that we’ve determined the values for our sign bit, exponent and mantissa we can put them together into a floating point number. We can do so in one of two ways. Floating point numbers are divided into either singles or doubles which refers to the level of precision each has available. A single is made of 32 bits (4-bytes) while a double takes 64 bits (8-bytes). I will go over each of them with our current example.

Single Precision

In single precision we use 32 bits. Of these 32-bits, 1 is used for the sign, 8 for the exponent and 23 for the mantissa. Using the values we determined previously that means the single point representation of 10.625 would be:

S | Exponent | Mantissa

0 | 1000 0010 | 0101 0100 0000 0000 0000 000

Notice I have added zeros to the tail end of the mantissa to fill the available bits. This won’t affect the number. It’s also pretty clear looking at this example that even with the lower precision of a single we can represent an incredible spectrum of numbers. We can use exponents from 127 to -127. Just to give an idea of the scale there that would approximately translate to a number with 38 zeros before or after the decimal point in base 10.

Double Precision

In case single doesn’t provide the range or precision necessary another larger option is available. With double precision we use 64 bits to represent a number. This is divided into, 1 sign bit, 11 bits for the exponent and 52 bits for the mantissa.

One important note is that since we now have more than 8 bits for the exponent we need to adjust our bias. 11 bits provides a range from 0-2047. Half of that range gives us a bias of 1023. This makes the exponent for our example 1026 (1023+3). In binary 1026 is 1000 0000 010. By putting this together with the other results calculated earlier we get a double precision floating point representation as follows:

S | Exponent | Mantissa

0 | 1000 0000 010 | 0101 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

With 11 bits available for the exponent the available number range has grown exponentially (pun intended). We can now use exponents anywhere from 1023 to -1023. These numbers actually go beyond what my poor Casio calculator is capable of outputting. In base 10 they equate to approximately 307 zeroes on either side of the decimal place!

In Closing

Floating point representation can seem really intimidating at first. That being said with some practice they start to make sense and you’ll begin to develop an intuition while working with them. As you go further working with micro-controllers you will start encountering situations where you have to send and receive data, read or write from registers or make bare-metal conversions between data-types. In any of these situations a strong grasp on floating point (and other common data types) will serve you well.

Binary Representation

Since I’ve been working so much lately in the digital space I thought it would be pertinent to do a quick review today. I want to spend some time on one of the most fundamental ideas in digital logic, Binary Representation. I know it’s not the most exciting topic but understanding binary numbers intuitively is critical to understanding the inner workings of digital devices.

Why Do We Care?

You’ve undoubtedly run into binary numbers in pop culture. They seem to appear any time a screenwriter wants to convey that a character “speaks computer.” But what do these zeros and ones mean? And more importantly, why do we care?

At the core the answer is that a computer has no idea what a 7 is. Computers are made up of millions (or billions) of transistors. These transistors only have two states, High and Low, or if you prefer zero and one. This means every thing you store in memory and any instructions you send into your processor need to be written as a series of these zeroes and ones. That goes for all your video files, pictures, video games and even your operating system itself. As far as your computer is concerned it’s all binary.

We can make it a long way working in high level languages like C or Python but inevitably there will come a time when you have to write directly to a register or transmit raw data. This is when binary will serve you. These situations are doubly likely to occur if you are working with microprocessors as both memory and power are limited. Further in time sensitive situations (like audio processing) writing straight to a register is typically faster and more efficient than using high level code.

How Does It Work?

The numbers we are familiar with are known as base 10 (or decimal numbers). This means each digit can be one of 10 possible values (0-9). If I add one to 9 the ones digit resets to zero and the tens digit is incremented to 1 (Giving you 10). When you were first learning to add numbers together you may have been taught to write the two numbers one atop the other and add each digit individually carrying to the next digit when your answer was more than 9. This gets at the core of the base 10 system.

The binary system is no great magic trick. We simply change the base to 2. This means each digit can only hold one of two values (0 or 1). As you count upwards you start with 0. Adding one gives you 1. When you try to add another one you have to carry over to the next digit (just like adding one to 9) giving you 10. To help clarify this process I’ve written out the binary representations of the numbers 0 to 15.

Binary Representation of 0-15

It’s good to note that there are other bases commonly used as well. In computation hexadecimal (base 16) is frequently used to make very large numbers manageable. In Hexadecimal we use the letters A-F to represent 10-15.

How Many Bits?

Notice in the previous table I used 4 digits and was able to represent numbers from 0 to 15 before I ran out of space. These digits are usually referred to as bits and they govern how large a number you can represent. This shouldn’t be too surprising as this is exactly how binary works (2 digits can represent numbers up to 99, 4 digits can represent up to 9999).

So how do we know how many bits we need? In decimal each new digit has a ten times higher value than the previous digit (1. In binary we can use a similar rule except since it’s base 2 each new digit is double the previous one. Here I have shown the value of a one in each of the first 8 digits to illustrate this rule.

Values of Binary Digits

Additionally in decimal you can find the number maximum value you can obtain with a number of digits using the following formula:

Where n is the number of digits and N is the highest value possible. We can do the same in binary by swapping the 10 for a 2:

Using this formula we can calculate the range of numbers available given any number of bits:

Maximum Value Based on Quantity Of Bits

Conversions

There are various methods to convert between decimal and binary. The route I have always found easiest though involves repeatedly dividing a number by two. Each time you divide by two you check if there is a remainder and note that remainder (it will always be 1 if it exists). If there is no remainder (ie. the number is even) note a zero. When you finish you can reverse the order of the numbers you have noted to see the binary representation.

Lets try applying this algorithm for 42:

Binary Conversion of 42

We can see that the binary conversion of 42 is 101010 by reading the remainders from the bottom up. Additionally you can verify your answer by multiplying each digit by the values determined earlier in this article (1*32 + 1*8 + 1*4). Doing this you should get back your original number.

Closing

Before I finish up for the day I have one final question. How high can you count on your fingers? If you answered 10 you’re not thinking with portals yet. We have 10 fingers, each of which can be either extended or folded. If we use binary counting we can reach 2^10 – 1. That’s 1023! You’ll never need a calculator again!

That’s all for me today. I hope you’ve found this refresher helpful, I’ll be back soon with further updates to my Arduino R2R DAC project.