Embedded Wednesdays: Floating Point Numbers

Last week, as a part of getting up to speed in the C language, we looked at the first data type, integers. This week we will look at floating point numbers.

Let’s start with some examples; 2,000,000.6 has a single digit after the decimal place, 3.14159265359 has a whole bunch of digits after the decimal place, and 6.022140857 × 1023 is a really big number. A floating point number, in computing, is a number that has digits before (characteristic) and after (mantissa) the decimal place, whereas integers are whole numbers with no mantissa.

The C language provides support for various types of floating point variables float, double, and long double. Commonly, processors will use IEEE 754 format, but C doesn’t explicitly specify the format or precision.

In IEEE 754, the float data type, also known as single precision, is a 32-bit value that gives you a range of ±1.18×10−38 to ±3.4×1038 and about 7 digits of precision.  That means that you can only accurately represent pi as 3.141592. That's fewer digits than you might expect.

Not good enough for you? Well the double data type, also known as double precision is a 64-bit value with a range of ±2.23×10−308 to ±1.80×10308 and almost 16 digits of precision. Our value of pi can be 3.141592653589793.

With integers, we use the stdint.h header file to give us integer type names with known sizes. Since C doesn’t explicitly give sizes for the floating point variables and there is no equivalent header file, you need to read the documentation for your compiler to figure out the keywords to use to get a 32 or a 64-bit float. Depending on your compiler, a 64-bit float could be a double or a long double. These typedef statements will need to change to suit what I need. I use something like:

typedef float float32_t;
typedef double float64_t;

and then I always use float32_t and float64_t in my code.

Floating Point Hardware

The ARM Cortex-M4F STM32F407 has single precision (32-bit) floating point hardware built in. It can add, subtract, and multiply floating point numbers together in a single clock cycle. Divide and square root take about 14 cycles, but still, that is awesome.

If I needed double precision, the compiler and processor would use some math and simulate the 64-bit calculation using 32-bit values. Slower, but still pretty quick.

Atmel's new Cortex-M7 processor, the SAM S70, has double precision floating point hardware. 64-bit floating point calculation in one cycle.

A Cortex-M3 processor does not support floating point instructions in hardware. They are always done in software libraries using 32-bit integers. This is quite a bit slower than native floating point operations.

On an Arduino, there is no floating point built in, just 8-bit integers. You can use floats and doubles in your program, but the calculations will be simulated using 8-bit integers. It will be very slow.


Floating point constants are assumed to be doubles (64-bit). So

float32_t circumference = diameter * 3.141; 

will generate code to do the calculation as 64-bit numbers, and then reduce the precision to fit into 32-bits. This can be quite slow. You append and F to indicate that a constant is a single precision float:

float32_t circumference = diameter * 3.141F; 

That will be calculated using 32-bit values.

To have constants in scientific notation, use an ‘e’ before the power of 10 giving numbers like:

float32_t avo = 6.022141e23F;
float32_t twoMillion = 2.e6F;

Using a lower case ‘e’ for the exponent makes it more recognizable:

float32_t deBruijn = -1.1E-12F;
float32_t deBruijn = -1.1e-12F;

Float Subroutine Library

Your C compiler will come with a library of math related functions #include <math.h>, but keep in mind that a lot of these routines act on double values. On embedded processors with single precision floating point in hardware, these routines will calculate the answers in software. Alternative functions (like sqrtf instead of sqrt to take the square root) are expecting floats instead of doubles, and will run faster, especially if they are just hardware instructions. The libraries do not pick the best routine to suit your program, you must check the documentation to see which routine to call.

That’s Not A Number!

The floating point system has a couple of neat things. One is something that will put a smile on the faces of math majors, and the other is seen on some badly written web pages: infinity and “not a number”.

Positive and negative infinity is given when the math would normally give you an infinity and is not an error, like when you divide a floating point number by zero or take the tangent of 90 degrees.

Far too often, I see web pages that have the text “NaN” in one of the number boxes. This means that the floating point number that has been pasted in the box is not a number. The internal format of the number is broken and it cannot be interpreted. This is also the value that you get for the typical math errors of zero divided by zero and square root of negative numbers.

Repeating Binary Digits

Computers have a problem; since they are binary machines, they have troubles representing decimal fractions accurately. Computers can accurately handle fractions like ½, ¼, ⅛, that have powers of 2 in the denominator, and integer multiples of them. But other numbers can’t be represented accurately. The number 1/10 is represented as: 001111011100110011001100110011 in binary. The problem being the repeated pattern of 0011. It never gives a value that is equal to 1/10  (much like ⅓ never ends in decimal; 0.33333333333333).

To see the problem, let’s start with zero and add 1/10th, 10 times. We should get the value 1.

#include < stdint.h >
#include < stdio.h >
typedef float float32_t;
int main(void) {
    uint8_t i;
    float32_t val;
    val = 0.F;
    for (i = 0; i < 10; i++) {
        val += 0.1F;
    if (val == 1.0F) {
    } else {
        printf("no %9.8f\n", val);
    return EXIT_SUCCESS;

The result “no 1.00000012 is printed to the screen when this runs.

This is a problem when trying to compare any floating point value with another. You cannot simply say;

if (val == 1.0F) 

The little discrepancies that have crept in will cause your comparison to fail. The alternative is to check to see if the value is close enough for what we need.

if (fabsf(1.0 - val) < 0.0001) 

Yes, this works. The comparison problem isn’t limited to 1.0, it happens on all floating point values. You cannot safely compare two floating point numbers for equivalence.

Warning Chop Chop

Converting between floating point and integer values will take you back to junior high school math class. Remember truncation? Yes, that is what C does to convert from floating point to integer. So:

int result;
result = 2.0F / 3.0F;

Gives a value of 0. The floating point value of ⅔ or 0.666 is truncated to 0 even though it is greater than ½. To round up conventionally, add 0.5 and then let the truncation happen.

result = (2.0F / 3.0F) + 0.5F;

Gives a value of 1. There are also rounding functions in math.h that you can use.

Alternatives to Floating Point

If you are working with a small processor, you may not be able to afford to use floating point calculations, since they can take a lot of time. There are alternatives:

  • If you need to do a certain calculation, but it only happens occasionally, it may be perfectly fine to use floating point calculations. A long calculation that happens once at system startup is hardly noticeable. If it happens every 5 seconds, it may appear as an annoying timing variation. The same calculation every millisecond may cause you to miss your deadlines and you’ll want to try one of the following methods.

  • If you don’t need a lot of decimal digits, you can use integers and multiply everything by, say, 100 to give you two extra digits of precision. It’s like doing your calculations in inches rather than feet (go metric!).

  • What if instead of multiplying by 100, you could multiply by 2 or by 2^16, depending on what’s needed? There is a way to do this call Q numbers.  In Elecia White’s book Making Embedded Systems, chapter 9 has a section called Fake Floating-Point Numbers where she covers the usage and implementation of Q numbers. Elecia gives everything you need to add, subtract, multiply, and divide using a system of integers.

This is just a quick overview of floating point numbers with some topics to watch out for. Much more information is available on Wikipedia.

Next week I'll finish up the data types with booleans and characters.


[I'm really sorry about the formatting of the scientific notation values. It looked perfect in Google Docs, but when it got pasted into Square Space, the superscripts became subscripts, and I can't find the option to fix it.]

This post is part of a series. Please see the other posts here.

The floating point coprocessor chip for the National Semiconductor NS32032 processor circa 1986.

The floating point coprocessor chip for the National Semiconductor NS32032 processor circa 1986.