Embedded Wednesdays: Characters.

The last two posts covered the integer and floating point data types in the C language. This week we cover the char data type.

In C, the name char is short for character, it has enough space to store one character. A char doesn’t handle Unicode or any of the other multi-byte character encodings, C has other extensions to support those.

Character data types are seemingly simple, but since C was designed for optimization, to save bytes and cycles, a few bad decisions were made along the way which have proved to be very expensive. Let’s take a look at characters, strings, and some problems that you need to know about.

Characters and Strings

Like integers and floats, the size of a char is not specified in the language, and should not be assumed. As you can see in the table below, these current production Texas Instruments processors vary drastically in their data type bit lengths. Look at the C55x Digital Signal Processors, they have 16-bit characters, a 40-bit integer type, and no way to get an 8-bit integer.

Texas Instrument's DSP and Microcontroller data types and sizes. From TI's documentation.

Texas Instrument's DSP and Microcontroller data types and sizes. From TI's documentation.

The moral being; don’t use char when you mean uint8_t or int8_t, say what you mean. Use char when you are storing characters.

C supports three char types; unsigned, signed, and plain. The unsigned char and signed char are typically the base types used to define uint8_t and int8_t, but may vary. You could avoid using unsigned and signed char completely. The unadorned plain char type is used to store characters. It is expected by the various character handling subroutines, and if an unsigned or signed char is substituted, compilers will now complain about a type mismatch.

An array of char is known as a string. In computing there are various ways to represent a string, storing the length before the content of the string, storing the length and content in a structure, or using an end of string marker. C uses the end of string marker method, where a string is just an array of chars and the last character is followed by the character NULL (binary zero). "THIS" is stored as:

[0]

[1]

[2]

[3]

[4]

T

H

I

S

NUL

0x54

0x48

0x49

0x53

0x00

In our first sample, we have four declarations, two for a single character, and two for character arrays.

void main( void) {
    char response;
    char terminator = '*';
    char inputBuffer[256];
    char prompt[] = "Enter a value:";
}

The first example allocates space for a single character. The second gives a single character and initializes it to splat. The third creates an array of 256 characters, we don’t know what it contains as it is uninitialized. In the fourth example, an array is created and initialized with a constant string but the length isn’t given. C can determine how big an array has to be and fill in the length for us but only during compilation.

From the declaration of prompt, above, you might expect this to work:

    prompt = "abcd";

but C does not support assigning an array to a variable, neither character arrays or any others, you have to use strcpy (or something safer I’ll show you in a minute).

Characters

In this example we have two strings, described in the next section, and a char variable. Single chars aren’t all that interesting, you  can assign values to them compare them to other chars.

Initialization of a single character looks like this:

    char myChar = 'x';

We initialize it using a single character surrounded by single quotes. We can also assign a single character using a similar command.

#include < stdio.h >

#define MAX_LEN  80

void main( void) {
    char first[MAX_LEN] = "Embedded";
    char last[MAX_LEN] = "FM";
    char initial;
    initial = '.';
    if (initial == 'x') {
        printf("That would be really odd\n");
    }
    printf( "This is %s%c%s\n", first, initial, last);
}

This will print:

This is Embedded.FM

Strings

Strings are quite a bit more useful than single chars. Strings are given an initial value either at compile time or by using the strlcpy function. Two strings can be joined together using strlcat. Strings can be compared using strcmp.

Single characters within the string can be accessed by using the “[ ]” index operator. The indexes on arrays in C start at 0. For instance, above we had char first[MAX_LEN] = “Embedded”; so first[0] would have the value ‘E’, and first[4] would be ‘d’. Above, I mentioned that strings are terminated by the NULL character, it shows up at location [8] in our string, just after the final ‘d’ in Embedded.

In this next example I show some typical simple string manipulations. The important parts are:

  1. the include statement for string.h.

  2. declare three strings with two being initialized, and one holding space called a buffer.

  3. copy “Hello” into one.

  4. print it out.

  5. append “ World!” to Hello.

  6. append a new line.

  7. print it out.

  8. compare the first character with ‘H’.

  9. print out a message if the character matched.

#include < stdio.h >
#include < string.h >

#define MAX_LEN  80

void main( void) {
    char myString[MAX_LEN];
    char theWorld[] = " World!";
    char newLine[] = "\n";

    strlcpy( myString, "Hello", MAX_LEN);
    printf( "myString is %s\n", myString);
    strlcat( myString, theWorld, MAX_LEN);
    strlcat( myString, newLine, MAX_LEN);
    
    printf( "%s", myString);
    
    if (myString[0] == 'H') {
        printf(“The first character is an haych\n”);
    }
}

This will print:

myString is Hello
Hello World!
The first character is an haych

Other String and Character Constants

  • “” - an empty string, used to make a string buffer empty

  • ‘\0’ - a character containing null

  • ‘\x2A’ - how you specify a character in hex. In this case 0x2A = 42, the ‘*’ splat character.

  • ‘G’ - the single letter G

The C language uses the backslash character in front of a few characters to change their meaning:

  • \n - the becomes the linefeed character, you can see it used in the printf commands above.

  • \r - the carriage return character

  • \a - the bell character - make sure you print out a million of these

  • \t - the tab character to line up all of your columns

  • \f - formfeed, on a printer this will spit out the sheet and start a new one.

  • \\ - in case you need to use a ‘\’ character in a string, you have to use two.

  • \” - in case you need to store a double quote

  • \’ - in case you need to store a single quote

To use them, the escape sequences above become a single character:

"Hello\a" - a string that says Hello and rings the bell.

Danger - Sample Code From Hell

This next section goes over a problem that plagues C, buffer overflows. It gives examples of typical C code, then dissects the code to show how buffer overflows happen.

Only the final example should be considered an example of how to do things.

For the purposes of getting the compiler to DO WHAT I SAID I use the keyword volatile. Volatile tells the compiler to turn off all optimizations when dealing a particular data item.

strcpy

First example, the setup; declare a string and an 8-bit value, replace the contents of the string and print them out.

#include < stdio.h >
#include < string.h >
#include < stdint.h >

void main( void) {
	volatile char myString[4] = "123";
	volatile uint8_t myInt = 42;

	strcpy( myString, "abcd");
	printf( "myInt has the value %d\n", myInt);
	printf( "myString is %s\n", myString);
}

In this example I use two variables; myString and myInt. I initialize myString to contain the string “123” and myInt is given the value 42. I copy the string “abcd” into myString. Finally I print out the value of myInt and myString.

From the top, we include stdio.h to use the printf subroutine. We then include string.h for the standard string manipulation library subroutines. Then we include stdint.h to get the integer type names.

volatile char myString[4] = "123";

 

I will be using myString as a bunch of characters, so I use the type char. myString will be set up with enough space to store up to 4 characters. Finally, we initialize it it using 123 surrounded by double quotes.

Next:

strcpy( myString, "abcd");

This is how you set a string to a particular value, you have to call a subroutine to do it. Here we call the string copy routine strcpy to copy “abcd” into myString. It copies the thing on the right into the thing on the left.

There is a problem with my strcpy though, even though myString is declared to have a length of 4, and “abcd” seems to be 4 characters, it is actually 5 because strings in C are terminated by the null character (value 0). In memory this string looks like:

"abcd"

[0]

[1]

[2]

[3]

myInt

a

b

c

d

NUL

0x61

0x62

0x63

0x64

0x00

Where does the null go? Well, it goes on top of myInt because it was the next thing in memory. This is a classic buffer overflow. Strcpy is not smart enough to stop at the end of myString because C does not store the size of myString anywhere, so it doesn’t get checked. For speed and efficiency, all security is removed.

When you run our simple program, above, you get:

myInt has the value 0
myString is abcd

 

strncpy

The more modern method is to use a function called strncpy. Strncpy has a third parameter that you use to set the maximum number of of characters to transfer. We would use it like this:

strncpy( myString, “abcd”, 4);

This avoids smashing into myInt.

Now we get subtle. myString now contains:

 

[0]

[1]

[2]

[3]

myInt

a

b

c

d

*

0x61

0x62

0x63

0x64

0x2A

 

But there is no null terminator on our string and printf is expecting a null terminated string.

printf( "myString is %s\n", myString);

Printf will merrily print out whatever is in memory until it finds a null, since there is no null in myString, it continues to the next thing in memory, which is myInt. myInt gets converted to a character, ‘*’ (ASCII character 42, splat), and the next memory location after myInt just happens to contain a null so it stops (if the next location wasn’t a null, it would just continue on until it ran into one). This gets printed:

myInt is 42
myString is abcd*

This isn’t a buffer overflow, it’s just an example of how stupid the string routines are.

How about if we get clever to avoid problems, let’s try and fix all of this by using the preprocessor:

#include < stdio.h >
#include < string.h >
#include < stdint.h >

#define MY_STRING_LEN 4

void main( void) {
	volatile char myString[MY_STRING_LEN + 1] = "123";
	volatile uint8_t myInt = 42;
    
	strncpy( myString, "abcd", MY_STRING_LEN);
	printf( "myInt has the value %d\n", myInt);
	printf( "myString is %s\n", myString);
}
myInt is 42
myString is abcd

 

Success! This code accidentally works. When myString got allocated, it happened to get initialized with an extra null in it. Without the initializer string the buffer isn’t zeroed, it just contains the random stuff that was in memory when the program runs, so this code only works because myString was initialized.

Being clever is a bad idea. All we have to do is remove the initializer and we’re hooped again.

strlcpy

These days strncpy is getting a really bad name and is considered dangerous because of the lack of null problem. Instead, we should use the new, new, version of strcpy known as strlcpy. Strlcpy has the same syntax as strncpy, but it assures that the string is left in a terminated state.

strlcpy( myString, "abcd",4);

Would give us:

[0]

[1]

[2]

[3]

myInt

a

b

c

NUL

*

0x61

0x62

0x63

0x00

0x2A

 

It copies as much of our string as it can, and terminates the string with a null within the length specified.

#include < stdio.h >
#include < string.h >
#include < stdint.h >

void main( void) {
	volatile char myString[4] = "123";
	volatile uint8_t myInt = 42;

	strlcpy( myString, "abcd", 4);

	printf( "myInt has the value %d\n", myInt);
	printf( "myString is %s\n", myString);
}
myInt is 42
myString is abc

Safe.

Let’s get rid of the length of 4. The 4 sucks, it is what is known as a “magic number”. Magic numbers are the numbers that get put into programs without any hints to the future programmer what they refer to. An example would be the number 31622400, which nobody recognizes, but is better replaced by:

#define SECONDS_PER_LEAP_YEAR 31622400

Defines are effectively free, they don’t take up any extra room in your compiled code, but they make your code more readable.

Assume that we figure out that we actually need 5 characters instead of 4. We would have to search through all of our code to find all of the 4s and change them to 5, except the ones that happen to be correct.

#include < stdio.h>
#include < string.h>
#include < stdint.h>

#define PHRASE_LENGTH 4
#define LIFE_UNIVERSE_EVERYTHING 42
#define INITIAL_STRING "dog"

void main( void) {
    volatile char myString[PHRASE_LENGTH] = INITIAL_STRING;
    volatile uint8_t myInt = LIFE_UNIVERSE_EVERYTHING;
  
    strlcpy( myString, "Sextant", PHRASE_LENGTH);
    printf( "myInt has the value %d\n", myInt);
    printf( "myString is %s\n", myString);
}

And if we need more space, we just need to change the define:

#define PHRASE_LENGTH 8
myInt has the value 42
myString is Sextant

 

For something so basic and simple, chars in the C language are surprisingly tricky. The safer versions of strcat and strcpy were announced in 1999, so a lot of code was written without them, and a lot of people still don’t know about them. These functions have caused a lot of problems over the years, and best practice says that strlcat and strlcpy should be used instead.

If your compiler doesn’t have a version of strlcat and strlcpy, complain to your provider (I’m looking directly at you Microchip).


This post is part of a series. Please see the other posts here.


Public Domain, https://commons.wikimedia.org/w/index.php?curid=264135

Public Domain, https://commons.wikimedia.org/w/index.php?curid=264135