Computers don’t differentiate between characters, integers, floating point numbers, objects, arrays, or code. These are simply binary patterns stored in bytes in memory. Once they leave the confines of the processor and get transmitted across the airwaves or a piece of wire, they are still just patterns of bits that are open to interpretation. Today we look at some issues in interpretation of data.

A byte of data holds a number between 0 and 255. The meaning of this number depends completely on its use. For instance, the number 42 is the answer to the ultimate question of life, the universe, and everything, it is also 7 times 6, the atomic number for molybdenum, and the character ‘*’ (known as splat in some compiler construction classes).

If I wanted to handwrite the number 42 on paper, how many characters does it take? Two: a 4 and a 2. Even though we can think of the number 42 as being a single number, it takes two characters to represent it in English. The symbols for 42 will vary depending on which language you write in. In Japanese and Chinese the symbols 四十二 (four ten two) are used.

The same thing happens on computers; we can have a byte that stores the value 42, but for me to print it out takes two characters, a 4 and a 2. I just need to send the proper information to my printer to get the 4 and 2 characters to show up in the appropriate language. Strangely, I don’t send the numbers 4 and 2.

In the early days of computing there were no rules yet, so you couldn’t just buy a printer or terminal for your computer, each one was specific to a computer model and had its own command set. To help with interoperability and make communication a little more predictable, ANSI put together the American Standard Code for Information Interchange (ASCII pronounced ask-ee), which gives a translation between internal numbers and commands and character representations.

In ASCII, the printing numeric characters have the values 48 through 57 (0 through 9 respectively). Uppercase English characters are 65 through 90 (A through Z), and lowercase are 97 through 122 (a through z). The values from 0 to 31 are commands for communications modems, computers, and printers. The values above 127 aren’t defined (since the characters being sent were only 7 bits wide, the MSB being used for error checking), but are now allocated for character graphics symbols that were popular back in the 80s. [1]

To get our 42 onto a printer, we would have to send the ASCII code for 4, followed by the code for 2. From the table, these are the values 52 and 50 respectively. How’s that for confusing? Why don’t we just send the value 42 to the printer? We would get a splat, because the ASCII value for the ‘*’ character is 42.

C provides functions like printf to convert our numbers into their ASCII display form. C also provides scanf to convert from ASCII numbers into the binary forms that we need to do calculations.

Which End is UP?

Sometimes data is sent in binary form: if you want to transmit a 42, you send a 42. This is very efficient, it only takes one byte to transfer one byte. The problem comes up when you need to transfer a value that is greater than 8 bits.

If you want to transmit a 32 bit value, it goes out a byte at a time, but which byte goes first? It depends. On a computer with an Intel processor you send out the least significant byte first. Many other computers send out the most significant byte first.

It looks like this:

#include < stdint.h >
uint32_t val = 0x87654321;   /* In hex to show what happens to the bytes */

0x87

0x65

0x43

0x21

An Intel Pentium would transmit this as 0x21, followed by 0x43, and so on, from right to left. This is known as little endian encoding. Data are transmitted least significant byte first.

The Motorola MC68000 and PowerPC would transmit this as 0x87, then 0x65, and so on, left to right. This is known as Jabba the Hutt encoding. Actually it’s not, I’m just kidding, it’s big endian. Data are transmitted most significant byte first.

When two computers want to communicate with each other, they have to agree on a format for the data. With a Pentium transmitting and PowerPC receiving, the Pentium (little endian) sends its data least significant byte first, and the PowerPC receives data that looks like:

uint32_t val = 0x21436587;

0x21

0x43

0x65

0x87

And then the PowerPC would have to swap the bytes around to get the original value back. This is known as byte swapping.

Many parts of the internet’s communications use big endian network byte order format. Each remote computer then converts the data into the format it needs.

Escape plans

You can send commands mixed in with your data, but if your data is binary and can take any possible value, you need to do some special processing to avoid having your data trigger commands accidentally.

Assume that we are sending an unknown amount of binary data in a packet and we put a character on the end that we use to detect errors in the data transmission. We could use the ASCII characters STX and ETX (2 and 3) to delimit our data. Our data packet will look like this:

STX = 2

byte

ETX = 3

What happens if we need to send the values 2 or 3, which just happen to be our STX and ETX characters? We’ve got a problem that will cause incorrect starting and stopping of packets, and general mayhem.

The problem data stream looks like this:

STX

ETX

We add a new rule, if you need to send the values 2 or 3 in the data, put the ASCII escape character ‘ESC’ (27) in front of it.

Of course we now have to escape the escape character as well. If we want to transmit the value 27 (which is the ESC character) we would have to send ESC ESC.

STX

ESC

ETX

When the data is received it’s scanned for ESC characters. The ESC is removed and the next character is used as data. If everything is received correctly, we can use the data.

This is an example of a simple protocol that I used once, but this type of escape processing is also used in various forms wherever you have commands and data intermixed. In the printf function in C, to include the ‘%’ character in a format specifier, you have to use ‘%%’. HTML has issues with the < and > characters being shown on a screen, they must be replaced by the phrases ‘<’ and ‘>’, and since ‘&’ is a special character to HTML, you must use ‘&’ if you want an ampersand on the screen.

And Another Thing...

We’ve learned that computers can’t tell the difference between numbers and letters, can’t communicate with each other without a playbook, sometimes repeat themselves so they can be heard, and don’t know if they are coming or going. It is obvious that computers are, in fact, male. All of the evidence supports that conclusion.

[1] Embedded systems deal with the conversion between numbers and ASCII character codes very frequently. This habit even comes up in literature as well:

“There was one binary code that I knew most of - ASCII. Embedded software engineers often had to look at how the letters and numbers are encoded in the electronics. It was one of those things that I had done for two weeks every six months for about ten years, debugging some driver or another. Even better, it was build-able, B is one plus A, and I knew that A was 0x41 and lower case started with 0x61; without putting in too much detail that was equivalent to the binary code 0100 0001.” White, E (2007) Pony Up

This post is part of a series. Please see the other posts here.