DMA - A Little Help From My Friends

This time on our journey we’re going to take a look at an aspect of computer architecture that is very useful, direct memory access (DMA).

DMA is not a new concept. I first learned about it in 1979 in my computer architecture class where we discussed the DMA that IBM used in their System 360 communication channels[1]. But the first use of what became DMA is the IBM 709 vacuum tube computer from 1958.

Think of DMA as a co-processor that is used to quickly transfer data between main memory and peripherals without the intervention of the CPU.

DMA essentially freezes the CPU, disconnecting it from the memory and I/O busses, so that specialized data-moving hardware can transfer data between memory and peripherals. When the transfer is complete, the CPU is brought back online and the CPU is notified. This provides a way for the peripherals to have direct access to the system memory without involving the CPU.

Tell Me Why

Why would you do this? When copying a single piece of data from one place in memory to another, the CPU normally has to arrange for the data to be transferred onto the bus, loaded into the CPU, moved out of the CPU, back onto the bus and stored back into memory. The CPU really didn’t add much value to the transaction, and it took two instructions. When you move a block of data, the CPU has to go through this same dance for every element, plus increment the source and destination pointers appropriately.

The addition of a DMA controller gives the option of freeing up the CPU for more valuable tasks while the DMA controller takes care of the data movement and pointer incrementing.

DMA is useful when dealing with data generated at a very high and very low speed. In high speed data transfers, like from USB and Ethernet, data is typically handled by complex peripherals that can deal with the blocks of data being transferred. DMA can be used to transfer the data blocks out of the peripheral into user memory space, or from user space memory to peripherals, or even memory to memory. For low speed data, the DMA controller will aggregate multiple data items and provide a signal when they are ready for processing.

What Goes On

Historically, there are three different styles of DMA: block, burst/cycle stealing, and transparent.

  • With block DMA, the processor is removed from the bus until the complete block is transferred. This would be useful for rapidly moving large block of data, like bringing code in from a disk controller on a page fault, but the processor is stalled for a long period.
  • Burst mode takes the processor off the bus for a very short period, transferring a handful of data, then letting the processor run again, and repeating until the transfer is complete.
  • Transparent DMA sneaks transfers in by watching the processor states and moving data when the CPU isn’t using the bus. Transparent mode is the slowest of all of the DMA methods, but doesn’t cause any CPU speed jitter.

Over the years, processor architects have adjusted how DMA is implemented, since they have more transistors to work with and can optimize their approach to get optimal performance.

The Intel 8257 DMA controller for the Intel 8085 used block mode to transfer up to 16K bytes in one single block move. On the IBM-PC, the Intel 8237 DMA controller could do burst or block mode transfers by inserting wait states (signals to the CPU to make it wait for slower memory to become ready). Now that processors are so fast compared to the busses, block mode DMA is very expensive. The cost of taking the CPU offline for thousands of bus cycles is far too high. Architectural developments like multi-level caches are used to keep the processor running while the DMA transfers happen in the background.

Modern burst mode DMA is taking cues from transparent mode: the DMA controller will delay starting a transfer until the CPU is not using the bus. For instance, if the CPU happens to be processing an interrupt request, where a large group of registers are pushed onto the stack in main memory. The DMA controller will wait until the register push is complete before taking the CPU off of the bus, then transfer a small number of bytes. But if the instruction being executed is not going to be using the bus, for instance a multiply instruction that just works on values in registers, the CPU can continue processing even though it is disconnected from the bus. The instruction execution rate, during burst DMA, ends up with a small amount of jitter as the processor gets stalled for a couple of bus cycles.

The capabilities of the DMA controller are dependant on how your chip manufacturer designed it. They typically support memory to memory, and bidirectional memory to peripheral transfers, but you should check your data sheets.

Memory to memory transfers can be useful for copying of a buffer in a communications stacks and especially in computer graphics where windows and font elements are moved as a large chunk.

Peripheral to memory transfers are common in dealing with peripherals that work with blocks of data, such as video camera interfaces and SD card controllers. For example, with a DCMI video interface, the CPU would program the DMA and camera interface to transfer a block of data from the camera into a buffer. While the camera is transferring its image into the controller, the image is broken up into data packets, fed into a FIFO, then it requests the DMA to begin.

Memory to peripheral transfers are very interesting. Beyond a really efficient lower level to printf, DMA could be used to stream data to a DAC, forming a waveform generator.

Peripheral to peripheral transfer is less common. ST, Intel, and SPARC processors don’t support this transfer mode. ATMEL SAM processors support peripheral to peripheral transfers, and their documentation shows an example of a transfer from an ADC to a UART, though I can see issues with the datastream getting out of sync if you drop a character. A much better example is ADC to Flash memory. You would select the Flash chip, send the write command with the target address, then use DMA to stream the ADC data to the Flash chip. Once the ADC has sent your chunk of data you get an interrupt, then you deselect the Flash.

Faster

DMA comes into its own when dealing with data transfers that are so fast that the processor cannot handle them using the other methods that are available.

In my previous post, I discussed the issues of over-run errors in working with receiving data from UARTs. On our embedded processors, like the Microchip PIC, STM32 or ATMEL SAM microcontrollers, we only have a one byte input buffer.[2]  If a second character comes in before the receive buffer is unloaded, the new data destroys the first. UARTs detect this situation as an over-run error.

This is a common occurrence when polling is used. The processor provides a status bit that you periodically read (poll) to see if there is a character waiting for you. You poll then your code has to unload any pending characters before a new one comes in or you get an over-run error and data is lost.

One way to overcome the problems of polling is to use interrupts. You get an interrupt every time a character is ready to be unloaded, no need to poll. But when the interrupt rate is high enough, the processor cannot field the interrupts quickly enough and data is again lost.

However, using DMA, when a character is received, the UART signals the DMA controller that it needs service. The character is transferred into memory, and the processor continues. No polling, no interrupts.

Doing some rough estimates and assuming an STM32 processor, the DMA overhead for a single UART character transfer would be two bus cycles. A bus cycle is usually two to four CPU cycles (the bus runs at ½ or ¼ of the processor speed). The STM32 executes most simple instructions in one cycle, so it takes about 4-8 instruction times to transfer one character from the UART into RAM (24-48nS at 168MHz). That is about the same number of instructions that would be necessary to poll the UART to see if a character is ready to be transferred (but not do the transfer, just check), and about ⅓ of what it would take to unload the character using an interrupt routine.

One way or another, the character has to be unloaded from the UART. With some quick calculations, DMA has the least overhead.

Slow Down

DMA is really good at slow events as well. Polling infrequent events is a waste of computer cycles, since most of the polling will show that there is nothing to do. Interrupts are a good alternative, giving you one interrupt per event.

But if you have something like a noisy ADC where you grab a bunch of samples and average them, DMA will let you know when the whole data set is ready for processing. If you are dealing with, say, 1000 ADC scans, each 1 mS apart, you could program the DMA controller to transfer 1000 values as they become available. After 1 second, when the last transfer is done, you get an interrupt and you have an array of results to process. There would be no polling involved and only one interrupt generated when the full data set is ready.

In the meantime, DMA has used fewer cycles than if you were to use interrupts since you only get one interrupt for the complete data set and far less instruction time than polling.

An issue crops up when you are doing a peripheral to memory transfer of an unknown amount of data. For instance, you can set up for a DMA transfer of 1000 bytes of data from your UART, but if you only receive 999 bytes, the transfer will not complete. In this case you could also use a timer that will give you a timeout in, say, 1500 character times. In a timeout situation, you can retrieve the remaining transfer count from the DMA controller to figure out how much data is sitting in your buffer. Then terminate the DMA transfer, process your data, and set up for the next read.

Carry That Weight

DMA is even more useful when multiple streams happen simultaneously. You could have multiple UARTs receiving and transmitting data, ADCs being unloaded, plus periodic checksum checks of the program memory, all happening while the CPU is taking care of the application.

All Together Now

With a little help from our DMA controller we can keep up to high speed peripherals when polling and interrupts let us down. And if we need multiple data items and don’t want to be bothered until they are all ready, DMA will bring it together. Next time we’ll look at some examples of how DMA is used.

Jack Ganssle has a great article on the theory of DMA.

[1] Hayes, John P.: Computer Architecture and Organization, 1978

[2] PCs tried to solve this issue a while back by using a UART that contained a 16 bytes FIFO (and much lower baud rates). The grandchildren of these same UARTs are now housed in the “super-io” chip attached to the southbridge on PCs. With PCs, you have to start unloading the UART within 17 character times.


This post is part of a series. Please see the other posts here.


Music to program by - I got a recommendation from someone on the Boldport Slack channel. It's pretty good, but the tracks that have vocals will break your concentration.

The Intel 8237 DMA controller for the 8086 family. The CPU is on the left and the address and data busses are top and bottom.  From "The 8086 Family User's Manual", Intel 1979.

The Intel 8237 DMA controller for the 8086 family. The CPU is on the left and the address and data busses are top and bottom.  From "The 8086 Family User's Manual", Intel 1979.