DMA - Monster Machines Moving Massive Memory Mounds

Recently, I was experimenting with the CRC unit on my processor. A cyclic redundancy check (CRC) is an error detecting code that is used to detect corruption of data during communication. CRCs are usually generated by some horrible binary math or table lookup algorithms, but my processor has a CRC generator in hardware. All you have to do is copy 32-bit values into it and it will generate a CRC using the Ethernet checksum algorithm. I figured that it would be a cheap way to validate my configuration parameters at system boot time. I could checksum my parameters and store the CRC away. When I reboot the system, I can checksum the parameters again and compare it with the stored value to make sure the code hasn’t been changed by any outside forces.

To see the results of the calculation, and to help in debugging, I wrote some code to use DMA to support the printf command. In one of my previous posts, I explained how the C support libraries need a function written to send the output of printf and putc to your choice of serial port. By using DMA, we are potentially freeing up the processor for other things (if we were using an RTOS). But for now, it’s a simple DMA example.

I talked about DMA theory last time in this series. This time I want to show two examples of DMA in practice: UART transmission and CRC calculation.

DMA UART Transmitter Example

In this example, we program the DMA controller to transfer 8-bit characters from memory to the UART transmit data buffer. From there, many things happen in the background, inside the hardware. The UART signals the DMA controller when the transmit buffer is empty and it is able to accept another character. The DMA controller arbitrates control of main memory and the address and data buses, to move the next character into the UARTs transmit buffer, then relinquishes control of memory and the buses. This continues until the memory block has been completely transferred, at which time it generates an interrupt to indicate the DMA transfer is complete.

The function, below, is called _write. It is given a file number that can indicate which device to send the output to, but we will not be using it in this example. The function is also given a pointer to a string of characters and the length of the string. The function needs to return the number of characters written or -1 if there is an error.

For this, I’m using an STM32F407 processor, using the CubeMX tool, and gcc toolchain C compiler from openstm32.org, but the concepts are going to be similar for other processors and toolchains. I started a new project, in the same way as previous articles. To set up the DMA controller, I will use Cube. I enabled the UART that I wired in this article. Next, I went into the Configuration tab and clicked on the DMA item.

Then added a DMA request for controller number 1:

I then generated code for my toolchain and added these functions. You can add them to main.c or a separate .c file that you can carry over to your various projects.

static volatile bool serialComDone;
/*
 * The following two functions are used to support the stdio output functions.
 */
int _write(int file, char *outgoing, int len) {
  HAL_StatusTypeDef status;

  serialComDone = false;
  status = HAL_UART_Transmit_DMA(&huart3, (uint8_t *) outgoing, len);
  if (HAL_OK != status) {
    serialComDone = true;
    len = -1;
  }
  while (serialComDone == false) {
    /* Hang around until the DMA complete interrupt fires */
  }

  return (len);
}

void HAL_UART_TxCpltCallback(UART_HandleTypeDef* huart) {
  serialComDone = true;
}

When you use printf, it formats your data and eventually calls _write. In _write, I use a volatile boolean flag called serialComDone. This flag is cleared and the UART is started using DMA. The code now goes into a tight while loop waiting for the serialComDone flag to become true. Once the DMA transfer is complete, HAL_UART_TxCpltCallback is called, which sets serialComDone to be true. The while loop then terminates and _write returns.

On other processors, you will have to figure out how to program your DMA controller, but a lot of the concepts will be identical. A transfer from memory to a peripheral, transferring an 8-bit value into an 8-bit register, and generate an interrupt when it is finished. You may choose to poll the DMA complete bit instead.

If I was using an RTOS, I would replace the while loop with a RTOS call to pend on a semaphore. The interrupt routine would post to the semaphore. Without an RTOS, instead of having the while loop, if there are no errors, I could return directly after the HAL_UART_Transmit_DMA. On the next call, I might pend if the previous DMA is still active.

DMA CRC Generation Example

In this example, I will be using DMA to generate a checksum for the code that has been loaded onto my processor’s internal Flash. This method could be used in high reliability systems to periodically checksum the firmware in Flash memory, checking the result against a known value to detect corruption.

First, please look at “Using the CRC peripheral in the STM32 family”, then I'll give a more detailed example of how you can implement what they talk about to verify your internal Flash.

Using Cube, we use DMA controller number 2. Unlike DMA1, DMA2 has the ability to do memory-to-memory transfers. So click on the Configure tab, click DMA, and DMA2. This time we will add a memory-to-memory transfer that will go from Flash memory where our code is stored into the memory-mapped CRC unit data register.

The CRC unit is a 32-bit device, so the transfers are done with Word size (other options are byte (8-bit) or half-word (16-bit). The DMA controller will iterate through our data block, but each value will be deposited at the same address, so the destination memory address is not incremented. The other options are quite arcane, but feel free to read about them in the DMA chapter of the manual.

Before we see the code, I have a few notes about it.

I’ve included the UART transmitter code from above in this block. The CRC DMA transfers use the same sort of boolean flag in an interrupt routine to detect the end of the DMA transfer.

Next, the ARM architecture places the boot ROM at location 0, but ST provides a bootloader in their processors at that location. Your program gets burnt into the chip at location 0x0800 0000. Depending on the voltage on a pin called BOOT0, either the bootloader or your code will appear at location 0 when the processor starts. To checksum our code, we have to use the version stored at location 0x0800 0000. When I tried to checksum the code starting at location 0, the processor halted with an interrupt.

This processor’s DMA controller can only transfer up to 65,535 items. We need to CRC 1MB of code, but we are doing it 32-bits (a word) at a time. That would be 262,144 words. This isn’t 4 maximum sized transfers, because the maximum is 64K minus 1 which is only 262,140 words. Arrrg. So I decided to use 8 transfers of 32,768 words each.

Just for kicks, I decided to use this code to figure out if there is any performance difference between using DMA and just pushing the same data into the CRC unit in a tight loop. I put in code to see how many times I can CRC the Flash array in one second then print the results as well as the checksum just to make sure that it is all working properly.

The code goes like this:

  • Register a callback function that gets called when the DMA transfer is complete.

  • Take note of the current tick time.

  • For each of the 8 blocks

    • Clear the DMA done flag.

    • Start transferring a 32K word block into the CRC unit using DMA

    • Wait for the done flag to get set. (When the DMA transfer completes, set the done flag.)

  • Retrieve the final CRC

The iterative method simply uses a pointer to a 32-bit word and a for loop to move 256K words into the CRC unit, then retrieves the result. Now, let’s look at this code. Note that it uses the Cube generated functionality but is in a completely separate file where Application() is called from main.

#include < stdbool.h >
#include "application.h"
#include "dma.h"
#include "crc.h"
#include "usart.h"
#include "rng.h"
#include "stm32f4xx_hal_crc.h"

/* Defines section - local macros and definitions for array sizes etc. */
/* Number of 32-bit words of code FLASH */
#define FLASH_SIZE ((1024U*1024U)/4U) 

/* The number of blocks that we will checksum */
#define FLASH_CRC_BLOCKS 8U

/* Number of words to CRC in one transfer, must be less than 65536 */ 
#define FLASH_CRC_BLOCK_SIZE (FLASH_SIZE / FLASH_CRC_BLOCKS) 

#define START_OF_FLASH 0x08000000U    /* Start address of the code FLASH */
#define ONE_SECOND 1000U              /* in millisecond ticks */

/* File global variable section. All should be declared static. */
static volatile bool crcDMAdone;
static volatile bool serialComDone;

/* Private function prototypes. */
void DMADoneCallback(DMA_HandleTypeDef* handle);

/* Public functions */
void Application(void) {
/*
 * This application demonstrates the use of DMA in two ways:
 * First, using DMA to support the low level _write function for printf.
 * Second, implementing a memory to memory DMA function to generate a checksum for the code memory.
 */
  uint8_t block;
  uint32_t i;
  uint32_t count;
  uint32_t startTime;
  uint32_t* flashPointer;
  HAL_StatusTypeDef returnStatus;

  returnStatus = HAL_DMA_RegisterCallback(&hdma_memtomem_dma2_stream0,
                     HAL_DMA_XFER_CPLT_CB_ID, DMADoneCallback);
  if (returnStatus != HAL_OK) {
    printf("Registering the DMA complete callback failed with %d\r\n", returnStatus);
  }

  FOREVER {
    count = 0U;
    startTime = HAL_GetTick();
    do {
      __HAL_CRC_DR_RESET(&hcrc);
      for (block = 0; block < FLASH_CRC_BLOCKS; block++) {
        crcDMAdone = false;
        /* Note that the count is the number of 32 bit values */
        HAL_DMA_Start_IT(&hdma_memtomem_dma2_stream0,
        START_OF_FLASH + (block * 4U * FLASH_CRC_BLOCK_SIZE),
        (uint32_t) &CRC->DR, FLASH_CRC_BLOCK_SIZE);
        while (crcDMAdone == false) {
          /* Wait here for the DMA to complete */
        }
      }
      count++;
    } while ((HAL_GetTick() - startTime) < ONE_SECOND);
    printf("%d DMA CRCs per second. And the CRC for this batch is %08X\r\n", 
            (int) count, (unsigned int) CRC->DR);
   
   count = 0U;
    startTime = HAL_GetTick();
    do {
      flashPointer = (uint32_t*) START_OF_FLASH;
      __HAL_CRC_DR_RESET(&hcrc);
      for (i = 0U; i < FLASH_SIZE; i++) {
        CRC->DR = *flashPointer++;
      }
      count++;
    } while ((HAL_GetTick() - startTime) < ONE_SECOND);
    printf("%d manual CRCs per second. And the CRC for this batch is %08X\r\n", 
            (int) count, (unsigned int) CRC->DR);
  }
}

void DMADoneCallback(DMA_HandleTypeDef* handle) {
  crcDMAdone = true;
}

/*
 * The following two functions are used to support the stdio output functions.
 */
int _write(int file, char *outgoing, int len) {
  HAL_StatusTypeDef status;
  
  serialComDone = false;
  status = HAL_UART_Transmit_DMA(&huart3, (uint8_t *) outgoing, len);
  if (HAL_OK != status) {
    serialComDone = true;
    len = -1;
  }
  while (serialComDone == false) {
    /* Hang around until the DMA complete interrupt fires */
  }
  return (len);
}

void HAL_UART_TxCpltCallback(UART_HandleTypeDef* huart) {
  serialComDone = true;
}

The results are in:

80 DMA CRCs per second. And the CRC for this batch is A023B174
63 manual CRCs per second. And the CRC for this batch is A023B174

On my processor, DMA gives a 27% advantage over iterative memory assignment. I think it is because everything is done with a hardware mover that doesn’t have to increment, involve registers, gotos, branch less than, and so on.

The End

The main benefits of DMA includes the ability to move relatively large amounts of data faster than by using iterative load/store instructions. It also frees up the CPU to do other tasks while transfers happen. DMA can be used to solve data loss issues when dealing with high speed time sensitive peripherals like UART input. But, as we saw in the previous post,  DMA also helps with slow peripherals, bundling together sparse events into a single transfer, like repetitive ADC conversions.


This post is part of a series. Please see the other posts here.


Music to program by - "The Jazz Show" for January 30, 2017 from CITR, student radio from the University of British Columbia in Vancouver. History and music, spun by a real jazz cat, Gavin Walker. This weekly 3-hour podcast punches far above its student radio weight class.

A powerful mover, but not a general purpose vehicle.By James Heilman, MD (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons

A powerful mover, but not a general purpose vehicle.

By James Heilman, MD (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons