Using the DMA and AXI4 Stream on Zynq US+.

When we are working with a SOC or MPSOC, is very common the data interchange between the PL and the APU, or between the PL and the RPU. To do that, the Zynq platform gives us several interfaces between the PL and both APU and RPU aka PS. In case of Zynq MPSOC, these interfaces are described on the TRM of Zynq US+, on chapter 35. On this post, we will focus on the High Performance interfaces (S_AXI_HP0-3_FPD).

What makes interesting this interfaces is their directly connection to the DDR, without PS interaction. In other words, is a direct access to the main memory of the system, from the PL, therefore, it allows our IPs to write data on the DDR in parallel to the PS program execution. This connection is made through the System Memory Management Unit (SMMU), which basically is a unit to translate between the AXI address map, and the PS address map.

To manage that kind of transmissions, we will use the the AXI DMA IP from Xilinx. DMA stands for Direct Memory Access, and it allows data transfer between 2 memories, or one data generator, like ADC, and memory, or between memory and a data consumer like DAC. In the project that I have designed, the APU will send a signal to the xFFT IP on the PL side, and then, xFFT will send the result of the FFT again to the APU. We can catalog this example as an FFT accelerator, since the PL will compute the FFT using several multiplication threads. The data transfer diagram is shown on the next figure.

As we can see on the diagram, 2 DMA are used, DMA_0 can read and write from the xFFT to the DDR. This DMA will be used to send the signal and receive the result. The other one, the DMA_1, will be used to configure the xFFT.

Focusing on the DMA, we can see that there are 2 AXI4 connections on each DMA. The AXI4 Lite interface will be used to configure the DMA (set source and destination addresses, start a transfer…), and the AXI4 Stream will be used to manage the data itself. Port S2MM (Stream to Memory Mapped), is used to transfer data from xFFT, to the DDR (Slave connect to xFFT, and Master to AXI interconnect), and port MM2S (Memory Mapped to Stream), will send data from DDR to the xFFT (Slave connect to AXI interconnect, and Master connect to xFFT).

Regarding the xFFT, the configuration is quite different to the one we used on this post. On that occasion, the xFFT IP was managed by a RTL code design by me, so the input/output widths could be customizable according my own IP manager. In this case, widths of input and output ports must be compliant with AXI4 widths. This affects to the FFT scaling, because regardless the length of the FFT, the output width has to be 32 or 64, but the FFT is accumulative, and the width of the final result will depends of the samples computed. To reduce the output width, xFFT allows us to apply an scaling factor to each step. This scaling factor is defined by the number of shifts applied to the result of each step, and is coded in 2 bits, that corresponds with factors of 1/1, 1/2, 1/4 and 1/8. The number of steps will depend on the algorithm used to deploy the FFT. In case of Radix-2, at each step we will obtain 2 parallel FFT, therefore, for a 64 FFT, we will have 6 steps (64/2, 32/2, 16/2, 8/2, 4/2, 2/2). In case of using Radix-4, each step will perform 4 parallel FFT, so the steps will be reduced to the half (64/4, 16/4, 4/4). In case of Radix-2, the number of DSP Slices used is lower than Radix-4, therefore, at this point, we will have to decide between save resources, using Radix-2, or improve the throughput, using Radix-4. The configuration I have selected is Radix-4, and the output scaled, so I have to configure 3 scale factors.

The final block design must look like the next one.

Once we have our block design complete, the bitstream generated, and the hardware exported, the next step is the design on Vitis.

To develop the C code to manage the DMAs, I’ve used the example code provided by Xilinx, xaxidma_example_simple_poll.c. This code has to be modified to add a second DMA, the signal generation, and I also have added a code for verification, that compute the 2s complement and the module of each harmonic.

The result will appear on abs vector, and for this code, the result shows the third harmonic, and its image, with an amplitude of 1260.

On the next we will see how to change the signal generation on the C code, to an AXI4 stream IP to manage the Digilent’s ZMOD ADC.

Leave a Reply