In the industry, we can find many different applications, and most of them, especially in recent years, these applications require some kind of computing. Acquiring data on where the application is, and processing the data later in any other location, maybe an office or any kind of data center is, in some cases a good option, but in many other applications, we will need to process the data and take a decision in the same place where the application is, and eventually, we will need to make this as fast as possible. This is the case, for example, of a damaged fruit detection, where the computer has to take a photo and analyze if that fruit is damaged or not, and take a decision before the fruit arrives to the box. In this application, where the processing is executed on the same application, or in other words, on the edge, we need computers with special requirements like small size, robustness and low power consumption, and these are the requirements of the edge computers.

Last Christmas I had the opportunity to use an edge computer from Advantech, in particular, I used the unit EI-52. This computer is very robust since it does not have any mechanical elements like fans or HDD. Regarding the interfaces, they are distributed between the front and the back sides, and we can find from gigabit interfaces like HDMI or Display Port, and other ports that, although they are not used in consumer electronics, but in the industry are widely used like COM ports. If we need more power, we can also run our application using units like the EPC-C Series, which are powerful and full of interfaces.

In this blog I talk about FPGA so, how can be related devices like the EI-52 or an EPC-C with the FPGA?… well, the key is the interfaces that these devices have, in particular, the PCIe interface. From my point of view, the power of this kind of computers has not to be in its processor, but in the interfaces that allow it to connect any other devices or even coprocessors, that will execute tasks while the main processor is in charge of generic tasks like communications or data storage. In this article, we are going to design a system, where we will have a host PC with a generic processor, and we are going to connect through the PCIe interface an FPGA with a soft-core implemented to use it as a coprocessor. The data shared between the host PC and the coprocessor will be stored in a DDR memory attached to the FPGA.

The board we will use is the LiteFury, a board with an M.2 format built using an Artix7 100t FPGA. Since there is no board file available for the Litefury board, we will need to define all the parameters of the board. When we are in the project creation, we have to select the part xc7a100t484tfgg-2L.

Then we will follow all the steps to create a project without selecting the source files at this moment. Once the project is created, our design will be based on a block design, thus we need to create a block design, in my case, I named this block design by adding _bd to the project name, so the block design name will be pci_mb_accel_bd.

Now we have an empty block design. The first block we are going to add is the xDMA. This module will be in charge of the PCIe interface. The Litefury board is an M.2 format board, so it has 4 PCIe lanes and the GTP transceiver included in the Artix 7 part can reach up to 6.6 Gbps, although for the PCIe interface, the maximum available speed is 5 Gbps. Regarding the DMA interface, to read and write data into memory through the DMA, we need to configure this module to use an AXI Memory Mapped interface, with 2 different channels for reading (H2C), and two for write (C2H). The xDMA module can also be configured to integrate an AXI Stream interface instead, as we used in this post.

Next, we need to connect the clock through a differential buffer. To make this connection we need to add an Utility buffer, and configure it as IBUFDS_GTE2. Finally, we need to connect a constant block configured with a value of ‘0’ to the pci_clkreq_l output.

Once the block is in the block design, we need to change the data width of the AXI interface in order to make it match the width of the memory interface.

The next module we are going to add is the Memory interface generator (MIG7) for the DDR memory. In general, this block is configured on the board file supplied by the manufacturer. In this case, RHS Research gives us a project where MIG7 block is instantiated. What I have done is create a new MIG7 block, then check the configuration in the project provided, and copy the configuration into this project, but we need to make a change. The project provided by RHS Research instantiates the xADC in order to read the temperature of the silicon. Then, this temperature is connected to the MIG7 block. This block uses the silicon temperature to make some adjustments in the delays of the DDR interface. In this project, since we do not need the xADC we will configure the MIG7 block to include the xADC instantiation within the same MIG block.

Regarding the pin out used, to avoid copy pin by pin, we can export from the original project an ucf file with the configuration of the pins. Then, in the new project we can import the pin-out configuration and verify it.

When the configuration is finished, the MIG7 block will be instantiated in the block design. Now we need to make the connections with the IOs. First, we need to connect the clock signal to an input named sys_clk. The name is important since the MIG has its own constraints file, and this pin is defined inside. Now, we need to create the interface with the DDR, it has to be named DDR. Also, I have added two of the board LEDs to verify that the MIG is well initialized.

In the image below we can see the MIG7 and the xDMA block with their corresponding connections. Notice that the MIG7 block has 2 reset inputs, aresetn and `sys_rst, which are also negated, and both inputs are connected to a constant block configured with a value of ‘0’. This configuration is valid since we want that the MIG7 will be initialized at the beginning. In this case, I didn’t use a reset generator, but Vivado will add one when you click on Run block automation pop-up. You can keep the reset generator but it will only have connected one input and one output. The design has another reset generated by the XDMA block, but this will be connected to the AXI Interconnect, so it does not have to be connected directly to the MIG7.

Regarding the clock signals, MIG7 block generates a 200 MHz clock (800/4), and the xDMA runs at 100 MHz, but we do not need to manage the synchronization of these two clocks, since this will be managed by the AXI Interconnect.

Finally, we will add to the design our coprocessor, the MicroBlaze core. I have selected a Real-Time preset to include some mathematical capabilities to the core, but any of the configurations will work. The MicroBlaze will have a local memory block, and I have added ports for Data and Instructions Cache (DC and IC). These interfaces are also AXI Memory Mapped so we need to connect them to the same AXI Interconnect connected to the MIG7 block. This interface will connect the external DDR to the host PC through the PCIe interface, and also will connect the external DDR to the coprocessor, and will allow the data transfer between them. Also, I have added a BlockRAM which will act as an extension of the local memory of the Microblaze, with access also from the PCIe interface.

If we take a look at the memory map, we will note that there are some unassigned sections. This is because the default addresses will conflict. To fix that we only have to change the address range and assign them. In my case, the resulting address map is the next.

Notice that the addresses for the PCIe interface may be different from the addresses of the Microblaze. The block that will manage this will be the AXI Interconnect.

At this point we have the block design complete, so we need to Create the HDL Wrapper, and Generate the bitstream. Then once the design is implemented, we will export the hardware to create the *.xsa: file to import it on Vitis.

The next is creating the software design. Once in Vitis, we have to create the platform project which will describe the hardware where the software design will run. To create the platform we use the .xsa file created before. This platform project has to be compiled in order to generate the corresponding output files.

Now, we need to create an Application project which will run into de the platform created. The application project belongs a System which will include the application that will run on each core, in this case since we just have one core, the system will have only one project. When we create the application project, Vitis asks us if we want to use a template. In general, I use the hello world project, but this template uses an UART, and the hardware project we have created has not an AXI Uart block, so we can not use this template. In my case, I have used an empty project since we only need an application that makes the processor run. The code I used as simple as the next.

int main (void){
  int test;

  while(1){
    test = 1;
  }
}

Now we can program the board and the design is ready. Now, in the Edge computer, we need to install the xDMA drivers. Since the edge computer uses an Ubuntu server distribution, we can follow the steps described in this article to install the drivers. Once the drivers are installed, we need to reboot the edge computer.

Now, we must check if the board has been detected correctly. We can do that using the command lspci.

~$ lspci
00:00.0 Host bridge: Intel Corporation Device 9b73 (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Device 9ba8 (rev 03)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:14.0 USB controller: Intel Corporation Device a3af
00:14.2 Signal processing controller: Intel Corporation Device a3b1
00:16.0 Communication controller: Intel Corporation Device a3ba
00:17.0 SATA controller: Intel Corporation Device a382
00:1c.0 PCI bridge: Intel Corporation Device a394 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a3da
00:1f.2 Memory controller: Intel Corporation Device a3a1
00:1f.3 Audio device: Intel Corporation Device a3f0
00:1f.4 SMBus: Intel Corporation Device a3a3
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (12) I219-V
01:00.0 Serial controller: Xilinx Corporation Device 7024

We can see that the board is detected correctly. Then, remember that we have configured the xDMA block to use two channels for write and reading. If we check the devices detected by the edge computer we can see the devices xdma0_c2h_x and xdma0_h2c_x, that are corresponding with the different xDMA channels.

~/dma_ip_drivers/XDMA/linux-kernel/tests$ ls /dev
autofs           dri          i2c-3    loop14        mem     rtc       tty0   tty20  tty32  tty44  tty56      ttyS0   ttyS20  ttyS4    vcs2   vcsu1        xdma0_control    xdma0_events_6
block            drm_dp_aux0  i2c-4    loop2         mqueue  rtc0      tty1   tty21  tty33  tty45  tty57      ttyS1   ttyS21  ttyS5    vcs3   vcsu2        xdma0_events_0   xdma0_events_7
bsg              ecryptfs     initctl  loop3         net     sda       tty10  tty22  tty34  tty46  tty58      ttyS10  ttyS22  ttyS6    vcs4   vcsu3        xdma0_events_1   xdma0_events_8
btrfs-control    fd           input    loop4         null    sda1      tty11  tty23  tty35  tty47  tty59      ttyS11  ttyS23  ttyS7    vcs5   vcsu4        xdma0_events_10  xdma0_events_9
bus              full         kmsg     loop5         nvram   sda2      tty12  tty24  tty36  tty48  tty6       ttyS12  ttyS24  ttyS8    vcs6   vcsu5        xdma0_events_11  xdma0_h2c_0
char             fuse         kvm      loop6         port    sg0       tty13  tty25  tty37  tty49  tty60      ttyS13  ttyS25  ttyS9    vcsa   vcsu6        xdma0_events_12  xdma0_h2c_1
console          gpiochip0    log      loop7         ppp     shm       tty14  tty26  tty38  tty5   tty61      ttyS14  ttyS26  udmabuf  vcsa1  vfio         xdma0_events_13  zero
core             hpet         loop0    loop8         psaux   snapshot  tty15  tty27  tty39  tty50  tty62      ttyS15  ttyS27  uhid     vcsa2  vga_arbiter  xdma0_events_14  zfs
cpu              hugepages    loop1    loop9         ptmx    snd       tty16  tty28  tty4   tty51  tty63      ttyS16  ttyS28  uinput   vcsa3  vhci         xdma0_events_15
cpu_dma_latency  hwrng        loop10   loop-control  ptp0    stderr    tty17  tty29  tty40  tty52  tty7       ttyS17  ttyS29  urandom  vcsa4  vhost-net    xdma0_events_2
cuse             i2c-0        loop11   mapper        pts     stdin     tty18  tty3   tty41  tty53  tty8       ttyS18  ttyS3   userio   vcsa5  vhost-vsock  xdma0_events_3
disk             i2c-1        loop12   mcelog        random  stdout    tty19  tty30  tty42  tty54  tty9       ttyS19  ttyS30  vcs      vcsa6  xdma0_c2h_0  xdma0_events_4
dma_heap         i2c-2        loop13   mei0          rfkill  tty       tty2   tty31  tty43  tty55  ttyprintk  ttyS2   ttyS31  vcs1     vcsu   xdma0_c2h_1  xdma0_events_5

Now, we are going to execute one of the tests included in the xDMA drivers. One of them is designed to test the xDMA using the memory-mapped interface. This test will send through the PCIe a binary file, and then it will read the data. The test checks if the data read is the same as the data written before. We need to change the test code lightly. The test will write data starting in the address 0, but in our design, the DDR memory starts in the address 0x8000 0000, so we need to increase the address offset in 2147483648, which is 0x8000 0000 in decimal. To do that we need to navigate to the tests folder.

~$ cd dma_ip_drivers/XDMA/linux-kernel/tests/

And modify the test file by adding the line addrOffset=$(($addrOffset+2147483648)) next to the addrOffset declaration.

# Write to all enabled h2cChannels in parallel
if [ $h2cChannels -gt 0 ]; then
  # Loop over four blocks of size $transferSize and write to them (in parallel where possible)
  for ((i=0; i<=3; i++))
  do
    addrOffset=$(($transferSize * $i))
    addrOffset=$(($addrOffset+2147483648))
    curChannel=$(($i % $h2cChannels))
    echo "Info: Writing to h2c channel $curChannel at address offset $addrOffset."
    $tool_path/dma_to_device -d /dev/xdma0_h2c_${curChannel} -f data/datafile${i}_4K.bin -s $transferSize -a $addrOffset -c $transferCount &
    # If all channels have active transactions we must wait for them to complete
    if [ $(($curChannel+1)) -eq $h2cChannels ]; then
      echo "Info: Wait for current transactions to complete."
      wait
    fi
  done
fi

To execute this we need to pass as an argument the amount of data we want to send, the number of transactions and the number of read and data channels. The result of the test must be the next.

~/dma_ip_drivers/XDMA/linux-kernel/tests$ sudo bash ./dma_memory_mapped_test.sh 1024 1 1 1
Info: Running PCIe DMA memory mapped write read test
      transfer size:  1024
      transfer count: 1
Info: Writing to h2c channel 0 at address offset 2147483648.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 12.706764
Info: Writing to h2c channel 0 at address offset 2147484672.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 20.844784
Info: Writing to h2c channel 0 at address offset 2147485696.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 56.387665
Info: Writing to h2c channel 0 at address offset 2147486720.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 57.424854
Info: Reading from c2h channel 0 at address offset 2147483648.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 51.423695
Info: Reading from c2h channel 0 at address offset 2147484672.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 52.019302
Info: Reading from c2h channel 0 at address offset 2147485696.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 51.305176
Info: Reading from c2h channel 0 at address offset 2147486720.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 49.214207
Info: Checking data integrity.
Info: Data check passed for address range 0 - 1024.
Info: Data check passed for address range 1024 - 2048.
Info: Data check passed for address range 2048 - 3072.
Info: Data check passed for address range 3072 - 4096.
Info: All PCIe DMA memory mapped tests passed.

If we take a look at the test code, we can see that the test uses the data stored in a file.

If all is connected properly, data will be accessible from both the edge computer and the MicroBlaze soft-core. We can check data by opening the Memory window on Vitis and selecting the address 0x8000 0000.

At this point, we have the edge computer connected through the PCIe interface to a MicroBlaze soft-core that acts as a general-purpose coprocessor where we can execute different tasks. Edge computers are very interesting devices in applications where we need a computer very close to the application, but this made that one of the requirements of these computers is the lowest consumption, making them run without fans and keeping them as simple as possible. Low power consumption also means in some cases lower performance, but by adding this kind of coprocessor we can optimize the device to execute a specific task.