FFT algorithm using an FPGA and XDMA.

In this blog we talked (a little) about the xDMA IP from Xilinx, and how to send and receive data through PCI using an FPGA. On that occasion, we used the Picozed board with the FMC Carrier gen 2. This time the board used Litefury from RHS research. This board is the same as the ACORN CLE-215, and is based on the Artix7 piece XC7A100T. The board has a M.2 format, therefore has 4 PCI Express Lanes, with a speed of up to 6.6 Gbps (GTP transceivers). The M.2 format make this board an excellent choice to be integrated into our working PC in order to test different algorithm without the need of connecting external hardware. The board has a connector with the JTAG pins, but we can also debug our design using Xilinx Virtual Cable accessible through the xDMA IP. In this post, we will use this board to execute an FFT algorithm, and then we will compare its performance with the same algorithm executed on the PC. The use of a Xilinx IP to perform this kind of algorithm has advantages, we do not need to develop it, but also it has son disadvantages because we can not optimize the interfaces in order to improve their performance.

Unlike the previous post, in this post we will use the xDMA drivers for Linux provided by Xilinx. The use of these drivers makes easy access to the DMA channels of the DMA/Bridge Subsystem for PCI Express.

To install the drivers, first of all, we need to clone the repository.

~$ git clone https://github.com/Xilinx/dma_ip_drivers.git

Now, in the repository folder, we need to navigate to the /xdma folder.

~$ cd dma_ip_drivers/XDMA/linux-kernel/xdma/

In this folder, according to the instructions that we can find in the readme.md file, we need to execute the make install command. In some cases, the output for this command will be the next.

~/dma_ip_drivers/XDMA/linux-kernel/xdma$ sudo make install
Makefile:10: XVC_FLAGS: .
make -C /lib/modules/5.11.0-27-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules
make[1]: Entering directory '/usr/src/linux-headers-5.11.0-27-generic'
/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: .
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/libxdma.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_cdev.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_ctrl.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_events.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_sgdma.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_xvc.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_bypass.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_mod.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_thread.o
  LD [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.o
/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: .
  MODPOST /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Module.symvers
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.mod.o
  LD [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko
make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-27-generic'
make -C /lib/modules/5.11.0-27-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules_install
make[1]: Entering directory '/usr/src/linux-headers-5.11.0-27-generic'
  INSTALL /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko
At main.c:160:
- SSL error:02001002:system library:fopen:No such file or directory: ../crypto/bio/bss_file.c:69
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: ../crypto/bio/bss_file.c:76
sign-file: certs/signing_key.pem: No such file or directory
  DEPMOD  5.11.0-27-generic
Warning: modules_install: missing 'System.map' file. Skipping depmod.
make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-27-generic'

We can see that the command returns some errors regarding the SSL, and also with depmod command. First, let’s fix the SSL error. If you have this error, is because Linux, in order to improve its security, requires that all the modules that you install will be signed. In this case, these modules are not signed. The make command will try to sign the module but it cannot find the key to do it, so we need to generate a key to allow the make command fix the module and install it. To do that, we must navigate to certs/ folder, and generate the signing_key.pem file using Openssl. Before generating the key, we will create a config file named x509.genkey in /certs folder.

~$ cd /lib/modules/$(uname -r)/build/certs
~$ sudo nano x509.genkey

The content of this file must be the next.

/lib/modules/5.11.0-36-generic/build/certs$ cat x509.genkey 
[ req ]
default_bits = 4096
distinguished_name = req_distinguished_name
prompt = no
string_mask = utf8only
x509_extensions = myexts

[ req_distinguished_name ]
CN = Modules

[ myexts ]
basicConstraints=critical,CA:FALSE
keyUsage=digitalSignature
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid

Once this file is created, we can generate the key files using this file as a config file.

/lib/modules/5.11.0-36-generic/build/certs$ sudo openssl req -new -nodes -utf8 -sha512 -days 36500 -batch -x509 -config x509.genkey -outform DER -out signing_key.x509 -keyout signing_key.pem
Generating a RSA private key
.................................++++
...................................................................................................++++
writing new private key to 'signing_key.pem'
-----

Once the command is executed, we can check that the files are created in the same folder.

/lib/modules/5.11.0-36-generic/build/certs$ ls
Kconfig  Makefile  signing_key.pem  signing_key.x509  x509.genkey

When the key files are created, we can return them to the /xdma folder in the downloaded repository. Now, the second issue that make the command returned to us was regarding the depmod command. This command is included in the make install command, but also can be included next to this command. We can execute depmod -a` after making the install, or we can add it to Makefile. if you choose this second option, the content of Makefile will be the next.

~/dma_ip_drivers/XDMA/linux-kernel/xdma$ cat Makefile 
SHELL = /bin/bash
ifneq ($(xvc_bar_num),)
	XVC_FLAGS += -D__XVC_BAR_NUM__=$(xvc_bar_num)
endif

ifneq ($(xvc_bar_offset),)
	XVC_FLAGS += -D__XVC_BAR_OFFSET__=$(xvc_bar_offset)
endif

$(warning XVC_FLAGS: $(XVC_FLAGS).)

topdir := $(shell cd $(src)/.. && pwd)

TARGET_MODULE:=xdma

EXTRA_CFLAGS := -I$(topdir)/include $(XVC_FLAGS)
#EXTRA_CFLAGS += -D__LIBXDMA_DEBUG__
#EXTRA_CFLAGS += -DINTERNAL_TESTING

ifneq ($(KERNELRELEASE),)
	$(TARGET_MODULE)-objs := libxdma.o xdma_cdev.o cdev_ctrl.o cdev_events.o cdev_sgdma.o cdev_xvc.o cdev_bypass.o xdma_mod.o xdma_thread.o
	obj-m := $(TARGET_MODULE).o
else
	BUILDSYSTEM_DIR:=/lib/modules/$(shell uname -r)/build
	PWD:=$(shell pwd)
all :
	$(MAKE) -C $(BUILDSYSTEM_DIR) M=$(PWD) modules

clean:
	$(MAKE) -C $(BUILDSYSTEM_DIR) M=$(PWD) clean
	@/bin/rm -f *.ko modules.order *.mod.c *.o *.o.ur-safe .*.o.cmd

install: all
	$(MAKE) -C $(BUILDSYSTEM_DIR) M=$(PWD) modules_install
	depmod -a <--- THIS LINE CHANGES
endif

Once these changes are done, executing again the make install command will return the next output.

~/dma_ip_drivers/XDMA/linux-kernel/xdma$ sudo make install 
Makefile:10: XVC_FLAGS: .
make -C /lib/modules/5.11.0-36-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules
make[1]: Entering directory '/usr/src/linux-headers-5.11.0-36-generic'
/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: .
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/libxdma.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_cdev.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_ctrl.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_events.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_sgdma.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_xvc.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_bypass.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_mod.o
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_thread.o
  LD [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.o
/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: .
  MODPOST /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Module.symvers
  CC [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.mod.o
  LD [M]  /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko
make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-36-generic'
make -C /lib/modules/5.11.0-36-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules_install
make[1]: Entering directory '/usr/src/linux-headers-5.11.0-36-generic'
  INSTALL /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko
  DEPMOD  5.11.0-36-generic
Warning: modules_install: missing 'System.map' file. Skipping depmod.
make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-36-generic'
depmod -a

Now we must navigate to /tools folder and execute again the make command.

~/dma_ip_drivers/XDMA/linux-kernel/xdma$ cd ../tools/
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/tools$ make
cc -c -std=c99 -o reg_rw.o reg_rw.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE
cc -o reg_rw reg_rw.o
cc -c -std=c99 -o dma_to_device.o dma_to_device.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE
In file included from /usr/include/assert.h:35,
                 from dma_to_device.c:13:
/usr/include/features.h:187:3: warning: #warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" [-Wcpp]
  187 | # warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE"
      |   ^~~~~~~
cc -lrt -o dma_to_device dma_to_device.o -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE
cc -c -std=c99 -o dma_from_device.o dma_from_device.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE
In file included from /usr/include/assert.h:35,
                 from dma_from_device.c:13:
/usr/include/features.h:187:3: warning: #warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" [-Wcpp]
  187 | # warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE"
      |   ^~~~~~~
cc -lrt -o dma_from_device dma_from_device.o -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE
cc -c -std=c99 -o performance.o performance.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE
cc -o performance performance.o -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE

This command will return some warnings but we can continue with them without problems. Finally, we need to load the driver with modprobe command.

~/dma_ip_drivers/XDMA/linux-kernel/tools$ sudo modprobe xdma

Now, we must navigate to /tests folder and execute the script /.load_driver with sudo privileges.

~/dma_ip_drivers/XDMA/linux-kernel/tests$ sudo su
root@mark1:/home/pablo/dma_ip_drivers/XDMA/linux-kernel/tests# source ./load_driver.sh 
xdma                   86016  0
Loading xdma driver...
The Kernel module installed correctly and the xmda devices were recognized.
 DONE

If we check the list of kernel nodes in /dev, we will see a group of xdma nodes created.

~/dma_ip_drivers/XDMA/linux-kernel/tools$ ls /dev
autofs           fuse       loop10        nvram     stderr  tty21  tty38  tty54      ttyS11  ttyS28   vcs2   vcsu6            xdma0_events_4
block            gpiochip0  loop11        port      stdin   tty22  tty39  tty55      ttyS12  ttyS29   vcs3   vfio             xdma0_events_5
bsg              hpet       loop2         ppp       stdout  tty23  tty4   tty56      ttyS13  ttyS3    vcs4   vga_arbiter      xdma0_events_6
btrfs-control    hugepages  loop3         psaux     tty     tty24  tty40  tty57      ttyS14  ttyS30   vcs5   vhci             xdma0_events_7
bus              hwrng      loop4         ptmx      tty0    tty25  tty41  tty58      ttyS15  ttyS31   vcs6   vhost-net        xdma0_events_8
char             i2c-0      loop5         ptp0      tty1    tty26  tty42  tty59      ttyS16  ttyS4    vcsa   vhost-vsock      xdma0_events_9
console          i2c-1      loop6         pts       tty10   tty27  tty43  tty6       ttyS17  ttyS5    vcsa1  xdma0_c2h_0      xdma0_h2c_0
core             i2c-2      loop7         random    tty11   tty28  tty44  tty60      ttyS18  ttyS6    vcsa2  xdma0_control    xdma0_user
cpu              i2c-3      loop8         rfkill    tty12   tty29  tty45  tty61      ttyS19  ttyS7    vcsa3  xdma0_events_0   xdma0_xvc
cpu_dma_latency  i2c-4      loop9         rtc       tty13   tty3   tty46  tty62      ttyS2   ttyS8    vcsa4  xdma0_events_1   zero
cuse             initctl    loop-control  rtc0      tty14   tty30  tty47  tty63      ttyS20  ttyS9    vcsa5  xdma0_events_10  zfs
disk             input      mapper        sda       tty15   tty31  tty48  tty7       ttyS21  udmabuf  vcsa6  xdma0_events_11
dma_heap         kmsg       mcelog        sda1      tty16   tty32  tty49  tty8       ttyS22  uhid     vcsu   xdma0_events_12
dri              kvm        mei0          sda2      tty17   tty33  tty5   tty9       ttyS23  uinput   vcsu1  xdma0_events_13
drm_dp_aux0      lightnvm   mem           sg0       tty18   tty34  tty50  ttyprintk  ttyS24  urandom  vcsu2  xdma0_events_14
ecryptfs         log        mqueue        shm       tty19   tty35  tty51  ttyS0      ttyS25  userio   vcsu3  xdma0_events_15
fd               loop0      net           snapshot  tty2    tty36  tty52  ttyS1      ttyS26  vcs      vcsu4  xdma0_events_2
full             loop1      null          snd       tty20   tty37  tty53  ttyS10     ttyS27  vcs1     vcsu5  xdma0_events_3

Once the kernel nodes are created, we can execute the tests of the driver. If we execute the ./run_tests.sh, in my case this is the output.

~/dma_ip_drivers/XDMA/linux-kernel/tests$ sudo bash ./run_test.sh 
Info: Number of enabled h2c channels = 1
Info: Number of enabled c2h channels = 1
Info: The PCIe DMA core is memory mapped.
./run_test.sh: line 68: ./dma_memory_mapped_test.sh: Permission denied
Info: All tests in run_tests.sh passed.

The DMA is detected correctly but when the script executes the script dma_memory_mapped_test.sh, even if the script is executed as sudo, it returns the message Permission denied. To execute correctly the dma_memory_mapped_test script, we can execute this script directly writing the corresponding arguments.

./dma_memory_mapped_test.sh $transferSize $transferCount $h2cChannels $c2hChannels

The transfer size that run_tests.sh configure is 1024 and the transfer count is 1. Regarding the h2c and c2h channels they are configured in hardware. For the example project of the Litefury board, both are configured as 1. The call to the script with the corresponding arguments will return us the next output.

~/dma_ip_drivers/XDMA/linux-kernel/tests$ sudo bash ./dma_memory_mapped_test.sh 1024 1 1 1
Info: Running PCIe DMA memory mapped write read test
      transfer size:  1024
      transfer count: 1
Info: Writing to h2c channel 0 at address offset 0.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 21.120369
Info: Writing to h2c channel 0 at address offset 1024.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 24.811611
Info: Writing to h2c channel 0 at address offset 2048.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 56.933170
Info: Writing to h2c channel 0 at address offset 3072.
Info: Wait for current transactions to complete.
/dev/xdma0_h2c_0 ** Average BW = 1024, 59.091698
Info: Reading from c2h channel 0 at address offset 0.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 23.916292
Info: Reading from c2h channel 0 at address offset 1024.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 50.665478
Info: Reading from c2h channel 0 at address offset 2048.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 45.241673
Info: Reading from c2h channel 0 at address offset 3072.
Info: Wait for the current transactions to complete.
/dev/xdma0_c2h_0 ** Average BW = 1024, 50.490608
Info: Checking data integrity.
Info: Data check passed for address range 0 - 1024.
Info: Data check passed for address range 1024 - 2048.
Info: Data check passed for address range 2048 - 3072.
Info: Data check passed for address range 3072 - 4096.
Info: All PCIe DMA memory mapped tests passed.

When the drivers are installed and the hardware is detected correctly, we can develop an application to use them. As I mentioned, for this project we will use the Litefury board, which uses the four PCI lanes of the M.2 connector. We have to create a new project, and add the IP DMA/Bridge Subsystem for PCI Express. Since the goal of the project is to connect the PCIe channel to the xFFT IP, we need to change the DMA Interface option to AXI Stream. Maximum link Speed will depend of the part you are using, in this case, the XC7A35T-2TFGG484 can achieve up to 5.0 GT/s. The AXI Clock Frequency will be set to the maximum, 250MHz.

In the DMA tab, we will configure the DMA channels. The xFFT IP needs two AXi Stream channels, for the configuration and the data, therefer the configuration number of Host to Card channels has to be set in two. The configuration channel of the xFFT only write from host to IP, so the number of the Card To Host channels can be configured as one.

Once the DMA/Bridge Subsystem for PCI Express IP is configured, we can build the rest of the components in the block design. I also added an AXI GPIO to manage the four leds of the board.

The xFFT IP configuration has to allow the minimum latency in the FFT compute. Configuration values I has select are the next.

Transform length: 1024
Architecture: Fastest is Pipeline, Streaming I/O
Target clock frequency: 250 MHz (AXI clock)
Scaling options: Scaled. We will need to configure the scale factor for each FFT stage through the DMA config channel.
Output ordering: Bit / Digit reversed order. The output of the FFT will be misordered. It is easy to find the corresponding harmonic value knowing the definition of the FFT. If we select for this field the configuration Natural Order setting, the latency of the xFFT is increased almost twice.

With this configuration, the latency of the xFFT IP is 8.604 us.

Regarding the constraints, Litefury board has connected the PCI Lanes in a different order than xDMA IP expects. As the xDMA IP has its constraints, we need to change the processing order of the xdc files. To do that, we will generate 2 different xdc files, one with the PCIe constraints, and the other one with the generic constraints. For the file that contains the PCIe constraints, we need to change the processing order to Early. We can do that from the user interface or execute the next tcl command.

set_property PROCESSING_ORDER EARLY [get_files /home/pablo/Desktop/pedalos/litefury_early.xdc]

The content of the both xdc files is shown below.

# Early constraints
###################
set_property PACKAGE_PIN J1 [get_ports pcie_rstn]
set_property IOSTANDARD LVCMOS33 [get_ports pcie_rstn]


set_property PACKAGE_PIN G1 [get_ports {pcie_clkreq[0]}]
set_property IOSTANDARD LVCMOS33 [get_ports {pcie_clkreq[0]}]

set_property PACKAGE_PIN F6 [get_ports {pcie_refclk_p[0]}]
set_property PACKAGE_PIN E6 [get_ports {pcie_refclk_n[0]}]

# PCIe lane 0
set_property LOC GTPE2_CHANNEL_X0Y6 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_>
set_property PACKAGE_PIN A10 [get_ports {pcie_mgt_rxn[0]}]
set_property PACKAGE_PIN B10 [get_ports {pcie_mgt_rxp[0]}]
set_property PACKAGE_PIN A6 [get_ports {pcie_mgt_txn[0]}]
set_property PACKAGE_PIN B6 [get_ports {pcie_mgt_txp[0]}]

# PCIe lane 1
set_property LOC GTPE2_CHANNEL_X0Y4 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_>
set_property PACKAGE_PIN A8 [get_ports {pcie_mgt_rxn[1]}]
set_property PACKAGE_PIN B8 [get_ports {pcie_mgt_rxp[1]}]
set_property PACKAGE_PIN A4 [get_ports {pcie_mgt_txn[1]}]
set_property PACKAGE_PIN B4 [get_ports {pcie_mgt_txp[1]}]

# PCIe lane 2
set_property LOC GTPE2_CHANNEL_X0Y5 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_>
set_property PACKAGE_PIN C11 [get_ports {pcie_mgt_rxn[2]}]
set_property PACKAGE_PIN D11 [get_ports {pcie_mgt_rxp[2]}]
set_property PACKAGE_PIN C5 [get_ports {pcie_mgt_txn[2]}]
set_property PACKAGE_PIN D5 [get_ports {pcie_mgt_txp[2]}]

# PCIe lane 3
set_property LOC GTPE2_CHANNEL_X0Y7 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_>
set_property PACKAGE_PIN C9 [get_ports {pcie_mgt_rxn[3]}]
set_property PACKAGE_PIN D9 [get_ports {pcie_mgt_rxp[3]}]
set_property PACKAGE_PIN C7 [get_ports {pcie_mgt_txn[3]}]
set_property PACKAGE_PIN D7 [get_ports {pcie_mgt_txp[3]}]

And the generic constraints.

# Normal constraints

set_property PACKAGE_PIN G3 [get_ports {leds[3]}]
set_property PACKAGE_PIN H3 [get_ports {leds[2]}]
set_property PACKAGE_PIN G4 [get_ports {leds[1]}]
set_property PACKAGE_PIN H4 [get_ports {leds[0]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[2]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[1]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[0]}]

At this point, we can generate the bitstream. The configuration of the PCIe lanes in a different order that xDMA IP expects will cause Vivado reports some critical warnings. We do not care about them because the cause of this critical warning is under control.

In my case, the Litefury board is connected to a computer that I use for tests which I connect remotely. This computer has already the xDMA drivers installed and also Python 3.8 and Jupyter Notebook. In order to redirect the HTTP port that Jupyter will create, we need to access to a remote computer through SSH and redirect the address localhost:8080 of the remote to port 8080 of the host. The way to do that is the next.

~$ ssh -L 8080:localhost:8080 pablo@192.168.1.138

When the SSH connection is established, we can open the Jupyter notebook in the remote computer as sudo. Also, remember to change the port to the redirected in the host.

~$ sudo jupyter-notebook --no-browser --port=8080 --allow-root

If we check the devices detected, we can see the two channels of the DMA that we have configured in the XDMA IP. Channels xdma0_c2h_0 and xdma0_c2h_1, are dedicated to transferring data from card to host, and xdma0_h2c_0 and xdma0_h2c_1 are dedicated to transferring data from host to card.

~$ ls /dev
autofs           i2c-0     loop-control  shm       tty23  tty44  tty8       ttyS27   vcs6           xdma0_events_0
block            i2c-1     mapper        snapshot  tty24  tty45  tty9       ttyS28   vcsa           xdma0_events_1
bsg              i2c-2     mcelog        snd       tty25  tty46  ttyprintk  ttyS29   vcsa1          xdma0_events_10
btrfs-control    i2c-3     mei0          stderr    tty26  tty47  ttyS0      ttyS3    vcsa2          xdma0_events_11
bus              i2c-4     mem           stdin     tty27  tty48  ttyS1      ttyS30   vcsa3          xdma0_events_12
char             initctl   mqueue        stdout    tty28  tty49  ttyS10     ttyS31   vcsa4          xdma0_events_13
console          input     net           tty       tty29  tty5   ttyS11     ttyS4    vcsa5          xdma0_events_14
core             kmsg      null          tty0      tty3   tty50  ttyS12     ttyS5    vcsa6          xdma0_events_15
cpu              kvm       nvram         tty1      tty30  tty51  ttyS13     ttyS6    vcsu           xdma0_events_2
cpu_dma_latency  lightnvm  port          tty10     tty31  tty52  ttyS14     ttyS7    vcsu1          xdma0_events_3
cuse             log       ppp           tty11     tty32  tty53  ttyS15     ttyS8    vcsu2          xdma0_events_4
disk             loop0     psaux         tty12     tty33  tty54  ttyS16     ttyS9    vcsu3          xdma0_events_5
dma_heap         loop1     ptmx          tty13     tty34  tty55  ttyS17     udmabuf  vcsu4          xdma0_events_6
dri              loop10    ptp0          tty14     tty35  tty56  ttyS18     uhid     vcsu5          xdma0_events_7
drm_dp_aux0      loop11    pts           tty15     tty36  tty57  ttyS19     uinput   vcsu6          xdma0_events_8
ecryptfs         loop2     random        tty16     tty37  tty58  ttyS2      urandom  vfio           xdma0_events_9
fd               loop3     rfkill        tty17     tty38  tty59  ttyS20     userio   vga_arbiter    xdma0_h2c_0
full             loop4     rtc           tty18     tty39  tty6   ttyS21     vcs      vhci           xdma0_h2c_1
fuse             loop5     rtc0          tty19     tty4   tty60  ttyS22     vcs1     vhost-net      xdma0_user
gpiochip0        loop6     sda           tty2      tty40  tty61  ttyS23     vcs2     vhost-vsock    xdma0_xvc
hpet             loop7     sda1          tty20     tty41  tty62  ttyS24     vcs3     xdma0_c2h_0    zero
hugepages        loop8     sda2          tty21     tty42  tty63  ttyS25     vcs4     xdma0_c2h_1    zfs
hwrng            loop9     sg0           tty22     tty43  tty7   ttyS26     vcs5     xdma0_control

Also, we can see the xdma0_user channel, which allows access to the Master AXI4 Lite interface, xdma0_control which allows access to the Slave AXI4 Lite interface and xdma0_xvc will allow access to the Xilinx Virtual Cable.

Once inside Python, we need to open connections to the different DMA channels with the os library.

# Open devices for DMA operation
xdma_axis_rd_data = os.open('/dev/xdma0_c2h_0',os.O_RDONLY)
xdma_axis_wr_data = os.open('/dev/xdma0_h2c_0',os.O_WRONLY)

xdma_axis_wr_config = os.open('/dev/xdma0_h2c_1',os.O_WRONLY)

In order to write data in the DMA channels, we have to consider that the AXI interface has a width of 64, and the xFFT IP has an interface of 32 bits, so we will need to convert the corresponding data into 64 bits packets, with the interest data in the low 32 bits. In order to make these conversions, we will import the library struct of python, and convert data to <Q format, that is corresponding with a Little Endian Long variable (8 bytes)

# Write DMA Config channel
config = struct.pack('<Q',11)
os.pwrite(xdma_axis_wr_config,config,0)

The value of the configuration that we will send is a bit set in the position 0 in order to select Forward FFT, and also some additional bits that configure the scaling factor. These values will depend on the value of the input signal and also the width selected.

When the xFFT IP is configured through its Config channel, we can generate the signal using numpy library, and also packaged using the struct library.

# Generate Data channel
nSamples = 1024
angle = np.linspace(0,2*np.pi,nSamples, endpoint=False)
sig = np.cos(2*angle)*100
sig_int = sig.astype(int)+100

# Write DMA data channel
data = struct.pack('<1024Q', *sig_int)
os.pwrite(xdma_axis_wr_data,data,0)

These instructions will be executed immediately, and since always that the device exists, the data will be written. The read operation will be delayed until the xFFT has the output value ready. Therefore we can run the read instruction in Python and when data will be available the instruction will be executed. When data is read, it has to be unpacked, and separated into real and imaginary parts.

# Read DMA data channel
fft_data = os.pread(xdma_axis_rd_data,8192,0)
data_unpack = struct.unpack('<1024Q',fft_data)

# Decode real and imaginary parts.
data_real = []
data_imag = []
for i in data_unpack:
    real_2scompl = _2sComplement(i&0xFFFF,16) / 512
    imag_2scompl = _2sComplement((i>>16) & 0xFFFF,16) / 512
    data_real.append(real_2scompl)
    data_imag.append(imag_2scompl)

If we want to perform an FFT in the computer instead of in the FPGA, the instruction will be the next.

# FFT using PC
fft_data_pc = fft.fft(sig)

Now we have two different methods to perform an FFT. To test the performance of both methods, I have written a code that performs 1024 FFTs using the fft instruction of the scipy package, and the FPGA connected through PCI, and the result was that the scipy instruction is 4 times faster than the execution in the FPGA, does this make sense? Of course!! In order to perform an FFT in the computer, the corresponding data is already in there, so there is no data moving between devices. Also, the computer uses its Floating Point Unit (FPU) to compute the FFT, so there is no data change in the process. In the case of the PCI connected FPGA, for each FFT we need first to package data in a format that the xFFT IP can handle. Then the data has to be sent to the FPGA through PCI, which is a high-speed interface but, we need to consider that Python is executed in an Operating System that will introduce a delay in the data sending according to the number of tasks pending to be executed. When the xFFT IP has computed the result, we need to receive the data and undo the transformations. All of these steps will decrease the performance of the FPGA.

To obtain benefits, many in most cases, with connecting an FPGA through PCI to a computer we need to consider first what kind of algorithm we want to execute. To obtain the highest performance, the algorithm has to be executed mostly in the FPGA, so the transactions that involve computers and overall, the operating system have to be the less as possible. Algorithms that match exactly with this requirement are iterative algorithms where we need to find some value or a combination of values given a reduced amount o data. Examples of these algorithms can be neural network training, where the training set is sent to the FPGA and all the iterative processes to train all the neurons are made in the FPGA. Another application could be cryptocurrency mining, where in the case of Bitcoin or Ethereum, a block is sent, and the FPGA has to compute the HASH of this block many times adding a transaction in order to obtain a valid HASH.

Besides the algorithm itself, the FPGA design used has to be optimized to accept PCI transactions to reduce the changes in the format of the data. In this example, we have used the xFFT IP that is not optimized to handle 64-bit data. In the world of hardware acceleration, the design in the accelerator is known as kernel. In this case, the xFFT IP is the kernel. In the next post of the PCI series, I will develop a customized kernel in order to extract all the potential, and design a real hardware accelerator.

FFT algorithm using an FPGA and XDMA.

Related posts

Connecting an FPGA accelerator to the Raspberry Pi 5

The art of sharing

Developing a SYZYGY peripheral