In this blog we talked (a little) about the xDMA IP from Xilinx, and how to send and receive data through PCI using an FPGA. In that occasion we used the Picozed board with the FMC Carrier gen 2. This time the board used is the Litefury from RHS research. This board is the same as the ACORN CLE-215, and is based on the Artix7 piece XC7A100T. The board has a M.2 format, therefore has 4 PCi Express Lanes, with an speed up to 6.6 Gbps (GTP transceivers). The M.2 format make this board an excellent choice to be integrated in our working PC in order to test different algorithm without the need of connect an external hardware. The board has a connector with the JTAG pins, but we can also debug our design using Xilinx Virtual Cable accessible through the xDMA IP. In this post we will use this board to execute a FFT algorithm, and then we will compare its performance with the same algorithm executed in the PC. The use of a Xilinx IP to perform this kind of algorithms has advantages, we do not need to develop it, but also it has son disadvantages, because we can not optimize the interfaces in order to improve their performance.

Unlike the previous post, in this post we will use the xDMA drivers for Linux provided by Xilinx. The use of this drivers makes easy the access to the DMA channels of the DMA/Bridge Subsystem for PCI Express.
in order to install the drivers, first of all we need to clone the repository.
pablo@mark1:~$ git clone https://github.com/Xilinx/dma_ip_drivers.git
Now, in the repository folder, we need to navigate to the /xdma folder.
pablo@mark1:~$ cd dma_ip_drivers/XDMA/linux-kernel/xdma/
In this folder, according the instructions that we can find in the readme.md file, we need to execute the make install command. In some cases, the output for this command will be the next:
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/xdma$ sudo make install Makefile:10: XVC_FLAGS: . make -C /lib/modules/5.11.0-27-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules make[1]: Entering directory '/usr/src/linux-headers-5.11.0-27-generic' /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: . CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/libxdma.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_cdev.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_ctrl.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_events.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_sgdma.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_xvc.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_bypass.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_mod.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_thread.o LD [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.o /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: . MODPOST /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Module.symvers CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.mod.o LD [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-27-generic' make -C /lib/modules/5.11.0-27-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules_install make[1]: Entering directory '/usr/src/linux-headers-5.11.0-27-generic' INSTALL /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko At main.c:160: - SSL error:02001002:system library:fopen:No such file or directory: ../crypto/bio/bss_file.c:69 - SSL error:2006D080:BIO routines:BIO_new_file:no such file: ../crypto/bio/bss_file.c:76 sign-file: certs/signing_key.pem: No such file or directory DEPMOD 5.11.0-27-generic Warning: modules_install: missing 'System.map' file. Skipping depmod. make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-27-generic'
We can see that the command return us some errors regarding the SSL, and also with depmod command. First let’s fix the SSL error. If you have this error, is because Linux, in order to improve its security, requires that all the modules that you install will be signed. In this case, this modules are not signed. The make command will try to sign the module but it cannot find the key to do it, so we need to generate a key to allow the make command fix the module and install it. To do that, we must navigate to certs/ folder, and generate the signing_key.pem file using openssl. Before generate the key, we will create a config file named x509.genkey in /certs folder.
pablo@mark1:~$ cd /lib/modules/$(uname -r)/build/certs pablo@mark1:~$ sudo nano x509.genkey
The content of this file must be the next.
pablo@mark1:/lib/modules/5.11.0-36-generic/build/certs$ cat x509.genkey [ req ] default_bits = 4096 distinguished_name = req_distinguished_name prompt = no string_mask = utf8only x509_extensions = myexts [ req_distinguished_name ] CN = Modules [ myexts ] basicConstraints=critical,CA:FALSE keyUsage=digitalSignature subjectKeyIdentifier=hash authorityKeyIdentifier=keyid
Once this file is created, we can generate the key files using this file as config file.
pablo@mark1:/lib/modules/5.11.0-36-generic/build/certs$ sudo openssl req -new -nodes -utf8 -sha512 -days 36500 -batch -x509 -config x509.genkey -outform DER -out signing_key.x509 -keyout signing_key.pem Generating a RSA private key .................................++++ ...................................................................................................++++ writing new private key to 'signing_key.pem' -----
Once the command is executed, we can check that the files are created in the same folder.
pablo@mark1:/lib/modules/5.11.0-36-generic/build/certs$ ls Kconfig Makefile signing_key.pem signing_key.x509 x509.genkey
When the key files are created, we can return to the /xdma folder in the downloaded repository. Now, the second issue that make command returned us was regarding the depmod command. This command is included in the make install command, bu also can be included next to this command. We can execute depmod -a after make install, or we can add it to Makefile. if you choose this second option, the content of Makefile will be the next (with the changes in bold).
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/xdma$ cat Makefile SHELL = /bin/bash ifneq ($(xvc_bar_num),) XVC_FLAGS += -D__XVC_BAR_NUM__=$(xvc_bar_num) endif ifneq ($(xvc_bar_offset),) XVC_FLAGS += -D__XVC_BAR_OFFSET__=$(xvc_bar_offset) endif $(warning XVC_FLAGS: $(XVC_FLAGS).) topdir := $(shell cd $(src)/.. && pwd) TARGET_MODULE:=xdma EXTRA_CFLAGS := -I$(topdir)/include $(XVC_FLAGS) #EXTRA_CFLAGS += -D__LIBXDMA_DEBUG__ #EXTRA_CFLAGS += -DINTERNAL_TESTING ifneq ($(KERNELRELEASE),) $(TARGET_MODULE)-objs := libxdma.o xdma_cdev.o cdev_ctrl.o cdev_events.o cdev_sgdma.o cdev_xvc.o cdev_bypass.o xdma_mod.o xdma_thread.o obj-m := $(TARGET_MODULE).o else BUILDSYSTEM_DIR:=/lib/modules/$(shell uname -r)/build PWD:=$(shell pwd) all : $(MAKE) -C $(BUILDSYSTEM_DIR) M=$(PWD) modules clean: $(MAKE) -C $(BUILDSYSTEM_DIR) M=$(PWD) clean @/bin/rm -f *.ko modules.order *.mod.c *.o *.o.ur-safe .*.o.cmd install: all $(MAKE) -C $(BUILDSYSTEM_DIR) M=$(PWD) modules_install depmod -a endif
Once this changes are done, executing again the make install command will return us the next output.
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/xdma$ sudo make install Makefile:10: XVC_FLAGS: . make -C /lib/modules/5.11.0-36-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules make[1]: Entering directory '/usr/src/linux-headers-5.11.0-36-generic' /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: . CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/libxdma.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_cdev.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_ctrl.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_events.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_sgdma.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_xvc.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/cdev_bypass.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_mod.o CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma_thread.o LD [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.o /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Makefile:10: XVC_FLAGS: . MODPOST /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/Module.symvers CC [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.mod.o LD [M] /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-36-generic' make -C /lib/modules/5.11.0-36-generic/build M=/home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma modules_install make[1]: Entering directory '/usr/src/linux-headers-5.11.0-36-generic' INSTALL /home/pablo/dma_ip_drivers/XDMA/linux-kernel/xdma/xdma.ko DEPMOD 5.11.0-36-generic Warning: modules_install: missing 'System.map' file. Skipping depmod. make[1]: Leaving directory '/usr/src/linux-headers-5.11.0-36-generic' depmod -a
Now we must navigate to /tools folder and execute again the make command.
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/xdma$ cd ../tools/ pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/tools$ make cc -c -std=c99 -o reg_rw.o reg_rw.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE cc -o reg_rw reg_rw.o cc -c -std=c99 -o dma_to_device.o dma_to_device.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE In file included from /usr/include/assert.h:35, from dma_to_device.c:13: /usr/include/features.h:187:3: warning: #warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" [-Wcpp] 187 | # warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" | ^~~~~~~ cc -lrt -o dma_to_device dma_to_device.o -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE cc -c -std=c99 -o dma_from_device.o dma_from_device.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE In file included from /usr/include/assert.h:35, from dma_from_device.c:13: /usr/include/features.h:187:3: warning: #warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" [-Wcpp] 187 | # warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" | ^~~~~~~ cc -lrt -o dma_from_device dma_from_device.o -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE cc -c -std=c99 -o performance.o performance.c -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE cc -o performance performance.o -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGE_FILE_SOURCE
This command will return us some warnings but we can continue with them without problems. Finally we need to load the driver with modprobe command.
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/tools$ sudo modprobe xdma
Now, we must navigate to /tests folder and execute the script /.load_driver with sudo privileges.
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/tests$ sudo su root@mark1:/home/pablo/dma_ip_drivers/XDMA/linux-kernel/tests# source ./load_driver.sh xdma 86016 0 Loading xdma driver... The Kernel module installed correctly and the xmda devices were recognized. DONE
If we check the list of kernel nodes in /dev, we will see a group of xdma nodes created.
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/tools$ ls /dev autofs fuse loop10 nvram stderr tty21 tty38 tty54 ttyS11 ttyS28 vcs2 vcsu6 xdma0_events_4 block gpiochip0 loop11 port stdin tty22 tty39 tty55 ttyS12 ttyS29 vcs3 vfio xdma0_events_5 bsg hpet loop2 ppp stdout tty23 tty4 tty56 ttyS13 ttyS3 vcs4 vga_arbiter xdma0_events_6 btrfs-control hugepages loop3 psaux tty tty24 tty40 tty57 ttyS14 ttyS30 vcs5 vhci xdma0_events_7 bus hwrng loop4 ptmx tty0 tty25 tty41 tty58 ttyS15 ttyS31 vcs6 vhost-net xdma0_events_8 char i2c-0 loop5 ptp0 tty1 tty26 tty42 tty59 ttyS16 ttyS4 vcsa vhost-vsock xdma0_events_9 console i2c-1 loop6 pts tty10 tty27 tty43 tty6 ttyS17 ttyS5 vcsa1 xdma0_c2h_0 xdma0_h2c_0 core i2c-2 loop7 random tty11 tty28 tty44 tty60 ttyS18 ttyS6 vcsa2 xdma0_control xdma0_user cpu i2c-3 loop8 rfkill tty12 tty29 tty45 tty61 ttyS19 ttyS7 vcsa3 xdma0_events_0 xdma0_xvc cpu_dma_latency i2c-4 loop9 rtc tty13 tty3 tty46 tty62 ttyS2 ttyS8 vcsa4 xdma0_events_1 zero cuse initctl loop-control rtc0 tty14 tty30 tty47 tty63 ttyS20 ttyS9 vcsa5 xdma0_events_10 zfs disk input mapper sda tty15 tty31 tty48 tty7 ttyS21 udmabuf vcsa6 xdma0_events_11 dma_heap kmsg mcelog sda1 tty16 tty32 tty49 tty8 ttyS22 uhid vcsu xdma0_events_12 dri kvm mei0 sda2 tty17 tty33 tty5 tty9 ttyS23 uinput vcsu1 xdma0_events_13 drm_dp_aux0 lightnvm mem sg0 tty18 tty34 tty50 ttyprintk ttyS24 urandom vcsu2 xdma0_events_14 ecryptfs log mqueue shm tty19 tty35 tty51 ttyS0 ttyS25 userio vcsu3 xdma0_events_15 fd loop0 net snapshot tty2 tty36 tty52 ttyS1 ttyS26 vcs vcsu4 xdma0_events_2 full loop1 null snd tty20 tty37 tty53 ttyS10 ttyS27 vcs1 vcsu5 xdma0_events_3
Once the kernel nodes are created, we can execute the tests of the driver. If we execute the ./run_tests.sh, in my case this is the output.
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/tests$ sudo bash ./run_test.sh Info: Number of enabled h2c channels = 1 Info: Number of enabled c2h channels = 1 Info: The PCIe DMA core is memory mapped. ./run_test.sh: line 68: ./dma_memory_mapped_test.sh: Permission denied Info: All tests in run_tests.sh passed.
The DMA is detected correctly but when the script execute the script dma_memory_mapped_test.sh, even the script is executed as sudo, it returns the message Permission denied. To execute correctly the dma_memory_mapped_test script, we can execute this script directly writing the corresponding arguments.
./dma_memory_mapped_test.sh $transferSize $transferCount $h2cChannels $c2hChannels
The transfer size that run_tests.sh configure is 1024 and the transfer count is 1. Regarding the h2c and c2h channels they are configured in hardware. For the example project of the Litefury board, both are configured as 1. The call to the script with the corresponding arguments will return us the next output.
pablo@mark1:~/dma_ip_drivers/XDMA/linux-kernel/tests$ sudo bash ./dma_memory_mapped_test.sh 1024 1 1 1 Info: Running PCIe DMA memory mapped write read test transfer size: 1024 transfer count: 1 Info: Writing to h2c channel 0 at address offset 0. Info: Wait for current transactions to complete. /dev/xdma0_h2c_0 ** Average BW = 1024, 21.120369 Info: Writing to h2c channel 0 at address offset 1024. Info: Wait for current transactions to complete. /dev/xdma0_h2c_0 ** Average BW = 1024, 24.811611 Info: Writing to h2c channel 0 at address offset 2048. Info: Wait for current transactions to complete. /dev/xdma0_h2c_0 ** Average BW = 1024, 56.933170 Info: Writing to h2c channel 0 at address offset 3072. Info: Wait for current transactions to complete. /dev/xdma0_h2c_0 ** Average BW = 1024, 59.091698 Info: Reading from c2h channel 0 at address offset 0. Info: Wait for the current transactions to complete. /dev/xdma0_c2h_0 ** Average BW = 1024, 23.916292 Info: Reading from c2h channel 0 at address offset 1024. Info: Wait for the current transactions to complete. /dev/xdma0_c2h_0 ** Average BW = 1024, 50.665478 Info: Reading from c2h channel 0 at address offset 2048. Info: Wait for the current transactions to complete. /dev/xdma0_c2h_0 ** Average BW = 1024, 45.241673 Info: Reading from c2h channel 0 at address offset 3072. Info: Wait for the current transactions to complete. /dev/xdma0_c2h_0 ** Average BW = 1024, 50.490608 Info: Checking data integrity. Info: Data check passed for address range 0 - 1024. Info: Data check passed for address range 1024 - 2048. Info: Data check passed for address range 2048 - 3072. Info: Data check passed for address range 3072 - 4096. Info: All PCIe DMA memory mapped tests passed.
When the drivers are installed and the hardware is detected correctly, we can develop an application to use them. As I mentioned, for this project we will use the Litefury board, that uses the four PCI lanes of the M.2 connector. We have to create a next project, and add the IP DMA/Bridge Subsystem for PCI Express. Since the goal of the project is conenct the PCIe channel to t he xFFT IP, we need to change the DMA Interface option to AXI Stream. Maximum link Speed will depend of the part you are using, in this case, for the XC7A35T-2TFGG484 it can achieve up to 5.0 GT/s. The AXI Clock Frequency will be set to the maximum, 250MHz.

In the DMA tab, we will configure the DMA channels. The xFFT IP needs two AXi Stream channels, for the configuration and the data, therefer the configuration number of Host to Card channels has to be set in two. The configuration channel of the xFFT only write from host to IP, so the number of the Card To Host channels can be configured as one.

Once the DMA/Bridge Subsystem for PCI Express IP is configured, we can build the rest of the components in the block design. I also added an AXI GPIO to manage the four leds of the board.

The xFFT IP configuration has to allow the minimum latency in the FFT compute. Configuration values I has select are the next.
- Transform length: 1024
- Architecture: Fastest is Pipeline, Streaming I/O
- Target clock frequency: 250 MHz (AXI clock)
- Scaling options: Scaled. We will need to configure the scale factor for each FFT stage through the DMA config channel.
- Output ordering: Bit / Digit reversed order. The output of the FFT will be missordered. It is easy to find the corresponding harmonic value knowing the definition of the FFT. If we select for this field the configuration Natural Order setting, the latency of the xFFT is increased almost to the twice.
With this configuration, the latency of the xFFT IP is 8.604 us.
Regarding the constraints, Litefury board has connected the PCI Lanes in a different order that xDMA IP expects. As the xDMA IP has its own constraints, we need to change the processing order of the xdc files. To do that, we will generate 2 different xdc files, one wit the PCIe constraints, and the other one with the generic constraints. For the file that contains the PCIe constraints, we need to change the processing order to Early. We can do that from the user interface, or executing the next tcl command.
set_property PROCESSING_ORDER EARLY [get_files /home/pablo/Desktop/pedalos/litefury_early.xdc]
The content of the both xdc files is shown below.
# Early constraints ################### set_property PACKAGE_PIN J1 [get_ports pcie_rstn] set_property IOSTANDARD LVCMOS33 [get_ports pcie_rstn] set_property PACKAGE_PIN G1 [get_ports {pcie_clkreq[0]}] set_property IOSTANDARD LVCMOS33 [get_ports {pcie_clkreq[0]}] set_property PACKAGE_PIN F6 [get_ports {pcie_refclk_p[0]}] set_property PACKAGE_PIN E6 [get_ports {pcie_refclk_n[0]}] # PCIe lane 0 set_property LOC GTPE2_CHANNEL_X0Y6 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_> set_property PACKAGE_PIN A10 [get_ports {pcie_mgt_rxn[0]}] set_property PACKAGE_PIN B10 [get_ports {pcie_mgt_rxp[0]}] set_property PACKAGE_PIN A6 [get_ports {pcie_mgt_txn[0]}] set_property PACKAGE_PIN B6 [get_ports {pcie_mgt_txp[0]}] # PCIe lane 1 set_property LOC GTPE2_CHANNEL_X0Y4 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_> set_property PACKAGE_PIN A8 [get_ports {pcie_mgt_rxn[1]}] set_property PACKAGE_PIN B8 [get_ports {pcie_mgt_rxp[1]}] set_property PACKAGE_PIN A4 [get_ports {pcie_mgt_txn[1]}] set_property PACKAGE_PIN B4 [get_ports {pcie_mgt_txp[1]}] # PCIe lane 2 set_property LOC GTPE2_CHANNEL_X0Y5 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_> set_property PACKAGE_PIN C11 [get_ports {pcie_mgt_rxn[2]}] set_property PACKAGE_PIN D11 [get_ports {pcie_mgt_rxp[2]}] set_property PACKAGE_PIN C5 [get_ports {pcie_mgt_txn[2]}] set_property PACKAGE_PIN D5 [get_ports {pcie_mgt_txp[2]}] # PCIe lane 3 set_property LOC GTPE2_CHANNEL_X0Y7 [get_cells {accelerator_bd_i/xdma_0/inst/accelerator_bd_xdma_0_1_pcie2_to_pcie3_> set_property PACKAGE_PIN C9 [get_ports {pcie_mgt_rxn[3]}] set_property PACKAGE_PIN D9 [get_ports {pcie_mgt_rxp[3]}] set_property PACKAGE_PIN C7 [get_ports {pcie_mgt_txn[3]}] set_property PACKAGE_PIN D7 [get_ports {pcie_mgt_txp[3]}]
And the generic constraints.
# Normal constraints set_property PACKAGE_PIN G3 [get_ports {leds[3]}] set_property PACKAGE_PIN H3 [get_ports {leds[2]}] set_property PACKAGE_PIN G4 [get_ports {leds[1]}] set_property PACKAGE_PIN H4 [get_ports {leds[0]}] set_property IOSTANDARD LVCMOS33 [get_ports {leds[3]}] set_property IOSTANDARD LVCMOS33 [get_ports {leds[2]}] set_property IOSTANDARD LVCMOS33 [get_ports {leds[1]}] set_property IOSTANDARD LVCMOS33 [get_ports {leds[0]}]
At this point we can generate the bitstream. The configuration of the PCIe lanes in a different order that xDMA IP expects, will cause that Vivado reports some critical warnings. We do not care about them because the cause of this critical warning is under control.
In my case, the Litefury board is connected to a computer that I use for test which I connect remotely. This computer has already the xDMA drivers installed and also Python 3.8 and Jupyter Notebook. In order to redirect the HTTP port that Jupyter will create, we need to access to remote computer through SSH and redirect the address localhost:8080 of remote to port 8080 of the host. The way to do that is the next.
pablo@friday:~$ ssh -L 8080:localhost:8080 pablo@192.168.1.138
When the SSH connection is established, we can open the jupyter notebook in the remote computer as sudo. Also remember change the port to the redurected in the host.
pablo@mark1:~$ sudo jupyter-notebook --no-browser --port=8080 --allow-root
If we check the devices detected, we can see the two channels of the DMA that we have configured in the XDMA IP. Channels xdma0_c2h_0 and xdma0_c2h_1, are dedicated to transfer data from card to host, and xdma0_h2c_0 and xdma0_h2c_1 are dedicated to transfer data from host to card.
pablo@mark1:~$ ls /dev autofs i2c-0 loop-control shm tty23 tty44 tty8 ttyS27 vcs6 xdma0_events_0 block i2c-1 mapper snapshot tty24 tty45 tty9 ttyS28 vcsa xdma0_events_1 bsg i2c-2 mcelog snd tty25 tty46 ttyprintk ttyS29 vcsa1 xdma0_events_10 btrfs-control i2c-3 mei0 stderr tty26 tty47 ttyS0 ttyS3 vcsa2 xdma0_events_11 bus i2c-4 mem stdin tty27 tty48 ttyS1 ttyS30 vcsa3 xdma0_events_12 char initctl mqueue stdout tty28 tty49 ttyS10 ttyS31 vcsa4 xdma0_events_13 console input net tty tty29 tty5 ttyS11 ttyS4 vcsa5 xdma0_events_14 core kmsg null tty0 tty3 tty50 ttyS12 ttyS5 vcsa6 xdma0_events_15 cpu kvm nvram tty1 tty30 tty51 ttyS13 ttyS6 vcsu xdma0_events_2 cpu_dma_latency lightnvm port tty10 tty31 tty52 ttyS14 ttyS7 vcsu1 xdma0_events_3 cuse log ppp tty11 tty32 tty53 ttyS15 ttyS8 vcsu2 xdma0_events_4 disk loop0 psaux tty12 tty33 tty54 ttyS16 ttyS9 vcsu3 xdma0_events_5 dma_heap loop1 ptmx tty13 tty34 tty55 ttyS17 udmabuf vcsu4 xdma0_events_6 dri loop10 ptp0 tty14 tty35 tty56 ttyS18 uhid vcsu5 xdma0_events_7 drm_dp_aux0 loop11 pts tty15 tty36 tty57 ttyS19 uinput vcsu6 xdma0_events_8 ecryptfs loop2 random tty16 tty37 tty58 ttyS2 urandom vfio xdma0_events_9 fd loop3 rfkill tty17 tty38 tty59 ttyS20 userio vga_arbiter xdma0_h2c_0 full loop4 rtc tty18 tty39 tty6 ttyS21 vcs vhci xdma0_h2c_1 fuse loop5 rtc0 tty19 tty4 tty60 ttyS22 vcs1 vhost-net xdma0_user gpiochip0 loop6 sda tty2 tty40 tty61 ttyS23 vcs2 vhost-vsock xdma0_xvc hpet loop7 sda1 tty20 tty41 tty62 ttyS24 vcs3 xdma0_c2h_0 zero hugepages loop8 sda2 tty21 tty42 tty63 ttyS25 vcs4 xdma0_c2h_1 zfs hwrng loop9 sg0 tty22 tty43 tty7 ttyS26 vcs5 xdma0_control
Also we can see the xdma0_user channel, that allow access to the Master AXI4 Lite interface, xdma0_control that allow access to the Slave AXI4 Lite interface and xdma0_xvc that will allow access to the Xilinx Virtual Cable.
Once inside Python, we need to open connections to the different DMA channels with the os library.
# Open devices for DMA operation xdma_axis_rd_data = os.open('/dev/xdma0_c2h_0',os.O_RDONLY) xdma_axis_wr_data = os.open('/dev/xdma0_h2c_0',os.O_WRONLY) xdma_axis_wr_config = os.open('/dev/xdma0_h2c_1',os.O_WRONLY)
In order to write data in the DMA channels, we have to consider that the AXI interface has a width of 64, and the xFFT IP has an interface of 32 bits, so we will need to convert the corresponding data into 64 bits packets, with the interest data in the low 32 bits. In order to make this conversions, we will import the library struct of python, and convert data to ‘<Q’ format, that is corresponding with a Little Endian Long variable (8 bytes)
# Write DMA Config channel config = struct.pack('<Q',11) os.pwrite(xdma_axis_wr_config,config,0)
The value of the configuration that we will send is a bit set in the position 0 in order to select Forward FFT, and also some additional bits that configure the scaling factor. This values will depends of the value of the input signal and also the width selected.
When the xFFT IP is configured through its Config channel, we can generate the signal using numpy library, and also packaged using struct library.
# Generate Data channel nSamples = 1024 angle = np.linspace(0,2*np.pi,nSamples, endpoint=False) sig = np.cos(2*angle)*100 sig_int = sig.astype(int)+100 # Write DMA data channel data = struct.pack('<1024Q', *sig_int) os.pwrite(xdma_axis_wr_data,data,0)
These instructions will be executed immediately, since always that the device exists, the data will be write. The read operation will be delayed until the xFFT has the output value ready. Therefore we can run the read instruction in Python and when data will be available the instruction will be executed. When data is read, it has to be unpack, and separated into real and imaginary parts.
# Read DMA data channel fft_data = os.pread(xdma_axis_rd_data,8192,0) data_unpack = struct.unpack('<1024Q',fft_data) # Decode real and imaginary parts. data_real = [] data_imag = [] for i in data_unpack: real_2scompl = _2sComplement(i&0xFFFF,16) / 512 imag_2scompl = _2sComplement((i>>16) & 0xFFFF,16) / 512 data_real.append(real_2scompl) data_imag.append(imag_2scompl)
If we want to perform an FFT in the computer instead of in the FPGA, the instruction will be the next.
# FFT using PC fft_data_pc = fft.fft(sig)
Now we have to different methods to perform an FFT. To test the performance of both methods, I have write a code that performs 1024 FFTs using the fft instruction of the scipy package, and the FPGA connected through PCI, and the result was that the scipy instruction is 4 times faster than the execution in the FPGA, does this make sense? Of course!! In order to perform an FFT in the computer, the corresponding data is already in there, so there is no data moving between devices. Also, the computer uses their Floating Point Unit (FPU) to compute the FFT, so there is no data changed in the process. In the case of the PCI connected FPGA, for each FFT we need first to package data in a format than the xFFT IP can handle. Then the data has to be sent to the FPGA through PCI, that is a high speed interface but, we need to consider that Python is executed in an Operating System that will introduce a delay in the data sending according the number of tasks pending to be executed. When the xFFT IP has computed the result, we need to recieve the data and undo the transformations. All of these steps will decrease the performance of the FPGA.
To obtain benefits, many in most cases, with connecting an FPGA through PCI to a computer we need to consider first what king of algorithm we want to execute. To obtain the highest performance, the algorithm has to be executed mostly in the FPGA, so the transactions that involve computer and overall, the operating system have to be the less as possible. Algorithms that matches exactly with this requirement are iterative algorithms where we need to find some value, or a combination of values given a reduced amount o data. Examples of these algorithms can be neural network training, where the training set is sent to the FPGA and all the iterative process to train all the neurons are made in the FPGA. Other application could be the crypto currency mining, where in case of Bitcoin or Ethereum, a block is sent, and the FPGA has to compute the HASH of this block many times adding a transaction in order to obtain a valid HASH.
Besides the algorithm itself, the FPGA design used have to be optimized to accept PCI transactions to reduce the changes in the format of the data. In this example we have used the xFFT IP that is not optimized to handle 64 bit data. In the world of hardware acceleration, the design in the accelerator is known as kernel. In this case the xFFT IP is the kernel. In the next post of the PCI series I will develop a customized kernel in order to extract all the potential, and design a real hardware accelerator.