Parallelizing SHA256 calculation on FPGA

A few weeks ago, I wrote an article where I developed a hash calculator on an FPGA. Specifically, I implemented an SHA-256 calculator. This module computes the hash of a string (up to 25 bytes) in 68 clock cycles.

The design leverages the parallelism of FPGAs to compute the W matrix and the recursive rounds concurrently. However, it produces only one hash every 68 clock cycles, leaving most of the FPGA underutilized during that time.

In this article we are going to elevate the performance of that system by adding a set of hash calculators to be able of computing several hashes at the same time.

The next diagram shows the structure of the project. I needed to change the hash calculator module to optimize it. If you remember the SHA-256 algorithm, it needs a set of pre-computed values, the K matrix. In this project, that matrix is not inside the SHA core, instead it is in a top level, where all the hash cores have access. This way only one K matrix has to be stored. In addition, the initialization of the W matrix values is performed in parallel, eliminating the AXI Stream interface.

FSM diagram

This two changes reduce the logic used by the core, and elevate its performance. This new SHA core is named sha256_core_pif (pif means parallel interface).

module sha256_core_pif (
  input wire aclk, 
  input wire aresetn, 

  /* input data channel */
  input wire [31:0] string_w0,
  input wire [31:0] string_w1,
  input wire [31:0] string_w2,
  input wire [31:0] string_w3,
  input wire [31:0] string_w4,
  input wire [31:0] string_w5,
  input wire [31:0] string_w6,
  input wire [31:0] string_w7,
  input wire [31:0] string_w8,
  input wire [31:0] string_w9,
  input wire [31:0] string_w10,
  input wire [31:0] string_w11,
  input wire [31:0] string_w12,
  input wire [31:0] string_w13,
  input wire string_dv,
  output wire string_ready,
  input wire [7:0] string_size,
  output reg [6:0] round,
  input wire [31:0] k_round,
  
  /* output data channel */
  output reg sha256_dv,
  output reg [255:0] sha256_data
);

Then, a module called SHA256_manager was added to coordinate all the cores and feed them with the appropriate input values.

The application I implemented is a simple hash cracker or password cracker. It receives a SHA-256 hash and attempts to recover the original string that generated it. This cannot be solved analytically; instead, the SHA256_manager iteratively hashes candidate strings, starting from the first printable character. It then increments the character until it reaches the last one, at which point it appends a new character and restarts the process.

There are 95 printable ASCII characters. This means the system must compute 95 hashes for strings of length 1, 95^2 = 9 025 for two-character strings, and 95^3 = 857 375 for three-character strings. In general, the number of required hashes is 95^n for strings of length n.

All the sha256_core_pif returns the hash calculated, and the SHA256_manager compares all with the received hash. If one of them is the match with the received hash, then the string sent to be evaluated to the first sha256_core_pif is sent to the host computer, and also the number of the sha256_core_pif that computes the correct hash. This way, the host computer can obtain the correct string.

The project uses the Litefury board connected to a Raspberry Pi 5 over PCIe. In the next diagram you can find the block design of Vivado.

Vivado block design

To meet the timing requirements, I needed to reduce the AXI clock speed to 62.5 MHz. Using this configuration, I was able of integrate 12 sha256_core_pif modules.

DMA AXI Clock

Regarding the utilization of the FPGA, you will see that it is not close to be full, but the problem was to met the timing requirements.

Utilization

Using 12 accelerators, and a clock speed of 62.5MHz, all the requirements were met.

Timing

In the host side, I created a Python driver to manage the LiteFury. I used the xDMA drivers from Xilinx with the modification we made in this article. Now, the Python driver just needs to open the /dev/xdma0_user peripheral, and write the registers according the register map of the AXI peripheral.

def __init__(self, uio_path="/dev/xdma0_user", map_size=0x20000):
    self.fd = os.open(uio_path, os.O_RDWR | os.O_SYNC)
    self.map_size = map_size
    self.m = mmap.mmap(self.fd, self.map_size, mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE, offset=0)

def close(self):
    self.m.close()
    os.close(self.fd)

def write(self, addr, value):
    self.m.seek(addr+self.AXI_PERIPH_OFFSET)
    self.m.write(struct.pack("<I", value))  # Little endian

def read(self, addr):
    self.m.seek(addr+self.AXI_PERIPH_OFFSET)
    return struct.unpack("<I", self.m.read(4))[0]

As I mentioned before, to obtain the final string, we need to read the resulting string addresses, and add the number of the winner module.

def get_password(self, winner):
    pw = b''
    for addr in self.REG_R:
        word = self.read(addr)
        pw += word.to_bytes(4, 'big')
    # Add the value of the winner as integer to the resulting string
    pw_int = int.from_bytes(pw, 'big') + winner
    # Convert the result to bytes
    pw_bytes = pw_int.to_bytes(len(pw), 'big')
    # Invert the order of the result
    pw_bytes = pw_bytes[::-1]
    # ASCII decodingssh p
    return pw_bytes.rstrip(b'\x00').decode('ascii', errors='ignore')

To test the project, I created another Python script that calculates the SHA-256 of a string (It also can be done using the openSSL library). Then, the hash calculated is sent to the accelerator, and it returns the initial string.

~/pass_cracker/python $ python3 sha256_comp.py eoi
SHA-256 of 'eoi': 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b

~/pass_cracker/python $ sudo python3 pass_cracker.py 7c02b8671bb4824e1cea44af7b628e88b81495699d5e9cb0e2533af99320a81b
Password: eoi

Projects like this can be quite impressive to engineers unfamiliar with FPGAs. The ability to accelerate SHA-256 computation by performing different tasks in parallel — and even using multiple hash calculators simultaneously — often sparks curiosity and interest in FPGA technology.

The role of FPGAs in fields like cryptography and cybersecurity is expected to grow significantly in the coming years, as increasingly faster and more flexible systems are required.

All the files of this project are shared in the controlpaths GitHub

Are you involved in a cryptography project and wants to know if an FPGA could help? Contact me.

Parallelizing SHA256 calculation on FPGA

Related posts

True random number generators based on FPGA