Implementing a digital biquad filter in Verilog.

Filtering is likely the most common DSP algorithm that any embedded engineer has to design, no matter it they are developing for an STM32, or TI DSP or, of course, en FPGA. Filtering is such important because the majority of the applications are connected to the real world, so they need to capture real signals with the undesirable elements that all real signals have, i.e noise, offset… Also, signals that we need to acquire, not always will be the dominant signals we acquire, for example, for biological signals, we will acquire a high level of 50 or 60 Hz corresponding with the grid frequency. On this cases we will need a filter.

As you sure know as reader of this blog, on digital systems, we have 2 kind of filters, FIR (Finite Impulse Response) filter, which only use the value of the input now and the value on the past, and IIR (Infinite Impulse Response), which use the value of the input now and in the past, and also the value of the past outputs. FIR filters are easy to design for many applications, on this blog we did that here, but they response are limited for a low order filters. On the other hand, with IIR filters, we can achieve more aggressive responses, but they have the disadvantage that the filter can be unstable in certain cases. On this blog we already designed IIR filters (here), but on that post we used MATLAB to design the filter and HDL Coder to implement that filter. On this post, I will show how we can implement a second order filter, or second order system, on Verilog step by step.

Why a second order system?

Second order systems are one of the 2 systems with we can modeling any linear system. The other ones are the first order systems. The reason is that if we want to create a third order system, we can implement this system as a first order system in series with a second order system. In case we would to develop a 7th order filter, we will generate 3 second order filters in series with one first order filter.

For example, a 5th order system looks like the next equation.

We can rewrite this equation as two 2nd order systems in series with a first order system

This kind of decomposition system are known as sections, and for a decomposition of second order systems, we can found it on internet and literature as second order sections. Actually, on the post where I talk about the band pass filter designed on MATLAB, the final implementation of the 8th order filter was four 2nd order filters in series (4 sections). On MATLAB command window, you can use the command tf2sos to convert a high order filter on a set of second order filters. In case you don’t have a MATLAB license, the Python library Scipy has the command scipy.signal.tf2sos, which perform the same function. Besides the improvement on simplicity that second order filters allow us to achieve, when we are using a fixed-point encoding, implement high order filters can bring us stability issues, so split the filter in second order systems will improve the stability of the system.

Verilog implementation.

Before start coding, we have to think what we want of the module we will create. In my case, is very important be able to parametrize the module, this way, in case I change the width of the input, or the width of the output, that only represents a change on a parameter, without modify the code of the module since it means new tests for the module. Also, we have to think what numeric format we will use. For this kind of filters, is obvious that we will need a signed format, and the capability to manage decimal numbers. To achieve this requirements, we have to select between fixed point and floating point, and for me the decision is clear, I want a code that can be instantiated alone, without an external floating point unit, so the format will be fixed point. If we add the requirement of parameters, and the fixed point format, we will need parameters to define the width of the signals, and also the width of the decimal part, which will be related to the precision we need. Also, the precision, will be selected to obtain a response of the implemented filter as similar as possible to the designed. Now, how many width we need to parametrize? We can define one width that will be used as for inputs and outputs and for coefficients, but this way, the width of the coefficients will be select according the width of the data generator, and for some filters, which the coefficients are on the stability limit, this could represent a problem. So, at least we will define 2 different widths, one for the input and outputs, according the data drain and the data source, and other one according the resolution we will need. Another thing we have to think about widths, are the resolution of the internal operations, because in some cases, an stability problem is not related to the width of the coefficient itself, but the operation resolution, so at this point, will be interesting decoupling the width of the coefficients that will be generated on MATLAB or Python, with the width of the internal filter operations. Finally, filter parameters will looks like the next.

  parameter inout_width = 16,
  parameter inout_decimal_width = 15,
  parameter coefficient_width = 16,
  parameter coefficient_decimal_width = 15,
  parameter internal_width = 16,
  parameter internal_decimal_width = 15

Next we have to think on the interfaces. Applications where data is transferred between modules in a continuous way, AXI4-Stream will be the best option. On the inputs and output field, we will define both master and slave AXI4-Stream interfaces to acquire and send data. Although the input and output widths are parametrize, if we want to connect the module to an existing AXI4-Stream IP, this width are restricted to the width of the bus.

  input aclk,
  input resetn,

  /* slave axis interface */
  input [inout_width-1:0] s_axis_tdata,
  input s_axis_tlast,
  input s_axis_tvalid,
  output s_axis_tready,

  /* master axis interface */
  output reg [inout_width-1:0] m_axis_tdata,
  output reg m_axis_tlast,
  output reg m_axis_tvalid,
  input m_axis_tready,

Regarding the coefficients, the best to insert them is with an AXI4-LIte interface, but I want to design this module without the need of use a Zynq or Microblaze, so the easy way to insert the coefficients are as inputs.

  /* coefficients */
  input signed [pw_coefficient_width-1:0] b0,
  input signed [pw_coefficient_width-1:0] b1,
  input signed [pw_coefficient_width-1:0] b2,
  input signed [pw_coefficient_width-1:0] a1,
  input signed [pw_coefficient_width-1:0] a2

Now, with all the inputs and outputs defined, we have to think on the data flow. First, as we have different formats defined, to be able to perform operation between the coefficients and data, we have to change the format of all signals to the internal format, which is defined by the internal width and the decimal width. The integer width used is defined as a localparam, and is corresponding with the difference between the width and the decimal width. To change the format, we will fill with zeros the low side of the signal until the decimal part will be completed. For the integer part, as the format used is signed, we have to perform a sign extension, that is fill with the value of the MSb until the integer part will be completed. Notice that this works because the internal width are bigger or equal to the inout and coefficients width.

/* resize signals to internal width */
  assign input_int = { {(internal_integer_width-inout_integer_width){s_axis_tdata[inout_width-1]}},
                            {(internal_decimal_width-inout_decimal_width){1'b0}} };
  assign b0_int = { {(internal_integer_width-coefficient_integer_width){b0[coefficient_width-1]}},
                            {(internal_decimal_width-coefficient_decimal_width){1'b0}} };
  assign b1_int = { {(internal_integer_width-coefficient_integer_width){b1[coefficient_width-1]}},
                            {(internal_decimal_width-coefficient_decimal_width){1'b0}} };
  assign b2_int = { {(internal_integer_width-coefficient_integer_width){b2[coefficient_width-1]}},
                            {(internal_decimal_width-coefficient_decimal_width){1'b0}} };
  assign a1_int = { {(internal_integer_width-coefficient_integer_width){a1[coefficient_width-1]}},
                            {(internal_decimal_width-coefficient_decimal_width){1'b0}} };
  assign a2_int = { {(internal_integer_width-coefficient_integer_width){a2[coefficient_width-1]}},
                            {(internal_decimal_width-coefficient_decimal_width){1'b0}} };

Digital filter, both FIR and IIR, need to store the value of the past inputs, and IIR also the value of the past outputs, so we will need a pipeline structure to store the past values.

  /* pipeline registers */
  always @(posedge aclk)
    if (!resetn) begin
      input_pipe1 <= 0;
      input_pipe2 <= 0;
      output_pipe1 <= 0;
      output_pipe2 <= 0;
      if (s_axis_tvalid) begin
        input_pipe1 <= input_int;
        input_pipe2 <= input_pipe1;
        output_pipe1 <= output_int;
        output_pipe2 <= output_pipe1;

Now, the next is perform the filter calculations. a second order filter need to perform 5 multiplications, and remember that it does not necessary mean that the modules uses 5 DSP slices. The code I developed performs combinational multiplications. This allows 0 clock cycles delay, but limit the speed of the filter. If your timing constrains are not met, you can register the input of the multipliers, and then make a retiming to let Vivado select where is more efficient put the register.

  /* combinational multiplications */
  assign input_b0 = input_int * b0_int;
  assign input_b1 = input_pipe1 * b1_int;
  assign input_b2 = input_pipe2 * b2_int;
  assign output_a1 = output_pipe1 * a1_int;
  assign output_a2 = output_pipe2 * a2_int;

The result of the multiplications will be added, in case of the input, and substracted in the case of the outputs, to obtain the filter output. As the output of the product has a size of twice of the operands, the addition must be the same width. To use the output on the multiplications, we have to perform a shift to delete extra decimal positions. Regarding the extra integer position will be truncated on the assignment.

  assign output_2int = input_b0 + input_b1 + input_b2 - output_a1 - output_a2;
  assign output_int = output_2int >>> (internal_decimal_width);

Finally, the value of the output will be reformatted to the inout widths.

  assign m_axis_tdata = output_int >>> (internal_decimal_width-inout_decimal_width);

Regarding the AXI4-Stream management signals, as the filter behaves as bridge, with one cycle delay, management signals will pass through the filter with one cycle delay.

Once the system in implemented we can test it.

Module verifying.

In order to verify the behavior of the filter, we will configure the system as a low pass filter, with a cut frequency of 1kHz, and a sampling frequency of 100kHz (script lowpass_sos.m). Then, we perform a unitary step to the filter. To obtain the response of the implemented filter, we configure the coefficients with a format of 20 bits, and 18 decimal bits. The internal width will be the same as the coefficients. On MATLAB, also we can test the response of the quantized filter. Testbench used (axis_biquad_tb.v), logs the output data at a sampling frequency, so the output both MATLAB and XSIM must be identical, with a gain of 1000 in the case of the simulation. The first response, is made with an internal format of I2Q18, and we can see how the gain in DC is been attenuated. This is due to the resolution of the internal operations, since MATLAB perform the operations with a resolution of 32 bits.

Now, with the same resolution of the coefficients, we will change the internal width to 32 bits, with 30 bits on the decimal side, and we obtain a response identical to the obtained on the MATLAB model.

Configuring again an internal width of I2Q18, we can se how increasing the Q factor, the response of the filter remains stable and with an acceptable gain.

The use of a higher or lower resolution will depends of the application. The fact of use the double of multiplicators to achieve a DC gain almost exactly the same of the continuous model, will depend of the number of DSP Slices available on the part you are using, but in most cases, applying a gain to the transfer function of the filter will be enough. Other option would be use a lower number of multiplicators than multiplications, and use a pipeline to perform the operations. This option will be valid if the clock cycles between data valid signals are enough to perform all the operations. This technique is known as folding, and the factor between number of operations and multiplications is named folding factor.

One important thing of this post, is that we have implement a generic second order system, that can be used as filter, or as regulator, or even a plant model to simulate a plant behavior on an FPGA. This kind of system are powerful and sure on future post we will use this module to create some cool projects.

All files are available on Github.

Leave a Reply