Multiplication with FPGA DSPs

Published 27 Nov 2021 · Updated 07 Mar 2023

Welcome back to my series covering mathematics and algorithms with FPGAs. Project F is known for its practical, hands-on tutorials. So, I decided to dedicate a post to a topic usually ignored by FPGA authors: multiplication with DSPs. New to the series? Start with Numbers in Verilog.

Series Outline

Numbers in Verilog - introduction to numbers in Verilog
Vectors and Arrays - working with Verilog vectors and arrays
Multiplication with FPGA DSPs (this post) - efficient multiplication with DSPs
Fixed-Point Numbers in Verilog - precision without complexity
Division in Verilog - divided we stand
More maths to follow

DSP Blocks

FPGAs incorporate dedicated digital signal processing (DSP) blocks. As their name suggests, DSPs accelerate typical signal processing tasks, such as fast Fourier transforms (FFTs) and finite impulse response filtering (FIR). DSP blocks include multipliers, which we’ll cover in this post. Just bear in mind that DSP blocks are useful for many things beyond straight multiplication.

You can also implement multiplication directly in logic (LUTs and flip-flops), but it takes significant resources. Using dedicated DSP blocks for multiplication makes sense from a performance and logic-use perspective. Hence, even small FPGAs dedicate space to DSP blocks.

Multiplication Capabilities

Let’s take a quick look at the multiplication capabilities of a few FPGAs.

The width of a DSP multiplier depends on the FPGA architecture:

Altera Cyclone V: 27 x 27 bit
Lattice iCE40UP (SB_MAC16): 16 x 16 bit
Lattice ECP5 (sysDSP): 18 x 18 bit
Xilinx 7 Series (DSP48E1): 25 × 18 bit
Xilinx Ultrascale+ (DSP48E2): 27 x 18 bit

For example, an ECP5 multiplier takes two 18-bit inputs and generates a 36-bit product output.

While the number of multipliers depends on the specific FPGA part:

Altera Cyclone V:
- 5CSEBA6U23I7: 112 multipliers
Lattice iCE40UP
- iCE40UP5K: 8 multipliers
Lattice ECP5
- LFE5U-85: 156 multipliers
- LFE5U-25: 56 multipliers
Xilinx 7 Series
- Artix-7 A200T: 740 multipliers
- Spartan-7 S25: 80 multipliers
Xilinx Ultrascale+
- Artix AU10P: 400 multipliers

Larger FPGAs include many thousands of multipliers, but we’ve confined ourselves to parts you might find on a hobbyist-friendly board (I live in hope of an affordable Ultrascale+ board in ~~2022~~ 2023).

DSP Block References

Note: each ECP5 DSP block incorporates two multipliers and each Cyclone V DSP block can handle one 27x27, two 18x18, or three 9x9 multiplications.

Using DSPs

There are two ways to incorporate DSPs into your FPGA designs:

Explicitly instantiate the DSP primitive (SB_MAC16, DSP48E1 etc.)
Infer the DSP using regular multiplication in Verilog or VHDL

Vendor primitives allow you to fine-tune your implementation but are more complex and limit your design to a single FPGA architecture. I’d recommend starting with inference and only reaching for vendor primitives if you need to do something clever or ring the last nanosecond out of your design.

Yosys DSP Inference

If you’re using Yosys with an iCE40UP FPGA, you need to explicitly enable DSP inference using the -dsp option.

For example, to synthesize a module called top_pong.v:

yosys -p "synth_ice40 -dsp -json top_pong.json" top_pong.v

On ECP5, Yosys infers DSPs by default; you can disable them with -nodsp.

Multiplication Examples

To create simple, practical examples, I’ve combined simple maths functions with graphics output. If you have an FPGA board with video output, you can view the results on screen, but even if you don’t, you can still look at the designs and timings with your FPGA development tools. I also provide Verilator simulations of the designs.

Our examples run mathematical functions for each pixel on the screen, generating a 0 or 1 as output. The function needs to keep pace with the display signal: calculating 25 million values per second for 640x480 or 74 million for 1280x720.

Source

The SystemVerilog designs featured in this post are available from the projf-explore git repo under the open-source MIT licence: build on them to your heart’s content. The rest of the blog content is subject to standard copyright restrictions: don’t republish it without permission.

Squared

Let’s start with a module that includes just one multiplication.

This module takes the X and Y coordinates of a pixel and returns 1 if X squared is larger than the scaled Y value (explained below).

module func_squared #(
    CORDW=8,     // signed coordinate width (bits)
    Y_SCALE=256  // increase y-scale so we can see more on-screen
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // v1: simple version (latency: 2 cycles)
    logic signed [2*CORDW-1:0] x_squared;
    logic signed [2*CORDW-1:0] y_scaled;
    always_ff @(posedge clk) begin
        y_scaled <= Y_SCALE * y;
        x_squared <= x*x;
        r <= (x_squared < y_scaled) ? 1 : 0;
    end
endmodule

Because we’re using integers, each pixel represents one whole number. When we square X, the value quickly becomes too large to fit on screen if we assume each vertical pixel represents 1. To make more of the function visible, we scale the Y-axis with a parameter. In this case, we’ve scaled the Y-axis so each pixel represents 256.

Here’s a sample top module to drive it; this version is for the Arty (Xilinx 7 Series). (0,0) is in the centre of the screen. We’re using 12-bit signed signals, so they comfortably fit into any DSP.

module top_graphing (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen_480p clock_pix_inst (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display sync signals and coordinates
    localparam CORDW = 12;
    logic hsync, vsync;
    logic [CORDW-1:0] sx, sy;
    logic de;
    display_480p #(.CORDW(CORDW)) display_inst (
        .clk_pix,
        .rst(!clk_locked),
        .sx,
        .sy,
        .hsync,
        .vsync,
        .de,
        .frame(),
        .line()
    );

    // signal when to draw mathematical function and axis
    logic draw, axes;

    // VGA output
    always_ff @(posedge clk_pix) begin
        vga_hsync <= hsync;
        vga_vsync <= vsync;
        vga_r <= !de ? 4'h0 : (axes ? 4'hC : (draw ? 4'h3 : 4'h2));
        vga_g <= !de ? 4'h0 : (axes ? 4'hC : (draw ? 4'hB : 4'h2));
        vga_b <= !de ? 4'h0 : (axes ? 4'hC : (draw ? 4'hC : 4'h2));
    end


    //
    // Graphing Logic
    //

    // function coordinates
    logic signed [CORDW-1:0] x, y;

    // adjust screen coordinates so (0,0) is at centre of screen
    localparam X_OFFS = 320;
    localparam Y_OFFS = 239;
    always_ff @(posedge clk_pix) begin
        x <= sx - X_OFFS + 3;  // latency for function (+n) and offset calculation (+1)
        y <= Y_OFFS - sy;
    end

    // draw X and Y axes
    logic x_axis, y_axis;
    always_comb begin
        x_axis = (sy == Y_OFFS);
        y_axis = (sx == X_OFFS);
        axes = x_axis || y_axis;
    end

    // function to graph
    func_squared #(.CORDW(CORDW)) func_inst (
        .clk(clk_pix),
        .x,
        .y,
        .r(draw)
    );
endmodule

Building the Demos
In the Maths Demo section of the git repo, you’ll find the design files, a makefile for iCEBreaker, a Vivado project for Arty and Nexys Video, and instructions for building the designs for dev boards and with Verilator.

And here’s what the Verilator simulation looks like:

X Squared

Vivado automatically infers DSPs for the multiplication, but it gives a warning of the form:

[DRC DPOP-2] MREG Output pipelining
Pipelining the multiplier function will improve performance and will save significant power so it is suggested whenever possible to fully pipeline this function. If this multiplier was inferred, it is suggested to describe an additional register stage after this function.

Let’s take Vivado’s advice and register the result of the multiplication before comparing it with the Y-value. This adds an additional cycle of latency, so we need to delay the Y value:

    // v2: extra pipeline stages (latency: 3 cycles)
    logic signed [2*CORDW-1:0] x_squared, x_squared_p1;
    logic signed [2*CORDW-1:0] y_scaled, y_scaled_p1;
    always_ff @(posedge clk) begin
        y_scaled_p1 <= Y_SCALE * y;
        y_scaled    <= y_scaled_p1;

        x_squared_p1 <= x*x;
        x_squared <= x_squared_p1;

        r <= (x_squared < y_scaled) ? 1 : 0;
    end

We also update the latency correction in top_graphing; otherwise our results don’t match our inputs and our graphic would be shifted to the right:

        x <= sx - X_OFFS + 4;  // latency for function (+n) and offset calculation (+1)

Let’s compare the timings and logic used for the two versions of the design on different FPGAs:

FPGA	Design	Target	Cycles	LUT	FF	DSP	WNS (ns)
XC7A35T	Squared v1	25.2 MHz	2	81	90	1	33.6
XC7A35T	Squared v2	25.2 MHz	3	84	96	1	34.1
XC7A200T	Squared v1	74.25 MHz	2	137	127	1	9.2
XC7A200T	Squared v2	74.25 MHz	3	139	135	1	8.2

Built with Vivado 2021.2 using default settings for synthesis and implementation. Xilinx FPGAs are speed grade -1 (slowest grade).

The results for XC7A35T include 640x480 VGA output, while those for XC7A200T include 1280x720p60 DVI output. Timing (represented by WNS) wasn’t an issue for 25 or 74 MHz operation, even with the first version of our design. The extra pipeline stage uses a little more logic and adds an additional cycle of latency.

Timing on the XC7A200T is actually worse for our pipelined design, but still comfortable. Vivado doesn’t optimise a design when timing is comfortable, so we shouldn’t read too much into this.

Worst Negative Slack
The unhelpfully named worst negative slack (WNS) is how Vivado reports timing. The larger this number, the better, but you need to account for the target clock speed. For example, at 100 MHz the maximum WNS is 10 ns because that’s how long a clock cycle lasts at 100 MHz. If WNS is negative, your design doesn’t meet timing and might not work.

Circle

A more complex function is the circle equation, which requires squaring both X and Y.

Again, we create two versions and compare them. I’m including both versions in one module but commented one out.

module func_circle #(
    CORDW=8,    // signed coordinate width (bits)
    RADIUS=128  // circle radius
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // // v1: simple version (latency: 2 cycles)
    // logic signed [2*CORDW:0] circle;  // addition of two squares, so twice as wide
    // always_ff @(posedge clk) begin
    //     circle <= x*x + y*y;
    //     r <= (circle < RADIUS * RADIUS) ? 1 : 0;
    // end

    // v2: extra pipeline stages (latency: 4 cycles)
    logic [2*CORDW-1:0] x_squared, x_squared_p1;
    logic [2*CORDW-1:0] y_squared, y_squared_p1;
    logic signed [2*CORDW:0] circle;  // addition of two squares, so twice as wide
    always_ff @(posedge clk) begin
        x_squared_p1 <= x*x;
        x_squared <= x_squared_p1;

        y_squared_p1 <= y*y;
        y_squared <= y_squared_p1;

        circle <= x_squared + y_squared;
        r <= (circle < RADIUS * RADIUS) ? 1 : 0;
    end
endmodule

FPGA	Design	Target	Cycles	LUT	FF	DSP	WNS (ns)
XC7A35T	Circle v1	25.2 MHz	2	78	66	2	33.6
XC7A35T	Circle v2	25.2 MHz	4	102	77	2	33.9
XC7A200T	Circle v1	74.25 MHz	2	135	103	2	7.7
XC7A200T	Circle v2	74.25 MHz	4	159	114	2	9.2

As with our squared module, we’ve so much timing slack at 25 MHz that the extra registers don’t make any significant difference. At 74 MHz there is some improvement, but both versions have generous slack considering the target frequency.

NB. This is not an efficient way to draw a circle for graphics applications; use a specialist circle drawing algorithm that doesn’t require DSPs, such as the one in FPGA Shapes.

Cubed

The next example increases the complexity a little more, calculating X cubed:

module func_cubed #(
    CORDW=8,       // signed coordinate width (bits)
    Y_SCALE=16384  // increase y-scale so we can see more on-screen
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // // v1: simple version (latency: 2 cycles)
    // logic signed [3*CORDW-1:0] x_cubed, y_scaled;
    // always_ff @(posedge clk) begin
    //     y_scaled <= Y_SCALE * y;
    //     x_cubed <= x*x*x;
    //     r <= (x_cubed < y_scaled) ? 1 : 0;
    // end

    // v2: extra pipeline stages (latency: 5 cycles)
    logic signed [2*CORDW-1:0] x_squared, x_squared_p1;
    logic signed [3*CORDW-1:0] x_cubed_p1, x_cubed;
    logic signed [3*CORDW-1:0] y_scaled, y_scaled_p1, y_scaled_p2, y_scaled_p3;
    always_ff @(posedge clk) begin
        y_scaled_p3 <= Y_SCALE * y;
        y_scaled_p2 <= y_scaled_p3;
        y_scaled_p1 <= y_scaled_p2;
        y_scaled    <= y_scaled_p1;

        x_squared_p1 <= x*x;
        x_squared    <= x_squared_p1;
        x_cubed_p1   <= x*x_squared;
        x_cubed      <= x_cubed_p1;

        r <= (x_cubed < y_scaled) ? 1 : 0;
    end
endmodule

In the initial version of cubed, we have two multiplications in one expression. Interestingly, if the signal width is 16, it requires three DSPs, while 12-bit signals need only two. This is another reason to infer DSPs for multiplication: synthesis tools can make non-obvious optimisations.

Adding the extra pipeline stages and registers makes a more noticeable difference:

FPGA	Design	Target	Cycles	LUT	FF	DSP	WNS (ns)
XC7A35T	Cubed v1	25.2 MHz	2	83	90	2	30.7
XC7A35T	Cubed v2	25.2 MHz	5	91	79	3	33.9
XC7A200T	Cubed v1	74.25 MHz	2	140	127	2	4.6
XC7A200T	Cubed v2	74.25 MHz	5	148	115	3	9.1

For our 25 MHz design, we’ve still got plenty of slack, but the additional pipeline stages improve performance at the cost of an extra DSP. However, the number of flip flops goes down in the pipelined version. The 74 MHz results mirror 25 MHz, though timing is starting to get tight for the non-pipelined design.

Quartic Polynomial

Let’s up the ante and graph a quartic polynomial: x⁴−x².

module func_polynomial #(
    CORDW=8,       // signed coordinate width (bits)
    Y_SCALE=2**24  // increase y-scale so we can see more on-screen
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // // v1: simple version (latency: 2 cycles)
    // logic signed [4*CORDW-1:0] x_poly, y_scaled;
    // always_ff @(posedge clk) begin
    //     y_scaled <= Y_SCALE * y;
    //     x_poly <= x*x*x*x - 2**16 * x*x;
    //     r <= (x_poly < y_scaled) ? 1 : 0;
    // end

    // v2: extra pipeline stages (latency: 6 cycles)
    logic signed [2*CORDW-1:0] x_squared, x_squared_p1, x_squared_p2, x_squared_p3;
    logic signed [4*CORDW-1:0] x_fourth, x_fourth_p1, x_poly;
    logic signed [4*CORDW-1:0] y_scaled, y_scaled_p1, y_scaled_p2, y_scaled_p3, y_scaled_p4;
    always_ff @(posedge clk) begin
        y_scaled_p4 <= Y_SCALE * y;
        y_scaled_p3 <= y_scaled_p4;
        y_scaled_p2 <= y_scaled_p3;
        y_scaled_p1 <= y_scaled_p2;
        y_scaled    <= y_scaled_p1;

        x_squared_p3 <= x*x;
        x_squared_p2 <= x_squared_p3;
        x_squared_p1 <= x_squared_p2;
        x_squared    <= x_squared_p1;

        x_fourth_p1 <= x_squared_p2 * x_squared_p2;
        x_fourth    <= x_fourth_p1;

        x_poly <= x_fourth - 2**16 * x_squared;
        r <= (x_poly < y_scaled) ? 1 : 0;
    end
endmodule

FPGA	Design	Target	Cycles	LUT	FF	DSP	WNS (ns)
XC7A35T	Polynomial v1	25.2 MHz	2	130	112	6	20.9
XC7A35T	Polynomial v2	25.2 MHz	6	150	160	3	33.1
XC7A200T	Polynomial v1	74.25 MHz	2	186	149	6	- 2.7
XC7A200T	Polynomial v2	74.25 MHz	6	206	198	3	7.6

This time there’s a significant difference in timing, even at 25 MHz. For 74 MHz, the complexity has caught up with us, and v1 misses timing by quite a margin. However, the pipelined version is still comfortable.

Notably, the extra pipeline stages also halve the number of DSPs required from 6 to 3. Don’t assume that adding pipeline stages results in higher DSP usage: handling too much complex logic in one cycle can be inefficient in logic as well as time.

x⁴−x²

A Balancing Act

Implementing an algorithm in hardware is always a balancing act, never more so than when using DSPs. Adding pipeline stages and registering outputs is almost always worthwhile, but it does increase latency, which is problematic for some applications. Using narrower signals saves logic but reduces accuracy or range.

I’ll finish by giving you a few tips to take away:

Improve timing by including at least one register between each calculation step
Narrower signals can save significant DSPs and logic
Use signed values with Xilinx DSPs
Use an FSM to divide a complex algorithm into multiple steps
- We’ll see examples of using FSM with algorithms in later posts

Next Time

In our fourth installment we get an introduction to Fixed-Point Numbers in Verilog or try rendering the Mandelbrot Set in Verilog.