27 November 2021

Multiplication with FPGA DSPs

Welcome back to my series covering mathematics and algorithms with FPGAs.

I was initially going to look at real numbers in this part, but Project F is known for its practical, hands-on tutorials. So, I decided to dedicate a post to a topic usually ignored by introductory guides: multiplication with DSPs. We’ll cover real numbers in the next post.

New to the series? Start with Numbers in Verilog.

Updated 2021-12-21. Share your thoughts with @WillFlux or find me on 1BitSquared Discord.

DSP Blocks

FPGAs incorporate dedicated digital signal processing (DSP) blocks. As their name suggests, DSPs accelerate typical signal processing tasks, such as fast Fourier transforms (FFTs) and finite impulse response filtering (FIR). DSP blocks include multipliers, which we’ll cover in this post. Just bear in mind that DSP blocks are useful for many things beyond straight multiplication.

You can also implement multiplication directly in logic (LUTs and flip-flops), but it takes significant resources. Using dedicated DSP blocks for multiplication makes sense from a performance and logic-use perspective. Hence, even small FPGAs dedicate space to DSP blocks.

Multiplication Capabilities

Let’s take a quick look at the multiplication capabilities of a few FPGAs.

The width of a DSP multiplier depends on the FPGA architecture:

  • Lattice iCE40UP (SB_MAC16): 16 x 16 bit
  • Lattice ECP5 (sysDSP): 18 x 18 bit
  • Xilinx 7 Series (DSP48E1): 25 × 18 bit
  • Xilinx Ultrascale+ (DSP48E2): 27 x 18 bit

For example, an ECP5 multiplier takes two 18-bit inputs and generates a 36-bit product output.

While the number of multipliers depends on the specific FPGA part:

  • Lattice iCE40UP
    • iCE40UP5K: 8 multipliers
  • Lattice ECP5
    • LFE5U-85: 156 multipliers
    • LFE5U-25: 56 multipliers
  • Xilinx 7 Series
    • Artix-7 A200T: 740 multipliers
    • Spartan-7 S25: 80 multipliers
  • Xilinx Ultrascale+
    • Artix AU10P: 400 multipliers

Larger FPGAs include many thousands of multipliers, but we’ve confined ourselves to parts you might find on a hobbyist-friendly board (I live in hope of an affordable Ultrascale+ board in 2022).

DSP Block References

Note: each ECP5 DSP block incorporates two multipliers.

Using DSPs

There are two ways to incorporate DSPs into your FPGA designs:

  • Explicitly instantiate the DSP primitive (SB_MAC16, DSP48E1 etc.)
  • Infer the DSP using regular multiplication in Verilog or VHDL

Vendor primitives allow you to fine-tune your implementation but are more complex and limit your design to a single FPGA architecture. I’d recommend starting with inference and only reaching for vendor primitives if you need to do something clever or ring the last nanosecond out of your design.

Yosys DSP Inference

If you’re using Yosys with an iCE40UP FPGA, you need to explicitly enable DSP inference using the -dsp option.

For example, to synthesize a module called top_pong.v:

yosys -p "synth_ice40 -dsp -json top_pong.json" top_pong.v

On ECP5, Yosys infers DSPs by default; you can disable them with -nodsp.

Once I better understand Yosys iCE40UP DSP inference, I’ll add timing results for iCE40UP5K.

Multiplication Examples

To create simple, practical examples, I’ve combined simple maths functions with graphics output. If you have an FPGA board with video output, you can view the results on screen, but even if you don’t, you can still look at the designs and timings with your FPGA development tools. I also provide Verilator simulations of the designs.

Our examples run mathematical functions for each pixel on the screen, generating a 0 or 1 as output. The function needs to keep pace with the display signal: calculating 25 million values per second for 640x480 or 74 million for 1280x720.

Source

The SystemVerilog designs featured in this post are available from the projf-explore git repo under the open-source MIT licence: build on them to your heart’s content. The rest of the blog content is subject to standard copyright restrictions: don’t republish it without permission.

Squared

Let’s start with a module that includes just one multiplication.

This module takes the X and Y coordinates of a pixel and returns 1 if X squared is larger than the scaled Y value (explained below).

module func_squared #(
    CORDW=8,     // signed coordinate width (bits)
    Y_SCALE=256  // increase y-scale so we can see more on-screen
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // v1: simple version (latency: 2 cycles)
    logic signed [2*CORDW-1:0] x_squared;
    logic signed [2*CORDW-1:0] y_scaled;
    always_ff @(posedge clk) begin
        y_scaled <= Y_SCALE * y;
        x_squared <= x*x;
        r <= (x_squared < y_scaled) ? 1 : 0;
    end
endmodule

Because we’re using integers, each pixel represents one whole number. When we square X, the value quickly becomes too large to fit on screen if we assume each vertical pixel represents 1. To make more of the function visible, we scale the Y-axis with a parameter. In this case, we’ve scaled the Y-axis so each pixel represents 256.

Here’s a sample top module to drive it; this version is for the Arty (Xilinx 7 Series). (0,0) is in the centre of the screen. We’re using 12-bit signed signals, so they comfortably fit into any DSP.

module top_graphing (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen_480p clock_pix_inst (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display sync signals and coordinates
    localparam CORDW = 12;
    logic hsync, vsync;
    logic [CORDW-1:0] sx, sy;
    logic de;
    display_480p #(.CORDW(CORDW)) display_inst (
        .clk_pix,
        .rst(!clk_locked),
        .sx,
        .sy,
        .hsync,
        .vsync,
        .de,
        .frame(),
        .line()
    );

    // signal when to draw mathematical function and axis
    logic draw, axes;

    // VGA output
    always_ff @(posedge clk_pix) begin
        vga_hsync <= hsync;
        vga_vsync <= vsync;
        vga_r <= !de ? 4'h0 : (axes ? 4'hC : (draw ? 4'h3 : 4'h2));
        vga_g <= !de ? 4'h0 : (axes ? 4'hC : (draw ? 4'hB : 4'h2));
        vga_b <= !de ? 4'h0 : (axes ? 4'hC : (draw ? 4'hC : 4'h2));
    end


    //
    // Graphing Logic
    //

    // function coordinates
    logic signed [CORDW-1:0] x, y;

    // adjust screen coordinates so (0,0) is at centre of screen
    localparam X_OFFS = 320;
    localparam Y_OFFS = 239;
    always_ff @(posedge clk_pix) begin
        x <= sx - X_OFFS + 3;  // latency for function (+n) and offset calculation (+1)
        y <= Y_OFFS - sy;
    end

    // draw X and Y axes
    logic x_axis, y_axis;
    always_comb begin
        x_axis = (sy == Y_OFFS);
        y_axis = (sx == X_OFFS);
        axes = x_axis || y_axis;
    end

    // function to graph
    func_squared #(.CORDW(CORDW)) func_inst (
        .clk(clk_pix),
        .x,
        .y,
        .r(draw)
    );
endmodule

Building the Demos
In the Maths Demo section of the git repo, you’ll find the design files, a makefile for iCEBreaker, a Vivado project for Arty and Nexys Video, and instructions for building the designs for dev boards and with Verilator.

And here’s what the Verilator simulation looks like:

X Squared

Vivado automatically infers DSPs for the multiplication, but it gives a warning of the form:

[DRC DPOP-2] MREG Output pipelining
Pipelining the multiplier function will improve performance and will save significant power so it is suggested whenever possible to fully pipeline this function. If this multiplier was inferred, it is suggested to describe an additional register stage after this function.

Let’s take Vivado’s advice and register the result of the multiplication before comparing it with the Y-value. This adds an additional cycle of latency, so we need to delay the Y value:

    // v2: extra pipeline stages (latency: 3 cycles)
    logic signed [2*CORDW-1:0] x_squared, x_squared_p1;
    logic signed [2*CORDW-1:0] y_scaled, y_scaled_p1;
    always_ff @(posedge clk) begin
        y_scaled_p1 <= Y_SCALE * y;
        y_scaled    <= y_scaled_p1;

        x_squared_p1 <= x*x;
        x_squared <= x_squared_p1;

        r <= (x_squared < y_scaled) ? 1 : 0;
    end

We also update the latency correction in top_graphing; otherwise our results don’t match our inputs and our graphic would be shifted to the right:

        x <= sx - X_OFFS + 4;  // latency for function (+n) and offset calculation (+1)

Let’s compare the timings and logic used for the two versions of the design on different FPGAs:

FPGA Design Target Cycles LUT FF DSP WNS (ns)
XC7A35T Squared v1 25.2 MHz 2 81 90 1 33.6
XC7A35T Squared v2 25.2 MHz 3 84 96 1 34.1
XC7A200T Squared v1 74.25 MHz 2 137 127 1 9.2
XC7A200T Squared v2 74.25 MHz 3 139 135 1 8.2

Built with Vivado 2021.2 using default settings for synthesis and implementation. All Xilinx FPGAs are speed grade -1 (slowest grade).

The results for XC7A35T include 640x480 VGA output, while those for XC7A200T include 1280x720p60 DVI output. Timing (represented by WNS) wasn’t an issue for 25 or 74 MHz operation, even with the first version of our design. The extra pipeline stage uses a little more logic and adds an additional cycle of latency.

Timing on the XC7A200T is actually worse for our pipelined design, but still comfortable. Vivado doesn’t optimise a design when timing is comfortable, so we shouldn’t read too much into this.

Worst Negative Slack
The unhelpfully named worst negative slack (WNS) is how Vivado reports timing. The larger this number, the better, but you need to account for the target clock speed. For example, at 100 MHz the maximum WNS is 10 ns because that’s how long a clock cycle lasts at 100 MHz. If WNS is negative, your design doesn’t meet timing and might not work.

Circle

A more complex function is the circle equation, which requires squaring both X and Y.

Again, we create two versions and compare them. I’m including both versions in one module but commented one out.

module func_circle #(
    CORDW=8,    // signed coordinate width (bits)
    RADIUS=128  // circle radius
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // // v1: simple version (latency: 2 cycles)
    // logic signed [2*CORDW:0] circle;  // addition of two squares, so twice as wide
    // always_ff @(posedge clk) begin
    //     circle <= x*x + y*y;
    //     r <= (circle < RADIUS * RADIUS) ? 1 : 0;
    // end

    // v2: extra pipeline stages (latency: 4 cycles)
    logic [2*CORDW-1:0] x_squared, x_squared_p1;
    logic [2*CORDW-1:0] y_squared, y_squared_p1;
    logic signed [2*CORDW:0] circle;  // addition of two squares, so twice as wide
    always_ff @(posedge clk) begin
        x_squared_p1 <= x*x;
        x_squared <= x_squared_p1;

        y_squared_p1 <= y*y;
        y_squared <= y_squared_p1;

        circle <= x_squared + y_squared;
        r <= (circle < RADIUS * RADIUS) ? 1 : 0;
    end
endmodule
FPGA Design Target Cycles LUT FF DSP WNS (ns)
XC7A35T Circle v1 25.2 MHz 2 78 66 2 33.6
XC7A35T Circle v2 25.2 MHz 4 102 77 2 33.9
XC7A200T Circle v1 74.25 MHz 2 135 103 2 7.7
XC7A200T Circle v2 74.25 MHz 4 159 114 2 9.2

As with our squared module, we’ve so much timing slack at 25 MHz that the extra registers don’t make any significant difference. At 74 MHz there is some improvement, but both versions have generous slack considering the target frequency.

NB. This is not an efficient way to draw a circle for graphics applications; use a specialist circle drawing algorithm that doesn’t require DSPs, such as the one in FPGA Shapes.

Cubed

The next example increases the complexity a little more, calculating X cubed:

module func_cubed #(
    CORDW=8,       // signed coordinate width (bits)
    Y_SCALE=16384  // increase y-scale so we can see more on-screen
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // // v1: simple version (latency: 2 cycles)
    // logic signed [3*CORDW-1:0] x_cubed, y_scaled;
    // always_ff @(posedge clk) begin
    //     y_scaled <= Y_SCALE * y;
    //     x_cubed <= x*x*x;
    //     r <= (x_cubed < y_scaled) ? 1 : 0;
    // end

    // v2: extra pipeline stages (latency: 5 cycles)
    logic signed [2*CORDW-1:0] x_squared, x_squared_p1;
    logic signed [3*CORDW-1:0] x_cubed_p1, x_cubed;
    logic signed [3*CORDW-1:0] y_scaled, y_scaled_p1, y_scaled_p2, y_scaled_p3;
    always_ff @(posedge clk) begin
        y_scaled_p3 <= Y_SCALE * y;
        y_scaled_p2 <= y_scaled_p3;
        y_scaled_p1 <= y_scaled_p2;
        y_scaled    <= y_scaled_p1;

        x_squared_p1 <= x*x;
        x_squared    <= x_squared_p1;
        x_cubed_p1   <= x*x_squared;
        x_cubed      <= x_cubed_p1;

        r <= (x_cubed < y_scaled) ? 1 : 0;
    end
endmodule

In the initial version of cubed, we have two multiplications in one expression. Interestingly, if the signal width is 16, it requires three DSPs, while 12-bit signals need only two. This is another reason to infer DSPs for multiplication: synthesis tools can make non-obvious optimisations.

Adding the extra pipeline stages and registers makes a more noticeable difference:

FPGA Design Target Cycles LUT FF DSP WNS (ns)
XC7A35T Cubed v1 25.2 MHz 2 83 90 2 30.7
XC7A35T Cubed v2 25.2 MHz 5 91 79 3 33.9
XC7A200T Cubed v1 74.25 MHz 2 140 127 2 4.6
XC7A200T Cubed v2 74.25 MHz 5 148 115 3 9.1

For our 25 MHz design, we’ve still got plenty of slack, but the additional pipeline stages improve performance at the cost of an extra DSP. However, the number of flip flops goes down in the pipelined version. The 74 MHz results mirror 25 MHz, though timing is starting to get tight for the non-pipelined design.

Quartic Polynomial

Let’s up the ante and graph a quartic polynomial: x⁴−x².

module func_polynomial #(
    CORDW=8,       // signed coordinate width (bits)
    Y_SCALE=2**24  // increase y-scale so we can see more on-screen
    ) (
    input  wire clk,
    input  wire signed [CORDW-1:0] x,
    input  wire signed [CORDW-1:0] y,
    output logic r
    );

    // // v1: simple version (latency: 2 cycles)
    // logic signed [4*CORDW-1:0] x_poly, y_scaled;
    // always_ff @(posedge clk) begin
    //     y_scaled <= Y_SCALE * y;
    //     x_poly <= x*x*x*x - 2**16 * x*x;
    //     r <= (x_poly < y_scaled) ? 1 : 0;
    // end

    // v2: extra pipeline stages (latency: 6 cycles)
    logic signed [2*CORDW-1:0] x_squared, x_squared_p1, x_squared_p2, x_squared_p3;
    logic signed [4*CORDW-1:0] x_fourth, x_fourth_p1, x_poly;
    logic signed [4*CORDW-1:0] y_scaled, y_scaled_p1, y_scaled_p2, y_scaled_p3, y_scaled_p4;
    always_ff @(posedge clk) begin
        y_scaled_p4 <= Y_SCALE * y;
        y_scaled_p3 <= y_scaled_p4;
        y_scaled_p2 <= y_scaled_p3;
        y_scaled_p1 <= y_scaled_p2;
        y_scaled    <= y_scaled_p1;

        x_squared_p3 <= x*x;
        x_squared_p2 <= x_squared_p3;
        x_squared_p1 <= x_squared_p2;
        x_squared    <= x_squared_p1;

        x_fourth_p1 <= x_squared_p2 * x_squared_p2;
        x_fourth    <= x_fourth_p1;

        x_poly <= x_fourth - 2**16 * x_squared;
        r <= (x_poly < y_scaled) ? 1 : 0;
    end
endmodule
FPGA Design Target Cycles LUT FF DSP WNS (ns)
XC7A35T Polynomial v1 25.2 MHz 2 130 112 6 20.9
XC7A35T Polynomial v2 25.2 MHz 6 150 160 3 33.1
XC7A200T Polynomial v1 74.25 MHz 2 186 149 6 - 2.7
XC7A200T Polynomial v2 74.25 MHz 6 206 198 3 7.6

This time there’s a significant difference in timing, even at 25 MHz. For 74 MHz, the complexity has caught up with us, and v1 misses timing by quite a margin. However, the pipelined version is still comfortable.

Notably, the extra pipeline stages also halve the number of DSPs required from 6 to 3. Don’t assume that adding pipeline stages results in higher DSP usage: handling too much complex logic in one cycle can be inefficient in logic as well as time.

x⁴−x²

A Balancing Act

Implementing an algorithm in hardware is always a balancing act, never more so than when using DSPs. Adding pipeline stages and registering outputs is almost always worthwhile, but it does increase latency, which is problematic for some applications. Using narrower signals saves logic but reduces accuracy or range.

I’ll finish by giving you a few tips to take away:

  • Improve timing by including at least one register between each calculation step
  • Narrower signals can save significant DSPs and logic
  • Use signed values with Xilinx DSPs
  • Use an FSM to divide a complex algorithm into multiple steps
    • We’ll see examples of using FSM with algorithms in later posts

Next Time

Stay tuned for the next part of Maths and Algorithms with FPGAs in spring 2022. In part 3, we’ll look at real numbers, including fixed point. Until then, why not check out our existing maths posts: division, square root, and sine & cosine.

Constructive feedback is always welcome. Get in touch with @WillFlux or open an issue on GitHub.

Sponsor Project F
If you like what I do, consider sponsoring me on GitHub. I use contributions to spend more time creating open-source FPGA designs and tutorials.

©2022 Will Green, Project F