1 July 2020

Division in Verilog

Division is a fundamental arithmetic operation; one we take for granted in most contexts. FPGAs are different; Verilog can’t synthesize division: we need to do it ourselves. In this how to, we’re going to look at a straightforward division algorithm for positive integers and fixed-point numbers. For integers, this method takes one cycle per bit: 32 cycles for 32-bit numbers. This post was last updated June 2021.

There are also posts on fixed-point numbers, square root, and sine & cosine.

Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)

Sponsor My Work
If you like what I do, consider sponsoring me on GitHub.
I love FPGAs and want to help more people discover and use them in their projects.
My hardware designs are open source, and my blog is advert free.

Source

The SystemVerilog designs featured in this post are available from the Project F Library under the open-source MIT licence: build on them to your heart’s content. The rest of the blog content is subject to standard copyright restrictions: don’t republish it without permission.

Division Defined

Before we get to the design, it helps be familiar with some terminology.

When you divide dividend X by divisor Y you get quotient Q and remainder R:

X = Y*Q + R

Consider a trivial example: you have seven slices of apple pie to divide equally amongst three people. Each person gets two slices of pie, and there is one slice left over.

Less deliciously: given X=7 and Y=3, then Q=2 and R=1 because 7 = 3*2 + 1.

Long Division

The traditional way to divide numbers with pen and paper is long division. We move from left to right, trying to divide the shortest sequence of digits in the dividend by the divisor.

For example, let’s divide 512 by 12. We find that 12 is larger 5 (the first digit of the dividend), so we next consider 51, which 12 divides 4 times, with 3 left over. So, the first digit of our answer is 4.

Next, we take the leftover 3, and bring down the 2 from 512, to make 32. 12 fits into 32 twice, so the second digit of our answer is 2 with 8 left over. We’ve considered all the digits of the dividend, so the quotient is 42 and the remainder is 8.

This is easier to see when laid out as a calculation in columns:

    X=512  Y=12

         42    Quotient
       ————
    12 )512
        48     51: 12x4=48 + 3
        ——
         32    Take the 3 and bring down the 2 down from 512
         24    32: 12x2=24 + 8
         ——
          8    Remainder

If this doesn’t seem clear, check out the Wikipedia page on long division and try doing a few calculations yourself. Nothing beats doing hands-on examples when it comes to maths.

In Binary

For binary, we can follow the same long division process. For example, let’s divide 1110 by 11 (that’s 14 divided by 3 in decimal).

Our divisor 11 is two digits long, so we can start by considering the first two digits of the dividend, which is also 11: we record a 1 for the first digit of the quotient and move on.

The remaining digits are 10, which is smaller than 11, so we stop. We have a quotient of 100 and a remainder of 10. This is easier to see when laid out in columns:

    X=1110  Y=11

         100    Quotient
        ————
    11 )1110
        11      11x1=11 + 0
        ——
         010    Bring third 1 down from 1110
         000    11 doesn't divide 10, so it's the remainder
         ———
          10    Remainder

Doing binary division by hand is painful: each step is simple, but even moderately-sized numbers have a large number of digits, all of which are 1 and 0, so it’s easy to make a mistake. However, the very simplicity of the approach makes it straightforward to implement in Verilog.

Divide is typically slower than Multiply
Even high-end CPUs take their time with division. Intel Skylake has a latency of 42-95 cycles for signed 64-bit integer division, but only 3 cycles for multiplication. Source: agner.org.

Algorithm Implementation

We’ll take our example from above: X=1110 and Y=0011. We also need a register to accumulate (store) the intermediate calculations, and a register to record the quotient: we name them A and Q respectively.

There are four digits in the inputs, so we need four steps. For each step, we shift the left-most digit of the dividend X into A, then compare it with the divisor Y. If A is greater or equal to Y, then we subtract Y from A and add 1 to the quotient Q.

This is easiest to see by working through the example:

Inputs: X=1110  Y=0011

Step    A       X       Q       Description
——————————————————————————————————————————————————————————————————
        0000    1110    0000    Starting values.
1       0001    1100    0000    Left shift X into A. Left shift Q.
                                Is A≥Y? No: next digit...
2       0011    1000    0000    Left shift X into A. Left shift Q.
                                Is A≥Y? Yes: update quotient...
        0000    1000    0001    Subtract Y from A. Set Q[0]=1.
3       0001    0000    0010    Left shift X into A. Left shift Q.
                                Is A≥Y? No: next digit...
4       0010    0000    0100    Left shift X into A. Left shift Q.
                                Is A≥Y? No. Done.
——————————————————————————————————————————————————————————————————

Outputs: Q=0100  R=010

With this algorithm, the divisor Y can only ever fit into the accumulator A once at most, which is why we can simply subtract Y from A when A≥Y.

We never use the same digit in X and Q at the same time, so it’s possible to combine those registers together. As a digit of the dividend is shifted out, a digit of the quotient is shifted in. We’ll do this in our Verilog module.

Verilog Module

Our Verilog design uses the above algorithm, but adds a check for divide by zero and allows the width of the numbers to be configured.

Project F Library source: [div_int.sv].

module div_int #(parameter WIDTH=4) (
    input wire logic clk,
    input wire logic start,          // start signal
    output     logic busy,           // calculation in progress
    output     logic valid,          // quotient and remainder are valid
    output     logic dbz,            // divide by zero flag
    input wire logic [WIDTH-1:0] x,  // dividend
    input wire logic [WIDTH-1:0] y,  // divisor
    output     logic [WIDTH-1:0] q,  // quotient
    output     logic [WIDTH-1:0] r   // remainder
    );

    logic [WIDTH-1:0] y1;            // copy of divisor
    logic [WIDTH-1:0] q1, q1_next;   // intermediate quotient
    logic [WIDTH:0] ac, ac_next;     // accumulator (1 bit wider)
    logic [$clog2(WIDTH)-1:0] i;     // iteration counter

    always_comb begin
        if (ac >= {1'b0,y1}) begin
            ac_next = ac - y1;
            {ac_next, q1_next} = {ac_next[WIDTH-1:0], q1, 1'b1};
        end else begin
            {ac_next, q1_next} = {ac, q1} << 1;
        end
    end

    always_ff @(posedge clk) begin
        if (start) begin
            valid <= 0;
            i <= 0;
            if (y == 0) begin  // catch divide by zero
                busy <= 0;
                dbz <= 1;
            end else begin  // initialize values
                busy <= 1;
                dbz <= 0;
                y1 <= y;
                {ac, q1} <= {{WIDTH{1'b0}}, x, 1'b0};
            end
        end else if (busy) begin
            if (i == WIDTH-1) begin  // we're done
                busy <= 0;
                valid <= 1;
                q <= q1_next;
                r <= ac_next[WIDTH:1];  // undo final shift
            end else begin  // next iteration
                i <= i + 1;
                ac <= ac_next;
                q1 <= q1_next;
            end
        end
    end
endmodule

ProTip: I’ve used SystemVerilog, but you can easily adapt logic, $clog2, and always* to Verilog.

To use the module, set WIDTH to the correct number of bits and the inputs x and y to dividend and divisor respectively. To begin the calculation set start high for one clock. The valid signal indicates when the output data is valid; you can then read the results from q and r. If you divide by zero, then valid, q, and r will be zero, and the dbz flag signal will be high. The busy signal is high during calculation.

The Verilog itself is straightforward. The algorithm is in the always_comb block, though we start with the initial shift already in place. The always_ff block tests for division by zero, sets up the initial values, then runs the algorithm for the same number of iterations as the width of the numbers.

Accumulator Width
The accumulator needs to be 1 bit wider than the dividend because the remainder comes from unshifting the final ac_next. For example, when we divide eight by nine, the remainder should be eight 4'b1000, but without the wider accumulator the left-most digit would be lost, and the remainder would appear to be 4'b0000.

Testing

The following Vivado test bench exercises the module, including dividing by zero.

Project F Library source: [xc7/div_int_tb.sv].

module div_int_tb();

    parameter CLK_PERIOD = 10;  // 10 ns == 100 MHz
    parameter WIDTH = 4;

    logic clk;
    logic start;            // start signal
    logic busy;             // calculation in progress
    logic valid;            // quotient and remainder are valid
    logic dbz;              // divide by zero flag
    logic [WIDTH-1:0] x;    // dividend
    logic [WIDTH-1:0] y;    // divisor
    logic [WIDTH-1:0] q;    // quotient
    logic [WIDTH-1:0] r;    // remainder

    div_int #(.WIDTH(WIDTH)) div_int_inst (.*);

    always #(CLK_PERIOD / 2) clk = ~clk;

    initial begin
        $monitor("\t%d:\t%d /%d =%d (r =%d) (V=%b) (DBZ=%b)",
            $time, x, y, q, r, valid, dbz);
    end

    initial begin
        clk = 1;

        #100    x = 4'b0000;  // 0
                y = 4'b0010;  // 2
                start = 1;
        #10     start = 0;

        #50     x = 4'b0010;  // 2
                y = 4'b0000;  // 0
                start = 1;
        #10     start = 0;

        #50     x = 4'b0111;  // 7
                y = 4'b0010;  // 2
                start = 1;
        #10     start = 0;

        #50     x = 4'b1111;  // 15
                y = 4'b0101;  //  5
                start = 1;
        #10     start = 0;

        #50     x = 4'b0001;  // 1
                y = 4'b0001;  // 1
                start = 1;
        #10     start = 0;

        #50     x = 4'b1000;  // 8
                y = 4'b1001;  // 9
                start = 1;
        #10     start = 0;

        // ...

        #50     $finish;
    end
endmodule

The output looks like this:

  0:     x / x = x (r = x) (V=x) (DBZ=x)
100:     0 / 2 = x (r = x) (V=0) (DBZ=0)
140:     0 / 2 = 0 (r = 0) (V=1) (DBZ=0)
160:     2 / 0 = 0 (r = 0) (V=0) (DBZ=1)
220:     7 / 2 = 0 (r = 0) (V=0) (DBZ=0)
260:     7 / 2 = 3 (r = 1) (V=1) (DBZ=0)
280:    15 / 5 = 3 (r = 1) (V=0) (DBZ=0)
320:    15 / 5 = 3 (r = 0) (V=1) (DBZ=0)
340:     1 / 1 = 3 (r = 0) (V=0) (DBZ=0)
380:     1 / 1 = 1 (r = 0) (V=1) (DBZ=0)
400:     8 / 9 = 1 (r = 0) (V=0) (DBZ=0)
440:     8 / 9 = 0 (r = 8) (V=1) (DBZ=0)

NB. I have tested with a wider range of values, but only show a few here for brevity.

Fixed-Point Support

In a previous recipe, we looked at Fixed Point Numbers in Verilog, but didn’t cover division, so let’s do that now. We have two changes to make:

  1. Increasing the number of iterations
  2. Handling overflow

To account for the fractional part of the number, we need to increase the number of iterations by the number of fractional bits.

More explanation to follow…

Project F Library source: [div.sv].

module div #(
    parameter WIDTH=4,  // width of numbers in bits
    parameter FBITS=0   // fractional bits (for fixed point)
    ) (
    input wire logic clk,
    input wire logic start,          // start signal
    output     logic busy,           // calculation in progress
    output     logic valid,          // quotient and remainder are valid
    output     logic dbz,            // divide by zero flag
    output     logic ovf,            // overflow flag (fixed-point)
    input wire logic [WIDTH-1:0] x,  // dividend
    input wire logic [WIDTH-1:0] y,  // divisor
    output     logic [WIDTH-1:0] q,  // quotient
    output     logic [WIDTH-1:0] r   // remainder
    );

    // avoid negative vector width when fractional bits are not used
    localparam FBITSW = (FBITS) ? FBITS : 1;

    logic [WIDTH-1:0] y1;           // copy of divisor
    logic [WIDTH-1:0] q1, q1_next;  // intermediate quotient
    logic [WIDTH:0] ac, ac_next;    // accumulator (1 bit wider)

    localparam ITER = WIDTH+FBITS;  // iterations are dividend width + fractional bits
    logic [$clog2(ITER)-1:0] i;     // iteration counter

    always_comb begin
        if (ac >= {1'b0,y1}) begin
            ac_next = ac - y1;
            {ac_next, q1_next} = {ac_next[WIDTH-1:0], q1, 1'b1};
        end else begin
            {ac_next, q1_next} = {ac, q1} << 1;
        end
    end

    always_ff @(posedge clk) begin
        if (start) begin
            valid <= 0;
            ovf <= 0;
            i <= 0;
            if (y == 0) begin  // catch divide by zero
                busy <= 0;
                dbz <= 1;
            end else begin
                busy <= 1;
                dbz <= 0;
                y1 <= y;
                {ac, q1} <= {{WIDTH{1'b0}}, x, 1'b0};
            end
        end else if (busy) begin
            if (i == ITER-1) begin  // done
                busy <= 0;
                valid <= 1;
                q <= q1_next;
                r <= ac_next[WIDTH:1];  // undo final shift
            end else if (i == WIDTH-1 && q1_next[WIDTH-1:WIDTH-FBITSW]) begin // overflow?
                busy <= 0;
                ovf <= 1;
                q <= 0;
                r <= 0;
            end else begin  // next iteration
                i <= i + 1;
                ac <= ac_next;
                q1 <= q1_next;
            end
        end
    end
endmodule

A new Vivado test bench can exercise the module (a sample of tests shown for brevity).

Project F Library source: [xc7/div_tb.sv].

module div_tb();

    parameter CLK_PERIOD = 10;  // 10 ns == 100 MHz
    parameter WIDTH = 8;
    parameter FBITS = 4;
    parameter SF = 2.0**-4.0;  // Q4.4 scaling factor is 2^-4

    logic clk;
    logic start;            // start signal
    logic busy;             // calculation in progress
    logic valid;            // quotient and remainder are valid
    logic dbz;              // divide by zero flag
    logic ovf;              // overflow flag (fixed-point only)
    logic [WIDTH-1:0] x;    // dividend
    logic [WIDTH-1:0] y;    // divisor
    logic [WIDTH-1:0] q;    // quotient
    logic [WIDTH-1:0] r;    // remainder

    div #(.WIDTH(WIDTH), .FBITS(FBITS)) div_inst (.*);

    always #(CLK_PERIOD / 2) clk = ~clk;

    initial begin
        $monitor("\t%d:\t%f / %f = %b (%f) (r = %b) (V=%b) (DBZ=%b) (OVF=%b)",
            $time, x*SF, y*SF, q, q*SF, r, valid, dbz, ovf);
    end

    initial begin
                clk = 1;

        #100    x = 8'b0011_0000;  // 3.0
                y = 8'b0010_0000;  // 2.0
                start = 1;
        #10     start = 0;

        #120    x = 8'b0010_0000;  // 2.0
                y = 8'b0001_0110;  // 1.375 (the largest number that's ≤√2 in Q4.4)
                start = 1;
        #10     start = 0;

        #120    x = 8'b0010_0000;  // 2.0
                y = 8'b0000_0000;  // 0.0
                start = 1;
        #10     start = 0;

        #120    x = 8'b0000_0000;  // 0.0
                y = 8'b0010_0000;  // 2.0
                start = 1;
        #10     start = 0;

        #120    x = 8'b0000_0010;  // 0.125
                y = 8'b0010_0000;  // 2.0
                start = 1;
        #10     start = 0;

        #120    x = 8'b1000_0000;  // 8.0
                y = 8'b0000_0100;  // 0.25
                start = 1;
        #10     start = 0;

        #120    x = 8'b1111_1110;  // 15.875
                y = 8'b0010_0000;  // 2.0
                start = 1;
        #10     start = 0;

        #120    x = 8'b1000_0000;  // 8.0
                y = 8'b1001_0000;  // 9.0
                start = 1;
        #10     start = 0;

        // ...
    end
endmodule

The output looks like this:

   0:    0.000000 /  0.000000 = xxxxxxxx  (0.000000) (r = xxxxxxxx) (V=x) (DBZ=x) (OVF=x)
 100:    3.000000 /  2.000000 = xxxxxxxx  (0.000000) (r = xxxxxxxx) (V=0) (DBZ=0) (OVF=0)
 220:    3.000000 /  2.000000 = 00011000  (1.500000) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
 230:    2.000000 /  1.375000 = 00011000  (1.500000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
 350:    2.000000 /  1.375000 = 00010111  (1.437500) (r = 00000110) (V=1) (DBZ=0) (OVF=0)
 360:    2.000000 /  0.000000 = 00010111  (1.437500) (r = 00000110) (V=0) (DBZ=1) (OVF=0)
 490:    0.000000 /  2.000000 = 00010111  (1.437500) (r = 00000110) (V=0) (DBZ=0) (OVF=0)
 610:    0.000000 /  2.000000 = 00000000  (0.000000) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
 620:    0.125000 /  2.000000 = 00000000  (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
 740:    0.125000 /  2.000000 = 00000001  (0.062500) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
 750:    8.000000 /  0.250000 = 00000001  (0.062500) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
 830:    8.000000 /  0.250000 = 00000000  (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=1)
 880:   15.875000 /  2.000000 = 00000000  (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
1000:   15.875000 /  2.000000 = 01111111  (7.937500) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
1010:    8.000000 /  9.000000 = 01111111  (7.937500) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
1130:    8.000000 /  9.000000 = 00001110  (0.875000) (r = 00100000) (V=1) (DBZ=0) (OVF=0)

Try experimenting with a wider range of test values and different fractional widths.

Signed Numbers

I plan to add support for signed numbers during 2022. In the meantime you might like to check out Square Root or Sine & Cosine in Verilog.

Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)

©2022 Will Green, Project F