1 July 2020

Division in Verilog

Division is a fundamental arithmetic operation; one we take for granted in most contexts. FPGAs are different; Verilog can’t synthesize division: we need to do it ourselves. In this FPGA recipe, we’re going to look at a simple division algorithm for positive integers and fixed-point numbers. This method takes one cycle per bit: 32 cycles for 32-bit numbers.

Revised 2020-07-28. Feedback to @WillFlux is most welcome.

Division Defined

Before we get to the design, it helps be familiar with some terminology.

When you divide dividend X by divisor Y you get quotient Q and remainder R:

X = Y*Q + R

Consider a trivial example: you have seven slices of apple pie to divide equally amongst three people. Each person gets two slices of pie, and there is one slice left over.

Less deliciously: given X=7 and Y=3, then Q=2 and R=1 because 7 = 3*2 + 1.

Long Division

The traditional way to divide numbers with pen and paper is long division. We move from left to right, trying to divide the shortest sequence of digits in the dividend by the divisor.

For example, let’s divide 512 by 12. We find that 12 is larger 5 (the first digit of the dividend), so we next consider 51, which 12 divides 4 times, with 3 left over. So, the first digit of our answer is 4.

Next, we take the leftover 3, and bring down the 2 from 512, to make 32. 12 fits into 32 twice, so the second digit of our answer is 2 with 8 left over. We’ve considered all the digits of the dividend, so the quotient is 42 and the remainder is 8.

This is easier to see when laid out as a calculation in columns:

    X=512  Y=12

         42    Quotient
       ————
    12 )512
        48     51: 12x4=48 + 3
        ——
         32    Take the 3 and bring down the 2 down from 512
         24    32: 12x2=24 + 8
         ——
          8    Remainder

If this doesn’t seem clear, check out the Wikipedia page on long division and try doing a few calculations yourself. Nothing beats doing hands-on examples when it comes to maths.

In Binary

For binary, we can follow the same long division process. For example, let’s divide 1110 by 11 (that’s 14 divided by 3 in decimal).

Our divisor 11 is two digits long, so we can start by considering the first two digits of the dividend, which is also 11: we record a 1 for the first digit of the quotient and move on.

The remaining digits are 10, which is smaller than 11, so we stop. We have a quotient of 100 and a remainder of 10. This is easier to see when laid out in columns:

    X=1110  Y=11

         100    Quotient
        ————
    11 )1110
        11      11x1=11 + 0
        ——
         010    Bring third 1 down from 1110
         000    11 doesn't divide 10, so it's the remainder
         ———
          10    Remainder

Doing binary division by hand is painful: each step is simple, but even moderately-sized numbers have a large number of digits, all of which are 1 and 0, so it’s easy to make a mistake. However, the very simplicity of the approach makes it straightforward to implement in Verilog.

Quick Aside: Divide is typically much slower than Multiply
Even high-end CPUs take their time with division. Intel Skylake has a latency of 42-95 cycles for signed 64-bit integer division, but only 3 cycles for multiplication. Source: agner.org.

Algorithm Implementation

We’ll take our example from above: X=1110 and Y=0011. We also need a register to accumulate (store) the intermediate calculations, and a register to record the quotient: we name them A and Q respectively.

There are four digits in the inputs, so we need four steps. For each step, we shift the left-most digit of the dividend X into A, then compare it with the divisor Y. If A is greater or equal to Y, then we subtract Y from A and add 1 to the quotient Q.

This is easiest to see by working through the example:

Inputs: X=1110  Y=0011

Step    A       X       Q       Description
——————————————————————————————————————————————————————————————————
        0000    1110    0000    Starting values.
1       0001    1100    0000    Left shift X into A. Left shift Q.
                                Is A≥Y? No: next digit...
2       0011    1000    0000    Left shift X into A. Left shift Q.
                                Is A≥Y? Yes: update quotient...
        0000    1000    0001    Subtract Y from A. Set Q[0]=1.
3       0001    0000    0010    Left shift X into A. Left shift Q.
                                Is A≥Y? No: next digit...
4       0010    0000    0100    Left shift X into A. Left shift Q.
                                Is A≥Y? No. Done.
——————————————————————————————————————————————————————————————————

Outputs: Q=0100  R=010

With this algorithm, the divisor Y can only ever fit into the accumulator A once at most, which is why we can simply subtract Y from A when A≥Y.

We never use the same digit in X and Q at the same time, so it’s possible to combine those registers together. As a digit of the dividend is shifted out, a digit of the quotient is shifted in. We’ll do this in our Verilog module.

Verilog Module

Our Verilog design uses the above algorithm, but adds a check for divide by zero and allows the width of the numbers to be configured.

// Project F: Division
// (C)2020 Will Green, Open source hardware released under the MIT License

module div #(parameter WIDTH=4)
    (
    input wire logic clk,
    input wire logic start,          // start signal
    input wire logic [WIDTH-1:0] x,  // dividend
    input wire logic [WIDTH-1:0] y,  // divisor
    output     logic [WIDTH-1:0] q,  // quotient
    output     logic [WIDTH-1:0] r,  // remainder
    output     logic done,           // done signal
    output     logic dbz             // divide by zero flag
    );

    logic [WIDTH-1:0] y1;            // copy of divisor
    logic [WIDTH-1:0] q1, q1_next;   // intermediate quotient
    logic [WIDTH-1:0] ac, ac_next;   // accumulator
    logic [$clog2(WIDTH+1)-1:0] i;   // dividend bit counter

    always_comb begin
        if (ac >= y1) begin
            ac_next = ac - y1;
            {ac_next, q1_next} = {ac_next[WIDTH-2:0], q1, 1'b1};
        end else begin
            {ac_next, q1_next} = {ac, q1} << 1;
        end
    end

    always_ff @(posedge clk) begin
        if (start) begin
            if (y == 0) begin  // catch divide by zero
                dbz <= 1;
                done <= 1;
                q <= 0;
                r <= 0;
            end else begin  // initialize values
                dbz <= 0;
                done <= 0;
                i <= 0;
                y1 <= y;
                {ac, q1} <= {{WIDTH-1{1'b0}}, x, 1'b0};
            end
        end else if (!done) begin
            if (i == WIDTH-1) begin  // we're done
                done <= 1;
                q <= q1_next;
                r <= ac_next >> 1;  // undo final shift
            end else begin  // next iteration
                i <= i + 1;
                ac <= ac_next;
                q1 <= q1_next;
            end
        end
    end
endmodule

ProTip: I’ve used SystemVerilog, but you can easily adapt logic, $clog2, and always* to Verilog.

To use the module, set WIDTH to the correct number of bits and the inputs x and y to dividend and divisor respectively. To begin the calculation set start high for one clock. The done signal indicates when the calculation is complete, then you can read the result out of q and r. If you divide by zero, then q and r will be zero, and the dbz flag signal will be high.

The Verilog itself is straightforward. The algorithm is in the always_comb block, though we start with the initial shift already in place. The always_ff block tests for division by zero, sets up the initial values, then runs the algorithm for the same number of steps as the width of the numbers.

Testing

The following test bench exercises the module, including dividing by zero.

// Project F: Division Test Bench
// (C)2020 Will Green, Open source hardware released under the MIT License

module div_tb();

    parameter CLK_PERIOD = 10;
    parameter WIDTH = 4;

    logic clk;
    logic start;            // start signal
    logic [WIDTH-1:0] x;    // dividend
    logic [WIDTH-1:0] y;    // divisor
    logic [WIDTH-1:0] q;    // quotient
    logic [WIDTH-1:0] r;    // remainder
    logic done;             // done signal
    logic dbz;              // divide by zero flag

    div #(.WIDTH(WIDTH)) div_inst (.*);

    always #(CLK_PERIOD / 2) clk = ~clk;

    initial begin
                clk = 1;

        #100    x = 4'b0111;
                y = 4'b0010;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0001;
                y = 4'b0001;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0010;
                y = 4'b0010;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0010;
                y = 4'b0001;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0000;
                y = 4'b0010;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0010;
                y = 4'b0000;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0011;
                y = 4'b0010;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b1111;
                y = 4'b0101;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b1111;
                y = 4'b0010;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0001;
                y = 4'b1111;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b0010;
                y = 4'b0100;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                x = 4'b1101;
                y = 4'b0111;
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", $time, x, y, q, r, dbz);
                $finish;
    end
endmodule

Fixed Point Support

In a previous recipe, we looked at Fixed Point Numbers in Verilog, but didn’t cover division, so let’s do that now. We have two changes to make:

  1. Scaling the quotient
  2. Handling overflow

NB. We don’t support signed values with this simple division algorithm.

If we use fixed-point numbers with the original module (above), the resulting quotient is not scaled correctly. For example, if we divide 8 by 4 in Q4.4 format, the result is 0.25, rather than the customary 2. To get the correct result, we shift the quotient left by the number of bits in the fractional part of the number, in this case, 4 bits:

   1000.0000        8.0000
/  0100.0000      / 4.0000
=  0000.0010      = 0.2500 (unscaled)
<< 0100           << 4 (scale)
   0010.0000      = 2.0000

Because of quotient scaling, it’s possible for overflow to occur. Consider dividing 8 by 0.25 using Q4.4 format. After scaling, we end up with the result 0 rather than the expected 32.

   1000.0000         8.0000
/  0000.0010      /  0.2500
=  0010.0000      =  4 (unscaled)
<< 0100           << 4 (scale)
   0000.0000      =  0 ?!

We could ignore this and make the module user responsible for overflow, but it’s much easier for the module to do this before the scaling occurs. To indicate overflow, we set the overflow flag ovf.

With these changes, the module looks like this:

// Project F: Division (with fixed-point support)
// (C)2020 Will Green, Open source hardware released under the MIT License

module div #(
    parameter WIDTH=4,  // width of number in bits
    parameter FBITS=0   // fractional bits
    ) (
    input wire logic clk,
    input wire logic start,          // start signal
    input wire logic [WIDTH-1:0] x,  // dividend
    input wire logic [WIDTH-1:0] y,  // divisor
    output     logic [WIDTH-1:0] q,  // quotient
    output     logic [WIDTH-1:0] r,  // remainder
    output     logic done,           // done signal
    output     logic dbz,            // divide by zero flag
    output     logic ovf             // overflow flag (fixed-point)
    );

    // avoid negative vector width when fractional bits are not used
    localparam FBITSW = (FBITS) ? FBITS : 1;

    logic [WIDTH-1:0] y1;            // copy of divisor
    logic [WIDTH-1:0] q1, q1_next;   // intermediate quotient
    logic [WIDTH-1:0] ac, ac_next;   // accumulator
    logic [$clog2(WIDTH+1)-1:0] i;   // dividend bit counter

    always_comb begin
        if (ac >= y1) begin
            ac_next = ac - y1;
            {ac_next, q1_next} = {ac_next[WIDTH-2:0], q1, 1'b1};
        end else begin
            {ac_next, q1_next} = {ac, q1} << 1;
        end
    end

    always_ff @(posedge clk) begin
        if (start) begin
            ovf <= 0;
            q <= 0;
            r <= 0;
            i <= 0;
            if (y == 0) begin  // catch divide by zero
                dbz <= 1;
                done <= 1;
            end else begin
                dbz <= 0;
                done <= 0;
                y1 <= y;
                {ac, q1} <= {{WIDTH-1{1'b0}}, x, 1'b0};
            end
        end else if (!done) begin
            if (i == WIDTH-1) begin  // done
                done <= 1;
                if (FBITS != 0 && q1_next[WIDTH-1:WIDTH-FBITSW]) begin  // catch overflow
                    ovf <= 1;
                    q <= 0;
                    r <= 0;
                end else begin
                    q <= q1_next << FBITS;  // fixed-point correction
                    r <= ac_next >> 1;      // undo final shift
                end
            end else begin  // next iteration
                i <= i + 1;
                ac <= ac_next;
                q1 <= q1_next;
            end
        end
    end
endmodule

Updating the test bench to support fixed point is left as an exercise for the reader. Don’t forget to add a definition of the overflow flag ovf to your test bench.

Note that the fixed-point version still produces an integer answer and remainder, for example:

6.500000 / 0.562500 = 11.000000 (r = 0.312500)

Future Features

The following features may be added in future, time permitting:

  1. Non-integer results for fixed-point numbers
  2. Signed numbers
  3. Formal verification

©2020 Will Green, Project F