Project F

Division in Verilog

Published · Updated

Division is a fundamental arithmetic operation we take for granted. FPGAs include dedicated hardware to perform addition, subtraction, and multiplication and will infer the necessary logic. Division is different: we need to do it ourselves. This post looks at a straightforward division algorithm for positive integers before extending it to cover fixed-point numbers and signed numbers.

New to Verilog maths? Check out my introduction to Numbers in Verilog.

Series Outline

Division Defined

Before we get to the design, it helps be familiar with some terminology.

When you divide dividend A by divisor B you get quotient Q and remainder R:

A = B*Q + R

Consider a trivial example: you have seven slices of apple pie to divide equally amongst three people. Each person gets two slices of pie, and one slice remains.

Less deliciously: given A=7 and B=3, then Q=2 and R=1 because 7 = 3*2 + 1.

Long Division

The traditional way to divide numbers with pen and paper is long division. We move from left to right, trying to divide the shortest sequence of digits in the dividend by the divisor.

For example, let’s divide 512 by 12. We find that 12 is larger 5 (the first digit of the dividend), so we next consider 51, which 12 divides 4 times, with 3 left over. So, the first digit of our answer is 4.

Next, we take the leftover 3 and bring down the 2 from 512 to make 32. 12 fits into 32 twice, so the second digit of our answer is 2 with 8 left over. We’ve considered all the digits of the dividend, so the quotient is 42 and the remainder is 8.

This is easier to see when laid out as a calculation in columns:

    A=512  B=12

         42    Quotient
       ————
    12 )512
        48     51: 12x4=48 + 3
        ——
         32    Take the 3 and bring down the 2 down from 512
         24    32: 12x2=24 + 8
         ——
          8    Remainder

If this doesn’t seem clear, check out the Wikipedia page on long division and try doing a few calculations yourself. Nothing beats doing hands-on examples when it comes to maths. On the other hand, you don’t need to be able to do long division to use these designs, so feel free to move on. :)

In Binary

For binary, we can follow the same long division process. For example, let’s divide 1110 by 11 (in decimal: 14 divided by 3).

Our divisor 11 is two digits long, so we can start by considering the first two digits of the dividend, which is also 11: we record a 1 for the first digit of the quotient and move on.

The remaining digits are 10, which is smaller than 11, so we stop. We have a quotient of 100 and a remainder of 10. This is easier to see when laid out in columns:

    A=1110  B=11

         100    Quotient
        ————
    11 )1110
        11      11x1=11 + 0
        ——
         010    Bring the third 1 down from 1110
         000    11 doesn't divide 10, so it's the remainder
         ———
          10    Remainder

Doing binary division by hand is painful: each step is simple, but even moderately-sized numbers have many digits, all of which are 1 and 0, so it’s easy to make a mistake. However, the very simplicity of the approach makes it straightforward to implement in Verilog.

Slow Divide
Even contemporary CPUs take their time with division. Intel Skylake has a latency of 42-95 cycles for signed 64-bit integer division but only 3 cycles for multiplication. Source: agner.org.

Algorithm Implementation

We’ll take our example from above: A=1110 and B=0011. We also need a register to accumulate (store) the intermediate calculations and a register to record the quotient: we name them ACC and QUO, respectively.

There are four digits in the inputs, so we need four steps. For each step, we shift the left-most digit of the dividend A into ACC, then compare it with the divisor B. If ACC is greater or equal to B, then we subtract B from ACC and add 1 to the quotient QUO.

This is easiest to see by working through the example:

Inputs: A=1110  B=0011

Step    ACC     A       QUO     Description
——————————————————————————————————————————————————————————————————————
        0000    1110    0000    Starting values.
1       0001    1100    0000    Left shift A into ACC. Left shift QUO.
                                Is ACC≥B? No. Next digit...
2       0011    1000    0000    Left shift A into ACC. Left shift QUO.
                                Is ACC≥B? Yes. Update quotient...
        0000    1000    0001    Subtract B from ACC. Set QUO[0]=1.
3       0001    0000    0010    Left shift A into ACC. Left shift QUO.
                                Is ACC≥B? No. Next digit...
4       0010    0000    0100    Left shift A into ACC. Left shift QUO.
                                Is ACC≥B? No. Done.
——————————————————————————————————————————————————————————————————————

The resulting quotient is 0100, and the remainder is 0010.

With this algorithm, the divisor B can only ever fit into the accumulator ACC once at most, which is why we can simply subtract B from ACC when ACC≥B.

We never use the same digit in A and QUO simultaneously, so it’s possible to combine those registers. As a digit of the dividend is shifted out, a digit of the quotient is shifted in. We’ll do this in our Verilog module.

Verilog Module

Our first Verilog design uses the above algorithm but adds a check for dividing by zero and allows the width of the numbers to be configured. This method takes one cycle per bit: 32 cycles for 32-bit numbers.

Unsigned integer division - divu_int.sv:

module divu_int #(parameter WIDTH=5) ( // width of numbers in bits
    input wire logic clk,              // clock
    input wire logic rst,              // reset
    input wire logic start,            // start calculation
    output     logic busy,             // calculation in progress
    output     logic done,             // calculation is complete (high for one tick)
    output     logic valid,            // result is valid
    output     logic dbz,              // divide by zero
    input wire logic [WIDTH-1:0] a,    // dividend (numerator)
    input wire logic [WIDTH-1:0] b,    // divisor (denominator)
    output     logic [WIDTH-1:0] val,  // result value: quotient
    output     logic [WIDTH-1:0] rem   // result: remainder
    );

    logic [WIDTH-1:0] b1;             // copy of divisor
    logic [WIDTH-1:0] quo, quo_next;  // intermediate quotient
    logic [WIDTH:0] acc, acc_next;    // accumulator (1 bit wider)
    logic [$clog2(WIDTH)-1:0] i;      // iteration counter

    // division algorithm iteration
    always_comb begin
        if (acc >= {1'b0, b1}) begin
            acc_next = acc - b1;
            {acc_next, quo_next} = {acc_next[WIDTH-1:0], quo, 1'b1};
        end else begin
            {acc_next, quo_next} = {acc, quo} << 1;
        end
    end

    // calculation control
    always_ff @(posedge clk) begin
        done <= 0;
        if (start) begin
            valid <= 0;
            i <= 0;
            if (b == 0) begin  // catch divide by zero
                busy <= 0;
                done <= 1;
                dbz <= 1;
            end else begin
                busy <= 1;
                dbz <= 0;
                b1 <= b;
                {acc, quo} <= {{WIDTH{1'b0}}, a, 1'b0};  // initialize calculation
            end
        end else if (busy) begin
            if (i == WIDTH-1) begin  // we're done
                busy <= 0;
                done <= 1;
                valid <= 1;
                val <= quo_next;
                rem <= acc_next[WIDTH:1];  // undo final shift
            end else begin  // next iteration
                i <= i + 1;
                acc <= acc_next;
                quo <= quo_next;
            end
        end
        if (rst) begin
            busy <= 0;
            done <= 0;
            valid <= 0;
            dbz <= 0;
            val <= 0;
            rem <= 0;
        end
    end
endmodule

To use the module, set WIDTH to the correct number of bits and the inputs a and b to dividend and divisor, respectively. To begin the calculation set start high for one clock.

The valid signal indicates when the output data is valid; you can then read the results from val and r. If you divide by zero, then valid will be zero, and the dbz flag signal will be high. The busy signal is high during calculation.

The Verilog itself is straightforward. The algorithm iteration is in the always_comb block. The always_ff block tests for division by zero, sets up the initial values, and then runs the algorithm for the same number of iterations as the width of the numbers.

Division might seem slow at one cycle per bit, but it uses little logic, so you can quickly improve throughput by adding additional instances.

Accumulator Width
The accumulator needs to be 1 bit wider than the dividend because the remainder comes from unshifting the final acc_next. For example, if we divide eight by nine using four-bit numbers, the remainder should be eight 4'b1000, but without the wider accumulator, the left-most digit would be lost, and the remainder would appear to be 4'b0000.

Testing Division

For 2023, I’ve started testing my Verilog library modules with cocotb and Icarus Verilog. You can find tests for maths lib modules in lib/maths/test. I plan to write up my experience with cocotb, but for now, you can run these tests by installing:

  • cocotb - Test bench tool for Verilog and VHDL written in Python
  • Icarus Verilog - Verilog simulation tool
  • spfpm - fixed-point Python module
  • pytest - Python test framework (optional)
  • GTKWave - waveform viewer (optional)

On Debian & Ubuntu:

apt install make python3 python3-pip iverilog gtkwave
pip install cocotb pytest spfpm

On macOS with brew:

brew install python icarus-verilog gtkwave
pip3 install cocotb pytest spfpm

Once you have the tools installed, you can run tests with the included Makefile:

cd lib/maths/test
make divu_int

Example output for 97/13 (test 8 of 14):

   111.01ns INFO     cocotb.regression      running rem_3 (8/14)
                                              Test 97/13
   126.01ns INFO     cocotb.divu_int        dut a:     01100001
   126.01ns INFO     cocotb.divu_int        dut b:     00001101
   126.01ns INFO     cocotb.divu_int        dut val:   00000111
   126.01ns INFO     cocotb.divu_int        dut rem:   00000110
   126.01ns INFO     cocotb.divu_int        model val: 00000111
   126.01ns INFO     cocotb.divu_int        model rem: 00000110
   127.01ns INFO     cocotb.regression      rem_3 passed

‘dut’ is from the device under test (our Verilog module). ‘model’ is from Python.

Our test bench exercises our maths by comparing it with the same calculation performed in Python. We use spfpm to convert the Python result into fixed-point with the correct precision (eight bits for this example). The test bench also tests for the correct operation of flags, such as done and dbz (divide by zero).

It’s straightforward to add and edit tests in the Python module test/divu_int.py.

Fixed-Point Support

In a previous part, we looked at Fixed Point Numbers in Verilog, but didn’t cover division, so let’s do that now. We have two changes to make to our division module:

  1. Divide the Remainder
  2. Handle overflow

Accounting for the fractional part of the number, we need to increase the number of iterations to divide the remainder. For example, if we have an 8-bit number with 4 fractional bits, we must perform 12 iterations. We pass the number of fractional bits as a new parameter: FBITS.

By supporting fixed-point numbers, we can divide by numbers less than one, which means our result can overflow. For example, if you divide 6 by 0.25, the result is 24, which requires five bits to store: 11000; if we only have four integer bits, we can’t handle this. We check for overflow and report it with the ovf signal.

Unsigned fixed-point division - divu.sv:

module divu #(
    parameter WIDTH=8,  // width of numbers in bits (integer and fractional)
    parameter FBITS=4   // fractional bits within WIDTH
    ) (
    input wire logic clk,    // clock
    input wire logic rst,    // reset
    input wire logic start,  // start calculation
    output     logic busy,   // calculation in progress
    output     logic done,   // calculation is complete (high for one tick)
    output     logic valid,  // result is valid
    output     logic dbz,    // divide by zero
    output     logic ovf,    // overflow
    input wire logic [WIDTH-1:0] a,   // dividend (numerator)
    input wire logic [WIDTH-1:0] b,   // divisor (denominator)
    output     logic [WIDTH-1:0] val  // result value: quotient
    );

    localparam FBITSW = (FBITS == 0) ? 1 : FBITS;  // avoid negative vector width when FBITS=0

    logic [WIDTH-1:0] b1;             // copy of divisor
    logic [WIDTH-1:0] quo, quo_next;  // intermediate quotient
    logic [WIDTH:0] acc, acc_next;    // accumulator (1 bit wider)

    localparam ITER = WIDTH + FBITS;  // iteration count: unsigned input width + fractional bits
    logic [$clog2(ITER)-1:0] i;       // iteration counter

    // division algorithm iteration
    always_comb begin
        if (acc >= {1'b0, b1}) begin
            acc_next = acc - b1;
            {acc_next, quo_next} = {acc_next[WIDTH-1:0], quo, 1'b1};
        end else begin
            {acc_next, quo_next} = {acc, quo} << 1;
        end
    end

    // calculation control
    always_ff @(posedge clk) begin
        done <= 0;
        if (start) begin
            valid <= 0;
            ovf <= 0;
            i <= 0;
            if (b == 0) begin  // catch divide by zero
                busy <= 0;
                done <= 1;
                dbz <= 1;
            end else begin
                busy <= 1;
                dbz <= 0;
                b1 <= b;
                {acc, quo} <= {{WIDTH{1'b0}}, a, 1'b0};  // initialize calculation
            end
        end else if (busy) begin
            if (i == ITER-1) begin  // done
                busy <= 0;
                done <= 1;
                valid <= 1;
                val <= quo_next;
            end else if (i == WIDTH-1 && quo_next[WIDTH-1:WIDTH-FBITSW] != 0) begin  // overflow?
                busy <= 0;
                done <= 1;
                ovf <= 1;
                val <= 0;
            end else begin  // next iteration
                i <= i + 1;
                acc <= acc_next;
                quo <= quo_next;
            end
        end
        if (rst) begin
            busy <= 0;
            done <= 0;
            valid <= 0;
            dbz <= 0;
            ovf <= 0;
            val <= 0;
        end
    end
endmodule

Notice how the division algorithm iteration remains unchanged from the integer version.

Fractional Testing

Naturally, we have a test bench for fixed-point division:

cd lib/maths/test
make divu

Example output for 8/9 (test 6 of 22):

    99.00ns INFO     cocotb.regression      running simple_6 (6/22)
                                              Test 8/9
   118.00ns INFO     cocotb.divu            dut a:     10000000
   118.00ns INFO     cocotb.divu            dut b:     10010000
   118.00ns INFO     cocotb.divu            dut val:   00001110
   118.00ns INFO     cocotb.divu                       0.875
   118.00ns INFO     cocotb.divu            model val: 00000.1110
   118.00ns INFO     cocotb.divu                       0.875
   119.01ns INFO     cocotb.regression      simple_6 passed

Our test bench includes tests for overflow and dividing by zero.

You’ll see some tests “failed as expected” because divu doesn’t handle rounding in the same way as Python. We’ll fix this in the next section.

   197.01ns INFO     cocotb.regression      running round_5 (11/22)
                                              Test 13/7
   216.01ns INFO     cocotb.divu            dut a:     11010000
   216.01ns INFO     cocotb.divu            dut b:     01110000
   216.01ns INFO     cocotb.divu            dut val:   00011101
   216.01ns INFO     cocotb.divu                       1.8125
   216.01ns INFO     cocotb.divu            model val: 00001.1110
   216.01ns INFO     cocotb.divu                       1.875
   216.01ns INFO     cocotb.regression      round_5 passed: failed as expected...

Signed Numbers

Our division algorithm doesn’t work with signed numbers, but the solution is straightforward: perform the division on the inputs’ absolute value, then adjust the sign afterwards. Because we now have more steps, we switch to a finite state machine (FSM) to control the calculation.

Signed division with Gaussian rounding - div.sv:

module div #(
    parameter WIDTH=8,  // width of numbers in bits (integer and fractional)
    parameter FBITS=4   // fractional bits within WIDTH
    ) (
    input wire logic clk,    // clock
    input wire logic rst,    // reset
    input wire logic start,  // start calculation
    output     logic busy,   // calculation in progress
    output     logic done,   // calculation is complete (high for one tick)
    output     logic valid,  // result is valid
    output     logic dbz,    // divide by zero
    output     logic ovf,    // overflow
    input wire logic signed [WIDTH-1:0] a,   // dividend (numerator)
    input wire logic signed [WIDTH-1:0] b,   // divisor (denominator)
    output     logic signed [WIDTH-1:0] val  // result value: quotient
    );

    localparam WIDTHU = WIDTH - 1;                 // unsigned widths are 1 bit narrower
    localparam FBITSW = (FBITS == 0) ? 1 : FBITS;  // avoid negative vector width when FBITS=0
    localparam SMALLEST = {1'b1, {WIDTHU{1'b0}}};  // smallest negative number

    localparam ITER = WIDTHU + FBITS;  // iteration count: unsigned input width + fractional bits
    logic [$clog2(ITER):0] i;          // iteration counter (allow ITER+1 iterations for rounding)

    logic a_sig, b_sig, sig_diff;      // signs of inputs and whether different
    logic [WIDTHU-1:0] au, bu;         // absolute version of inputs (unsigned)
    logic [WIDTHU-1:0] quo, quo_next;  // intermediate quotients (unsigned)
    logic [WIDTHU:0] acc, acc_next;    // accumulator (unsigned but 1 bit wider)

    // input signs
    always_comb begin
        a_sig = a[WIDTH-1+:1];
        b_sig = b[WIDTH-1+:1];
    end

    // division algorithm iteration
    always_comb begin
        if (acc >= {1'b0, bu}) begin
            acc_next = acc - bu;
            {acc_next, quo_next} = {acc_next[WIDTHU-1:0], quo, 1'b1};
        end else begin
            {acc_next, quo_next} = {acc, quo} << 1;
        end
    end

    // calculation state machine
    enum {IDLE, INIT, CALC, ROUND, SIGN} state;
    always_ff @(posedge clk) begin
        done <= 0;
        case (state)
            INIT: begin
                state <= CALC;
                ovf <= 0;
                i <= 0;
                {acc, quo} <= {{WIDTHU{1'b0}}, au, 1'b0};  // initialize calculation
            end
            CALC: begin
                if (i == WIDTHU-1 && quo_next[WIDTHU-1:WIDTHU-FBITSW] != 0) begin  // overflow
                    state <= IDLE;
                    busy <= 0;
                    done <= 1;
                    ovf <= 1;
                end else begin
                    if (i == ITER-1) state <= ROUND;  // calculation complete after next iteration
                    i <= i + 1;
                    acc <= acc_next;
                    quo <= quo_next;
                end
            end
            ROUND: begin  // Gaussian rounding
                state <= SIGN;
                if (quo_next[0] == 1'b1) begin  // next digit is 1, so consider rounding
                    // round up if quotient is odd or remainder is non-zero
                    if (quo[0] == 1'b1 || acc_next[WIDTHU:1] != 0) quo <= quo + 1;
                end
            end
            SIGN: begin  // adjust quotient sign if non-zero and input signs differ
                state <= IDLE;
                if (quo != 0) val <= (sig_diff) ? {1'b1, -quo} : {1'b0, quo};
                busy <= 0;
                done <= 1;
                valid <= 1;
            end
            default: begin  // IDLE
                if (start) begin
                    valid <= 0;
                    if (b == 0) begin  // divide by zero
                        state <= IDLE;
                        busy <= 0;
                        done <= 1;
                        dbz <= 1;
                        ovf <= 0;
                    end else if (a == SMALLEST || b == SMALLEST) begin  // overflow
                        state <= IDLE;
                        busy <= 0;
                        done <= 1;
                        dbz <= 0;
                        ovf <= 1;
                    end else begin
                        state <= INIT;
                        au <= (a_sig) ? -a[WIDTHU-1:0] : a[WIDTHU-1:0];  // register abs(a)
                        bu <= (b_sig) ? -b[WIDTHU-1:0] : b[WIDTHU-1:0];  // register abs(b)
                        sig_diff <= (a_sig ^ b_sig);  // register input sign difference
                        busy <= 1;
                        dbz <= 0;
                        ovf <= 0;
                    end
                end
            end
        endcase
        if (rst) begin
            state <= IDLE;
            busy <= 0;
            done <= 0;
            valid <= 0;
            dbz <= 0;
            ovf <= 0;
            val <= 0;
        end
    end
endmodule

We create unsigned versions of the inputs as au and bu and register the difference in signs as sig_diff. After the calculation, we use sig_diff to set the correct sign for the result, being careful not to change zero.

Rounding

When working with fixed-point numbers, we need to consider rounding. For example, the square root of two is: √2 ≈ 1.41421... with a finite number of bits, we need to decide how to approximate it.

The divu module truncates its answers, which is simple to implement, but doesn’t produce the nearest number to the actual value. For example, 13/7 ≈ 1.85714... but divu produces 1.8125 rather than 1.875.

Alas, there is not one-true-way to round numbers, but the default for IEEE 754 floating point numbers (also used in Python) is Gaussian rounding, also known as bankers’ rounding or “rounding half to even”. In this system, 1.5 is rounded up to 2 and 2.5 is rounded down to 2, while 3.5 is rounded up to 4 etc. Gaussian rounding avoids the positive bias that would occur if we always rounded half up.

To implement Gaussian rounding, we need to calculate the next bit of the result and look at the remainder, which is what we do in the ROUND step:

    ROUND: begin  // Gaussian rounding
        state <= SIGN;
        if (quo_next[0] == 1'b1) begin  // next digit is 1, so consider rounding
            // round up if quotient is odd or remainder is non-zero
            if (quo[0] == 1'b1 || acc_next[WIDTHU:1] != 0) quo <= quo + 1;
        end
    end

With the addition of this rounding step, our module now matches the Python model.

Testing the Final Design

Test bench for signed fixed-point division:

cd lib/maths/test
make div

Example output for -7.0625/2 (test 17 of 32):

   367.02ns INFO     cocotb.regression      running round_8 (17/32)
                                              Test -7.0625/2
   389.02ns INFO     cocotb.div             dut a:     110001111
   389.02ns INFO     cocotb.div             dut b:     000100000
   389.02ns INFO     cocotb.div             dut val:   111001000
   389.02ns INFO     cocotb.div                        -3.5
   389.02ns INFO     cocotb.div             model val: 111100.1000
   389.02ns INFO     cocotb.div                        -3.5
   390.02ns INFO     cocotb.regression      round_8 passed

What’s Next?

I have other Verilog maths posts covering square root and sine & cosine. You can also check out my FPGA & RISC-V Tutorials.

Get in touch on Mastodon, Bluesky, or X. If you enjoy my work, please sponsor me. 🙏

Apple Pie photo by Shisma under Creative Commons Attribution licence.