# Division in Verilog

Division is a fundamental arithmetic operation; one we take for granted in most contexts. FPGAs are different; Verilog can’t synthesize division: we need to do it ourselves. In this FPGA recipe, we’re going to look at a simple division algorithm for positive integers and fixed-point numbers. This method takes one cycle per bit: 32 cycles for 32-bit numbers.

Revised 2020-07-28. Feedback to @WillFlux is most welcome.

## Division Defined

Before we get to the design, it helps be familiar with some terminology.

When you divide dividend `X` by divisor `Y` you get quotient `Q` and remainder `R`:

`X = Y*Q + R`

Consider a trivial example: you have seven slices of apple pie to divide equally amongst three people. Each person gets two slices of pie, and there is one slice left over.

Less deliciously: given `X=7` and `Y=3`, then `Q=2` and `R=1` because `7 = 3*2 + 1`.

## Long Division

The traditional way to divide numbers with pen and paper is long division. We move from left to right, trying to divide the shortest sequence of digits in the dividend by the divisor.

For example, let’s divide `512` by `12`. We find that `12` is larger `5` (the first digit of the dividend), so we next consider `51`, which `12` divides `4` times, with `3` left over. So, the first digit of our answer is 4.

Next, we take the leftover `3`, and bring down the `2` from `512`, to make `32`. `12` fits into `32` twice, so the second digit of our answer is `2` with `8` left over. We’ve considered all the digits of the dividend, so the quotient is `42` and the remainder is `8`.

This is easier to see when laid out as a calculation in columns:

``````    X=512  Y=12

42    Quotient
————
12 )512
48     51: 12x4=48 + 3
——
32    Take the 3 and bring down the 2 down from 512
24    32: 12x2=24 + 8
——
8    Remainder
``````

If this doesn’t seem clear, check out the Wikipedia page on long division and try doing a few calculations yourself. Nothing beats doing hands-on examples when it comes to maths.

### In Binary

For binary, we can follow the same long division process. For example, let’s divide `1110` by `11` (that’s 14 divided by 3 in decimal).

Our divisor `11` is two digits long, so we can start by considering the first two digits of the dividend, which is also `11`: we record a `1` for the first digit of the quotient and move on.

The remaining digits are `10`, which is smaller than `11`, so we stop. We have a quotient of `100` and a remainder of `10`. This is easier to see when laid out in columns:

``````    X=1110  Y=11

100    Quotient
————
11 )1110
11      11x1=11 + 0
——
010    Bring third 1 down from 1110
000    11 doesn't divide 10, so it's the remainder
———
10    Remainder
``````

Doing binary division by hand is painful: each step is simple, but even moderately-sized numbers have a large number of digits, all of which are 1 and 0, so it’s easy to make a mistake. However, the very simplicity of the approach makes it straightforward to implement in Verilog.

Quick Aside: Divide is typically much slower than Multiply
Even high-end CPUs take their time with division. Intel Skylake has a latency of 42-95 cycles for signed 64-bit integer division, but only 3 cycles for multiplication. Source: agner.org.

## Algorithm Implementation

We’ll take our example from above: `X=1110` and `Y=0011`. We also need a register to accumulate (store) the intermediate calculations, and a register to record the quotient: we name them `A` and `Q` respectively.

There are four digits in the inputs, so we need four steps. For each step, we shift the left-most digit of the dividend `X` into `A`, then compare it with the divisor `Y`. If `A` is greater or equal to `Y`, then we subtract `Y` from `A` and add 1 to the quotient `Q`.

This is easiest to see by working through the example:

``````Inputs: X=1110  Y=0011

Step    A       X       Q       Description
——————————————————————————————————————————————————————————————————
0000    1110    0000    Starting values.
1       0001    1100    0000    Left shift X into A. Left shift Q.
Is A≥Y? No: next digit...
2       0011    1000    0000    Left shift X into A. Left shift Q.
Is A≥Y? Yes: update quotient...
0000    1000    0001    Subtract Y from A. Set Q=1.
3       0001    0000    0010    Left shift X into A. Left shift Q.
Is A≥Y? No: next digit...
4       0010    0000    0100    Left shift X into A. Left shift Q.
Is A≥Y? No. Done.
——————————————————————————————————————————————————————————————————

Outputs: Q=0100  R=010
``````

With this algorithm, the divisor `Y` can only ever fit into the accumulator `A` once at most, which is why we can simply subtract `Y` from `A` when `A≥Y`.

We never use the same digit in `X` and `Q` at the same time, so it’s possible to combine those registers together. As a digit of the dividend is shifted out, a digit of the quotient is shifted in. We’ll do this in our Verilog module.

## Verilog Module

Our Verilog design uses the above algorithm, but adds a check for divide by zero and allows the width of the numbers to be configured.

``````// Project F: Division
// (C)2020 Will Green, Open source hardware released under the MIT License

module div #(parameter WIDTH=4)
(
input wire logic clk,
input wire logic start,          // start signal
input wire logic [WIDTH-1:0] x,  // dividend
input wire logic [WIDTH-1:0] y,  // divisor
output     logic [WIDTH-1:0] q,  // quotient
output     logic [WIDTH-1:0] r,  // remainder
output     logic done,           // done signal
output     logic dbz             // divide by zero flag
);

logic [WIDTH-1:0] y1;            // copy of divisor
logic [WIDTH-1:0] q1, q1_next;   // intermediate quotient
logic [WIDTH-1:0] ac, ac_next;   // accumulator
logic [\$clog2(WIDTH+1)-1:0] i;   // dividend bit counter

always_comb begin
if (ac >= y1) begin
ac_next = ac - y1;
{ac_next, q1_next} = {ac_next[WIDTH-2:0], q1, 1'b1};
end else begin
{ac_next, q1_next} = {ac, q1} << 1;
end
end

always_ff @(posedge clk) begin
if (start) begin
if (y == 0) begin  // catch divide by zero
dbz <= 1;
done <= 1;
q <= 0;
r <= 0;
end else begin  // initialize values
dbz <= 0;
done <= 0;
i <= 0;
y1 <= y;
{ac, q1} <= {{WIDTH-1{1'b0}}, x, 1'b0};
end
end else if (!done) begin
if (i == WIDTH-1) begin  // we're done
done <= 1;
q <= q1_next;
r <= ac_next >> 1;  // undo final shift
end else begin  // next iteration
i <= i + 1;
ac <= ac_next;
q1 <= q1_next;
end
end
end
endmodule
``````

ProTip: I’ve used SystemVerilog, but you can easily adapt `logic`, `\$clog2`, and always* to Verilog.

To use the module, set `WIDTH` to the correct number of bits and the inputs `x` and `y` to dividend and divisor respectively. To begin the calculation set `start` high for one clock. The `done` signal indicates when the calculation is complete, then you can read the result out of `q` and `r`. If you divide by zero, then `q` and `r` will be zero, and the `dbz` flag signal will be high.

The Verilog itself is straightforward. The algorithm is in the `always_comb` block, though we start with the initial shift already in place. The `always_ff` block tests for division by zero, sets up the initial values, then runs the algorithm for the same number of steps as the width of the numbers.

### Testing

The following test bench exercises the module, including dividing by zero.

``````// Project F: Division Test Bench
// (C)2020 Will Green, Open source hardware released under the MIT License

module div_tb();

parameter CLK_PERIOD = 10;
parameter WIDTH = 4;

logic clk;
logic start;            // start signal
logic [WIDTH-1:0] x;    // dividend
logic [WIDTH-1:0] y;    // divisor
logic [WIDTH-1:0] q;    // quotient
logic [WIDTH-1:0] r;    // remainder
logic done;             // done signal
logic dbz;              // divide by zero flag

div #(.WIDTH(WIDTH)) div_inst (.*);

always #(CLK_PERIOD / 2) clk = ~clk;

initial begin
clk = 1;

#100    x = 4'b0111;
y = 4'b0010;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0001;
y = 4'b0001;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0010;
y = 4'b0010;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0010;
y = 4'b0001;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0000;
y = 4'b0010;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0010;
y = 4'b0000;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0011;
y = 4'b0010;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b1111;
y = 4'b0101;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b1111;
y = 4'b0010;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0001;
y = 4'b1111;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b0010;
y = 4'b0100;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
x = 4'b1101;
y = 4'b0111;
start = 1;
#10     start = 0;
#40     \$display("\t%d:\t%d /%d =%d (r =%d) (DBZ=%b)", \$time, x, y, q, r, dbz);
\$finish;
end
endmodule
``````

## Fixed Point Support

In a previous recipe, we looked at Fixed Point Numbers in Verilog, but didn’t cover division, so let’s do that now. We have two changes to make:

1. Scaling the quotient
2. Handling overflow

NB. We don’t support signed values with this simple division algorithm.

If we use fixed-point numbers with the original module (above), the resulting quotient is not scaled correctly. For example, if we divide `8` by `4` in Q4.4 format, the result is `0.25`, rather than the customary `2`. To get the correct result, we shift the quotient left by the number of bits in the fractional part of the number, in this case, 4 bits:

``````   1000.0000        8.0000
/  0100.0000      / 4.0000
=  0000.0010      = 0.2500 (unscaled)
<< 0100           << 4 (scale)
0010.0000      = 2.0000
``````

Because of quotient scaling, it’s possible for overflow to occur. Consider dividing `8` by `0.25` using Q4.4 format. After scaling, we end up with the result `0` rather than the expected `32`.

``````   1000.0000         8.0000
/  0000.0010      /  0.2500
=  0010.0000      =  4 (unscaled)
<< 0100           << 4 (scale)
0000.0000      =  0 ?!
``````

We could ignore this and make the module user responsible for overflow, but it’s much easier for the module to do this before the scaling occurs. To indicate overflow, we set the overflow flag `ovf`.

With these changes, the module looks like this:

``````// Project F: Division (with fixed-point support)
// (C)2020 Will Green, Open source hardware released under the MIT License

module div #(
parameter WIDTH=4,  // width of number in bits
parameter FBITS=0   // fractional bits
) (
input wire logic clk,
input wire logic start,          // start signal
input wire logic [WIDTH-1:0] x,  // dividend
input wire logic [WIDTH-1:0] y,  // divisor
output     logic [WIDTH-1:0] q,  // quotient
output     logic [WIDTH-1:0] r,  // remainder
output     logic done,           // done signal
output     logic dbz,            // divide by zero flag
output     logic ovf             // overflow flag (fixed-point)
);

// avoid negative vector width when fractional bits are not used
localparam FBITSW = (FBITS) ? FBITS : 1;

logic [WIDTH-1:0] y1;            // copy of divisor
logic [WIDTH-1:0] q1, q1_next;   // intermediate quotient
logic [WIDTH-1:0] ac, ac_next;   // accumulator
logic [\$clog2(WIDTH+1)-1:0] i;   // dividend bit counter

always_comb begin
if (ac >= y1) begin
ac_next = ac - y1;
{ac_next, q1_next} = {ac_next[WIDTH-2:0], q1, 1'b1};
end else begin
{ac_next, q1_next} = {ac, q1} << 1;
end
end

always_ff @(posedge clk) begin
if (start) begin
ovf <= 0;
q <= 0;
r <= 0;
i <= 0;
if (y == 0) begin  // catch divide by zero
dbz <= 1;
done <= 1;
end else begin
dbz <= 0;
done <= 0;
y1 <= y;
{ac, q1} <= {{WIDTH-1{1'b0}}, x, 1'b0};
end
end else if (!done) begin
if (i == WIDTH-1) begin  // done
done <= 1;
if (FBITS != 0 && q1_next[WIDTH-1:WIDTH-FBITSW]) begin  // catch overflow
ovf <= 1;
q <= 0;
r <= 0;
end else begin
q <= q1_next << FBITS;  // fixed-point correction
r <= ac_next >> 1;      // undo final shift
end
end else begin  // next iteration
i <= i + 1;
ac <= ac_next;
q1 <= q1_next;
end
end
end
endmodule
``````

Updating the test bench to support fixed point is left as an exercise for the reader. Don’t forget to add a definition of the overflow flag `ovf` to your test bench.

Note that the fixed-point version still produces an integer answer and remainder, for example:

``````6.500000 / 0.562500 = 11.000000 (r = 0.312500)
``````

## Future Features

The following features may be added in future, time permitting:

1. Non-integer results for fixed-point numbers
2. Signed numbers
3. Formal verification