# Division in Verilog

Division is a fundamental arithmetic operation; one we take for granted in most contexts. FPGAs are different; Verilog can’t synthesize division: we need to do it ourselves. In this FPGA recipe, we’re going to look at a straightforward division algorithm for positive integers and fixed-point numbers. For integers, this method takes one cycle per bit: 32 cycles for 32-bit numbers. This post was last updated June 2021.

This post is part of a series of handy recipes to solve common FPGA development problems. There are also posts on fixed-point numbers, square root, and sine & cosine.

Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)

If you like what I do, consider sponsoring me on GitHub.
I love FPGAs and want to help more people discover and use them in their projects.
My hardware designs are open source, and my blog is advert free.

### Source

The SystemVerilog designs featured in this post are available from the Project F Library under the open-source MIT licence: build on them to your heart’s content. The rest of the blog content is subject to standard copyright restrictions: don’t republish it without permission.

## Division Defined

Before we get to the design, it helps be familiar with some terminology.

When you divide dividend `X` by divisor `Y` you get quotient `Q` and remainder `R`:

`X = Y*Q + R`

Consider a trivial example: you have seven slices of apple pie to divide equally amongst three people. Each person gets two slices of pie, and there is one slice left over.

Less deliciously: given `X=7` and `Y=3`, then `Q=2` and `R=1` because `7 = 3*2 + 1`.

## Long Division

The traditional way to divide numbers with pen and paper is long division. We move from left to right, trying to divide the shortest sequence of digits in the dividend by the divisor.

For example, let’s divide `512` by `12`. We find that `12` is larger `5` (the first digit of the dividend), so we next consider `51`, which `12` divides `4` times, with `3` left over. So, the first digit of our answer is 4.

Next, we take the leftover `3`, and bring down the `2` from `512`, to make `32`. `12` fits into `32` twice, so the second digit of our answer is `2` with `8` left over. We’ve considered all the digits of the dividend, so the quotient is `42` and the remainder is `8`.

This is easier to see when laid out as a calculation in columns:

``````    X=512  Y=12

42    Quotient
————
12 )512
48     51: 12x4=48 + 3
——
32    Take the 3 and bring down the 2 down from 512
24    32: 12x2=24 + 8
——
8    Remainder
``````

If this doesn’t seem clear, check out the Wikipedia page on long division and try doing a few calculations yourself. Nothing beats doing hands-on examples when it comes to maths.

### In Binary

For binary, we can follow the same long division process. For example, let’s divide `1110` by `11` (that’s 14 divided by 3 in decimal).

Our divisor `11` is two digits long, so we can start by considering the first two digits of the dividend, which is also `11`: we record a `1` for the first digit of the quotient and move on.

The remaining digits are `10`, which is smaller than `11`, so we stop. We have a quotient of `100` and a remainder of `10`. This is easier to see when laid out in columns:

``````    X=1110  Y=11

100    Quotient
————
11 )1110
11      11x1=11 + 0
——
010    Bring third 1 down from 1110
000    11 doesn't divide 10, so it's the remainder
———
10    Remainder
``````

Doing binary division by hand is painful: each step is simple, but even moderately-sized numbers have a large number of digits, all of which are 1 and 0, so it’s easy to make a mistake. However, the very simplicity of the approach makes it straightforward to implement in Verilog.

Divide is typically slower than Multiply
Even high-end CPUs take their time with division. Intel Skylake has a latency of 42-95 cycles for signed 64-bit integer division, but only 3 cycles for multiplication. Source: agner.org.

## Algorithm Implementation

We’ll take our example from above: `X=1110` and `Y=0011`. We also need a register to accumulate (store) the intermediate calculations, and a register to record the quotient: we name them `A` and `Q` respectively.

There are four digits in the inputs, so we need four steps. For each step, we shift the left-most digit of the dividend `X` into `A`, then compare it with the divisor `Y`. If `A` is greater or equal to `Y`, then we subtract `Y` from `A` and add 1 to the quotient `Q`.

This is easiest to see by working through the example:

``````Inputs: X=1110  Y=0011

Step    A       X       Q       Description
——————————————————————————————————————————————————————————————————
0000    1110    0000    Starting values.
1       0001    1100    0000    Left shift X into A. Left shift Q.
Is A≥Y? No: next digit...
2       0011    1000    0000    Left shift X into A. Left shift Q.
Is A≥Y? Yes: update quotient...
0000    1000    0001    Subtract Y from A. Set Q=1.
3       0001    0000    0010    Left shift X into A. Left shift Q.
Is A≥Y? No: next digit...
4       0010    0000    0100    Left shift X into A. Left shift Q.
Is A≥Y? No. Done.
——————————————————————————————————————————————————————————————————

Outputs: Q=0100  R=010
``````

With this algorithm, the divisor `Y` can only ever fit into the accumulator `A` once at most, which is why we can simply subtract `Y` from `A` when `A≥Y`.

We never use the same digit in `X` and `Q` at the same time, so it’s possible to combine those registers together. As a digit of the dividend is shifted out, a digit of the quotient is shifted in. We’ll do this in our Verilog module.

## Verilog Module

Our Verilog design uses the above algorithm, but adds a check for divide by zero and allows the width of the numbers to be configured.

Project F Library source: [div_int.sv].

``````module div_int #(parameter WIDTH=4) (
input wire logic clk,
input wire logic start,          // start signal
output     logic busy,           // calculation in progress
output     logic valid,          // quotient and remainder are valid
output     logic dbz,            // divide by zero flag
input wire logic [WIDTH-1:0] x,  // dividend
input wire logic [WIDTH-1:0] y,  // divisor
output     logic [WIDTH-1:0] q,  // quotient
output     logic [WIDTH-1:0] r   // remainder
);

logic [WIDTH-1:0] y1;            // copy of divisor
logic [WIDTH-1:0] q1, q1_next;   // intermediate quotient
logic [WIDTH:0] ac, ac_next;     // accumulator (1 bit wider)
logic [\$clog2(WIDTH)-1:0] i;     // iteration counter

always_comb begin
if (ac >= {1'b0,y1}) begin
ac_next = ac - y1;
{ac_next, q1_next} = {ac_next[WIDTH-1:0], q1, 1'b1};
end else begin
{ac_next, q1_next} = {ac, q1} << 1;
end
end

always_ff @(posedge clk) begin
if (start) begin
valid <= 0;
i <= 0;
if (y == 0) begin  // catch divide by zero
busy <= 0;
dbz <= 1;
end else begin  // initialize values
busy <= 1;
dbz <= 0;
y1 <= y;
{ac, q1} <= {{WIDTH{1'b0}}, x, 1'b0};
end
end else if (busy) begin
if (i == WIDTH-1) begin  // we're done
busy <= 0;
valid <= 1;
q <= q1_next;
r <= ac_next[WIDTH:1];  // undo final shift
end else begin  // next iteration
i <= i + 1;
ac <= ac_next;
q1 <= q1_next;
end
end
end
endmodule
``````

ProTip: I’ve used SystemVerilog, but you can easily adapt `logic`, `\$clog2`, and always* to Verilog.

To use the module, set `WIDTH` to the correct number of bits and the inputs `x` and `y` to dividend and divisor respectively. To begin the calculation set `start` high for one clock. The `valid` signal indicates when the output data is valid; you can then read the results from `q` and `r`. If you divide by zero, then `valid`, `q`, and `r` will be zero, and the `dbz` flag signal will be high. The `busy` signal is high during calculation.

The Verilog itself is straightforward. The algorithm is in the `always_comb` block, though we start with the initial shift already in place. The `always_ff` block tests for division by zero, sets up the initial values, then runs the algorithm for the same number of iterations as the width of the numbers.

Accumulator Width
The accumulator needs to be 1 bit wider than the dividend because the remainder comes from unshifting the final `ac_next`. For example, when we divide eight by nine, the remainder should be eight `4'b1000`, but without the wider accumulator the left-most digit would be lost, and the remainder would appear to be `4'b0000`.

### Testing

The following Vivado test bench exercises the module, including dividing by zero.

Project F Library source: [xc7/div_int_tb.sv].

``````module div_int_tb();

parameter CLK_PERIOD = 10;  // 10 ns == 100 MHz
parameter WIDTH = 4;

logic clk;
logic start;            // start signal
logic busy;             // calculation in progress
logic valid;            // quotient and remainder are valid
logic dbz;              // divide by zero flag
logic [WIDTH-1:0] x;    // dividend
logic [WIDTH-1:0] y;    // divisor
logic [WIDTH-1:0] q;    // quotient
logic [WIDTH-1:0] r;    // remainder

div_int #(.WIDTH(WIDTH)) div_int_inst (.*);

always #(CLK_PERIOD / 2) clk = ~clk;

initial begin
\$monitor("\t%d:\t%d /%d =%d (r =%d) (V=%b) (DBZ=%b)",
\$time, x, y, q, r, valid, dbz);
end

initial begin
clk = 1;

#100    x = 4'b0000;  // 0
y = 4'b0010;  // 2
start = 1;
#10     start = 0;

#50     x = 4'b0010;  // 2
y = 4'b0000;  // 0
start = 1;
#10     start = 0;

#50     x = 4'b0111;  // 7
y = 4'b0010;  // 2
start = 1;
#10     start = 0;

#50     x = 4'b1111;  // 15
y = 4'b0101;  //  5
start = 1;
#10     start = 0;

#50     x = 4'b0001;  // 1
y = 4'b0001;  // 1
start = 1;
#10     start = 0;

#50     x = 4'b1000;  // 8
y = 4'b1001;  // 9
start = 1;
#10     start = 0;

// ...

#50     \$finish;
end
endmodule
``````

The output looks like this:

``````  0:     x / x = x (r = x) (V=x) (DBZ=x)
100:     0 / 2 = x (r = x) (V=0) (DBZ=0)
140:     0 / 2 = 0 (r = 0) (V=1) (DBZ=0)
160:     2 / 0 = 0 (r = 0) (V=0) (DBZ=1)
220:     7 / 2 = 0 (r = 0) (V=0) (DBZ=0)
260:     7 / 2 = 3 (r = 1) (V=1) (DBZ=0)
280:    15 / 5 = 3 (r = 1) (V=0) (DBZ=0)
320:    15 / 5 = 3 (r = 0) (V=1) (DBZ=0)
340:     1 / 1 = 3 (r = 0) (V=0) (DBZ=0)
380:     1 / 1 = 1 (r = 0) (V=1) (DBZ=0)
400:     8 / 9 = 1 (r = 0) (V=0) (DBZ=0)
440:     8 / 9 = 0 (r = 8) (V=1) (DBZ=0)
``````

NB. I have tested with a wider range of values, but only show a few here for brevity.

## Fixed-Point Support

In a previous recipe, we looked at Fixed Point Numbers in Verilog, but didn’t cover division, so let’s do that now. We have two changes to make:

1. Increasing the number of iterations
2. Handling overflow

To account for the fractional part of the number, we need to increase the number of iterations by the number of fractional bits.

More explanation to follow…

Project F Library source: [div.sv].

``````module div #(
parameter WIDTH=4,  // width of numbers in bits
parameter FBITS=0   // fractional bits (for fixed point)
) (
input wire logic clk,
input wire logic start,          // start signal
output     logic busy,           // calculation in progress
output     logic valid,          // quotient and remainder are valid
output     logic dbz,            // divide by zero flag
output     logic ovf,            // overflow flag (fixed-point)
input wire logic [WIDTH-1:0] x,  // dividend
input wire logic [WIDTH-1:0] y,  // divisor
output     logic [WIDTH-1:0] q,  // quotient
output     logic [WIDTH-1:0] r   // remainder
);

// avoid negative vector width when fractional bits are not used
localparam FBITSW = (FBITS) ? FBITS : 1;

logic [WIDTH-1:0] y1;           // copy of divisor
logic [WIDTH-1:0] q1, q1_next;  // intermediate quotient
logic [WIDTH:0] ac, ac_next;    // accumulator (1 bit wider)

localparam ITER = WIDTH+FBITS;  // iterations are dividend width + fractional bits
logic [\$clog2(ITER)-1:0] i;     // iteration counter

always_comb begin
if (ac >= {1'b0,y1}) begin
ac_next = ac - y1;
{ac_next, q1_next} = {ac_next[WIDTH-1:0], q1, 1'b1};
end else begin
{ac_next, q1_next} = {ac, q1} << 1;
end
end

always_ff @(posedge clk) begin
if (start) begin
valid <= 0;
ovf <= 0;
i <= 0;
if (y == 0) begin  // catch divide by zero
busy <= 0;
dbz <= 1;
end else begin
busy <= 1;
dbz <= 0;
y1 <= y;
{ac, q1} <= {{WIDTH{1'b0}}, x, 1'b0};
end
end else if (busy) begin
if (i == ITER-1) begin  // done
busy <= 0;
valid <= 1;
q <= q1_next;
r <= ac_next[WIDTH:1];  // undo final shift
end else if (i == WIDTH-1 && q1_next[WIDTH-1:WIDTH-FBITSW]) begin // overflow?
busy <= 0;
ovf <= 1;
q <= 0;
r <= 0;
end else begin  // next iteration
i <= i + 1;
ac <= ac_next;
q1 <= q1_next;
end
end
end
endmodule
``````

A new Vivado test bench can exercise the module (a sample of tests shown for brevity).

Project F Library source: [xc7/div_tb.sv].

``````module div_tb();

parameter CLK_PERIOD = 10;  // 10 ns == 100 MHz
parameter WIDTH = 8;
parameter FBITS = 4;
parameter SF = 2.0**-4.0;  // Q4.4 scaling factor is 2^-4

logic clk;
logic start;            // start signal
logic busy;             // calculation in progress
logic valid;            // quotient and remainder are valid
logic dbz;              // divide by zero flag
logic ovf;              // overflow flag (fixed-point only)
logic [WIDTH-1:0] x;    // dividend
logic [WIDTH-1:0] y;    // divisor
logic [WIDTH-1:0] q;    // quotient
logic [WIDTH-1:0] r;    // remainder

div #(.WIDTH(WIDTH), .FBITS(FBITS)) div_inst (.*);

always #(CLK_PERIOD / 2) clk = ~clk;

initial begin
\$monitor("\t%d:\t%f / %f = %b (%f) (r = %b) (V=%b) (DBZ=%b) (OVF=%b)",
\$time, x*SF, y*SF, q, q*SF, r, valid, dbz, ovf);
end

initial begin
clk = 1;

#100    x = 8'b0011_0000;  // 3.0
y = 8'b0010_0000;  // 2.0
start = 1;
#10     start = 0;

#120    x = 8'b0010_0000;  // 2.0
y = 8'b0001_0110;  // 1.375 (the largest number that's ≤√2 in Q4.4)
start = 1;
#10     start = 0;

#120    x = 8'b0010_0000;  // 2.0
y = 8'b0000_0000;  // 0.0
start = 1;
#10     start = 0;

#120    x = 8'b0000_0000;  // 0.0
y = 8'b0010_0000;  // 2.0
start = 1;
#10     start = 0;

#120    x = 8'b0000_0010;  // 0.125
y = 8'b0010_0000;  // 2.0
start = 1;
#10     start = 0;

#120    x = 8'b1000_0000;  // 8.0
y = 8'b0000_0100;  // 0.25
start = 1;
#10     start = 0;

#120    x = 8'b1111_1110;  // 15.875
y = 8'b0010_0000;  // 2.0
start = 1;
#10     start = 0;

#120    x = 8'b1000_0000;  // 8.0
y = 8'b1001_0000;  // 9.0
start = 1;
#10     start = 0;

// ...
end
endmodule
``````

The output looks like this:

``````   0:    0.000000 /  0.000000 = xxxxxxxx  (0.000000) (r = xxxxxxxx) (V=x) (DBZ=x) (OVF=x)
100:    3.000000 /  2.000000 = xxxxxxxx  (0.000000) (r = xxxxxxxx) (V=0) (DBZ=0) (OVF=0)
220:    3.000000 /  2.000000 = 00011000  (1.500000) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
230:    2.000000 /  1.375000 = 00011000  (1.500000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
350:    2.000000 /  1.375000 = 00010111  (1.437500) (r = 00000110) (V=1) (DBZ=0) (OVF=0)
360:    2.000000 /  0.000000 = 00010111  (1.437500) (r = 00000110) (V=0) (DBZ=1) (OVF=0)
490:    0.000000 /  2.000000 = 00010111  (1.437500) (r = 00000110) (V=0) (DBZ=0) (OVF=0)
610:    0.000000 /  2.000000 = 00000000  (0.000000) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
620:    0.125000 /  2.000000 = 00000000  (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
740:    0.125000 /  2.000000 = 00000001  (0.062500) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
750:    8.000000 /  0.250000 = 00000001  (0.062500) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
830:    8.000000 /  0.250000 = 00000000  (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=1)
880:   15.875000 /  2.000000 = 00000000  (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
1000:   15.875000 /  2.000000 = 01111111  (7.937500) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
1010:    8.000000 /  9.000000 = 01111111  (7.937500) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
1130:    8.000000 /  9.000000 = 00001110  (0.875000) (r = 00100000) (V=1) (DBZ=0) (OVF=0)
``````

Try experimenting with a wider range of test values and different fractional widths.

## Signed Numbers

I plan to add support for signed numbers during 2022. In the meantime you might like to check out Square Root or Sine & Cosine in Verilog.

Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)