Division is a fundamental arithmetic operation; one we take for granted in most contexts. FPGAs are different; Verilog can’t synthesize division: we need to do it ourselves. In this how to, we’re going to look at a straightforward division algorithm for positive integers and fixed-point numbers. For integers, this method takes one cycle per bit: 32 cycles for 32-bit numbers. This post was last updated June 2021.

There are also posts on fixed-point numbers, square root, and sine & cosine.

*Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)*

Sponsor My Work

If you like what I do, consider sponsoring me on GitHub.

I love FPGAs and want to help more people discover and use them in their projects.

My hardware designs are open source, and my blog is advert free.

### Source

The SystemVerilog designs featured in this post are available from the Project F Library under the open-source MIT licence: build on them to your heart’s content. The rest of the blog content is subject to standard copyright restrictions: don’t republish it without permission.

## Division Defined

Before we get to the design, it helps be familiar with some terminology.

When you divide **dividend** `X`

by **divisor** `Y`

you get **quotient** `Q`

and **remainder** `R`

:

`X = Y*Q + R`

Consider a trivial example: you have seven slices of apple pie to divide equally amongst three people. Each person gets two slices of pie, and there is one slice left over.

Less deliciously: given `X=7`

and `Y=3`

, then `Q=2`

and `R=1`

because `7 = 3*2 + 1`

.

## Long Division

The traditional way to divide numbers with pen and paper is long division. We move from left to right, trying to divide the shortest sequence of digits in the dividend by the divisor.

For example, let’s divide `512`

by `12`

. We find that `12`

is larger `5`

(the first digit of the dividend), so we next consider `51`

, which `12`

divides `4`

times, with `3`

left over. So, the first digit of our answer is 4.

Next, we take the leftover `3`

, and bring down the `2`

from `512`

, to make `32`

. `12`

fits into `32`

twice, so the second digit of our answer is `2`

with `8`

left over. We’ve considered all the digits of the dividend, so the quotient is `42`

and the remainder is `8`

.

This is easier to see when laid out as a calculation in columns:

```
X=512 Y=12
42 Quotient
————
12 )512
48 51: 12x4=48 + 3
——
32 Take the 3 and bring down the 2 down from 512
24 32: 12x2=24 + 8
——
8 Remainder
```

If this doesn’t seem clear, check out the Wikipedia page on long division and try doing a few calculations yourself. Nothing beats doing hands-on examples when it comes to maths.

### In Binary

For binary, we can follow the same long division process. For example, let’s divide `1110`

by `11`

(that’s 14 divided by 3 in decimal).

Our divisor `11`

is two digits long, so we can start by considering the first two digits of the dividend, which is also `11`

: we record a `1`

for the first digit of the quotient and move on.

The remaining digits are `10`

, which is smaller than `11`

, so we stop. We have a quotient of `100`

and a remainder of `10`

. This is easier to see when laid out in columns:

```
X=1110 Y=11
100 Quotient
————
11 )1110
11 11x1=11 + 0
——
010 Bring third 1 down from 1110
000 11 doesn't divide 10, so it's the remainder
———
10 Remainder
```

Doing binary division by hand is painful: each step is simple, but even moderately-sized numbers have a large number of digits, all of which are 1 and 0, so it’s easy to make a mistake. However, the very simplicity of the approach makes it straightforward to implement in Verilog.

Divide is typically slower than Multiply

Even high-end CPUs take their time with division. Intel Skylake has a latency of 42-95 cycles for signed 64-bit integer division, but only 3 cycles for multiplication. Source: agner.org.

## Algorithm Implementation

We’ll take our example from above: `X=1110`

and `Y=0011`

. We also need a register to accumulate (store) the intermediate calculations, and a register to record the quotient: we name them `A`

and `Q`

respectively.

There are four digits in the inputs, so we need four steps. For each step, we shift the left-most digit of the dividend `X`

into `A`

, then compare it with the divisor `Y`

. If `A`

is greater or equal to `Y`

, then we subtract `Y`

from `A`

and add 1 to the quotient `Q`

.

This is easiest to see by working through the example:

```
Inputs: X=1110 Y=0011
Step A X Q Description
——————————————————————————————————————————————————————————————————
0000 1110 0000 Starting values.
1 0001 1100 0000 Left shift X into A. Left shift Q.
Is A≥Y? No: next digit...
2 0011 1000 0000 Left shift X into A. Left shift Q.
Is A≥Y? Yes: update quotient...
0000 1000 0001 Subtract Y from A. Set Q[0]=1.
3 0001 0000 0010 Left shift X into A. Left shift Q.
Is A≥Y? No: next digit...
4 0010 0000 0100 Left shift X into A. Left shift Q.
Is A≥Y? No. Done.
——————————————————————————————————————————————————————————————————
Outputs: Q=0100 R=010
```

With this algorithm, the divisor `Y`

can only ever fit into the accumulator `A`

once at most, which is why we can simply subtract `Y`

from `A`

when `A≥Y`

.

We never use the same digit in `X`

and `Q`

at the same time, so it’s possible to combine those registers together. As a digit of the dividend is shifted out, a digit of the quotient is shifted in. We’ll do this in our Verilog module.

## Verilog Module

Our Verilog design uses the above algorithm, but adds a check for divide by zero and allows the width of the numbers to be configured.

Project F Library source: **[div_int.sv]**.

```
module div_int #(parameter WIDTH=4) (
input wire logic clk,
input wire logic start, // start signal
output logic busy, // calculation in progress
output logic valid, // quotient and remainder are valid
output logic dbz, // divide by zero flag
input wire logic [WIDTH-1:0] x, // dividend
input wire logic [WIDTH-1:0] y, // divisor
output logic [WIDTH-1:0] q, // quotient
output logic [WIDTH-1:0] r // remainder
);
logic [WIDTH-1:0] y1; // copy of divisor
logic [WIDTH-1:0] q1, q1_next; // intermediate quotient
logic [WIDTH:0] ac, ac_next; // accumulator (1 bit wider)
logic [$clog2(WIDTH)-1:0] i; // iteration counter
always_comb begin
if (ac >= {1'b0,y1}) begin
ac_next = ac - y1;
{ac_next, q1_next} = {ac_next[WIDTH-1:0], q1, 1'b1};
end else begin
{ac_next, q1_next} = {ac, q1} << 1;
end
end
always_ff @(posedge clk) begin
if (start) begin
valid <= 0;
i <= 0;
if (y == 0) begin // catch divide by zero
busy <= 0;
dbz <= 1;
end else begin // initialize values
busy <= 1;
dbz <= 0;
y1 <= y;
{ac, q1} <= {{WIDTH{1'b0}}, x, 1'b0};
end
end else if (busy) begin
if (i == WIDTH-1) begin // we're done
busy <= 0;
valid <= 1;
q <= q1_next;
r <= ac_next[WIDTH:1]; // undo final shift
end else begin // next iteration
i <= i + 1;
ac <= ac_next;
q1 <= q1_next;
end
end
end
endmodule
```

*ProTip: I’ve used SystemVerilog, but you can easily adapt logic, $clog2, and always* to Verilog.*

To use the module, set `WIDTH`

to the correct number of bits and the inputs `x`

and `y`

to dividend and divisor respectively. To begin the calculation set `start`

high for one clock. The `valid`

signal indicates when the output data is valid; you can then read the results from `q`

and `r`

. If you divide by zero, then `valid`

, `q`

, and `r`

will be zero, and the `dbz`

flag signal will be high. The `busy`

signal is high during calculation.

The Verilog itself is straightforward. The algorithm is in the `always_comb`

block, though we start with the initial shift already in place. The `always_ff`

block tests for division by zero, sets up the initial values, then runs the algorithm for the same number of iterations as the width of the numbers.

Accumulator Width

The accumulator needs to be 1 bit wider than the dividend because the remainder comes from unshifting the final`ac_next`

. For example, when we divide eight by nine, the remainder should be eight`4'b1000`

, but without the wider accumulator the left-most digit would be lost, and the remainder would appear to be`4'b0000`

.

### Testing

The following Vivado test bench exercises the module, including dividing by zero.

Project F Library source: **[xc7/div_int_tb.sv]**.

```
module div_int_tb();
parameter CLK_PERIOD = 10; // 10 ns == 100 MHz
parameter WIDTH = 4;
logic clk;
logic start; // start signal
logic busy; // calculation in progress
logic valid; // quotient and remainder are valid
logic dbz; // divide by zero flag
logic [WIDTH-1:0] x; // dividend
logic [WIDTH-1:0] y; // divisor
logic [WIDTH-1:0] q; // quotient
logic [WIDTH-1:0] r; // remainder
div_int #(.WIDTH(WIDTH)) div_int_inst (.*);
always #(CLK_PERIOD / 2) clk = ~clk;
initial begin
$monitor("\t%d:\t%d /%d =%d (r =%d) (V=%b) (DBZ=%b)",
$time, x, y, q, r, valid, dbz);
end
initial begin
clk = 1;
#100 x = 4'b0000; // 0
y = 4'b0010; // 2
start = 1;
#10 start = 0;
#50 x = 4'b0010; // 2
y = 4'b0000; // 0
start = 1;
#10 start = 0;
#50 x = 4'b0111; // 7
y = 4'b0010; // 2
start = 1;
#10 start = 0;
#50 x = 4'b1111; // 15
y = 4'b0101; // 5
start = 1;
#10 start = 0;
#50 x = 4'b0001; // 1
y = 4'b0001; // 1
start = 1;
#10 start = 0;
#50 x = 4'b1000; // 8
y = 4'b1001; // 9
start = 1;
#10 start = 0;
// ...
#50 $finish;
end
endmodule
```

The output looks like this:

```
0: x / x = x (r = x) (V=x) (DBZ=x)
100: 0 / 2 = x (r = x) (V=0) (DBZ=0)
140: 0 / 2 = 0 (r = 0) (V=1) (DBZ=0)
160: 2 / 0 = 0 (r = 0) (V=0) (DBZ=1)
220: 7 / 2 = 0 (r = 0) (V=0) (DBZ=0)
260: 7 / 2 = 3 (r = 1) (V=1) (DBZ=0)
280: 15 / 5 = 3 (r = 1) (V=0) (DBZ=0)
320: 15 / 5 = 3 (r = 0) (V=1) (DBZ=0)
340: 1 / 1 = 3 (r = 0) (V=0) (DBZ=0)
380: 1 / 1 = 1 (r = 0) (V=1) (DBZ=0)
400: 8 / 9 = 1 (r = 0) (V=0) (DBZ=0)
440: 8 / 9 = 0 (r = 8) (V=1) (DBZ=0)
```

*NB. I have tested with a wider range of values, but only show a few here for brevity.*

## Fixed-Point Support

In a previous recipe, we looked at Fixed Point Numbers in Verilog, but didn’t cover division, so let’s do that now. We have two changes to make:

- Increasing the number of iterations
- Handling overflow

To account for the fractional part of the number, we need to increase the number of iterations by the number of fractional bits.

*More explanation to follow…*

Project F Library source: **[div.sv]**.

```
module div #(
parameter WIDTH=4, // width of numbers in bits
parameter FBITS=0 // fractional bits (for fixed point)
) (
input wire logic clk,
input wire logic start, // start signal
output logic busy, // calculation in progress
output logic valid, // quotient and remainder are valid
output logic dbz, // divide by zero flag
output logic ovf, // overflow flag (fixed-point)
input wire logic [WIDTH-1:0] x, // dividend
input wire logic [WIDTH-1:0] y, // divisor
output logic [WIDTH-1:0] q, // quotient
output logic [WIDTH-1:0] r // remainder
);
// avoid negative vector width when fractional bits are not used
localparam FBITSW = (FBITS) ? FBITS : 1;
logic [WIDTH-1:0] y1; // copy of divisor
logic [WIDTH-1:0] q1, q1_next; // intermediate quotient
logic [WIDTH:0] ac, ac_next; // accumulator (1 bit wider)
localparam ITER = WIDTH+FBITS; // iterations are dividend width + fractional bits
logic [$clog2(ITER)-1:0] i; // iteration counter
always_comb begin
if (ac >= {1'b0,y1}) begin
ac_next = ac - y1;
{ac_next, q1_next} = {ac_next[WIDTH-1:0], q1, 1'b1};
end else begin
{ac_next, q1_next} = {ac, q1} << 1;
end
end
always_ff @(posedge clk) begin
if (start) begin
valid <= 0;
ovf <= 0;
i <= 0;
if (y == 0) begin // catch divide by zero
busy <= 0;
dbz <= 1;
end else begin
busy <= 1;
dbz <= 0;
y1 <= y;
{ac, q1} <= {{WIDTH{1'b0}}, x, 1'b0};
end
end else if (busy) begin
if (i == ITER-1) begin // done
busy <= 0;
valid <= 1;
q <= q1_next;
r <= ac_next[WIDTH:1]; // undo final shift
end else if (i == WIDTH-1 && q1_next[WIDTH-1:WIDTH-FBITSW]) begin // overflow?
busy <= 0;
ovf <= 1;
q <= 0;
r <= 0;
end else begin // next iteration
i <= i + 1;
ac <= ac_next;
q1 <= q1_next;
end
end
end
endmodule
```

A new Vivado test bench can exercise the module (a sample of tests shown for brevity).

Project F Library source: **[xc7/div_tb.sv]**.

```
module div_tb();
parameter CLK_PERIOD = 10; // 10 ns == 100 MHz
parameter WIDTH = 8;
parameter FBITS = 4;
parameter SF = 2.0**-4.0; // Q4.4 scaling factor is 2^-4
logic clk;
logic start; // start signal
logic busy; // calculation in progress
logic valid; // quotient and remainder are valid
logic dbz; // divide by zero flag
logic ovf; // overflow flag (fixed-point only)
logic [WIDTH-1:0] x; // dividend
logic [WIDTH-1:0] y; // divisor
logic [WIDTH-1:0] q; // quotient
logic [WIDTH-1:0] r; // remainder
div #(.WIDTH(WIDTH), .FBITS(FBITS)) div_inst (.*);
always #(CLK_PERIOD / 2) clk = ~clk;
initial begin
$monitor("\t%d:\t%f / %f = %b (%f) (r = %b) (V=%b) (DBZ=%b) (OVF=%b)",
$time, x*SF, y*SF, q, q*SF, r, valid, dbz, ovf);
end
initial begin
clk = 1;
#100 x = 8'b0011_0000; // 3.0
y = 8'b0010_0000; // 2.0
start = 1;
#10 start = 0;
#120 x = 8'b0010_0000; // 2.0
y = 8'b0001_0110; // 1.375 (the largest number that's ≤√2 in Q4.4)
start = 1;
#10 start = 0;
#120 x = 8'b0010_0000; // 2.0
y = 8'b0000_0000; // 0.0
start = 1;
#10 start = 0;
#120 x = 8'b0000_0000; // 0.0
y = 8'b0010_0000; // 2.0
start = 1;
#10 start = 0;
#120 x = 8'b0000_0010; // 0.125
y = 8'b0010_0000; // 2.0
start = 1;
#10 start = 0;
#120 x = 8'b1000_0000; // 8.0
y = 8'b0000_0100; // 0.25
start = 1;
#10 start = 0;
#120 x = 8'b1111_1110; // 15.875
y = 8'b0010_0000; // 2.0
start = 1;
#10 start = 0;
#120 x = 8'b1000_0000; // 8.0
y = 8'b1001_0000; // 9.0
start = 1;
#10 start = 0;
// ...
end
endmodule
```

The output looks like this:

```
0: 0.000000 / 0.000000 = xxxxxxxx (0.000000) (r = xxxxxxxx) (V=x) (DBZ=x) (OVF=x)
100: 3.000000 / 2.000000 = xxxxxxxx (0.000000) (r = xxxxxxxx) (V=0) (DBZ=0) (OVF=0)
220: 3.000000 / 2.000000 = 00011000 (1.500000) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
230: 2.000000 / 1.375000 = 00011000 (1.500000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
350: 2.000000 / 1.375000 = 00010111 (1.437500) (r = 00000110) (V=1) (DBZ=0) (OVF=0)
360: 2.000000 / 0.000000 = 00010111 (1.437500) (r = 00000110) (V=0) (DBZ=1) (OVF=0)
490: 0.000000 / 2.000000 = 00010111 (1.437500) (r = 00000110) (V=0) (DBZ=0) (OVF=0)
610: 0.000000 / 2.000000 = 00000000 (0.000000) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
620: 0.125000 / 2.000000 = 00000000 (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
740: 0.125000 / 2.000000 = 00000001 (0.062500) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
750: 8.000000 / 0.250000 = 00000001 (0.062500) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
830: 8.000000 / 0.250000 = 00000000 (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=1)
880: 15.875000 / 2.000000 = 00000000 (0.000000) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
1000: 15.875000 / 2.000000 = 01111111 (7.937500) (r = 00000000) (V=1) (DBZ=0) (OVF=0)
1010: 8.000000 / 9.000000 = 01111111 (7.937500) (r = 00000000) (V=0) (DBZ=0) (OVF=0)
1130: 8.000000 / 9.000000 = 00001110 (0.875000) (r = 00100000) (V=1) (DBZ=0) (OVF=0)
```

Try experimenting with a wider range of test values and different fractional widths.

## Signed Numbers

I plan to add support for signed numbers during 2022. In the meantime you might like to check out Square Root or Sine & Cosine in Verilog.

*Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)*