# Division in Verilog

Division is a fundamental arithmetic operation we take for granted. FPGAs include dedicated hardware to perform addition, subtraction, and multiplication and will infer the necessary logic. Division is different: we need to do it ourselves. This post looks at a straightforward division algorithm for positive integers before extending it to cover fixed-point numbers and signed numbers.

New to Verilog maths? Check out my introduction to Numbers in Verilog.

Share your thoughts with @WillFlux on Mastodon or Twitter. If you like what I do, sponsor me. 🙏

### Series Outline

- Numbers in Verilog - introduction to numbers in Verilog
- Vectors and Arrays - working with Verilog vectors and arrays
- Multiplication with FPGA DSPs - efficient multiplication with DSPs
- Fixed-Point Numbers in Verilog - precision without complexity
- Division in Verilog (this post) - divided we stand
*More maths to follow*

## Division Defined

Before we get to the design, it helps be familiar with some terminology.

When you divide **dividend** `A`

by **divisor** `B`

you get **quotient** `Q`

and **remainder** `R`

:

`A = B*Q + R`

Consider a trivial example: you have seven slices of apple pie to divide equally amongst three people. Each person gets two slices of pie, and one slice remains.

Less deliciously: given `A=7`

and `B=3`

, then `Q=2`

and `R=1`

because `7 = 3*2 + 1`

.

## Long Division

The traditional way to divide numbers with pen and paper is long division. We move from left to right, trying to divide the shortest sequence of digits in the dividend by the divisor.

For example, let’s divide `512`

by `12`

. We find that `12`

is larger `5`

(the first digit of the dividend), so we next consider `51`

, which `12`

divides `4`

times, with `3`

left over. So, the first digit of our answer is 4.

Next, we take the leftover `3`

and bring down the `2`

from `512`

to make `32`

. `12`

fits into `32`

twice, so the second digit of our answer is `2`

with `8`

left over. We’ve considered all the digits of the dividend, so the quotient is `42`

and the remainder is `8`

.

This is easier to see when laid out as a calculation in columns:

```
A=512 B=12
42 Quotient
————
12 )512
48 51: 12x4=48 + 3
——
32 Take the 3 and bring down the 2 down from 512
24 32: 12x2=24 + 8
——
8 Remainder
```

If this doesn’t seem clear, check out the Wikipedia page on long division and try doing a few calculations yourself. Nothing beats doing hands-on examples when it comes to maths. On the other hand, you don’t need to be able to do long division to use these designs, so feel free to move on. :)

### In Binary

For binary, we can follow the same long division process. For example, let’s divide `1110`

by `11`

(in decimal: 14 divided by 3).

Our divisor `11`

is two digits long, so we can start by considering the first two digits of the dividend, which is also `11`

: we record a `1`

for the first digit of the quotient and move on.

The remaining digits are `10`

, which is smaller than `11`

, so we stop. We have a quotient of `100`

and a remainder of `10`

. This is easier to see when laid out in columns:

```
A=1110 B=11
100 Quotient
————
11 )1110
11 11x1=11 + 0
——
010 Bring the third 1 down from 1110
000 11 doesn't divide 10, so it's the remainder
———
10 Remainder
```

Doing binary division by hand is painful: each step is simple, but even moderately-sized numbers have many digits, all of which are 1 and 0, so it’s easy to make a mistake. However, the very simplicity of the approach makes it straightforward to implement in Verilog.

Slow Divide

Even contemporary CPUs take their time with division. Intel Skylake has a latency of 42-95 cycles for signed 64-bit integer division but only 3 cycles for multiplication. Source: agner.org.

## Algorithm Implementation

We’ll take our example from above: `A=1110`

and `B=0011`

. We also need a register to accumulate (store) the intermediate calculations and a register to record the quotient: we name them `ACC`

and `QUO`

, respectively.

There are four digits in the inputs, so we need four steps. For each step, we shift the left-most digit of the dividend `A`

into `ACC`

, then compare it with the divisor `B`

. If `ACC`

is greater or equal to `B`

, then we subtract `B`

from `ACC`

and add 1 to the quotient `QUO`

.

This is easiest to see by working through the example:

```
Inputs: A=1110 B=0011
Step ACC A QUO Description
——————————————————————————————————————————————————————————————————————
0000 1110 0000 Starting values.
1 0001 1100 0000 Left shift A into ACC. Left shift QUO.
Is ACC≥B? No. Next digit...
2 0011 1000 0000 Left shift A into ACC. Left shift QUO.
Is ACC≥B? Yes. Update quotient...
0000 1000 0001 Subtract B from ACC. Set QUO[0]=1.
3 0001 0000 0010 Left shift A into ACC. Left shift QUO.
Is ACC≥B? No. Next digit...
4 0010 0000 0100 Left shift A into ACC. Left shift QUO.
Is ACC≥B? No. Done.
——————————————————————————————————————————————————————————————————————
The resulting quotient is 0100, and the remainder is 0010.
```

With this algorithm, the divisor `B`

can only ever fit into the accumulator `ACC`

once at most, which is why we can simply subtract `B`

from `ACC`

when `ACC≥B`

.

We never use the same digit in `A`

and `QUO`

simultaneously, so it’s possible to combine those registers. As a digit of the dividend is shifted out, a digit of the quotient is shifted in. We’ll do this in our Verilog module.

## Verilog Module

Our first Verilog design uses the above algorithm but adds a check for dividing by zero and allows the width of the numbers to be configured. This method takes one cycle per bit: 32 cycles for 32-bit numbers.

Unsigned integer division - **divu_int.sv**:

```
module divu_int #(parameter WIDTH=5) ( // width of numbers in bits
input wire logic clk, // clock
input wire logic rst, // reset
input wire logic start, // start calculation
output logic busy, // calculation in progress
output logic done, // calculation is complete (high for one tick)
output logic valid, // result is valid
output logic dbz, // divide by zero
input wire logic [WIDTH-1:0] a, // dividend (numerator)
input wire logic [WIDTH-1:0] b, // divisor (denominator)
output logic [WIDTH-1:0] val, // result value: quotient
output logic [WIDTH-1:0] rem // result: remainder
);
logic [WIDTH-1:0] b1; // copy of divisor
logic [WIDTH-1:0] quo, quo_next; // intermediate quotient
logic [WIDTH:0] acc, acc_next; // accumulator (1 bit wider)
logic [$clog2(WIDTH)-1:0] i; // iteration counter
// division algorithm iteration
always_comb begin
if (acc >= {1'b0, b1}) begin
acc_next = acc - b1;
{acc_next, quo_next} = {acc_next[WIDTH-1:0], quo, 1'b1};
end else begin
{acc_next, quo_next} = {acc, quo} << 1;
end
end
// calculation control
always_ff @(posedge clk) begin
done <= 0;
if (start) begin
valid <= 0;
i <= 0;
if (b == 0) begin // catch divide by zero
busy <= 0;
done <= 1;
dbz <= 1;
end else begin
busy <= 1;
dbz <= 0;
b1 <= b;
{acc, quo} <= {{WIDTH{1'b0}}, a, 1'b0}; // initialize calculation
end
end else if (busy) begin
if (i == WIDTH-1) begin // we're done
busy <= 0;
done <= 1;
valid <= 1;
val <= quo_next;
rem <= acc_next[WIDTH:1]; // undo final shift
end else begin // next iteration
i <= i + 1;
acc <= acc_next;
quo <= quo_next;
end
end
if (rst) begin
busy <= 0;
done <= 0;
valid <= 0;
dbz <= 0;
val <= 0;
rem <= 0;
end
end
endmodule
```

To use the module, set `WIDTH`

to the correct number of bits and the inputs `a`

and `b`

to dividend and divisor, respectively. To begin the calculation set `start`

high for one clock.

The `valid`

signal indicates when the output data is valid; you can then read the results from `val`

and `r`

. If you divide by zero, then `valid`

will be zero, and the `dbz`

flag signal will be high. The `busy`

signal is high during calculation.

The Verilog itself is straightforward. The algorithm iteration is in the `always_comb`

block. The `always_ff`

block tests for division by zero, sets up the initial values, and then runs the algorithm for the same number of iterations as the width of the numbers.

Division might seem slow at one cycle per bit, but it uses little logic, so you can quickly improve throughput by adding additional instances.

Accumulator Width

The accumulator needs to be 1 bit wider than the dividend because the remainder comes from unshifting the final`acc_next`

. For example, if we divide eight by nine using four-bit numbers, the remainder should be eight`4'b1000`

, but without the wider accumulator, the left-most digit would be lost, and the remainder would appear to be`4'b0000`

.

### Testing Division

For 2023, I’ve started testing my Verilog library modules with cocotb and Icarus Verilog. You can find tests for maths lib modules in lib/maths/test. I plan to write up my experience with cocotb, but for now, you can run these tests by installing:

- cocotb - Test bench tool for Verilog and VHDL written in Python
- Icarus Verilog - Verilog simulation tool
- spfpm - fixed-point Python module
- pytest - Python test framework (optional)
- GTKWave - waveform viewer (optional)

On **Debian & Ubuntu**:

```
apt install make python3 python3-pip iverilog gtkwave
pip install cocotb pytest spfpm
```

On **macOS** with brew:

```
brew install python icarus-verilog gtkwave
pip3 install cocotb pytest spfpm
```

Once you have the tools installed, you can **run tests** with the included Makefile:

```
cd lib/maths/test
make divu_int
```

Example output for `97/13`

(test 8 of 14):

```
111.01ns INFO cocotb.regression running rem_3 (8/14)
Test 97/13
126.01ns INFO cocotb.divu_int dut a: 01100001
126.01ns INFO cocotb.divu_int dut b: 00001101
126.01ns INFO cocotb.divu_int dut val: 00000111
126.01ns INFO cocotb.divu_int dut rem: 00000110
126.01ns INFO cocotb.divu_int model val: 00000111
126.01ns INFO cocotb.divu_int model rem: 00000110
127.01ns INFO cocotb.regression rem_3 passed
```

*‘dut’ is from the device under test (our Verilog module). ‘model’ is from Python.*

Our test bench exercises our maths by comparing it with the same calculation performed in Python. We use **spfpm** to convert the Python result into fixed-point with the correct precision (eight bits for this example). The test bench also tests for the correct operation of flags, such as `done`

and `dbz`

(divide by zero).

It’s straightforward to add and edit tests in the Python module test/divu_int.py.

## Fixed-Point Support

In a previous part, we looked at Fixed Point Numbers in Verilog, but didn’t cover division, so let’s do that now. We have two changes to make to our division module:

- Divide the Remainder
- Handle overflow

Accounting for the fractional part of the number, we need to increase the number of iterations to divide the remainder. For example, if we have an 8-bit number with 4 fractional bits, we must perform 12 iterations. We pass the number of fractional bits as a new parameter: `FBITS`

.

By supporting fixed-point numbers, we can divide by numbers less than one, which means our result can overflow. For example, if you divide 6 by 0.25, the result is 24, which requires five bits to store: `11000`

; if we only have four integer bits, we can’t handle this. We check for overflow and report it with the `ovf`

signal.

Unsigned fixed-point division - **divu.sv**:

```
module divu #(
parameter WIDTH=8, // width of numbers in bits (integer and fractional)
parameter FBITS=4 // fractional bits within WIDTH
) (
input wire logic clk, // clock
input wire logic rst, // reset
input wire logic start, // start calculation
output logic busy, // calculation in progress
output logic done, // calculation is complete (high for one tick)
output logic valid, // result is valid
output logic dbz, // divide by zero
output logic ovf, // overflow
input wire logic [WIDTH-1:0] a, // dividend (numerator)
input wire logic [WIDTH-1:0] b, // divisor (denominator)
output logic [WIDTH-1:0] val // result value: quotient
);
localparam FBITSW = (FBITS == 0) ? 1 : FBITS; // avoid negative vector width when FBITS=0
logic [WIDTH-1:0] b1; // copy of divisor
logic [WIDTH-1:0] quo, quo_next; // intermediate quotient
logic [WIDTH:0] acc, acc_next; // accumulator (1 bit wider)
localparam ITER = WIDTH + FBITS; // iteration count: unsigned input width + fractional bits
logic [$clog2(ITER)-1:0] i; // iteration counter
// division algorithm iteration
always_comb begin
if (acc >= {1'b0, b1}) begin
acc_next = acc - b1;
{acc_next, quo_next} = {acc_next[WIDTH-1:0], quo, 1'b1};
end else begin
{acc_next, quo_next} = {acc, quo} << 1;
end
end
// calculation control
always_ff @(posedge clk) begin
done <= 0;
if (start) begin
valid <= 0;
ovf <= 0;
i <= 0;
if (b == 0) begin // catch divide by zero
busy <= 0;
done <= 1;
dbz <= 1;
end else begin
busy <= 1;
dbz <= 0;
b1 <= b;
{acc, quo} <= {{WIDTH{1'b0}}, a, 1'b0}; // initialize calculation
end
end else if (busy) begin
if (i == ITER-1) begin // done
busy <= 0;
done <= 1;
valid <= 1;
val <= quo_next;
end else if (i == WIDTH-1 && quo_next[WIDTH-1:WIDTH-FBITSW] != 0) begin // overflow?
busy <= 0;
done <= 1;
ovf <= 1;
val <= 0;
end else begin // next iteration
i <= i + 1;
acc <= acc_next;
quo <= quo_next;
end
end
if (rst) begin
busy <= 0;
done <= 0;
valid <= 0;
dbz <= 0;
ovf <= 0;
val <= 0;
end
end
endmodule
```

Notice how the *division algorithm iteration* remains unchanged from the integer version.

### Fractional Testing

Naturally, we have a test bench for fixed-point division:

```
cd lib/maths/test
make divu
```

Example output for `8/9`

(test 6 of 22):

```
99.00ns INFO cocotb.regression running simple_6 (6/22)
Test 8/9
118.00ns INFO cocotb.divu dut a: 10000000
118.00ns INFO cocotb.divu dut b: 10010000
118.00ns INFO cocotb.divu dut val: 00001110
118.00ns INFO cocotb.divu 0.875
118.00ns INFO cocotb.divu model val: 00000.1110
118.00ns INFO cocotb.divu 0.875
119.01ns INFO cocotb.regression simple_6 passed
```

Our test bench includes tests for overflow and dividing by zero.

You’ll see some tests “failed as expected” because `divu`

doesn’t handle rounding in the same way as Python. We’ll fix this in the next section.

```
197.01ns INFO cocotb.regression running round_5 (11/22)
Test 13/7
216.01ns INFO cocotb.divu dut a: 11010000
216.01ns INFO cocotb.divu dut b: 01110000
216.01ns INFO cocotb.divu dut val: 00011101
216.01ns INFO cocotb.divu 1.8125
216.01ns INFO cocotb.divu model val: 00001.1110
216.01ns INFO cocotb.divu 1.875
216.01ns INFO cocotb.regression round_5 passed: failed as expected...
```

## Signed Numbers

Our division algorithm doesn’t work with signed numbers, but the solution is straightforward: perform the division on the inputs’ absolute value, then adjust the sign afterwards. Because we now have more steps, we switch to a finite state machine (FSM) to control the calculation.

Signed division with Gaussian rounding - **div.sv**:

```
module div #(
parameter WIDTH=8, // width of numbers in bits (integer and fractional)
parameter FBITS=4 // fractional bits within WIDTH
) (
input wire logic clk, // clock
input wire logic rst, // reset
input wire logic start, // start calculation
output logic busy, // calculation in progress
output logic done, // calculation is complete (high for one tick)
output logic valid, // result is valid
output logic dbz, // divide by zero
output logic ovf, // overflow
input wire logic signed [WIDTH-1:0] a, // dividend (numerator)
input wire logic signed [WIDTH-1:0] b, // divisor (denominator)
output logic signed [WIDTH-1:0] val // result value: quotient
);
localparam WIDTHU = WIDTH - 1; // unsigned widths are 1 bit narrower
localparam FBITSW = (FBITS == 0) ? 1 : FBITS; // avoid negative vector width when FBITS=0
localparam SMALLEST = {1'b1, {WIDTHU{1'b0}}}; // smallest negative number
localparam ITER = WIDTHU + FBITS; // iteration count: unsigned input width + fractional bits
logic [$clog2(ITER):0] i; // iteration counter (allow ITER+1 iterations for rounding)
logic a_sig, b_sig, sig_diff; // signs of inputs and whether different
logic [WIDTHU-1:0] au, bu; // absolute version of inputs (unsigned)
logic [WIDTHU-1:0] quo, quo_next; // intermediate quotients (unsigned)
logic [WIDTHU:0] acc, acc_next; // accumulator (unsigned but 1 bit wider)
// input signs
always_comb begin
a_sig = a[WIDTH-1+:1];
b_sig = b[WIDTH-1+:1];
end
// division algorithm iteration
always_comb begin
if (acc >= {1'b0, bu}) begin
acc_next = acc - bu;
{acc_next, quo_next} = {acc_next[WIDTHU-1:0], quo, 1'b1};
end else begin
{acc_next, quo_next} = {acc, quo} << 1;
end
end
// calculation state machine
enum {IDLE, INIT, CALC, ROUND, SIGN} state;
always_ff @(posedge clk) begin
done <= 0;
case (state)
INIT: begin
state <= CALC;
ovf <= 0;
i <= 0;
{acc, quo} <= {{WIDTHU{1'b0}}, au, 1'b0}; // initialize calculation
end
CALC: begin
if (i == WIDTHU-1 && quo_next[WIDTHU-1:WIDTHU-FBITSW] != 0) begin // overflow
state <= IDLE;
busy <= 0;
done <= 1;
ovf <= 1;
end else begin
if (i == ITER-1) state <= ROUND; // calculation complete after next iteration
i <= i + 1;
acc <= acc_next;
quo <= quo_next;
end
end
ROUND: begin // Gaussian rounding
state <= SIGN;
if (quo_next[0] == 1'b1) begin // next digit is 1, so consider rounding
// round up if quotient is odd or remainder is non-zero
if (quo[0] == 1'b1 || acc_next[WIDTHU:1] != 0) quo <= quo + 1;
end
end
SIGN: begin // adjust quotient sign if non-zero and input signs differ
state <= IDLE;
if (quo != 0) val <= (sig_diff) ? {1'b1, -quo} : {1'b0, quo};
busy <= 0;
done <= 1;
valid <= 1;
end
default: begin // IDLE
if (start) begin
valid <= 0;
if (b == 0) begin // divide by zero
state <= IDLE;
busy <= 0;
done <= 1;
dbz <= 1;
ovf <= 0;
end else if (a == SMALLEST || b == SMALLEST) begin // overflow
state <= IDLE;
busy <= 0;
done <= 1;
dbz <= 0;
ovf <= 1;
end else begin
state <= INIT;
au <= (a_sig) ? -a[WIDTHU-1:0] : a[WIDTHU-1:0]; // register abs(a)
bu <= (b_sig) ? -b[WIDTHU-1:0] : b[WIDTHU-1:0]; // register abs(b)
sig_diff <= (a_sig ^ b_sig); // register input sign difference
busy <= 1;
dbz <= 0;
ovf <= 0;
end
end
end
endcase
if (rst) begin
state <= IDLE;
busy <= 0;
done <= 0;
valid <= 0;
dbz <= 0;
ovf <= 0;
val <= 0;
end
end
endmodule
```

We create unsigned versions of the inputs as `au`

and `bu`

and register the difference in signs as `sig_diff`

. After the calculation, we use `sig_diff`

to set the correct sign for the result, being careful not to change zero.

### Rounding

When working with fixed-point numbers, we need to consider rounding. For example, the square root of two is: `√2 ≈ 1.41421...`

with a finite number of bits, we need to decide how to approximate it.

The `divu`

module truncates its answers, which is simple to implement, but doesn’t produce the nearest number to the actual value. For example, `13/7 ≈ 1.85714...`

but `divu`

produces `1.8125`

rather than `1.875`

.

Alas, there is not one-true-way to round numbers, but the default for IEEE 754 floating point numbers (also used in Python) is **Gaussian rounding**, also known as bankers’ rounding or “rounding half to even”. In this system, 1.5 is rounded up to 2 and 2.5 is rounded down to 2, while 3.5 is rounded up to 4 etc. Gaussian rounding avoids the positive bias that would occur if we always rounded half up.

To implement Gaussian rounding, we need to calculate the next bit of the result and look at the remainder, which is what we do in the ROUND step:

```
ROUND: begin // Gaussian rounding
state <= SIGN;
if (quo_next[0] == 1'b1) begin // next digit is 1, so consider rounding
// round up if quotient is odd or remainder is non-zero
if (quo[0] == 1'b1 || acc_next[WIDTHU:1] != 0) quo <= quo + 1;
end
end
```

With the addition of this rounding step, our module now matches the Python model.

### Testing the Final Design

Test bench for signed fixed-point division:

```
cd lib/maths/test
make div
```

Example output for `-7.0625/2`

(test 17 of 32):

```
367.02ns INFO cocotb.regression running round_8 (17/32)
Test -7.0625/2
389.02ns INFO cocotb.div dut a: 110001111
389.02ns INFO cocotb.div dut b: 000100000
389.02ns INFO cocotb.div dut val: 111001000
389.02ns INFO cocotb.div -3.5
389.02ns INFO cocotb.div model val: 111100.1000
389.02ns INFO cocotb.div -3.5
390.02ns INFO cocotb.regression round_8 passed
```

## What’s Next?

Check out some other maths posts: square root and sine & cosine.

*Apple Pie photo by Shisma under Creative Commons Attribution licence.*