Sometimes you need more precision than integers can provide, but floating-point computation is not trivial (try reading IEEE 754). You could use a library or IP block, but simple fixed point maths can often get the job done with little effort. Furthermore, most FPGAs have dedicated DSP blocks that make multiplication and addition of integers fast; we can take advantage of that with a fixed-point approach.

This post is part of a series of handy recipes to solve common FPGA development problems. There is also a post on Division in Verilog.

*Revised 2020-07-01. Feedback to @WillFlux is most welcome.*

## What is a Fixed Point Number?

In a regular binary integer, the bits represent powers of two, with the least significant bit being `1`

.

For example, decimal `13`

is `1101`

in binary: `8 + 4 + 1 = 13`

.

With decimal numbers, we’re used to the idea of using a decimal separator, a point or comma, to separate integer and fractional parts. 1001 is one thousand and one, whereas 10.01 is ten and one hundredth.

We can do the same thing in binary and use the bits to represent any powers of two we like.

For example, we can think of `4.75`

as being `4 + 1/2 + 1/4`

.

In binary terms this can be visualized as:

```
8 4 2 1 1/2 1/4 1/8 1/16
^ ^ ^ ^ ^ ^ ^ ^
0 1 0 0 1 1 0 0
```

Or, with a handy point to mark the fractional part: `0100.1100`

.

We’re choosing to interpret the value as being fixed point, but from an FPGA logic point of view, it’s just an 8-bit integer: provided we are consistent in the position of the fixed point we’ll get the expected result from mathematical operations. See later in this post if you have existing integers you need to convert to fixed point.

### Q Notation

To express the number of integer and fractional bits we use Q number format: **Qi.f** where **i** is the number of integer bits and **f** is the number of fractional bits. `0100.1100`

has four integer and four fractional bits, so is Q4.4. We’ll mostly use Q4.4 in this post to keep the examples manageable.

*ProTip: Some sources include the sign bit in i, but others do not. In our examples, we include the sign bit: thus Q4.4 is from -8 to 7.9375 (7 + 15/16): we discuss range later in this post.*

## Maths Just Works!

All the usual binary maths work when used with fixed-point numbers. Verilog can generally synthesize addition, subtraction, and multiplication on an FPGA. We cannot synthesize division automatically, but we can multiply by fractional numbers, e.g. multiply by 0.1 instead of dividing by 10.

### Addition & Subtraction

Addition works in exactly the same way as for integers:

```
0011.1010 3.6250
+ 0100.0001 + 4.0625
= 0111.1011 = 7.6875
```

We can also do subtraction using two’s complement to express the negative number:

```
0011.1010 3.6250
+ 1110.1000 - 1.5000
= 0010.0010 = 2.1250
```

How did we get the two’s complement for -1.5?

```
1.5 = 1 + 1/2 = 0001.1000
Start: 0001.1000 (1.5)
Invert: 1110.0111
Add 1: 0000.0001
Result: 1110.1000 (-1.5)
```

### Multiplication

Multiplication works as expected too, but the product contains twice the number of bits. Multiplying two Q4.4 numbers results in a Q8.8 product:

```
0011.0100 3.2500
x 0010.0001 2.0625
```

You can do the long multiplication to produce:

```
00000110.10110100 = 6.703125
```

We can round this to our original Q4.4 format by taking the eight middle bits:

```
0000[0110.1011]0100 = 0110.1011 = 6.6875
```

You can see we lost some precision there; this often happens when you convert to the original precision after multiplication. If our result had been 8 or higher, it would have overflowed our Q4.4 format because the largest digit we can represent with four bits in two’s complement is 7. We discuss range and precision in more detail below.

### Division

Division is less straightforward. You can’t synthesize it with regular Verilog unless it’s by a power of two, which uses a right shift. A right shift truncates the result, so we lose the fractional part:

```
0111 7
>>1 right-shift 1 bit
0011 3
```

However, with fractional numbers, we can do accurate division by a constant using multiplication.

For example, to divide by 2 we multiple by 0.5:

```
0111.1000 7.5000
x 0000.1000 0.5000
= 00000011.11000000 3.7500
```

Back in Q4.4: `0011.1100`

. In this case, our answer is exact.

To be able to divide arbitrary fixed-point numbers see: Division in Verilog.

## In Verilog

You can’t use a point ‘.’ in Verilog binary literals, but you can use an underscore. The following module demonstrates all the calculations we performed above:

```
module fixedtest();
reg signed [7:0] a, b, c;
reg signed [15:0] ab; // large enough for product
localparam sf = 2.0**-4.0; // Q4.4 scaling factor is 2^-4
initial begin
$display("Fixed Point Examples from projectf.io.");
a = 8'b0011_1010; // 3.6250
b = 8'b0100_0001; // 4.0625
c = a + b; // 0111.1011 = 7.6875
$display("%f + %f = %f", $itor(a)*sf, $itor(b)*sf, $itor(c)*sf);
a = 8'b0011_1010; // 3.6250
b = 8'b1110_1000; // -1.5000
c = a + b; // 0010.0010 = 2.1250
$display("%f + %f = %f", $itor(a)*sf, $itor(b)*sf, $itor(c)*sf);
a = 8'b0011_0100; // 3.2500
b = 8'b0010_0001; // 2.0625
ab = a * b; // 00000110.10110100 = 6.703125
c = ab[11:4]; // take middle 8 bits: 0110.1011 = 6.6875
$display("%f x %f = %f", $itor(a)*sf, $itor(b)*sf, $itor(c)*sf);
a = 8'b0111_1000; // 7.5000
b = 8'b0000_1000; // 0.5000
ab = a * b; // 00000011.11000000 = 3.7500
c = ab[11:4]; // take middle 8 bits: 0011.1100 = 3.7500
$display("%f x %f = %f", $itor(a)*sf, $itor(b)*sf, $itor(c)*sf);
end
endmodule
```

*NB. My syntax highlighter doesn’t like underscore in numbers, but they do work in actual tools.*

The ** $itor** function converts an integer into a real number we can display. We divide our results by 2

^{4}to account for the lower four bits being fractional. For example,

`00101000`

would normally be interpreted as 40 (32+8), but we want it to be 2.5 (2 + 1/2). If we divide 40 by 2^{4}=16 we get the desired 2.5.

### Large Numbers, Small Numbers

If the numbers you’re using are very small or large, you can improve your accuracy by scaling them so that they fit comfortably into the range of your fixed-point numbers.

Let’s say we want to perform the following: `3/256 x 7`

.

The smallest fraction we can represent in Q4.4 is 1/16, so 3/256 can’t be expressed. Rather than use a larger precision we can scale the smaller number up. If we multiply 3/256 by 2^{6} we get 0.75, which comfortably fits within our precision. Because we’ve multiplied one of our numbers by 2^{6} we need to add this to the scaling factor, making it 2^{10} (that’s 4 bits for our Q4.4 format and 6 bits for adjusted scaling):

```
a = 8'b0000_1100; // 0.7500 (0.75 = 3/256 x 2^6)
b = 8'b0111_0000; // 7.0000
ab = a * b; // 00000101.01000000
c = ab[11:4]; // take middle 8 bits: 0101.1010
$display("%f", $itor(c)*2.0**-10.0); // divide result by 2^10
```

We get the result 0.082031, which is close to the exact value of 0.08203125. To improve on the accuracy further would require using more bits.

## Range & Precision

You can get a wider range by using more integer bits and better precision by using more fractional bits. However, using more bits increases the amount of FPGA logic required. Consider what your problem requires and the capabilities of your FPGA.

### Q4.4

Using two’s complement, a 4-bit value ranges from -8 (`1000`

) to +7 (`0111`

).
A 4-bit fraction can represent numbers as small as 1/16 (`0001`

) and as large as 15/16 (`1111`

).

```
Range: -8 to 7.9375 (7 + 15/16)
Precision: 0.0625 (1/16)
```

### Q16.16

Using two’s compliment a 16-bit value ranges from -32,768 to +32,767. A 16-bit fraction can represent numbers as small as 1/65,536 and as large as 65,535/65,536.

```
Range: -32,768 to 32,767.9999847...
Precision: 0.0000152...
```

According to Wikipedia, **Doom** used a Q16.16 representation:

“…for all of its non-integer computations, including map system, geometry, rendering, player movement etc. This was done in order for the game to be playable on 386 and 486SX CPUs without an FPU. For compatibility reasons, this representation is still used in modern Doom source ports.”

### Overflow

All results so far have fitted in our Q4.4 representation, but this doesn’t always happen. Let’s consider what happens when we multiply 6.5 by 4:

```
0110.1000 6.5000
x 0100.0000 4.0000
00011010.00000000 26.0000
```

If we take the eight middle bits, we get 1010.0000, which is -6. Our Q4.4 representation can’t hold 26.
For a Q4.4 product, if any of the first four bits contain `1`

, then we’ve overflowed. You can also check for a loss of precision by looking for any 1s in the four least significant bits.

Overflow can also happen with addition:

```
0110.1010 6.6250
+ 0100.0001 + 4.0625
= 1010.1011 = -5.3125 ?!
```

And if our result is less than -8 it will overflow in the negative direction:

```
1010.0000 -6.0000
+ 1010.0000 + -6.0000
= 0100.0000 = 4.0000 ?!
```

Thankfully it’s not too hard to detect this. If both our operands are positive and the result is negative, then overflow must have occurred. Similarly, if both our operands are negative and the result is positive, then it has overflowed too.

## Converting to and from Integers

Conversion to and from regular binary integers simply requires using the appropriate left or right shift. For example, if we’re using Q16.16 format we need to left-shift an integer 16 positions to create the Q16.16 fixed-point number:

```
101010 // decimal 42
<< 16 // left shift 16 positions
101010000000000000000
0000000000101010.0000000000000000 // with all bits shown in Q16.16 notation
```

To make long binary numbers easier to read it’s common practice to include spaces. For example, we could represent decimal 42 in Q16.16 as:

```
0000 0000 0010 1010.0000 0000 0000 0000
```

We can then perform all the usual arithmetic operators on it with other Q16.16 numbers. For example, let’s add decimal 2.875:

```
0000 0000 0010 1010.0000 0000 0000 0000 // 42
0000 0000 0000 0010.1110 0000 0000 0000 // + 2.875
0000 0000 0010 1100.1110 0000 0000 0000 // 44.875
```

If you need to convert a Q16.16 number back to a regular integer, you right shift 16 positions:

```
0000 0000 0001 0111.1110 0000 0000 0000 // 44.875
>> 16 // right shift 16 positions
0000 0000 0010 1100 // 44
```

This removes (truncates) the fractional part, so the answer ends up being `101100`

(decimal 44).