Fixed Point Numbers in Verilog
Sometimes you need more precision than integers can provide, but floating-point computation is not trivial (try reading IEEE 754). You could use a library or IP block, but simple fixed point maths can often get the job done with little effort. Furthermore, most FPGAs have dedicated DSP blocks that make multiplication and addition of integers fast; we can take advantage of that with a fixed-point approach.
New to the series? Start with Numbers in Verilog.
Share your thoughts with @WillFlux on Mastodon or Twitter. If you like what I do, sponsor me. 🙏
Series Outline
- Numbers in Verilog - introduction to numbers in Verilog
- Vectors and Arrays - working with Verilog vectors and arrays
- Multiplication with FPGA DSPs - efficient multiplication with DSPs
- Fixed-Point Numbers in Verilog (this post) - precision without complexity
- Division in Verilog - divided we stand
- More maths to follow
What is a Fixed Point Number?
In a regular binary integer, the bits represent powers of two, with the least significant bit being 1
.
For example, decimal 13
is 1101
in binary: 8 + 4 + 1 = 13
.
With decimal numbers, we’re used to the idea of using a decimal separator, a point or comma, to separate integer and fractional parts. 1001 is one thousand and one, whereas 10.01 is ten and one hundredth.
We can do the same thing in binary with a binary point and use the bits to represent any powers of two we like.
For example, we can think of 4.75
as being 4 + 1/2 + 1/4
.
In binary terms this can be visualized as:
8 4 2 1 1/2 1/4 1/8 1/16
^ ^ ^ ^ ^ ^ ^ ^
0 1 0 0 1 1 0 0
Or, with a handy point to mark the fractional part: 0100.1100
.
We’re choosing to interpret the value as being fixed point, but from an FPGA logic point of view, it’s just an 8-bit integer: provided we are consistent in the position of the fixed point we’ll get the expected result from mathematical operations. See later in this post if you have existing integers you need to convert to fixed point.
Q Notation
To express the number of integer and fractional bits we use Q number format: Qi.f where i is the number of integer bits and f is the number of fractional bits. 0100.1100
has four integer and four fractional bits, so is Q4.4. We’ll mostly use Q4.4 in this post to keep the examples manageable.
ProTip: Some sources include the sign bit in i, but others do not. In our examples, we include the sign bit: thus Q4.4 is from -8 to 7.9375 (7 + 15/16): we discuss range later in this post.
Maths Just Works!
All the usual binary maths work when used with fixed-point numbers. Verilog can generally synthesize addition, subtraction, and multiplication on an FPGA. We cannot synthesize division automatically, but we can multiply by fractional numbers, e.g. multiply by 0.1 instead of dividing by 10.
Addition & Subtraction
Addition works in exactly the same way as for integers:
0011.1010 3.6250
+ 0100.0001 + 4.0625
= 0111.1011 = 7.6875
We can also do subtraction using two’s complement to express the negative number:
0011.1010 3.6250
+ 1110.1000 - 1.5000
= 0010.0010 = 2.1250
How did we get the two’s complement for -1.5?
1.5 = 1 + 1/2 = 0001.1000
Start: 0001.1000 (1.5)
Invert: 1110.0111
Add 1: 0000.0001
Result: 1110.1000 (-1.5)
Multiplication
Multiplication works as expected too, but the product contains twice the number of bits. Multiplying two Q4.4 numbers results in a Q8.8 product:
0011.0100 3.2500
x 0010.0001 2.0625
You can do the long multiplication to produce:
00000110.10110100 = 6.703125
We can round this to our original Q4.4 format by taking the eight middle bits:
0000[0110.1011]0100 = 0110.1011 = 6.6875
You can see we lost some precision there; this often happens when you convert to the original precision after multiplication. If our result had been 8 or higher, it would have overflowed our Q4.4 format because the largest digit we can represent with four bits in two’s complement is 7. We discuss range and precision in more detail below.
Division
Division is less straightforward. You can’t synthesize it with regular Verilog unless it’s by a power of two, which uses a right shift. A right shift truncates the result, so we lose the fractional part:
0111 7
>>1 right-shift 1 bit
0011 3
However, with fractional numbers, we can do accurate division by a constant using multiplication.
For example, to divide by 2 we multiple by 0.5:
0111.1000 7.5000
x 0000.1000 0.5000
= 00000011.11000000 3.7500
Back in Q4.4: 0011.1100
. In this case, our answer is exact.
To be able to divide arbitrary fixed-point numbers see: Division in Verilog.
In Verilog
You can’t use a point ‘.’ in Verilog binary literals, but you can use an underscore. The following module demonstrates all the calculations we performed above:
module fixedtest();
reg signed [7:0] a, b, c;
reg signed [15:0] ab; // large enough for product
localparam SF = 2.0**-4.0; // Q4.4 scaling factor is 2^-4
initial begin
$display("Fixed Point Examples from projectf.io.");
a = 8'b0011_1010; // 3.6250
b = 8'b0100_0001; // 4.0625
c = a + b; // 0111.1011 = 7.6875
$display("%f + %f = %f", $itor(a*SF), $itor(b*SF), $itor(c*SF));
a = 8'b0011_1010; // 3.6250
b = 8'b1110_1000; // -1.5000
c = a + b; // 0010.0010 = 2.1250
$display("%f + %f = %f", $itor(a*SF), $itor(b*SF), $itor(c*SF));
a = 8'b0011_0100; // 3.2500
b = 8'b0010_0001; // 2.0625
ab = a * b; // 00000110.10110100 = 6.703125
c = ab[11:4]; // take middle 8 bits: 0110.1011 = 6.6875
$display("%f * %f = %f", $itor(a*SF), $itor(b*SF), $itor(c*SF));
a = 8'b0111_1000; // 7.5000
b = 8'b0000_1000; // 0.5000
ab = a * b; // 00000011.11000000 = 3.7500
c = ab[11:4]; // take middle 8 bits: 0011.1100 = 3.7500
$display("%f * %f = %f", $itor(a*SF), $itor(b*SF), $itor(c*SF));
end
endmodule
NB. My syntax highlighter doesn’t like underscore in numbers, but they do work in actual tools.
The $itor
function converts an integer into a real number we can display. We divide our results by 24 to account for the lower four bits being fractional. For example, 00101000
would normally be interpreted as 40 (32+8), but we want it to be 2.5 (2 + 1/2). If we divide 40 by 24=16 we get the desired 2.5.
Large Numbers, Small Numbers
If the numbers you’re using are very small or large, you can improve your accuracy by scaling them so that they fit comfortably into the range of your fixed-point numbers.
Let’s say we want to perform the following: 3/256 x 7
.
The smallest fraction we can represent in Q4.4 is 1/16, so 3/256 can’t be expressed. Rather than use a larger precision we can scale the smaller number up. If we multiply 3/256 by 26 we get 0.75, which comfortably fits within our precision. Because we’ve multiplied one of our numbers by 26 we need to add this to the scaling factor, making it 210 (that’s 4 bits for our Q4.4 format and 6 bits for adjusted scaling):
a = 8'b0000_1100; // 0.7500 (0.75 = 3/256 x 2^6)
b = 8'b0111_0000; // 7.0000
ab = a * b; // 00000101.01000000
c = ab[11:4]; // take middle 8 bits: 0101.1010
$display("%f", $itor(c*2.0**-10.0)); // divide result by 2^10
We get the result 0.082031, which is close to the exact value of 0.08203125. To improve on the accuracy further would require using more bits.
Range & Precision
You can get a wider range by using more integer bits and better precision by using more fractional bits. However, using more bits increases the amount of FPGA logic required. Consider what your problem requires and the capabilities of your FPGA.
Q4.4
Using two’s complement, a 4-bit value ranges from -8 (1000
) to +7 (0111
).
A 4-bit fraction can represent numbers as small as 1/16 (0001
) and as large as 15/16 (1111
).
Range: -8 to 7.9375 (7 + 15/16)
Precision: 0.0625 (1/16)
Q16.16
Using two’s compliment a 16-bit value ranges from -32768 to +32767.
A 16-bit fraction can represent numbers as small as 1/65536 and as large as 65535/65536.
Range: -32768 to 32767.9999847...
Precision: 0.0000152...
According to Wikipedia, Doom used a Q16.16 representation:
“…for all of its non-integer computations, including map system, geometry, rendering, player movement etc. This was done in order for the game to be playable on 386 and 486SX CPUs without an FPU. For compatibility reasons, this representation is still used in modern Doom source ports.”
Overflow
All results so far have fitted in our Q4.4 representation, but this doesn’t always happen. Let’s consider what happens when we multiply 6.5 by 4:
0110.1000 6.5000
x 0100.0000 4.0000
00011010.00000000 26.0000
If we take the eight middle bits, we get 1010.0000, which is -6. Our Q4.4 representation can’t hold 26.
For a Q4.4 product, if any of the first four bits contain 1
, then we’ve overflowed. You can also check for a loss of precision by looking for any 1s in the four least significant bits.
Overflow can also happen with addition:
0110.1010 6.6250
+ 0100.0001 + 4.0625
= 1010.1011 = -5.3125 ?!
And if our result is less than -8 it will overflow in the negative direction:
1010.0000 -6.0000
+ 1010.0000 + -6.0000
= 0100.0000 = 4.0000 ?!
Thankfully it’s not too hard to detect this. If both our operands are positive and the result is negative, then overflow must have occurred. Similarly, if both our operands are negative and the result is positive, then it has overflowed too.
Converting to and from Integers
Conversion to and from regular binary integers simply requires using the appropriate left or right shift. For example, if we’re using Q16.16 format we need to left-shift an integer 16 positions to create the Q16.16 fixed-point number:
101010 // decimal 42
<< 16 // left shift 16 positions
101010000000000000000
0000000000101010.0000000000000000 // with all bits shown in Q16.16 notation
To make long binary numbers easier to read it’s common practice to include spaces. For example, we could represent decimal 42 in Q16.16 as:
0000 0000 0010 1010.0000 0000 0000 0000
We can then perform all the usual arithmetic operators on it with other Q16.16 numbers. For example, let’s add decimal 2.875:
0000 0000 0010 1010.0000 0000 0000 0000 // 42
0000 0000 0000 0010.1110 0000 0000 0000 // + 2.875
0000 0000 0010 1100.1110 0000 0000 0000 // 44.875
If you need to convert a Q16.16 number back to a regular integer, you right shift 16 positions:
0000 0000 0001 0111.1110 0000 0000 0000 // 44.875
>> 16 // right shift 16 positions
0000 0000 0010 1100 // 44
This removes (truncates) the fractional part, so the answer ends up being 101100
(decimal 44).
What’s Next?
If you enjoyed this post, please sponsor me. Sponsors help me create more FPGA and RISC-V projects for everyone, and they get early access to blog posts and source code. 🙏
In our next installment we get all divisive with Division in Verilog. Or check out some of our other maths posts: square root, and sine & cosine. I also have a demo using fixed-point multiplication to render the Mandelbrot set in Verilog.