22 December 2020

Square Root in Verilog

The square root is useful in many circumstances, including statistics, graphics, and signal processing. In this FPGA recipe, we’re going to look at a straightforward digit-by-digit square root algorithm for integer and fixed-point numbers. There are lower-latency methods, but this one is simple, using only subtraction and bit shifts. You might also be interested in Division in Verilog.

Updated 2021-01-04. Feedback to @WillFlux is most welcome.

Getting Radical

The square root of a number is a second number that multiplied by itself produces the first number. If b is the square root of a, then the following are different ways of representing their relationship:

  • b² = a
  • b = a1/2
  • b = √a

The square root is usually represented with the radical sign .

Let’s look at a few concrete examples:

  • √4 = 2
  • √100 = 10
  • √2 ≈ 1.41421
  • √3 ≈ 1.73205

Most square roots are irrational, so we only get an approximate answer algorithmically. However, the good news is that our method will provide an exact solution if one exists.

We’re only going to consider the positive root in this post, but there is also a negative root. For example, the square roots of 4 are +2 and -2.

Triangles

Pythagoras’s theorem is a fundamental equation in geometry. For a right-angled triangle, the area of the square of the hypotenuse is equal to the sum of the areas of the squares on the other two sides.

Pythagorean Triangle

a² + b² = c²

Finding c is a common problem in computer graphics, which means solving √(a² + b²).

For 3D modelling, it’s common to need the reciprocal of c.

Diagram courtesy of Wikipedia under the Creative Commons Attribution-Share Alike licence.

The Long Route

So far, so (probably) familiar. However, you’re unlikely to have calculated square roots by hand. Superficially, square roots seem too hard to do manually, but the approach is almost the same as long division. I won’t go into the mathematics here, but Wikipedia has a decent write up: methods of computing square roots. Instead, we’ll take a simple example, then turn it into a Verilog algorithm.

Example: √1111001

The number we’re finding the root of is known as the radicand.

For our example, the radicand is 1111001 (121 in decimal).

Our algorithm works with pairs of digits. Before we begin our calculation, we need to split the radicand into pairs starting with the least significant. Thus 1111001 becomes 01 11 10 01.

     1  0  1  1     Our answer: one digit for each pair of digits in the radicand
    ———————————
   √01 11 10 01     Our radicand

    01              Bring down the most significant pair
  - 01              Subtract 01
    ——
    00              Our answer is NOT negative, so our first answer digit is 1

    00 11           Bring down the next pair of digits and append to previous result
  - 01 01           Append 01 to our existing answer, 1, and subtract it
    —————
    11 00           Our result is negative, so our second answer digit is 0
                    We discard the result because it's negative

    00 11 10        Keep the existing digits, 00 11, and append the next pair
  -    10 01        Append 01 to our existing answer, 10, and subtract it
    ————————
       01 01        Our answer is NOT negative, so our next answer digit is 1

       01 01 01     Bring down the next pair of digits and append to previous result
    -  01 01 01     Append 01 to our existing answer, 101, and subtract it
       ————————
             00     Our answer is NOT negative, so our next answer digit is 1

There are no more pairs of digits, and the result of our last step is 0, so our answer is exact:

√1111001 = 1011 or in decimal √121 = 11.

We’ll look at what happens for irrational roots, such as √2, when we come to fixed-point numbers.

Algorithm Implementation

To make our algorithm usable on an FPGA, we need to turn the steps into simple operations we can represent in Verilog. The main change is using shifts to select numbers to work on. We’ll use four registers in our algorithm:

  • X - input radicand we want the square root of
  • A - holds the current value we’re working on
  • T - result of sign test
  • Q - the square root
Input: X=01111001 (decimal 121)

Step  A         X         T         Q       Description
——————————————————————————————————————————————————————————————————————————————
      00000000  01111001  00000000  0000    Starting values.

1     00000001  11100100                    Left shift X by two places into A.
                          00000000          Set T = A - {Q,01}: 01 - 01.
                                    0000    Left shift Q.
      00000000                      0001    Is T≥0? Yes. Set A=T and Q[0]=1.

2     00000011  10010000                    Left shift X by two places into A.
                          11111100          Set T = A - {Q,01}: 11 - 101.
                                    0010    Left shift Q.
                                            Is T≥0? No. Move to next step.

3     00001110  01000000                    Left shift X by two places into A.
                          00000101          Set T = A - {Q,01}: 1110 - 1001
                                    0100    Left shift Q.
      00000101                      0101    Is T≥0? Yes. Set A=T and Q[0]=1.

4     00010101  00000000                    Left shift X by two places into A.
                          00000000          Set T = A - {Q,01}: 10101 - 10101.
                                    1010    Left shift Q.
      00000000                      1011    Is T≥0? Yes. Set A=T and Q[0]=1.
——————————————————————————————————————————————————————————————————————————————

Output: Q=1010 (decimal 11), R=0 (remainder taken from final A).

Verilog Module

Our Verilog design uses the above algorithm, but you can configure the width of the radicand. WIDTH must be a multiple of two, for example, when working with seven binary digits, you must set the width to eight.

// Project F: Square Root (Integer)
// (C)2021 Will Green, Open source hardware released under the MIT License
// Learn more at https://projectf.io

`default_nettype none
`timescale 1ns / 1ps

module sqrt #(parameter WIDTH=8) (      // width of radicand
    input wire logic clk,
    input wire logic start,             // start signal
    output     logic busy,              // calculation in progress
    output     logic valid,             // root and rem are valid
    input wire logic [WIDTH-1:0] rad,   // radicand
    output     logic [WIDTH-1:0] root,  // root
    output     logic [WIDTH-1:0] rem    // remainder
    );

    logic [WIDTH-1:0] x, x_next;    // radicand copy
    logic [WIDTH-1:0] q, q_next;    // intermediate root (quotient)
    logic [WIDTH+1:0] ac, ac_next;  // accumulator (2 bits wider)
    logic [WIDTH+1:0] test_res;     // sign test result (2 bits wider)

    localparam ITER = WIDTH >> 1;   // iterations are half radicand width
    logic [$clog2(ITER)-1:0] i;     // iteration counter

    always_comb begin
        test_res = ac - {q, 2'b01};
        if (test_res[WIDTH+1] == 0) begin  // test_res ≥0? (check MSB)
            {ac_next, x_next} = {test_res[WIDTH-1:0], x, 2'b0};
            q_next = {q[WIDTH-2:0], 1'b1};
        end else begin
            {ac_next, x_next} = {ac[WIDTH-1:0], x, 2'b0};
            q_next = q << 1;
        end
    end

    always_ff @(posedge clk) begin
        if (start) begin
            busy <= 1;
            valid <= 0;
            i <= 0;
            q <= 0;
            {ac, x} <= {{WIDTH{1'b0}}, rad, 2'b0};
        end else if (busy) begin
            if (i == ITER-1) begin  // we're done
                busy <= 0;
                valid <= 1;
                root <= q_next;
                rem <= ac_next[WIDTH+1:2];  // undo final shift
            end else begin  // next iteration
                i <= i + 1;
                x <= x_next;
                ac <= ac_next;
                q <= q_next;
            end
        end
    end
endmodule

To use the module, set WIDTH to the correct number of bits and input rad to the radicand. To begin the calculation set start high for one clock. The valid signal indicates when the output data is valid; you can then read the results from root and rem. The calculation takes one cycle for each pair of bits in the radicand, so a 16-bit radicand would take eight cycles. The busy signal is high during calculation.

The Verilog itself is straightforward. The algorithm is in the always_comb block, though we start with the initial shift already in place. The always_ff block sets up the initial values, then runs the algorithm for the number of digits in the root.

Testing

The following test bench exercises the module:

// Project F: Test Bench for Integer Square Root
// (C)2021 Will Green, Open source hardware released under the MIT License
// Learn more at https://projectf.io

`default_nettype none
`timescale 1ns / 1ps

module sqrt_int_tb();

    parameter CLK_PERIOD = 10;
    parameter WIDTH = 8;

    logic clk;
    logic start;             // start signal
    logic busy;              // calculation in progress
    logic valid;             // root and rem are valid
    logic [WIDTH-1:0] rad;   // radicand
    logic [WIDTH-1:0] root;  // root
    logic [WIDTH-1:0] rem;   // remainder

    sqrt #(.WIDTH(WIDTH)) sqrt_inst (.*);

    always #(CLK_PERIOD / 2) clk = ~clk;

    initial begin
                clk = 1;

        #100    rad = 8'b00000000;  // 0
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\tsqrt(%d) =%d (rem =%d)", $time, rad, root, rem);

                rad = 8'b00000001;  // 1
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\tsqrt(%d) =%d (rem =%d)", $time, rad, root, rem);

                rad = 8'b01111001;  // 121
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\tsqrt(%d) =%d (rem =%d)", $time, rad, root, rem);

                rad = 8'b01010001;  // 81
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\tsqrt(%d) =%d (rem =%d)", $time, rad, root, rem);

                rad = 8'b01011010;  // 90
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\tsqrt(%d) =%d (rem =%d)", $time, rad, root, rem);

                rad = 8'b11111111;  // 255
                start = 1;
        #10     start = 0;
        #40     $display("\t%d:\tsqrt(%d) =%d (rem =%d)", $time, rad, root, rem);

                $finish;
    end
endmodule

The output looks like this:

sqrt(  0) =  0 (rem =  0)
sqrt(  1) =  1 (rem =  0)
sqrt(121) = 11 (rem =  0)
sqrt( 81) =  9 (rem =  0)
sqrt( 90) =  9 (rem =  9)
sqrt(255) = 15 (rem = 30)

Fixed-Point Roots

In a previous recipe, we looked at Fixed Point Numbers in Verilog. It turns out that supporting fixed-point square roots is straightforward: we just run more iterations to account for the fractional digits. Unlike with division, there’s no possibility of overflow, as the root of a radicand greater than 1 is always smaller than the radicand itself.

This new version of the sqrt module adds a parameter FBITS for the number of fractional bits in the radicand (FBITS must be a multiple of two). For example, if we want to calculate √2 using Q8.8 format (eight integer and eight fractional bits), we set WIDTH=16 and FBITS=8. The only change within the module is for ITER, which now accounts for fractional bits. For a Q8.8 radicand, we now perform (16+8)/2 = 12 iterations, generating four more bits for the root.

// Project F: Square Root
// (C)2021 Will Green, Open source hardware released under the MIT License
// Learn more at https://projectf.io

`default_nettype none
`timescale 1ns / 1ps

module sqrt #(
    parameter WIDTH=8,  // width of radicand
    parameter FBITS=0   // fractional bits (for fixed point)
    ) (
    input wire logic clk,
    input wire logic start,             // start signal
    output     logic busy,              // calculation in progress
    output     logic valid,             // root and rem are valid
    input wire logic [WIDTH-1:0] rad,   // radicand
    output     logic [WIDTH-1:0] root,  // root
    output     logic [WIDTH-1:0] rem    // remainder
    );

    logic [WIDTH-1:0] x, x_next;    // radicand copy
    logic [WIDTH-1:0] q, q_next;    // intermediate root (quotient)
    logic [WIDTH+1:0] ac, ac_next;  // accumulator (2 bits wider)
    logic [WIDTH+1:0] test_res;     // sign test result (2 bits wider)

    localparam ITER = (WIDTH+FBITS) >> 1;  // iterations are half radicand+fbits width
    logic [$clog2(ITER)-1:0] i;            // iteration counter

    always_comb begin
        test_res = ac - {q, 2'b01};
        if (test_res[WIDTH+1] == 0) begin  // test_res ≥0? (check MSB)
            {ac_next, x_next} = {test_res[WIDTH-1:0], x, 2'b0};
            q_next = {q[WIDTH-2:0], 1'b1};
        end else begin
            {ac_next, x_next} = {ac[WIDTH-1:0], x, 2'b0};
            q_next = q << 1;
        end
    end

    always_ff @(posedge clk) begin
        if (start) begin
            busy <= 1;
            valid <= 0;
            i <= 0;
            q <= 0;
            {ac, x} <= {{WIDTH{1'b0}}, rad, 2'b0};
        end else if (busy) begin
            if (i == ITER-1) begin  // we're done
                busy <= 0;
                valid <= 1;
                root <= q_next;
                rem <= ac_next[WIDTH+1:2];  // undo final shift
            end else begin  // next iteration
                i <= i + 1;
                x <= x_next;
                ac <= ac_next;
                q <= q_next;
            end
        end
    end
endmodule

And we can create a simple test bench for it:

// Project F: Test Bench for Fixed-Point Square Root
// (C)2021 Will Green, Open source hardware released under the MIT License
// Learn more at https://projectf.io

`default_nettype none
`timescale 1ns / 1ps

module sqrt_tb();

    parameter CLK_PERIOD = 10;
    parameter WIDTH = 16;
    parameter FBITS = 8;
    parameter SF = 2.0**-8.0;  // Q8.8 scaling factor is 2^-8

    logic clk;
    logic start;             // start signal
    logic busy;              // calculation in progress
    logic valid;             // root and rem are valid
    logic [WIDTH-1:0] rad;   // radicand
    logic [WIDTH-1:0] root;  // root
    logic [WIDTH-1:0] rem;   // remainder

    sqrt #(.WIDTH(WIDTH), .FBITS(FBITS)) sqrt_inst (.*);

    always #(CLK_PERIOD / 2) clk = ~clk;

    initial begin
                clk = 1;

        #100    rad = 16'b1110_1000_1001_0000;  // 232.56250000
                start = 1;
        #10     start = 0;
        #120    $display("\t%d:\tsqrt(%f) = %b (%f) (rem = %b)",
                    $time, $itor(rad)*SF, root, $itor(root)*SF, rem);

                rad = 16'b0000_0000_0100_0000;  // 0.25
                start = 1;
        #10     start = 0;
        #120    $display("\t%d:\tsqrt(%f) = %b (%f) (rem = %b)",
                    $time, $itor(rad)*SF, root, $itor(root)*SF, rem);

                rad = 16'b0000_0010_0000_0000;  // 2.0
                start = 1;
        #10     start = 0;
        #120    $display("\t%d:\tsqrt(%f) = %b (%f) (rem = %b)",
                    $time, $itor(rad)*SF, root, $itor(root)*SF, rem);

        $finish;
    end
endmodule

NB. I have tested with a wide range of roots, but only show a few here to keep the example managable.

The output looks like this:

sqrt(232.562500) = 0000111101000000 (15.250000) (rem = 0000000000000000)
sqrt(  0.250000) = 0000000010000000  (0.500000) (rem = 0000000000000000)
sqrt(  2.000000) = 0000000101101010  (1.414062) (rem = 0000000000011100)

1.01101010 is the correct result for √2 using eight fractional bits. Wolfram Alpha provides a quick way to check binary square roots, for example: √2 to base 2.

The smallest value the Q8.8 format can represent is 1/256 or ~0.004 in decimal; if you want a more accurate result, you need to use more fractional bits.

That wraps up this recipe, but watch out for forthcoming graphics posts using square root.

©2021 Will Green, Project F