22 September 2020

Life on Screen

Welcome back to Exploring FPGA Graphics. In previous posts, we raced the beam: every frame the entire screen was generated from scratch. In this fourth part, we’re going to draw on a bitmap and drive the display from that. Bitmaps allow for more complex graphics, but require memory and increase latency.

In this series, we explore graphics at the hardware level and get a feel for the power of FPGAs. We start by learning how displays work, before racing the beam with Pong, drawing starfields and sprites, simulating life with bitmaps, drawing lines and triangles, and finally creating simple 3D models. I’ll be writing and revising this series throughout 2020. New to the series? Start with Exploring FPGA Graphics.

NB. This post is a draft: designs for iCEBreaker and other improvements are coming.

Updated 2020-09-23. Get in touch with @WillFlux or open an issue on GitHub.

He is Archimedes, Mick Jagger, Salvador Dalí, and Richard Feynman, all rolled into one.
The Guardian, John Horton Conway: the world’s most charismatic mathematician (2015)

Series Outline

  • Exploring FPGA Graphics - how displays work and simple animated colour graphics
  • FPGA Pong - race the beam to create the arcade classic
  • FPGA Ad Astra - animated starfields, hardware sprites, and bitmap fonts
  • Life on Screen (this post) - bitmaps and Conway’s Game of Life
  • Hard Lines - 2D drawing (planned)
  • More to follow

Requirements

For this series, you need an FPGA board with video output. We’ll be working at 640x480, so pretty much any video output will do. You should be comfortable with programming your FPGA board and reasonably familiar with Verilog.

We’ll be demoing with these boards (FPGA type):

Follow the source README to quickly build a project for either of these boards.

Source

All the Verilog designs featured in this series are available in the Exploring FPGAs repo, and source links are included throughout the blog. The designs are open source under the permissive MIT licence, but this blog is subject to normal copyright restrictions.

An Array of Bits

A bitmapped display is backed by a memory array: each pixel corresponds to one or more locations in memory. To draw the display, we read out the value of the current pixel from memory. To set the colour of a pixel, we modify the contents of its associated memory location.

Bitmaps provide two significant benefits over what we’ve done so far: we’re free to create sophisticated graphics using whatever technique we like, and the setting of pixel colour is separated from the actual screen drawing process. The flexibility of bitmaps comes at the cost of increased memory storage and latency.

We start with a small 80x60 monochrome bitmap to simulate life, before increasing resolution and colours to display Earth itself.

Conway’s Life

John Conway was a remarkable mathematician, active in many fields, who sadly died in 2020. Conway is best known to the public for recreational mathematics with Martin Gardner in Scientific American. I remember playing Sprouts at school and trying to impress people by calculating the day of the week with the Doomsday rule. For now, I’ll be limiting myself to Game of Life, but I highly recommend learning more of Conway: The world’s most charismatic mathematician is an excellent place to start.

Game of Life first appeared in the Mathematical Games column in the October 1970 issue of Scientific American. Anyone can access the article at ibiblio.org, or you can see a scan of the original at JSTOR (requires login via academic institution or library).

The Rules of Life

The universe of the Game of Life is an infinite, two-dimensional, grid of cells. A cell is one of two states: dead or alive. Every cell interacts with its eight neighbours (those to the left, right, top, bottom, and the four diagonals).

The following rules determine the fate of a cell in the next generation:

  1. Survival: Every cell with two or three neighbours survives.
  2. Deaths:
    a. Every cell with four or more neighbours dies from overpopulation.
    b. Every cell with zero or one neighbours dies from isolation.
  3. Births: Every empty cell with exactly three neighbours comes to life.

To understand this, it helps to see some practical examples. Common patterns within the Life universe have been given names. The diagram below shows two patterns: “Beehive” and “Beacon”.

Conway’s Life Examples

Every live cell in the Beehive pattern has two or three neighbours: so, no cells die. No dead cells in Beehive have three neighbours: so, no cells come to life. This pattern is categorised as a “still life”, because it doesn’t change from one generation to the next.

The Beacon pattern is a bit more interesting: it oscillates between two states. Two dead cells in the centre of the pattern have three neighbours, so come to life. However, they then have four neighbours, so die the next generation due to overcrowding: this pattern repeats endlessly.

However, things start to get really interesting with slightly more complex patters. The following photo shows the Gosper glider gun, which repeatedly generates small repeating patterns called gliders.

Gosper Glider Gun

It’s even possible to construct logic gates using gliders, and ultimately a universal Turing machine. As this post is primarily about FPGA graphics, we’re not going to dig further into the possibilities of Conway’s Life, but check out Wikipedia’s article on Conway’s Game of Life.

Life in Hardware

Now we have a basic understanding of Conway’s Life we’re going to create a hardware implementation in Verilog.

We start by rearranging the rules to simplify the design:

  1. If a cell is alive
    a. if it has 2 or 3 neighbours, it’s alive next generation
    b. otherwise, it’s dead next generation
  2. If a cell is dead
    a. if it has 3 neighbours, it’s alive next generation
    b. otherwise, it’s dead next generation

Handily, a bitmap is a two-dimensional grid as required by Life, but our bitmap won’t be infinite! Instead, our bitmap wraps around top and bottom, and between left and right: this is known as toroidal array. Contrast this with a 2D map of Earth: which wraps around at the east and west, but not at the north and south.

Since a cell’s state depends on all those around it, we can’t calculate the new state of each cell sequentially in one bitmap. Instead, we have two buffers: one contains the current generation that we use to calculate the next generation. We swap buffers, repeating the process. This double buffering also allows us to safely update the display without risking any artefacts or tearing: the display is always being driven from a complete bitmap. Our life module is ignorant of the two buffers, this will be handled in the controlling top module, with reads and writes directed to different offsets.

Create a life module with the following design - life.sv:

module life #(
    parameter WORLD_WIDTH=6,
    parameter WORLD_HEIGHT=6,
    parameter ADDRW=$clog2(WORLD_WIDTH * WORLD_HEIGHT)
    ) (
    input  wire logic clk,
    input  wire logic start,
    input  wire logic run,
    output      logic [ADDRW-1:0] id,
    input  wire logic r_status,
    output      logic w_status,
    output      logic we,
    output      logic done
    );

    // simulation parameters
    localparam CELL_COUNT = WORLD_WIDTH * WORLD_HEIGHT;  // total number of cells
    localparam NEIGHBOURS_COUNT = 8;  // number of neighbours each cell has

    // number of alive neighbours (could be 8!)
    logic [$clog2(NEIGHBOURS_COUNT+1)-1:0] neighbours_alive;

    // internal cell and neighbour IDs
    logic [ADDRW-1:0] cid, cid_next;
    logic [ADDRW-1:0] nid;  // adding nid_next would improve timing slack
    logic [$clog2(NEIGHBOURS_COUNT)-1:0] npos, npos_next;

    // simulation state
    enum {IDLE, NEXT_CELL, NEIGHBOURS, CURRENT_CELL, UPDATE_CELL} state, state_next;
    always_comb begin
        case(state)
            IDLE: state_next = (start && !done) ? NEXT_CELL : IDLE;
            NEXT_CELL: begin
                if (done) begin
                    state_next = IDLE;
                end else if (run) begin
                    state_next = NEIGHBOURS;
                end else begin
                    state_next = NEXT_CELL;
                end
            end
            NEIGHBOURS: state_next = (npos == NEIGHBOURS_COUNT-1) ? CURRENT_CELL : NEIGHBOURS;
            CURRENT_CELL: state_next = UPDATE_CELL;
            UPDATE_CELL: state_next = NEXT_CELL;
            default: state_next = IDLE;
        endcase
    end

    always_ff @(posedge clk) begin
        state <= state_next;
        cid <= cid_next;
        npos <= npos_next;
    end

    // simulation calculations
    always_comb begin
        we = (state == UPDATE_CELL) ? 1 : 0;  // enable writing when updating
        id = cid;
        npos_next = npos;
        cid_next = cid;
        w_status = 0;

        case(state)
            IDLE: begin
                cid_next = 0;
                nid = 0;
                npos_next = 0;
            end
            NEIGHBOURS: begin
                // map neigbour index onto ID
                case (npos)
                    3'd0: nid = cid - (WORLD_WIDTH + 1);
                    3'd1: nid = cid - WORLD_WIDTH;
                    3'd2: nid = cid - (WORLD_WIDTH - 1);
                    3'd3: nid = cid - 1;
                    3'd4: nid = cid + 1;
                    3'd5: nid = cid + (WORLD_WIDTH - 1);
                    3'd6: nid = cid + WORLD_WIDTH;
                    3'd7: nid = cid + (WORLD_WIDTH + 1);
                endcase

                // because the life universe wraps we need to correct for possible under/overflow
                if (nid >= CELL_COUNT) begin
                    if (npos <= 3'd3) begin
                        nid = CELL_COUNT - (2**ADDRW - nid);
                    end else begin
                        nid = nid - CELL_COUNT;
                    end
                end

                npos_next = (state == NEIGHBOURS) ? npos + 1 : 0;
                id = nid;
            end
            UPDATE_CELL: begin
                if (r_status == 1) begin  // if cell is currently alive
                    w_status = (neighbours_alive == 4'd2 || neighbours_alive == 4'd3) ? 1 : 0;
                end else begin  // or dead
                    w_status = (neighbours_alive == 4'd3) ? 1 : 0;
                end

                // ready for next cell
                cid_next = (cid < CELL_COUNT-1) ? cid + 1 : 0;
            end
        endcase
    end

    always_ff @(posedge clk) begin
        case(state)
            IDLE: begin
                neighbours_alive <= 0;
                done <= 0;
            end
            // BRAM takes one cycle to read data, so we need to offset by one cycle
            NEIGHBOURS: if (npos >= 3'd1) neighbours_alive <= neighbours_alive + {3'b0, r_status};
            CURRENT_CELL: neighbours_alive <= neighbours_alive + {3'b0, r_status};
            UPDATE_CELL: begin
                // prepare for next cell
                if (cid < CELL_COUNT-1) begin
                    neighbours_alive <= 0;
                end else begin
                    done <= 1;
                end
            end
        endcase
    end
endmodule

Our module is a simple finite state machine (FSM) that reads all eight neighbours for each cell. This isn’t the most efficient design in terms of memory access: you could reuse the previous neighbour reads at the cost of slightly more complex logic. I may improve this module at a future date.

The cell under consideration is cid (between 0 and 4,799). npos is the position of a neighbour relative to the cell we’re dealing with (1 is top left, 2 top middle, 3 top right etc.), while nid is position in the bitmap of the neighbour (again between 0 and 4,799). Further details on the operation of the Life module will be added in later drafts.

To exercise this module we have a test bench for Vivado: life_tb.sv. If you have a Xilinx Vivado installed, try using the test bench with the included test seed files. You can find instructions for running the simulation in the source README.

Bitmap RAM

In FPGA Ad Astra, we used block ram (BRAM) as a small ROM to store font bitmaps, this time it will handle the whole display.

Our Life bitmap stores 1 bit for each of 80x60 cells: 4,800 bits (600 bytes) in total. Our 640x480 screen has eight times more horizontal and vertical resolution, making it straightforward to scale this bitmap up to the display.

We’re going to use a simple dual-port BRAM: with the Life simulation and display controller sharing the read port. We could have used true dual-port BRAM to avoid sharing, but iCE40 and most external memories don’t have dual read-write ports.

The simple dual-port BRAM modules:

You can learn more about block ram in FPGA Memory Types.

Scaling Bitmaps

In the previous section, I said scaling our bitmap for display was straightforward, because it’s eight times smaller than our screen resolution. However, practical scaling isn’t quite as simple as it first appears.

For example, given our display coordinates sx and sy you could calculate the 80x60 pixel bitmap address with:

always_comb bitmap_addr = sx / 8 + (sy / 8) * 80;  // ?!

This has at least three problems:

  1. It doesn’t account for memory latency: it takes one or more cycles to read BRAM
  2. It wastes memory bandwidth: every value is read 64 times per frame!
  3. It uses multiplication, which takes up valuable (probably DSP) logic

Using a line buffer allows us to avoid all three of these problems. The line buffer uses three simple dual-port BRAMs, one for each colour channel (red, green, blue). As a module, it’s almost trivial; the interesting part is the efficient reading of bitmap data and in how addresses are calculated.

We write a single 80-bit line of the bitmap into the line buffer, then we can read this out for eight display lines before going back to bitmap memory for the next 80 bits of data. By making efficient use of memory bandwidth, we can better share memory with other data.

Add a linebuffer module - linebuffer.sv:

module linebuffer #(
    parameter WIDTH=8,              // data width
    parameter DEPTH=2048,           // length of line
    parameter ADDRW=$clog2(DEPTH)   // address width
    ) (
    input  wire logic clk_write,
    input  wire logic clk_read, 
    input  wire logic [2:0] we,
    input  wire logic [ADDRW-1:0] addr_write,
    input  wire logic [ADDRW-1:0] addr_read,
    input  wire logic [WIDTH-1:0] data_in_0, data_in_1, data_in_2,
    output      logic [WIDTH-1:0] data_out_0, data_out_1, data_out_2
    );

    // channel 0
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(DEPTH)) ch0 (
        .clk_write,
        .clk_read,
        .we(we[0]),
        .addr_write,
        .addr_read,
        .data_in(data_in_0),
        .data_out(data_out_0)
    );

    // channel 1
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(DEPTH)) ch1 (
        .clk_write,
        .clk_read,
        .we(we[1]),
        .addr_write,
        .addr_read,
        .data_in(data_in_1),
        .data_out(data_out_1)
    );

    // channel 2
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(DEPTH)) ch2 (
        .clk_write,
        .clk_read,
        .we(we[2]),
        .addr_write,
        .addr_read,
        .data_in(data_in_2),
        .data_out(data_out_2)
    );
endmodule

We use three separate BRAM instances because colours are output separately to the display and 4-bit-wide buffers are a better fit for BRAM hardware than a single 12-wide buffer. A future post will cover display scaling more fully, but we have enough to efficiently display bitmaps for now.

Clock Domain Crossing
The line buffer module has another significant benefit: it allows the pixel clock to be different from the rest of the design. For example, you could run your Life simulation at 100 MHz (line buffer write port) and display it on a 720p60 screen with a 74.25 MHz pixel clock (line buffer read port). CDC is usually achieved with a FIFO, but the line buffer does it while providing efficient scaling.

Top Module

We can tie the design together with a top module, which includes the address calculations for the bitmap and line buffer.

The Xilinx version is shown below:

module top_life (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    parameter GEN_FRAMES = 15;  // each generation lasts this many frames
    parameter SEED_FILE = "simple_life.mem";  // seed to initiate universe with

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen clock_640x480 (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 10;  // screen coordinate width in bits
    logic [CORDW-1:0] sx, sy;
    logic de;
    display_timings timings_640x480 (
        .clk_pix,
        .rst(!clk_locked),  // wait for clock lock
        .sx,
        .sy,
        .hsync(vga_hsync),
        .vsync(vga_vsync),
        .de
    );

    logic frame_end;   // high for one clycle at the end of a frame
    logic line_start;  // high for one cycle at line start (drawing lines only)
    always_comb begin
        frame_end = (sy == 524 && sx == 799);
        line_start = (sy < 480 && sx == 0);
    end

    // life bitmap
    localparam BMP_COUNT  = 2;  // double buffered
    localparam BMP_WIDTH  = 80;
    localparam BMP_HEIGHT = 60;
    localparam BMP_PIXELS = BMP_WIDTH * BMP_HEIGHT;
    localparam BMP_DEPTH  = BMP_COUNT * BMP_PIXELS;
    localparam BMP_ADDRW  = $clog2(BMP_DEPTH);
    localparam BMP_DATAW  = 1;

    logic bmp_we;
    logic [BMP_ADDRW-1:0] bmp_addr_read, bmp_addr_write;
    logic [BMP_DATAW-1:0] bmp_data_in, bmp_data_out;

    bram_sdp #(
        .WIDTH(BMP_DATAW),
        .DEPTH(BMP_DEPTH),
        .INIT_F(SEED_FILE)
    ) bmp_life (
        .clk_read(clk_pix),
        .clk_write(clk_pix),
        .we(bmp_we),
        .addr_write(bmp_addr_write),
        .addr_read(bmp_addr_read),
        .data_in(bmp_data_in),
        .data_out(bmp_data_out)
    );

    // update frame counter
    logic life_start;  // trigger next calculation
    logic life_done;   // signals complete calculation
    logic front_buffer;  // which buffer to draw the display from
    logic [$clog2(GEN_FRAMES)-1:0] cnt_frames;
    always_ff @(posedge clk_pix) begin
        if (frame_end) cnt_frames <= cnt_frames + 1;
        if (cnt_frames == GEN_FRAMES - 1) begin
            front_buffer <= ~front_buffer;
            cnt_frames <= 0;
            life_start <= 1;
        end else life_start <= 0;
    end

    logic life_run;
    logic [BMP_ADDRW-1:0] cell_id;
    life #(
        .WORLD_WIDTH(BMP_WIDTH),
        .WORLD_HEIGHT(BMP_HEIGHT),
        .ADDRW(BMP_ADDRW)
    ) life_sim (
        .clk(clk_pix),
        .start(life_start),
        .run(life_run),
        .id(cell_id),
        .r_status(bmp_data_out),
        .w_status(bmp_data_in),
        .we(bmp_we),
        .done(life_done)
    );

    // line buffer
    localparam LB_SCALE_V = 8;                // factor to scale vertical drawing
    localparam LB_SCALE_H = 8;                // factor to scale horizontal drawing
    localparam LB_LINE  = 640 / LB_SCALE_H;   // line length
    localparam LB_ADDRW = $clog2(LB_LINE);    // line address width
    localparam LB_WIDTH = 4;                  // 4-bits per colour channel

    // line buffer read port (pixel clock)
    logic [LB_ADDRW-1:0] lb_addr_read_pix;
    logic [LB_WIDTH-1:0] lb0_out_pix, lb1_out_pix, lb2_out_pix;

    // line buffer write port (system clock - latency corrected)
    logic lb_re;  // signals when the line buffer needs to read data from bitmap
    logic [2:0] lb_we, lb_we_l1;
    logic [LB_ADDRW-1:0] lb_addr_write, lb_addr_write_l1;
    logic [LB_WIDTH-1:0] lb0_in, lb1_in, lb2_in;

    // latency correction
    always_ff @(posedge clk_pix) begin
        lb_we_l1 <= lb_we;
        lb_addr_write_l1 <= lb_addr_write;
    end

    linebuffer #(
        .WIDTH(LB_WIDTH),
        .DEPTH(LB_LINE)
        ) lb (
        .clk_write(clk_pix),
        .clk_read(clk_pix),
        .we(lb_we_l1),                  // use lb_we_l1 to correct for latency
        .addr_write(lb_addr_write_l1),  // use lb_addr_write_l1 to correct for latency
        .addr_read(lb_addr_read_pix),
        .data_in_0(lb0_in),
        .data_in_1(lb1_in),
        .data_in_2(lb2_in),
        .data_out_0(lb0_out_pix),
        .data_out_1(lb1_out_pix),
        .data_out_2(lb2_out_pix)
    );

    // address for lb to read from bitmap
    logic [BMP_ADDRW-1:0] lb_bmp_addr_read;
    always_ff @(posedge clk_pix) begin
        if (frame_end) begin
            lb_bmp_addr_read <= 0;
        end else if (lb_re) begin
            lb_bmp_addr_read <= lb_bmp_addr_read + 1;
        end
    end

    // calculate linebuffer system address
    logic [$clog2(LB_SCALE_V)-1:0] lb_repeat;  // repeat line based on scale
    always_ff @(posedge clk_pix) begin
        if (line_start) begin  // start new line
            if (lb_repeat == 0) begin  // time to read a fresh line of data?
                lb_re <= 1;
                lb_we <= 3'b111;
                lb_addr_write <= 0;
            end
            lb_repeat <= (lb_repeat == LB_SCALE_V-1) ? 0 : lb_repeat + 1;
        end else begin
            if (lb_addr_write < LB_LINE-1) begin  // next pixel
                lb_addr_write <= lb_addr_write + 1;
            end else begin  // disable drawing at end of line
                lb_re <= 0;
                lb_we <= 0;
            end
        end
    end

    // sim can run when linebuffer is not requesting data (leave one line empty)
    always_comb life_run = (lb_repeat > 1 && lb_repeat < LB_SCALE_V-1);

    // calculate linebuffer pixel address
    logic [$clog2(LB_SCALE_H)-1:0] scale_pix_cnt;
    always_ff @(posedge clk_pix) begin
        if (sx == 798) begin  // address 0 when sx=799, so we need to set when sx=798 (latency=1)
            lb_addr_read_pix <= 0;
            scale_pix_cnt <= 0;
        end else if (lb_addr_read_pix < LB_LINE-1) begin
            scale_pix_cnt <= (scale_pix_cnt < LB_SCALE_H-1) ? scale_pix_cnt + 1 : 0;
            if (scale_pix_cnt == LB_SCALE_H-1) lb_addr_read_pix <= lb_addr_read_pix + 1;
        end
    end

    // read into line buffer
    always_comb begin
        bmp_addr_read = (lb_re) ? lb_bmp_addr_read : cell_id;
        if (front_buffer == 1) bmp_addr_read = bmp_addr_read + BMP_PIXELS;
        bmp_addr_write = (front_buffer == 1) ? cell_id : cell_id + BMP_PIXELS;
        lb0_in = (bmp_data_out) ? 4'hF : 4'h9;
        lb1_in = (bmp_data_out) ? 4'hF : 4'h0;
        lb2_in = (bmp_data_out) ? 4'hF : 4'h0;
    end

    // VGA output
    always_comb begin
        vga_r = de ? lb2_out_pix : 4'h0;
        vga_g = de ? lb1_out_pix : 4'h0;
        vga_b = de ? lb0_out_pix : 4'h0;
    end
endmodule

NB. I will explain the latency correction in more detail in a later post on display scaling.

And we need a constraints file for our design:

  • Xilinx XC7: arty.xdc
  • Lattice iCE40: coming soon

You should be able to build this design and see the simple example with a beehive, blinker, toad, and beacon patterns. See README for details on how to build the project.

If that works, try the Gosper glider gun by changing the seed file in top_life.sv:

parameter SEED_FILE = "gosper_glider.mem";  // seed to initiate universe with

The Gosper pattern repeatedly generates small gliders that move down the screen. Our glider gun doesn’t continue indefinitely because our universe wraps around: the gliders interfere with the gun pattern, and the universe eventually breaks down, before settling into a pattern of simple oscillators and still lifes. It’s straightforward to update the life module not to wrap around: imagine a world bordered by dead cells.

You can also change how fast the simulation runs by changing the value of GEN_FRAMES (a value of 60 makes each generation last 1 second):

parameter GEN_FRAMES  = 60;  // each generation lasts this many frames

The Blue Planet

This next section is for the Arty (XC7) only, sorry. The iCEBreaker doesn’t have enough BRAM for these designs. If there’s interest, I’ll consider creating a version for iCEBreaker using SPRAM.

Our Life simulation featured tiny cells, at the other end of the scale, we can look back at Earth: the only place in the Universe known to support life.

Earthrise was taken by Bill Anders aboard Apollo 8 on December 24, 1968 (two years before Conway’s Life appeared in Scientific American), while in orbit around the Moon. Every person, animal, city, book… they’re all there on that little blue sphere.

Earthrise

Memory Requirements

To do the photo any justice we’re going to need a somewhat bigger and deeper bitmap than the one we used to simulate Life. Using 4 bits per pixel allows 16 colours, and if we want half-decent quality, then 320x240 is a good starting point. Going with 320x240x4-bit requires 300 Kb BRAM, which doesn’t tax the Arty’s resources too much.

So Few Colours?

16 colours might bring back memories of garish early computer displays, but we’re not using any old 16 colours. Instead of each 4-bit pixel value representing a fixed colour, it can represent a custom colour a palette optimised for this image. The palette is held in a colour lookup table (CLUT). If a particular pixel has the value 1010 (ten), then we consult the tenth entry in the CLUT, which contains the 12-bit colour to display.

While we’re still limited to a total of 16 colours, they can now be any 16 from the 4,096 colours the Arty’s VGA output is capable of producing. For example, a picture of a forest might have many greens, while one of a city would use more greys. This design was common in older computers, for example, the original Amiga chipset supported 32 colours from a possible 4,096: very similar to our design! The GIF and PNG formats still make use of this approach to squeeze the best quality out of 256-colour images.

If you’re interested in learning more about different colour palettes, there’s an excellent List of color palettes page on Wikipedia. That’s enough colour theory, let’s get on with loading our image onto the FPGA.

Preparing an Image for FPGA Memory

To display our image, we need a mechanism to load it and its palette into our design. Verilog supports a method called $readmemh that can load values into memory from a text file of hex values. I’ve created a handy Python script, img2fmem, that generates suitable files for images and their palettes. I’ve used the script to convert a 320x240 version of Earthrise into hex values that can be loaded into block ram.

You can see the hexadecimal data files for the bitmap and palette in the project source:

Creating Your Own Images

You can easily create your own images using the img2fmem script. The script is written in Python and uses the Pillow image library to perform the conversion. You can find it in the Project F FPGA Tools repo. Make sure your images are precisely 320x240 if you want to use them with design in this tutorial.

Displaying the Photo

The actual display of the photo is more straightforward than with Life, as there’s no updating, just reading of data values - top_earth.sv:

module top_earth (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen clock_640x480 (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 10;  // screen coordinate width in bits
    logic [CORDW-1:0] sx, sy;
    logic de;
    display_timings timings_640x480 (
        .clk_pix,
        .rst(!clk_locked),  // wait for clock lock
        .sx,
        .sy,
        .hsync(vga_hsync),
        .vsync(vga_vsync),
        .de
    );

    logic frame_end;   // high for one clycle at the end of a frame
    logic line_start;  // high for one cycle at line start (drawing lines only)
    always_comb begin
        frame_end = (sy == 524 && sx == 799);
        line_start = (sy < 480 && sx == 0);
    end

    // bitmap buffer
    localparam BMP_WIDTH  = 320;
    localparam BMP_HEIGHT = 240;
    localparam BMP_PIXELS = BMP_WIDTH * BMP_HEIGHT; 
    localparam BMP_ADDRW = $clog2(BMP_PIXELS);
    localparam BMP_DATAW = 4;  // colour bits per pixel
    localparam BMP_IMAGE = "earthrise_320x240.mem";
    localparam BMP_PALETTE = "earthrise_320x240_palette.mem";

    logic [BMP_ADDRW-1:0] bmp_addr_read;
    logic [BMP_DATAW-1:0] colr_idx;

    bram_sdp #(
        .WIDTH(BMP_DATAW),
        .DEPTH(BMP_PIXELS),
        .INIT_F(BMP_IMAGE)
    ) bram_bmp ( 
        .clk_read(clk_pix),
        .clk_write(clk_pix),
        .we(0),
        .addr_write(),
        .addr_read(bmp_addr_read),
        .data_in(),
        .data_out(colr_idx)
    );

    // Vivado makes this a LUT ROM
    logic [11:0] palette [0:15];  // 16 x 12-bit colour palette entries
    logic [11:0] colr;
    initial begin
        $display("Loading palette.");
        $readmemh(BMP_PALETTE, palette);  // bitmap palette to load
    end

    // line buffer
    localparam LB_SCALE_V = 2;                // factor to scale vertical drawing
    localparam LB_SCALE_H = 2;                // factor to scale horizontal drawing
    localparam LB_LINE  = 640 / LB_SCALE_H;   // line length
    localparam LB_ADDRW = $clog2(LB_LINE);    // line address width
    localparam LB_WIDTH = 4;                  // 4-bits per colour channel

    // line buffer read port (pixel clock)
    logic [LB_ADDRW-1:0] lb_addr_read_pix;
    logic [LB_WIDTH-1:0] lb0_out_pix, lb1_out_pix, lb2_out_pix;

    // line buffer write port (system clock - latency corrected)
    logic lb_re;  // signals when the line buffer needs to read data from bitmap
    logic [2:0] lb_we, lb_we_l1;
    logic [LB_ADDRW-1:0] lb_addr_write, lb_addr_write_l1;
    logic [LB_WIDTH-1:0] lb0_in, lb1_in, lb2_in;

    // latency correction
    always_ff @(posedge clk_pix) begin
        lb_we_l1 <= lb_we;
        lb_addr_write_l1 <= lb_addr_write;
    end

    linebuffer #(
        .WIDTH(LB_WIDTH),
        .DEPTH(LB_LINE)
        ) lb (
        .clk_write(clk_pix),
        .clk_read(clk_pix),
        .we(lb_we_l1),                  // use lb_we_l1 to correct for latency
        .addr_write(lb_addr_write_l1),  // use lb_addr_write_l1 to correct for latency
        .addr_read(lb_addr_read_pix),
        .data_in_0(lb0_in),
        .data_in_1(lb1_in),
        .data_in_2(lb2_in),
        .data_out_0(lb0_out_pix),
        .data_out_1(lb1_out_pix),
        .data_out_2(lb2_out_pix)
    );

    // address for lb to read from bitmap
    logic [BMP_ADDRW-1:0] lb_bmp_addr_read;
    always_ff @(posedge clk_pix) begin
        if (frame_end) begin
            lb_bmp_addr_read <= 0;
        end else if (lb_re) begin
            lb_bmp_addr_read <= lb_bmp_addr_read + 1;
        end
    end

    // calculate linebuffer system address
    logic [$clog2(LB_SCALE_V)-1:0] lb_repeat;  // repeat line based on scale
    always_ff @(posedge clk_pix) begin
        if (line_start) begin  // start new line
            if (lb_repeat == 0) begin  // time to read a fresh line of data?
                lb_re <= 1;
                lb_we <= 3'b111;
                lb_addr_write <= 0;
            end
            lb_repeat <= (lb_repeat == LB_SCALE_V-1) ? 0 : lb_repeat + 1;
        end else begin
            if (lb_addr_write < LB_LINE-1) begin  // next pixel
                lb_addr_write <= lb_addr_write + 1;
            end else begin  // disable drawing at end of line
                lb_re <= 0;
                lb_we <= 0;
            end
        end
    end

    // calculate linebuffer pixel address
    logic [$clog2(LB_SCALE_H)-1:0] scale_pix_cnt;
    always_ff @(posedge clk_pix) begin
        if (sx == 798) begin  // address 0 when sx=799, so we need to set when sx=798 (latency=1)
            lb_addr_read_pix <= 0;
            scale_pix_cnt <= 0;
        end else if (lb_addr_read_pix < LB_LINE-1) begin
            scale_pix_cnt <= (scale_pix_cnt < LB_SCALE_H-1) ? scale_pix_cnt + 1 : 0;
            if (scale_pix_cnt == LB_SCALE_H-1) lb_addr_read_pix <= lb_addr_read_pix + 1;
        end
    end

    // read into line buffer
    always_comb begin
        bmp_addr_read = lb_bmp_addr_read;
        lb0_in = colr[3:0];
        lb1_in = colr[7:4];
        lb2_in = colr[11:8];
    end

    // lookup colour in CLUT
    always_ff @(posedge clk_pix) begin
        colr <= palette[colr_idx];
    end

    // VGA output
    always_comb begin
        vga_r = de ? lb2_out_pix : 4'h0;
        vga_g = de ? lb1_out_pix : 4'h0;
        vga_b = de ? lb0_out_pix : 4'h0;
    end
endmodule

The same constraints file can be used as before: arty.xdc.

Fizzle Out

In the third part of our series, FPGA Ad Astra, we made use of linear feedback shift registers (LFSRs) to create animated starfields. LFSRs have another use: creating random dissolve effects, as used in Wolfenstein 3D and described by Fabien Sanglard in his excellent Fizzlefade blog post. We can use this effect to dissolve our photograph to an empty background using a separate bitmap mask.

The fizzlefade design is still being worked on. Check back soon.

Explore

I hope you enjoyed this instalment of Exploring FPGA Graphics, but nothing beats creating your own designs. Here are a few suggestions to get you started:

  • Experiment with your own Game of Life seed files
  • Display your own photos as bitmaps
  • Transition between two bitmaps using fizzlefade
  • Implement a Universal Turing Machine using Life: take a look at Paul Rendell’s Attic to get started

Next Time

In the next part, we’ll learn about drawing lines: the basis of most 2D and 3D graphics. The next part will be available later in 2020. Follow @WillFlux for updates.

©2020 Will Green, Project F