30 October 2020

Framebuffers

Welcome back to Exploring FPGA Graphics. In the previous two parts, we worked with sprites, but as graphics become more complex, a different approach is needed. Instead of drawing directly to the screen, we draw to a framebuffer, which is read out when required by the screen. This post provides an introduction to framebuffers and how to scale them. We’ll also learn how to fizzlefade graphics Wolfenstein 3D style. In the next part, we’ll use a framebuffer to visualize a simulation of life.

In this series, we explore graphics at the hardware level and get a feel for the power of FPGAs. We start by learning how displays work, before racing the beam with Pong, starfields and sprites, simulating life with bitmaps, drawing lines and triangles, and finally creating simple 3D models. I’ll be writing and revising this series throughout 2020 and 2021. New to the series? Start with Exploring FPGA Graphics.

Updated 2021-01-19. Get in touch with @WillFlux or open an issue on GitHub.

Series Outline

  • Exploring FPGA Graphics - learn how displays work and animate simple shapes
  • FPGA Pong - race the beam to create the arcade classic
  • Hardware Sprites - fast, colourful, graphics with minimal resources
  • FPGA Ad Astra - demo with hardware sprites and animated starfields
  • Framebuffers (this post) - driving the display from a bitmap in memory
  • Life on Screen - the screen comes alive with Conway’s Game of Life

More parts to follow.

Requirements

For this series, you need an FPGA board with video output. We’ll be working at 640x480, so pretty much any video output will do. You should be comfortable with programming your FPGA board and reasonably familiar with Verilog.

We’ll be demoing with these boards:

Source

The SystemVerilog designs featured in this series are available from the projf-explore repo on GitHub. The designs are open source hardware under the permissive MIT licence, but this blog is subject to normal copyright restrictions.

Framebuffer

A framebuffer is an in-memory bitmap that drives pixels on screen. When you write to a memory location within the framebuffer, the corresponding pixel will change on screen. Using a framebuffer provides two big benefits: we’re free to create sophisticated graphics using whatever technique we like, and the setting of pixel colour is separated from the process driving the screen. The flexibility of a framebuffer comes at the cost of increased memory storage and latency.

A Small Buffer

A framebuffer requires enough memory to hold the complete frame. To keep things simple, we’re going to store our framebuffer in internal FPGA block memory (BRAM). The iCEBreaker’s iCE40 FPGA has thirty 4kb BRAMs, for 120 kb in total, so that’s what we’ll target. You can learn more about block ram in FPGA Memory Types.

If we divide our 640x480 screen by four in both dimensions, we get 160x120. 160x120 is 19,200 pixels, so a monochrome framebuffer requires 18.75 kilobits of memory (19,200 = 18.75 * 1024).

We’ll start by wiring up our usual display timings with a framebuffer based on BRAM, then draw a simple horizontal line in it to confirm everything is working.

The Xilinx version is shown below:

module top_line (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen clock_640x480 (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 10;  // screen coordinate width in bits
    logic [CORDW-1:0] sx, sy;
    logic hsync, vsync;
    display_timings_480p timings_640x480 (
        .clk_pix,
        .rst(!clk_locked),  // wait for clock lock
        .sx,
        .sy,
        .hsync,
        .vsync,
        .de()
    );

    // size of screen with and without blanking
    localparam H_RES_FULL = 800;
    localparam V_RES_FULL = 525;
    localparam H_RES = 640;
    localparam V_RES = 480;

    // framebuffer
    localparam FB_WIDTH  = 160;
    localparam FB_HEIGHT = 120;
    localparam FB_PIXELS = FB_WIDTH * FB_HEIGHT;
    localparam FB_ADDRW  = $clog2(FB_PIXELS);
    localparam FB_DATAW  = 1;  // colour bits per pixel

    logic fb_we;
    logic [FB_ADDRW-1:0] fb_addr_write, fb_addr_read;
    logic [FB_DATAW-1:0] fb_colr_write, fb_colr_read;

    bram_sdp #(
        .WIDTH(FB_DATAW),
        .DEPTH(FB_PIXELS)
    ) framebuffer (
        .clk_write(clk_pix),
        .clk_read(clk_pix),
        .we(fb_we),
        .addr_write(fb_addr_write),
        .addr_read(fb_addr_read),
        .data_in(fb_colr_write),
        .data_out(fb_colr_read)
    );

    // draw a horizontal line at the top of the framebuffer
    always @(posedge clk_pix) begin
        if (sy >= V_RES) begin  // draw in blanking interval
            if (fb_we == 0 && fb_addr_write != FB_WIDTH-1) begin
                fb_colr_write <= 1;
                fb_we <= 1;
            end else if (fb_addr_write != FB_WIDTH-1) begin
                fb_addr_write <= fb_addr_write + 1;
            end else begin
                fb_colr_write <= 0;
                fb_we <= 0;
            end
        end
    end

    // determine when framebuffer is active for reading
    logic fb_active;
    always_comb fb_active = (sy < FB_HEIGHT && sx < FB_WIDTH);

    // calculate framebuffer read address for output to display
    always_ff @(posedge clk_pix) begin
        if (sy == V_RES_FULL-1 && sx == H_RES_FULL-1) begin
            fb_addr_read <= 0;  // reset address at end of frame
        end else if (fb_active) begin
            fb_addr_read <= fb_addr_read + 1;
        end
    end

    // VGA output
    always_ff @(posedge clk_pix) begin
        vga_hsync <= hsync;
        vga_vsync <= vsync;
        vga_r <= (fb_active && fb_colr_read) ? 4'hF : 4'h0;
        vga_g <= (fb_active && fb_colr_read) ? 4'hF : 4'h0;
        vga_b <= (fb_active && fb_colr_read) ? 4'hF : 4'h0;
    end
endmodule

Build this design and you should see a white horizontal line at the top left of your screen. It’s only 160 pixels long, so it won’t go right across the screen. If you’re using a VGA monitor you may need to tweak your monitor settings to ensure the line is visible.

Building the Designs
In the Framebuffers section of the git repo, you’ll find the design files, a makefile for iCEBreaker, a Vivado project for Arty, and instructions for building the designs for both boards.

A Small Bitmap

Michelangelo’s David

Now our framebuffer wired up correctly let’s load something into it. A small monochrome framebuffer calls for a striking image: I’ve chosen David by Michelangelo.

The version on the right is the original from Wikipedia; it has 64 shades of grey. I created the middle image by reducing the original to 16 colours using img2fmem with no dithering (we will discuss this tool later in the post). The monochrome image on the left was created by Gerbrant using Floyd-Steinberg dithering.

Create a new top module to load our monochrome image of David

top_david_v1 BRAM usage: 1x36Kb on XC7 and 10x4Kb on iCE40

The iCE40 version is shown below:

module top_david_v1 (
    input  wire logic clk_12m,      // 12 MHz clock
    input  wire logic btn_rst,      // reset button (active high)
    output      logic dvi_clk,      // DVI pixel clock
    output      logic dvi_hsync,    // DVI horizontal sync
    output      logic dvi_vsync,    // DVI vertical sync
    output      logic dvi_de,       // DVI data enable
    output      logic [3:0] dvi_r,  // 4-bit DVI red
    output      logic [3:0] dvi_g,  // 4-bit DVI green
    output      logic [3:0] dvi_b   // 4-bit DVI blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen clock_640x480 (
       .clk(clk_12m),
       .rst(btn_rst),
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 10;  // screen coordinate width in bits
    logic [CORDW-1:0] sx, sy;
    logic hsync, vsync, de;
    display_timings_480p timings_640x480 (
        .clk_pix,
        .rst(!clk_locked),  // wait for clock lock
        .sx,
        .sy,
        .hsync,
        .vsync,
        .de
    );

    // size of screen with and without blanking
    localparam H_RES_FULL = 800;
    localparam V_RES_FULL = 525;
    localparam H_RES = 640;
    localparam V_RES = 480;

    // framebuffer
    localparam FB_WIDTH  = 160;
    localparam FB_HEIGHT = 120;
    localparam FB_PIXELS = FB_WIDTH * FB_HEIGHT;
    localparam FB_ADDRW  = $clog2(FB_PIXELS);
    localparam FB_DATAW  = 1;  // colour bits per pixel
    localparam FB_IMAGE  = "../res/david/david_1bit.mem";

    logic fb_we;
    logic [FB_ADDRW-1:0] fb_addr_write, fb_addr_read;
    logic [FB_DATAW-1:0] fb_colr_write, fb_colr_read;

    bram_sdp #(
        .WIDTH(FB_DATAW),
        .DEPTH(FB_PIXELS),
        .INIT_F(FB_IMAGE)
    ) framebuffer (
        .clk_write(clk_pix),
        .clk_read(clk_pix),
        .we(fb_we),
        .addr_write(fb_addr_write),
        .addr_read(fb_addr_read),
        .data_in(fb_colr_write),
        .data_out(fb_colr_read)
    );

    // draw a horizontal line at the top of the framebuffer
    always @(posedge clk_pix) begin
        if (sy >= V_RES) begin  // draw in blanking interval
            if (fb_we == 0 && fb_addr_write != FB_WIDTH-1) begin
                fb_colr_write <= 1;
                fb_we <= 1;
            end else if (fb_addr_write != FB_WIDTH-1) begin
                fb_addr_write <= fb_addr_write + 1;
            end else begin
                fb_colr_write <= 0;
                fb_we <= 0;
            end
        end
    end

    // flag when framebuffer is active for display
    logic fb_active;
    always_comb fb_active = (sy < FB_HEIGHT && sx < FB_WIDTH);

    always_ff @(posedge clk_pix) begin
        if (sy == V_RES_FULL-1 && sx == H_RES_FULL-1) begin
            fb_addr_read <= 0;  // reset address at end of frame
        end else if (fb_active) begin
            fb_addr_read <= fb_addr_read + 1;
        end
    end

    logic [3:0] red, green, blue;  // output colour
    always_comb begin
        red   = (fb_active && fb_colr_read) ? 4'hF : 4'h0;
        green = (fb_active && fb_colr_read) ? 4'hF : 4'h0;
        blue  = (fb_active && fb_colr_read) ? 4'hF : 4'h0;
    end

    // Output DVI clock: 180° out of phase with other DVI signals
    SB_IO #(
        .PIN_TYPE(6'b010000)  // PIN_OUTPUT_DDR
    ) dvi_clk_io (
        .PACKAGE_PIN(dvi_clk),
        .OUTPUT_CLK(clk_pix),
        .D_OUT_0(1'b0),
        .D_OUT_1(1'b1)
    );

    // Output DVI signals
    SB_IO #(
        .PIN_TYPE(6'b010100)  // PIN_OUTPUT_REGISTERED
    ) dvi_signal_io [14:0] (
        .PACKAGE_PIN({dvi_hsync, dvi_vsync, dvi_de, dvi_r, dvi_g, dvi_b}),
        .OUTPUT_CLK(clk_pix),
        .D_OUT_0({hsync, vsync, de, red, green, blue}),
        .D_OUT_1()
    );
endmodule

Build this design and program your board. You should see the dithered image of David looking at you from the top left of your screen. We’re still drawing a white line at the top of the framebuffer, so this replaces the top line of the bitmap we’re loading.

iCE40: One Bit Too Many
The iCE40 version uses more block ram than you might expect: 160x120 should fit into five 4kb BRAMs, but the david_v1 design requires ten. This is because the iCE40 BRAMs are a minimum of two bits wide. When we expand the colour depth to 4-bit, the BRAM usage will be the expected twenty. Learn more in the Lattice ICE Technology Library.

Casting Shade

We can increase the bits assigned to each pixel to support more colours, or in the case of this image of David, more shades. Our video output is 12-bit, which means there are 16 possible shades of grey.

As with the hedgehog graphic in Hardware Sprites, we store the palette in a file: david_palette.mem.

FFF EEE DDD CCC BBB AAA 999 888 777 666 555 444 333 222 111 000

Greyscale

We load the palette file into a colour lookup table (CLUT) ROM, which is used to find the colour of each pixel for display. If you need a refresher on CLUTs, see A Refined Palette.

Build the updated top module with 4-bit greyscale David:

top_david_v2 BRAM usage: 4x36Kb on XC7 and 20x4Kb on iCE40

Warm Tones

Because the palette is separate from the image, we can quickly change it. In top_david_v2.sv, update localparam FB_PALETTE to reference david_palette_warm.mem and rebuild.

The warm palette looks like this:

FED EDC DCB CBA BA9 A98 987 876 765 654 543 432 321 210 100 000

Warm shades

There is also an inverted palette, david_palette_invert.mem, or you can create your own.

Quick Aside: Palettes
Wikipedia has a lovely list of color palettes for different hardware systems.

Framebuffer Scaling

Our framebuffer is too small to fill a 640x480 screen, and modern monitors don’t support lower resolutions. To make our framebuffer fill the screen, we need to scale it up.

We’ve made our framebuffer an integer divisor of our display resolution, so scaling ought to be simple. However, practical scaling isn’t quite as simple as it first appears. For example, given our display coordinates sx and sy you could calculate the current framebuffer read address with:

always_comb fb_read_addr = sx / 4 + (sy / 4) * 160;  // ?!

This has at least three problems:

  1. It doesn’t account for memory latency: it takes one or more cycles to read BRAM
  2. It wastes memory bandwidth: every value is read 16 times per frame!
  3. It uses multiplication, which takes up valuable (probably DSP) logic

Using a linebuffer allows us to avoid all three of these problems. The linebuffer uses three simple dual-port BRAMs, one for each colour channel (red, green, blue). We copy a single 160-pixel line of the framebuffer into the linebuffer, then we read this out for four display lines before going back to the framebuffer for the next line. By making efficient use of memory bandwidth, we can better share memory with other systems, such as a CPU.

Add a linebuffer module - linebuffer.sv:

module linebuffer #(
    parameter WIDTH=8,   // data width of each channel
    parameter LEN=2048,  // length of line
    parameter SCALEW=6   // horizontal scaling width
    ) (
    input  wire logic clk_in,
    input  wire logic clk_out,
    input  wire logic en_in,
    input  wire logic en_out,
    input  wire logic rst_in,
    input  wire logic rst_out,
    input  wire logic [SCALEW-1:0] scale,
    input  wire logic [WIDTH-1:0] data_in_0, data_in_1, data_in_2,
    output      logic [WIDTH-1:0] data_out_0, data_out_1, data_out_2
    );

    logic [$clog2(LEN)-1:0] addr_in, addr_out;
    logic [SCALEW-1:0] cnt_scale;

    // correct scale: if scale is 0, set to 1
    logic [SCALEW-1:0] scale_cor;
    always_comb scale_cor = (scale == 0) ? 1 : scale;

    always_ff @(posedge clk_in) begin
        if (en_in) addr_in <= (addr_in == LEN-1) ? 0 : addr_in + 1;
        if (rst_in) addr_in <= 0;  // reset takes precedence
    end

    always_ff @(posedge clk_out) begin
        if (en_out) begin
            cnt_scale <= (cnt_scale == scale_cor-1) ? 0 : cnt_scale + 1;
            if (cnt_scale == scale_cor-1)
                addr_out <= (addr_out == LEN-1) ? 0 : addr_out + 1;
        end
        if (rst_out) begin  // reset takes precedence
            addr_out <= 0;
            cnt_scale <= 0;
        end
    end

    // channel 0
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch0 (
        .clk_write(clk_in),
        .clk_read(clk_out),
        .we(en_in),
        .addr_write(addr_in),
        .addr_read(addr_out),
        .data_in(data_in_0),
        .data_out(data_out_0)
    );

    // channel 1
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch1 (
        .clk_write(clk_in),
        .clk_read(clk_out),
        .we(en_in),
        .addr_write(addr_in),
        .addr_read(addr_out),
        .data_in(data_in_1),
        .data_out(data_out_1)
    );

    // channel 2
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch2 (
        .clk_write(clk_in),
        .clk_read(clk_out),
        .we(en_in),
        .addr_write(addr_in),
        .addr_read(addr_out),
        .data_in(data_in_2),
        .data_out(data_out_2)
    );
endmodule

The linebuffer module handles horizontal scaling, but not vertical scaling. Vertical scaling requires the repetition of lines; this is dealt with by the top module. We use three separate BRAM instances because colours are output separately to the display and 4-bit-wide buffers are a better fit for BRAM hardware than a single 12-wide buffer.

Quick Aside: Clock Domain Crossing
The linebuffer module has another significant benefit: it allows the pixel clock to be different from the rest of the design. For example, you could draw on the framebuffer at 100 MHz (linebuffer data_in) and display it on a 720p60 screen with a 74.25 MHz pixel clock (linebuffer data_out). CDC is usually achieved with a FIFO, but the linebuffer does it while providing efficient scaling.

Top Scaling

To drive the linebuffer we need to enable and reset its data inputs and outputs at appropriate times. This extract from the top module (source link below listing), shows vertical scaling by a factor of four:

    // linebuffer (LB)
    localparam LB_SCALE_V = 4;               // scale vertical drawing
    localparam LB_SCALE_H = 4;               // scale horizontal drawing
    localparam LB_LEN = H_RES / LB_SCALE_H;  // line length
    localparam LB_WIDTH = 4;                 // bits per colour channel

    // LB data in from FB
    logic lb_en_in, lb_en_in_1;  // allow for BRAM latency correction
    logic [LB_WIDTH-1:0] lb_in_0, lb_in_1, lb_in_2;

    // correct vertical scale: if scale is 0, set to 1
    logic [$clog2(LB_SCALE_V+1):0] scale_v_cor;
    always_comb scale_v_cor = (LB_SCALE_V == 0) ? 1 : LB_SCALE_V;

    // count screen lines for vertical scaling - read when cnt_scale_v==0
    logic [$clog2(LB_SCALE_V):0] cnt_scale_v;
    always_ff @(posedge clk_pix) begin
        if (sx == 0)
            cnt_scale_v <= (cnt_scale_v == scale_v_cor-1) ? 0 : cnt_scale_v + 1;
        if (sy == V_RES_FULL-1) cnt_scale_v <= 0;
    end

    logic [$clog2(FB_WIDTH)-1:0] fb_h_cnt;  // counter for FB pixels on line
    always_ff @(posedge clk_pix) begin
        if (sy == V_RES_FULL-1 && sx == H_RES-1) fb_addr_read <= 0;

        // reset horizontal counter at the start of blanking on reading lines
        if (cnt_scale_v == 0 && sx == H_RES) begin
            if (fb_addr_read < FB_PIXELS-1) fb_h_cnt <= 0;  // read all pixels?
        end

        // read each pixel on FB line and write to LB
        if (fb_h_cnt < FB_WIDTH) begin
            lb_en_in <= 1;
            fb_h_cnt <= fb_h_cnt + 1;
            fb_addr_read <= fb_addr_read + 1;
        end else begin
            lb_en_in <= 0;
        end

        // enable LB data in with latency correction
        lb_en_in_1 <= lb_en_in;
    end

    // LB data out to display
    logic [LB_WIDTH-1:0] lb_out_0, lb_out_1, lb_out_2;

    linebuffer #(
        .WIDTH(LB_WIDTH),
        .LEN(LB_LEN)
        ) lb_inst (
        .clk_in(clk_pix),
        .clk_out(clk_pix),
        .en_in(lb_en_in_1),  // correct for BRAM latency
        .en_out(sy < V_RES && sx < H_RES),
        .rst_in(sx == H_RES),  // reset at start of horizontal blanking
        .rst_out(sx == H_RES),
        .scale(LB_SCALE_H),
        .data_in_0(lb_in_0),
        .data_in_1(lb_in_1),
        .data_in_2(lb_in_2),
        .data_out_0(lb_out_0),
        .data_out_1(lb_out_1),
        .data_out_2(lb_out_2)
    );

Build the updated top module with scaled David:

top_david BRAM usage: 4.5x36Kb on XC7 and 23x4Kb on iCE40

Quick Aside: BRAM Optimization
You’d expect the linebuffer to use 3x18Kb BRAMs (1.5x36Kb) on the Xilinx FPGA, but because all three colour channels are the same for the greyscale palette, the linebuffers are optimised from 3x18Kb BRAMs to 1x18Kb. The warm palette does use 1.5x36Kb BRAMs for a total of 5.5.

Creating Your Own Images

You can easily create your own images using img2fmem. The script is written in Python and uses the Pillow image library to perform the conversion. You can find it in the Project F FPGA Tools repo. Make sure your images are the same dimensions as the framebuffer you’re using.

To convert an image called acme.png to 4-bit colour with 12-bit palette for use with $readmemh:

img2fmem.py acme.png 4 mem 12

For details on installation and command line options, see the img2fmem README.

Fizzle Out

In FPGA Ad Astra, we used linear feedback shift registers (LFSRs) to create animated starfields. LFSRs have another graphical use: creating random dissolve effects, as used in Wolfenstein 3D and explained by Fabien Sanglard in his excellent post: Fizzlefade.

We can use LFSRs to dissolve our image of David using a mask. Our mask is another bitmap, but with only 1 bit per pixel. If a mask pixel is set to 0, then the framebuffer pixel colour is shown, otherwise we show red. By using a mask, rather than altering the original bitmap, we can fade the image back in again or apply other effects.

As previously discussed, the iCE40 doesn’t support 1-bit wide BRAM, so we’ve had to reduce the horizontal resolution of the fizzle bitmap mask so the complete designs still fits into the 30 iCE40 BRAMs.

top_david_fizzle BRAM usage: 6x36Kb on XC7 and 28x4Kb on iCE40

Xilinx version shown below:

module top_david_fizzle (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen clock_640x480 (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 10;  // screen coordinate width in bits
    logic [CORDW-1:0] sx, sy;
    logic hsync, vsync, de;
    display_timings_480p timings_640x480 (
        .clk_pix,
        .rst(!clk_locked),  // wait for clock lock
        .sx,
        .sy,
        .hsync,
        .vsync,
        .de
    );

    // size of screen with and without blanking
    localparam H_RES_FULL = 800;
    localparam V_RES_FULL = 525;
    localparam H_RES = 640;
    localparam V_RES = 480;

    // framebuffer (FB)
    localparam FB_WIDTH   = 160;
    localparam FB_HEIGHT  = 120;
    localparam FB_PIXELS  = FB_WIDTH * FB_HEIGHT;
    localparam FB_ADDRW   = $clog2(FB_PIXELS);
    localparam FB_DATAW   = 4;  // colour bits per pixel
    localparam FB_IMAGE   = "david.mem";
    localparam FB_PALETTE = "david_palette.mem";

    logic fb_we;
    logic [FB_ADDRW-1:0] fb_addr_write, fb_addr_read;
    logic [FB_DATAW-1:0] fb_cidx_write, fb_cidx_read;

    bram_sdp #(
        .WIDTH(FB_DATAW),
        .DEPTH(FB_PIXELS),
        .INIT_F(FB_IMAGE)
    ) framebuffer (
        .clk_read(clk_pix),
        .clk_write(clk_pix),
        .we(fb_we),
        .addr_write(fb_addr_write),
        .addr_read(fb_addr_read),
        .data_in(fb_cidx_write),
        .data_out(fb_cidx_read)
    );

    // draw a horizontal line at the top of the framebuffer
    always @(posedge clk_pix) begin
        if (sy >= V_RES) begin  // draw in blanking interval
            if (fb_we == 0 && fb_addr_write != FB_WIDTH-1) begin
                fb_cidx_write <= 4'h0;  // first palette entry (white)
                fb_we <= 1;
            end else if (fb_addr_write != FB_WIDTH-1) begin
                fb_addr_write <= fb_addr_write + 1;
            end else begin
                fb_we <= 0;
            end
        end
    end

    // fizzlebuffer (FZ)
    logic [FB_ADDRW-1:0] fz_addr_write;
    logic fz_en_in, fz_en_out;
    logic fz_we;

    bram_sdp #(
        .WIDTH(1),
        .DEPTH(FB_PIXELS),
        .INIT_F("")
    ) fz_inst (
        .clk_write(clk_pix),
        .clk_read(clk_pix),
        .we(fz_we),
        .addr_write(fz_addr_write),
        .addr_read(fb_addr_read),  // share read address with FB
        .data_in(fz_en_in),
        .data_out(fz_en_out)
    );

    // 15-bit LFSR (160x120 < 2^15)
    logic lfsr_en;
    logic [14:0] lfsr;
    lfsr #(
        .LEN(15),
        .TAPS(15'b110000000000000)
    ) lsfr_fz (
        .clk(clk_pix),
        .rst(!clk_locked),
        .en(lfsr_en),
        .sreg(lfsr)
    );

    localparam FADE_WAIT = 600;   // wait for 600 frames before fading
    localparam FADE_RATE = 3200;  // every 3200 pixel clocks update LFSR
    logic [$clog2(FADE_WAIT)-1:0] cnt_fade_wait;
    logic [$clog2(FADE_RATE)-1:0] cnt_fade_rate;
    always_ff @(posedge clk_pix) begin
        if (sy == V_RES && sx == H_RES) begin  // start of blanking
            cnt_fade_wait <= (cnt_fade_wait != FADE_WAIT-1) ?
                cnt_fade_wait + 1 : cnt_fade_wait;
        end
        if (cnt_fade_wait == FADE_WAIT-1) begin
            cnt_fade_rate <= (cnt_fade_rate == FADE_RATE) ?
                0 : cnt_fade_rate + 1;
        end
    end

    always_comb begin
        fz_addr_write = lfsr;
        if (cnt_fade_rate == FADE_RATE) begin
            lfsr_en = 1;
            fz_we = 1;
            fz_en_in = 1;
        end else begin
            lfsr_en = 0;
            fz_we = 0;
            fz_en_in = 0;
        end
    end

    // linebuffer (LB)
    localparam LB_SCALE_V = 4;               // scale vertical drawing
    localparam LB_SCALE_H = 4;               // scale horizontal drawing
    localparam LB_LEN = H_RES / LB_SCALE_H;  // line length
    localparam LB_WIDTH = 4;                 // bits per colour channel

    // LB data in from FB
    logic lb_en_in, lb_en_in_1;  // allow for BRAM latency correction
    logic [LB_WIDTH-1:0] lb_in_0, lb_in_1, lb_in_2;

    // correct vertical scale: if scale is 0, set to 1
    logic [$clog2(LB_SCALE_V+1):0] scale_v_cor;
    always_comb scale_v_cor = (LB_SCALE_V == 0) ? 1 : LB_SCALE_V;

    // count screen lines for vertical scaling - read when cnt_scale_v==0
    logic [$clog2(LB_SCALE_V):0] cnt_scale_v;
    always_ff @(posedge clk_pix) begin
        if (sx == 0)
            cnt_scale_v <= (cnt_scale_v == scale_v_cor-1) ? 0 : cnt_scale_v + 1;
        if (sy == V_RES_FULL-1) cnt_scale_v <= 0;
    end

    logic [$clog2(FB_WIDTH)-1:0] fb_h_cnt;  // counter for FB pixels on line
    always_ff @(posedge clk_pix) begin
        if (sy == V_RES_FULL-1 && sx == H_RES-1) fb_addr_read <= 0;

        // reset horizontal counter at the start of blanking on reading lines
        if (cnt_scale_v == 0 && sx == H_RES) begin
            if (fb_addr_read < FB_PIXELS-1) fb_h_cnt <= 0;  // read all pixels?
        end

        // read each pixel on FB line and write to LB
        if (fb_h_cnt < FB_WIDTH) begin
            lb_en_in <= 1;
            fb_h_cnt <= fb_h_cnt + 1;
            fb_addr_read <= fb_addr_read + 1;
        end else begin
            lb_en_in <= 0;
        end

        // enable LB data in with latency correction
        lb_en_in_1 <= lb_en_in;
    end

    // LB data out to display
    logic [LB_WIDTH-1:0] lb_out_0, lb_out_1, lb_out_2;

    linebuffer #(
        .WIDTH(LB_WIDTH),
        .LEN(LB_LEN)
        ) lb_inst (
        .clk_in(clk_pix),
        .clk_out(clk_pix),
        .en_in(lb_en_in_1),  // correct for BRAM latency
        .en_out(sy < V_RES && sx < H_RES),
        .rst_in(sx == H_RES),  // reset at start of horizontal blanking
        .rst_out(sx == H_RES),
        .scale(LB_SCALE_H),
        .data_in_0(lb_in_0),
        .data_in_1(lb_in_1),
        .data_in_2(lb_in_2),
        .data_out_0(lb_out_0),
        .data_out_1(lb_out_1),
        .data_out_2(lb_out_2)
    );

    // colour lookup table (ROM) 16x12-bit entries
    logic [11:0] clut_colr;
    rom_async #(
        .WIDTH(12),
        .DEPTH(16),
        .INIT_F(FB_PALETTE)
    ) clut (
        .addr(fb_cidx_read),
        .data(clut_colr)
    );

    // map colour index to palette using CLUT and read into LB
    always_ff @(posedge clk_pix) begin
        {lb_in_2, lb_in_1, lb_in_0} <= fz_en_out ? 12'hA00 : clut_colr;
    end

    // VGA output
    always_ff @(posedge clk_pix) begin
        vga_hsync <= hsync;
        vga_vsync <= vsync;
        vga_r <= de ? lb_out_2 : 4'h0;
        vga_g <= de ? lb_out_1 : 4'h0;
        vga_b <= de ? lb_out_0 : 4'h0;
    end
endmodule

You may have noticed that the top-left pixel doesn’t change; this is because the LFSR produces every value except zero. To fix this, add some logic to specifically update the pixel at address zero when fading.

Explore

I hope you enjoyed this instalment of Exploring FPGA Graphics, but nothing beats creating your own designs. Here are a few suggestions to get you started:

  • Load your own picture into the framebuffer using img2fmem
  • Cross-fade between two images
  • Fade an image by adjusting the intensity of the palette entries
  • Try moving the image up and down the screen by chaging the initial value of fb_addr_read

Next Time

In the next part, we’ll take our framebuffer and implement Conway’s Game of Life in Life on Screen. We’ll then move onto drawing lines and shapes in early 2021: stay tuned!

Constructive feedback is always welcome. Get in touch with @WillFlux or open an issue on GitHub.

©2021 Will Green, Project F