30 October 2020

Framebuffers

Welcome back to Exploring FPGA Graphics. In the previous two parts, we worked with sprites, but another approach is needed as graphics become more complex. Instead of drawing directly to the screen, we draw to a framebuffer, which is read out to the screen. This post provides an introduction to framebuffers and how to scale them up. We’ll also learn how to fizzlefade graphics Wolfenstein 3D style. In the next part, we’ll use a framebuffer to visualize a simulation of life.

In this series, we explore graphics at the hardware level and get a feel for the power of FPGAs. We’ll learn how displays work, race the beam with Pong, animate starfields and sprites, paint Michelangelo’s David, simulate life with bitmaps, draw lines and shapes, and finally render simple 3D models. New to the series? Start with Exploring FPGA Graphics.

Updated 2021-04-17. Get in touch with @WillFlux or open an issue on GitHub.

April 2021: Includes new framebuffer module design; further explanation will be added soon.

Series Outline

  • Exploring FPGA Graphics - learn how displays work and animate simple shapes
  • FPGA Pong - race the beam to create the arcade classic
  • Hardware Sprites - fast, colourful, graphics with minimal resources
  • FPGA Ad Astra - demo with hardware sprites and animated starfields
  • Framebuffers (this post) - driving the display from a bitmap in memory
  • Life on Screen - the screen comes alive with Conway’s Game of Life
  • Lines and Triangles - drawing lines and triangles with a framebuffer
  • FPGA Shapes - filling and animating shapes
  • Simple 3D - models and wireframe rendering (draft coming soon)

Requirements

For this series, you need an FPGA board with video output. We’ll be working at 640x480, so pretty much any video output will do. It helps to be comfortable with programming your FPGA board and reasonably familiar with Verilog.

We’ll be demoing with these boards:

Source

The SystemVerilog designs featured in this series are available from the projf-explore git repo under the open-source MIT licence: build on them to your heart’s content. The rest of the blog content is subject to standard copyright restrictions: don’t republish it without permission.

Framebuffer

A framebuffer is an in-memory bitmap that drives pixels on the screen. When you write to a memory location within the framebuffer, the corresponding pixel will change on the screen. Using a framebuffer provides two big benefits: we’re free to create sophisticated graphics using whatever technique we like, and the setting of pixel colour is separated from the process of driving the screen. The flexibility of a framebuffer comes at the cost of increased memory storage and latency.

A Small Buffer

A framebuffer requires enough memory to hold the complete frame. To keep things simple, we’ll store our framebuffer in internal FPGA block memory (BRAM). The iCEBreaker’s iCE40 FPGA has thirty 4kb BRAMs, for 120 kb in total, so that’s what we’ll target. You can learn more about block ram in FPGA Memory Types.

If we divide our 640x480 screen by four in both dimensions, we get 160x120. 160x120 is 19,200 pixels, so a monochrome framebuffer requires 18.75 kilobits of memory (19,200 = 18.75 * 1024).

We’ll start by wiring up our usual display timings with a framebuffer based on BRAM, then draw a simple horizontal line in it to confirm everything is working.

The Arty version is shown below:

module top_line (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen_480p clock_pix_inst (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 16;
    logic hsync, vsync;
    logic frame;
    logic signed [CORDW-1:0] sx, sy;
    display_timings_480p #(.CORDW(CORDW)) display_timings_inst (
        .clk_pix,
        .rst(!clk_locked),  // wait for pixel clock lock
        .sx,
        .sy,
        .hsync,
        .vsync,
        .de(),
        .frame,
        .line()
    );

    // framebuffer (FB)
    localparam FB_WIDTH  = 160;
    localparam FB_HEIGHT = 120;
    localparam FB_PIXELS = FB_WIDTH * FB_HEIGHT;
    localparam FB_ADDRW  = $clog2(FB_PIXELS);
    localparam FB_DATAW  = 1;  // colour bits per pixel

    logic fb_we;
    logic [FB_ADDRW-1:0] fb_addr_write, fb_addr_read;
    logic [FB_DATAW-1:0] fb_colr_write, fb_colr_read;

    bram_sdp #(
        .WIDTH(FB_DATAW),
        .DEPTH(FB_PIXELS)
    ) bram_inst (
        .clk_write(clk_pix),
        .clk_read(clk_pix),
        .we(fb_we),
        .addr_write(fb_addr_write),
        .addr_read(fb_addr_read),
        .data_in(fb_colr_write),
        .data_out(fb_colr_read)
    );

    // draw line across middle of framebuffer
    logic [$clog2(FB_WIDTH)-1:0] cnt_draw;
    enum {IDLE, DRAW, DONE} state;
    initial state = IDLE;  // needed for Yosys
    always @(posedge clk_pix) begin
        case (state)
            DRAW:
                if (cnt_draw < FB_WIDTH-1) begin
                    fb_addr_write <= fb_addr_write + 1;
                    cnt_draw <= cnt_draw + 1;
                end else begin
                    fb_we <= 0;
                    state <= DONE;
                end
            IDLE:
                if (frame) begin
                    fb_colr_write <= 1;
                    fb_we <= 1;
                    fb_addr_write <= (FB_HEIGHT>>1) * FB_WIDTH;
                    cnt_draw <= 0;
                    state <= DRAW;
                end
            default: state <= DONE;  // done forever!
        endcase

        if (!clk_locked) state <= IDLE;
    end

    logic paint;  // which area of the framebuffer should we paint?
    always_comb paint = (sy >= 0 && sy < FB_HEIGHT && sx >= 0 && sx < FB_WIDTH);

    // calculate framebuffer read address for display output
    // we start at address zero, so calculation doesn't add latency
    always_ff @(posedge clk_pix) begin
        if (frame) begin  // reset address at start of frame
            fb_addr_read <= 0;
        end else if (paint) begin  // increment address in painting area
            fb_addr_read <= fb_addr_read + 1;
        end
    end

    // reading from BRAM takes one cycle: delay display signals to match
    logic paint_p1, hsync_p1, vsync_p1;
    always @(posedge clk_pix) begin
        paint_p1 <= paint;
        hsync_p1 <= hsync;
        vsync_p1 <= vsync;
    end

    // VGA output
    always_ff @(posedge clk_pix) begin
        vga_hsync <= hsync_p1;
        vga_vsync <= vsync_p1;
        vga_r <= (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
        vga_g <= (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
        vga_b <= (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
    end
endmodule

Build this design, and you should see a white horizontal line near the top left of your screen. It’s only 160 pixels long, so it won’t go right across the screen.

Building the Designs
In the Framebuffers section of the git repo, you’ll find the design files, a makefile for iCEBreaker, a Vivado project for Arty, and instructions for building the designs for both boards.

A Small Bitmap

Michelangelo’s David

Now our framebuffer wired up correctly, let’s load something into it. A small monochrome framebuffer calls for a striking image: I’ve chosen David by Michelangelo.

The version on the right is the original from Wikipedia; it has 64 shades of grey. I created the middle image by reducing the original to 16 colours using img2fmem with no dithering (we will discuss this tool later in the post). The monochrome image on the left was created by Gerbrant using Floyd-Steinberg dithering.

Create a new top module to load the monochrome image of David. We’ll also draw a white border around the edge of the image using a simple finite state machine.

top_david_v1 BRAM usage: 1x36Kb on Arty and 10x4Kb on iCEBreaker

The iCEBreaker version is shown below:

module top_david_v1 (
    input  wire logic clk_12m,      // 12 MHz clock
    input  wire logic btn_rst,      // reset button (active high)
    output      logic dvi_clk,      // DVI pixel clock
    output      logic dvi_hsync,    // DVI horizontal sync
    output      logic dvi_vsync,    // DVI vertical sync
    output      logic dvi_de,       // DVI data enable
    output      logic [3:0] dvi_r,  // 4-bit DVI red
    output      logic [3:0] dvi_g,  // 4-bit DVI green
    output      logic [3:0] dvi_b   // 4-bit DVI blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen_480p clock_pix_inst (
       .clk(clk_12m),
       .rst(btn_rst),
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 16;
    logic hsync, vsync;
    logic de, frame;
    logic signed [CORDW-1:0] sx, sy;
    display_timings_480p #(.CORDW(CORDW)) display_timings_inst (
        .clk_pix,
        .rst(!clk_locked),  // wait for pixel clock lock
        .sx,
        .sy,
        .hsync,
        .vsync,
        .de,
        .frame,
        .line()
    );

    // framebuffer (FB)
    localparam FB_WIDTH  = 160;
    localparam FB_HEIGHT = 120;
    localparam FB_PIXELS = FB_WIDTH * FB_HEIGHT;
    localparam FB_ADDRW  = $clog2(FB_PIXELS);
    localparam FB_DATAW  = 1;  // colour bits per pixel
    localparam FB_IMAGE  = "../res/david/david_1bit.mem";

    logic fb_we;
    logic [FB_ADDRW-1:0] fb_addr_write, fb_addr_read;
    logic [FB_DATAW-1:0] fb_colr_write, fb_colr_read;

    bram_sdp #(
        .WIDTH(FB_DATAW),
        .DEPTH(FB_PIXELS),
        .INIT_F(FB_IMAGE)
    ) bram_inst (
        .clk_write(clk_pix),
        .clk_read(clk_pix),
        .we(fb_we),
        .addr_write(fb_addr_write),
        .addr_read(fb_addr_read),
        .data_in(fb_colr_write),
        .data_out(fb_colr_read)
    );

    // draw box around framebuffer
    logic [$clog2(FB_WIDTH)-1:0] cnt_draw;
    enum {IDLE, TOP, RIGHT, BOTTOM, LEFT, DONE} state;
    initial state = IDLE;  // needed for Yosys
    always @(posedge clk_pix) begin
        case (state)
            TOP:
                if (cnt_draw < FB_WIDTH-1) begin
                    fb_addr_write <= fb_addr_write + 1;
                    cnt_draw <= cnt_draw + 1;
                end else begin
                    cnt_draw <= 0;
                    state <= RIGHT;
                end
            RIGHT:
                if (cnt_draw < FB_HEIGHT-1) begin
                    fb_addr_write <= fb_addr_write + FB_WIDTH;
                    cnt_draw <= cnt_draw + 1;
                end else begin
                    fb_addr_write <= 0;
                    cnt_draw <= 0;
                    state <= LEFT;
                end
            LEFT:
                if (cnt_draw < FB_HEIGHT-1) begin
                    fb_addr_write <= fb_addr_write + FB_WIDTH;
                    cnt_draw <= cnt_draw + 1;
                end else begin
                    cnt_draw <= 0;
                    state <= BOTTOM;
                end
            BOTTOM:
                if (cnt_draw < FB_WIDTH-1) begin
                    fb_addr_write <= fb_addr_write + 1;
                    cnt_draw <= cnt_draw + 1;
                end else begin
                    fb_we <= 0;
                    state <= DONE;
                end
            IDLE:
                if (frame) begin
                    fb_colr_write <= 1;
                    fb_we <= 1;
                    cnt_draw <= 0;
                    state <= TOP;
                end
            default: state <= DONE;  // done forever!
        endcase

        if (!clk_locked) state <= IDLE;
    end

    logic paint;  // which area of the framebuffer should we paint?
    always_comb paint = (sy >= 0 && sy < FB_HEIGHT && sx >= 0 && sx < FB_WIDTH);

    // calculate framebuffer read address for display output
    always_ff @(posedge clk_pix) begin
        if (frame) begin  // reset address at start of frame
            fb_addr_read <= 0;
        end else if (paint) begin  // increment address in painting area
            fb_addr_read <= fb_addr_read + 1;
        end
    end

    // reading from BRAM takes one cycle: delay display signals to match
    logic paint_p1, hsync_p1, vsync_p1, de_p1;
    always @(posedge clk_pix) begin
        paint_p1 <= paint;
        hsync_p1 <= hsync;
        vsync_p1 <= vsync;
        de_p1 <= de;
    end

    logic [3:0] red, green, blue;  // output colour
    always_comb begin
        red   = (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
        green = (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
        blue  = (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
    end

    // Output DVI clock: 180° out of phase with other DVI signals
    SB_IO #(
        .PIN_TYPE(6'b010000)  // PIN_OUTPUT_DDR
    ) dvi_clk_io (
        .PACKAGE_PIN(dvi_clk),
        .OUTPUT_CLK(clk_pix),
        .D_OUT_0(1'b0),
        .D_OUT_1(1'b1)
    );

    // Output DVI signals
    SB_IO #(
        .PIN_TYPE(6'b010100)  // PIN_OUTPUT_REGISTERED
    ) dvi_signal_io [14:0] (
        .PACKAGE_PIN({dvi_hsync, dvi_vsync, dvi_de, dvi_r, dvi_g, dvi_b}),
        .OUTPUT_CLK(clk_pix),
        .D_OUT_0({hsync_p1, vsync_p1, de_p1, red, green, blue}),
        .D_OUT_1()
    );
endmodule

Build this design and program your board. You should see the dithered image of David looking at you from the top left of your screen.

There is one cycle between supplying an address to the BRAM and receiving the data. We delay the display signals one cycle, so they remain in sync with the pixels we’re reading from BRAM.

iCE40: One Bit Too Many
The iCE40 version uses more block ram than you might expect: 160x120 should fit into five 4kb BRAMs, but the david_v1 design requires ten. This is because the iCE40 BRAMs are a minimum of two bits wide. When we expand the colour depth to 4-bit, the BRAM usage will be the expected twenty. Learn more in the Lattice ICE Technology Library.

Casting Shade

We can increase the bits assigned to each pixel to support more colours, or in the case of David, more shades. Our video output is 12-bit, which means there are 16 possible shades of grey.

As with the hedgehog graphic in Hardware Sprites, we store the palette in a file: david_palette.mem.

FFF EEE DDD CCC BBB AAA 999 888 777 666 555 444 333 222 111 000

Greyscale

We load the palette file into a colour lookup table (CLUT) ROM, which is used to find the colour of each pixel for display. If you need a refresher on CLUTs, see A Refined Palette.

Fizzle Out

In FPGA Ad Astra, we used linear feedback shift registers (LFSRs) to create animated starfields. LFSRs have another graphical use: creating random dissolve effects, as used in Wolfenstein 3D and explained by Fabien Sanglard in his excellent post: Fizzlefade.

We’ll apply the fizzlefade to David to dissolve our image to black after a delay.

The fizzle logic is show below. Most of the logic is for delaying and controlling the fade rate.

    // fizzlefade!
    logic lfsr_en;
    logic [14:0] lfsr;
    lfsr #(  // 15-bit LFSR (160x120 < 2^15)
        .LEN(15),
        .TAPS(15'b110000000000000)
    ) lsfr_fz (
        .clk(clk_pix),
        .rst(!clk_locked),
        .en(lfsr_en),
        .sreg(lfsr)
    );

    localparam FADE_WAIT = 600;   // wait for 600 frames before fading
    localparam FADE_RATE = 3200;  // every 3200 pixel clocks update LFSR
    logic [$clog2(FADE_WAIT)-1:0] cnt_wait;
    logic [$clog2(FADE_RATE)-1:0] cnt_rate;
    always_ff @(posedge clk_pix) begin
        if (frame) begin
            cnt_wait <= (cnt_wait != FADE_WAIT-1) ? cnt_wait + 1 : cnt_wait;
        end
        if (cnt_wait == FADE_WAIT-1) begin
            if (cnt_rate == FADE_RATE-1) begin
                lfsr_en <= 1;
                fb_we <= 1;
                fb_addr_write <= lfsr;
                cnt_rate <= 0;
            end else begin
                cnt_rate <= cnt_rate + 1;
                lfsr_en <= 0;
                fb_we <= 0;
            end
        end
        fb_cidx_write <= 4'hF;  // palette index
    end

Build the updated top module with 4-bit greyscale David and fizzlefade:

top_david_v2 BRAM usage: 4x36Kb on Arty and 20x4Kb on iCEBreaker

You may have noticed that the top-left pixel doesn’t fade out; this is because the LFSR produces every value except zero. To fix this, you need to add some logic to specifically update the pixel at address zero when fading.

Warm Tones

Because the palette is separate from the image, we can quickly change it. In top_david_v2.sv, update localparam FB_PALETTE to reference david_palette_warm.mem and rebuild.

The warm palette looks like this:

FED EDC DCB CBA BA9 A98 987 876 765 654 543 432 321 210 100 000

Warm shades

There is also an inverted palette, david_palette_invert.mem, or you can create your own.

Quick Aside: Palettes
Wikipedia has a lovely list of color palettes for different hardware systems.

Crude Scaling

Our framebuffer is too small to fill a 640x480 screen, and modern monitors don’t support lower resolutions. To make our framebuffer fill the screen, we need to scale it up.

We’ve made our framebuffer an integer fraction of our display resolution, so scaling ought to be simple: just divide the coordinates by four!

    fb_addr_read <= FB_WIDTH * (sy/4) + (sx/4);

This does work, as the third version of David demonstrates:

However, I don’t recommend using this approach in any but the most basic designs.

With crude scaling, the memory access pattern is dictated by the pixel clock and beam position (sx,sy):

  • Every framebuffer memory location is read multiple times (16x for four-fold scaling)
  • It uses multiplication, even though display output is sequential
  • Memory access is driven by the pixel clock

The fundamental problem is that our framebuffer is tied to the display logic.

Even without scaling, we soon run into severe limitations:

  • The whole design is bound to the pixel clock
  • External memory, such as DRAM, has its own access and latency patterns
  • Sharing memory between the framebuffer and a CPU or other logic
  • Supporting screen resolutions with different pixel clocks

FPGAs are adroit at handling such situations: let’s build a better framebuffer.

A Better Buffer

We want to isolate our display logic, and its pixel clock, from the rest of our design.

We need two clock domains! Has the fear taken you yet?

Never fear, dual-port BRAM comes to the rescue. Rather than directly reading the framebuffer for display, we can use a linebuffer. The linebuffer has three BRAMs, one for each colour channel: red, green, blue.

diagram to follow

For our 160x120 framebuffer, we copy a single 160-pixel line into the linebuffer; then we read this out for four display lines before going back to the framebuffer for the following line of data. We write data into the linebuffer using our chosen design clock and read it out in the pixel clock domain.

We’ve freed out design from the tyranny of the pixel clock.

iCEBreaker BRAM & PLL
While iCE40UP5K BRAM is not true dual-port, it does support async read and writes.
However, iCE40UP5K has only one PLL, so you can only generate one clock.

Linebuffer Module

Add the linebuffer module - [linebuffer.sv]:

module linebuffer #(
    parameter WIDTH=8,    // data width of each channel
    parameter LEN=640,    // length of line
    parameter SCALE=1     // scaling factor (>=1)
    ) (
    input  wire logic clk_in,    // input clock
    input  wire logic clk_out,   // output clock
    input  wire logic rst_in,    // reset (clk_in)
    input  wire logic rst_out,   // reset (clk_out)
    output      logic data_req,  // request input data (clk_in)
    input  wire logic en_in,     // enable input (clk_in)
    input  wire logic en_out,    // enable output (clk_out)
    input  wire logic frame,     // start a new frame (clk_out)
    input  wire logic line,      // start a new line (clk_out)
    input  wire logic [WIDTH-1:0] din_0,  din_1,  din_2,  // data in (clk_in)
    output      logic [WIDTH-1:0] dout_0, dout_1, dout_2  // data out (clk_out)
    );

    // output data to display
    logic [$clog2(LEN)-1:0] addr_out;        // output address (pixel counter)
    logic [$clog2(SCALE)-1:0] cnt_v, cnt_h;  // scale counters
    logic set_end;   // line set end
    logic get_data;  // fresh data needed
    always_ff @(posedge clk_out) begin
        if (frame) begin  // reset addr and counters at frame start
            addr_out <= 0;
            cnt_h <= 0;
            cnt_v <= 0;
            set_end <= 1;  // ensure first line of frame triggers data_req
        end else if (en_out && !set_end) begin
            if (cnt_h == SCALE-1) begin
                cnt_h <= 0;
                if (addr_out == LEN-1) begin  // end of line
                    addr_out <= 0;
                    if (cnt_v == SCALE-1) begin  // end of line set
                        cnt_v <= 0;
                        set_end <= 1;
                    end else cnt_v <= cnt_v + 1;
                end else addr_out <= addr_out + 1;
            end else cnt_h <= cnt_h + 1;
        end else if (get_data) set_end <= 0;
        if (rst_out) begin
            addr_out <= 0;
            cnt_h <= 0;
            cnt_v <= 0;
            set_end <= 0;
        end
    end

    // request new data on at end of line set (needs to be in clk_in domain)
    always_comb get_data = (line && set_end);
    xd xd_req (.clk_i(clk_out), .clk_o(clk_in),
               .rst_i(rst_out), .rst_o(rst_in), .i(get_data), .o(data_req));

    // read data in
    logic [$clog2(LEN)-1:0] addr_in;
    always_ff @(posedge clk_in) begin
        if (en_in) addr_in <= (addr_in == LEN-1) ? 0 : addr_in + 1;
        if (data_req) addr_in <= 0;  // reset addr_in when we request new data
        if (rst_in) addr_in <= 0;
    end

    // channel 0
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch0 (
        .clk_write(clk_in),
        .clk_read(clk_out),
        .we(en_in),
        .addr_write(addr_in),
        .addr_read(addr_out),
        .data_in(din_0),
        .data_out(dout_0)
    );

    // channel 1
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch1 (
        .clk_write(clk_in),
        .clk_read(clk_out),
        .we(en_in),
        .addr_write(addr_in),
        .addr_read(addr_out),
        .data_in(din_1),
        .data_out(dout_1)
    );

    // channel 2
    bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch2 (
        .clk_write(clk_in),
        .clk_read(clk_out),
        .we(en_in),
        .addr_write(addr_in),
        .addr_read(addr_out),
        .data_in(din_2),
        .data_out(dout_2)
    );
endmodule

There’s a test bench you can use to exercise the module with Vivado: [xc7/linebuffer_tb.sv].

Crossing the Streams

When the output (clk_out domain) finishes with a line of data, it needs to signal the input (clk_in domain) to provide another line. We could use a pair of flip-flops to cross domains, but that could fail if the destination domain is of lower frequency. We could use a BRAM, but using a whole BRAM to move one bit of data is wasteful. Instead, we use a simple trick to safely send a pulse of data between two domains, irrespective of their frequencies.

Add the [xd.sv] module:

module xd (
    input  wire logic clk_i,  //  input clock: source domain
    input  wire logic clk_o,  // output clock: destination domain
    input  wire logic rst_i,  // reset (source domain)
    input  wire logic rst_o,  // reset (destination domain)
    input  wire logic i,      //  input pulse: source domain
    output      logic o       // output pulse: destination domain
    );

    // toggle reg when pulse received in source domain
    logic toggle_i;
    always_ff @(posedge clk_i) begin
        toggle_i <= toggle_i ^ i;
        if (rst_i) toggle_i <= 1'b0;
    end

    // cross to destination domain via shift reg
    logic [3:0] shr_o;
    always_ff @(posedge clk_o) begin
        shr_o <= {shr_o[2:0], toggle_i};
        if (rst_o) shr_o <= 4'b0;
    end

    // output pulse when transition occurs
    always_comb begin
        o = shr_o[3] ^ shr_o[2];
    end
endmodule

Before you rush to add new clock domains to your designs with xd.sv, you need to be aware of a significant limitation: this only works for isolated pulses of one clock cycle. This module is ideal for sending well-spaced events but can’t be used to send arbitrary data from one clock domain to another: use BRAM or a proper FIFO for that. Thanks to fpga4fun for this approach to CDC.

The following simulation shows what happens if you try to abuse this approach. Note how the final two-cycle pulse becomes two separate pulses when sent from slow to fast but disappears altogether when sent from fast to slow. You have been warned!

xd module simulation

There’s a test bench you can use to exercise the module with Vivado: [common/xc7/xd_tb.sv].

Framebuffer Module

The linebuffer doesn’t know anything about the framebuffer design. This is deliberate. We want the freedom to store our framebuffer however we like or even push data into the linebuffer from other sources. This post will use BRAM to create a simple framebuffer module that drives the linebuffer for us.

Add the framebuffer module - [framebuffer.sv]:

module framebuffer #(
    parameter CORDW=16,      // signed coordinate width (bits)
    parameter WIDTH=160,     // width of framebuffer in pixels
    parameter HEIGHT=120,    // height of framebuffer in pixels
    parameter CIDXW=4,       // colour index data width: 4=16, 8=256 colours
    parameter CHANW=4,       // width of RGB colour channels (4 or 8 bit)
    parameter SCALE=4,       // display output scaling factor (>=1)
    parameter F_IMAGE="",    // image file to load into framebuffer
    parameter F_PALETTE=""   // palette file to load into CLUT
    ) (
    input  wire logic clk_sys,    // system clock
    input  wire logic clk_pix,    // pixel clock
    input  wire logic rst_sys,    // reset (clk_sys)
    input  wire logic rst_pix,    // reset (clk_pix)
    input  wire logic de,         // data enable for display (clk_pix)
    input  wire logic frame,      // start a new frame (clk_pix)
    input  wire logic line,       // start a new screen line (clk_pix)
    input  wire logic we,         // write enable
    input  wire logic signed [CORDW-1:0] x,  // horizontal pixel coordinate
    input  wire logic signed [CORDW-1:0] y,  // vertical pixel coordinate
    input  wire logic [CIDXW-1:0] cidx,   // framebuffer colour index
    output      logic clip,               // pixel coordinate outside buffer
    output      logic [CHANW-1:0] red,    // colour output to display (clk_pix)
    output      logic [CHANW-1:0] green,  //     "    "    "    "    "
    output      logic [CHANW-1:0] blue    //     "    "    "    "    "
    );

    logic frame_sys;  // start of new frame in system clock domain
    xd xd_frame (.clk_i(clk_pix), .clk_o(clk_sys),
                 .rst_i(rst_pix), .rst_o(rst_sys), .i(frame), .o(frame_sys));

    // framebuffer (FB)
    localparam FB_PIXELS = WIDTH * HEIGHT;
    localparam FB_DEPTH  = FB_PIXELS;  // single buffer
    localparam FB_ADDRW  = $clog2(FB_DEPTH);
    localparam FB_DATAW  = CIDXW;

    logic [FB_ADDRW-1:0] fb_addr_read, fb_addr_write;
    logic [FB_DATAW-1:0] fb_cidx_read, fb_cidx_read_p1;

    // calculate write address from pixel coordinates (two stage: mul then add)
    logic signed [CORDW-1:0] x_add;     // pixel position on line
    logic [FB_ADDRW-1:0] fb_addr_line;  // address of line for writing
    always_ff @(posedge clk_sys) begin
        fb_addr_line <= WIDTH * y;  // write address 1st stage
        x_add <= x;  // save x for write address 2nd stage
        fb_addr_write <= fb_addr_line + x_add;
    end

    // draw colour and write enable (delay to match address calculation)
    logic fb_we, we_in_p1;
    logic [FB_DATAW-1:0] fb_cidx_write, cidx_in_p1;
    always_ff @(posedge clk_sys) begin
        we_in_p1 <= we;
        cidx_in_p1 <= cidx;  // draw colour
        clip <= (y < 0 || y >= HEIGHT || x < 0 || x >= WIDTH);  // clipped?
        fb_we <= (clip) ? 0 : we_in_p1;  // write enable if not clipped
        fb_cidx_write <= cidx_in_p1;
    end

    // framebuffer memory (BRAM)
    bram_sdp #(
        .WIDTH(FB_DATAW),
        .DEPTH(FB_DEPTH),
        .INIT_F(F_IMAGE)
    ) bram_inst (
        .clk_write(clk_sys),
        .clk_read(clk_sys),
        .we(fb_we),
        .addr_write(fb_addr_write),
        .addr_read(fb_addr_read),
        .data_in(fb_cidx_write),
        .data_out(fb_cidx_read)
    );

    // linebuffer (LB)
    localparam LB_SCALE = SCALE;  // scale (horizontal and vertical)
    localparam LB_LEN   = WIDTH;  // line length matches framebuffer
    localparam LB_BPC   = CHANW;  // bits per colour channel

    // Load data from FB into LB
    logic lb_data_req;  // LB requesting data
    logic [$clog2(LB_LEN+1)-1:0] cnt_h;  // count pixels in line to read
    always_ff @(posedge clk_sys) begin
        if (fb_addr_read < FB_PIXELS-1) begin
            if (lb_data_req) begin
                cnt_h <= 0;  // start new line
            end else if (cnt_h < LB_LEN) begin  // advance to start of next line
                cnt_h <= cnt_h + 1;
                fb_addr_read <= fb_addr_read + 1;
            end
        end else cnt_h <= LB_LEN;
        if (frame_sys) fb_addr_read <= 0;  // new frame
        if (rst_sys) begin
            fb_addr_read <= 0;
            cnt_h <= LB_LEN;  // don't start reading after reset
        end
    end

    // LB enable (not corrected for latency)
    logic lb_en_in, lb_en_out;
    always_comb lb_en_in  = cnt_h < LB_LEN;
    always_comb lb_en_out = de;

    // LB enable in: address calc and CLUT reg add three cycles of latency
    localparam LAT = 3;  // write latency
    logic [LAT-1:0] lb_en_in_sr;
    always @(posedge clk_sys) begin
        lb_en_in_sr <= {lb_en_in, lb_en_in_sr[LAT-1:1]};
        if (rst_sys) lb_en_in_sr <= 0;
    end

    // LB colour channels
    logic [LB_BPC-1:0] lb_in_0,  lb_in_1,  lb_in_2;
    logic [LB_BPC-1:0] lb_out_0, lb_out_1, lb_out_2;

    linebuffer #(
        .WIDTH(LB_BPC),   // data width of each channel
        .LEN(LB_LEN),     // length of line
        .SCALE(LB_SCALE)  // scaling factor (>=1)
        ) lb_inst (
        .clk_in(clk_sys),        // input clock
        .clk_out(clk_pix),       // output clock
        .rst_in(rst_sys),        // reset (clk_in)
        .rst_out(rst_pix),       // reset (clk_out)
        .data_req(lb_data_req),  // request input data (clk_in)
        .en_in(lb_en_in_sr[0]),  // enable input (clk_in)
        .en_out(lb_en_out),      // enable output (clk_out)
        .frame,                  // start a new frame (clk_out)
        .line,                   // start a new line (clk_out)
        .din_0(lb_in_0),         // data in (clk_in)
        .din_1(lb_in_1),
        .din_2(lb_in_2),
        .dout_0(lb_out_0),       // data out (clk_out)
        .dout_1(lb_out_1),
        .dout_2(lb_out_2)
    );

    // improve timing with register between BRAM and async ROM
    always @(posedge clk_sys) fb_cidx_read_p1 <= fb_cidx_read;

    // colour lookup table (ROM)
    localparam CLUTW = 3 * CHANW;
    logic [CLUTW-1:0] clut_colr;
    rom_async #(
        .WIDTH(CLUTW),
        .DEPTH(2**CIDXW),
        .INIT_F(F_PALETTE)
    ) clut (
        .addr(fb_cidx_read_p1),
        .data(clut_colr)
    );

    // map colour index to palette using CLUT and read into LB
    always_ff @(posedge clk_sys) {lb_in_2, lb_in_1, lb_in_0} <= clut_colr;

    logic lb_en_out_p1;  // LB enable out: reading from LB BRAM takes one cycle
    always_ff @(posedge clk_pix) lb_en_out_p1 <= lb_en_out;

    // colour output - combinational because top module should register
    always_comb begin
        red   = lb_en_out_p1 ? lb_out_2 : 0;
        green = lb_en_out_p1 ? lb_out_1 : 0;
        blue  = lb_en_out_p1 ? lb_out_0 : 0;
    end
endmodule

There’s a test bench you can use to exercise the module with Vivado: [xc7/framebuffer_tb.sv].

explanation of framebuffer module will be added shortly

Build the updated top module with the new framebuffer module:

top_david BRAM usage: 4.5x36Kb on Arty and 23x4Kb on iCEBreaker

The Arty version is shown below:

module top_david (
    input  wire logic clk_100m,     // 100 MHz clock
    input  wire logic btn_rst,      // reset button (active low)
    output      logic vga_hsync,    // horizontal sync
    output      logic vga_vsync,    // vertical sync
    output      logic [3:0] vga_r,  // 4-bit VGA red
    output      logic [3:0] vga_g,  // 4-bit VGA green
    output      logic [3:0] vga_b   // 4-bit VGA blue
    );

    // generate pixel clock
    logic clk_pix;
    logic clk_locked;
    clock_gen_480p clock_pix_inst (
       .clk(clk_100m),
       .rst(!btn_rst),  // reset button is active low
       .clk_pix,
       .clk_locked
    );

    // display timings
    localparam CORDW = 16;
    logic hsync, vsync;
    logic de, frame, line;
    display_timings_480p #(.CORDW(CORDW)) display_timings_inst (
        .clk_pix,
        .rst(!clk_locked),  // wait for pixel clock lock
        .sx(),
        .sy(),
        .hsync,
        .vsync,
        .de,
        .frame,
        .line
    );

    logic frame_sys;  // start of new frame in system clock domain
    xd xd_frame (.clk_i(clk_pix), .clk_o(clk_100m),
                 .rst_i(1'b0), .rst_o(1'b0), .i(frame), .o(frame_sys));

    // framebuffer (FB)
    localparam FB_WIDTH   = 160;
    localparam FB_HEIGHT  = 120;
    localparam FB_CIDXW   = 4;
    localparam FB_CHANW   = 4;
    localparam FB_SCALE   = 4;
    localparam FB_IMAGE   = "david.mem";
    localparam FB_PALETTE = "david_palette.mem";

    logic fb_we;
    logic signed [CORDW-1:0] fbx, fby;  // framebuffer coordinates
    logic [FB_CIDXW-1:0] fb_cidx;
    logic [FB_CHANW-1:0] fb_red, fb_green, fb_blue;  // colours for display

    framebuffer #(
        .WIDTH(FB_WIDTH),
        .HEIGHT(FB_HEIGHT),
        .CIDXW(FB_CIDXW),
        .CHANW(FB_CHANW),
        .SCALE(FB_SCALE),
        .F_IMAGE(FB_IMAGE),
        .F_PALETTE(FB_PALETTE)
    ) fb_inst (
        .clk_sys(clk_100m),
        .clk_pix,
        .rst_sys(1'b0),
        .rst_pix(1'b0),
        .de,
        .frame,
        .line,
        .we(fb_we),
        .x(fbx),
        .y(fby),
        .cidx(fb_cidx),
        .clip(),
        .red(fb_red),
        .green(fb_green),
        .blue(fb_blue)
    );

    // draw box around framebuffer
    enum {IDLE, TOP, RIGHT, BOTTOM, LEFT, DONE} state;
    initial state = IDLE;  // needed for Yosys
    always @(posedge clk_100m) begin
        case (state)
            TOP:
                if (fbx < FB_WIDTH-1) begin
                    fbx <= fbx + 1;
                end else begin
                    state <= RIGHT;
                end
            RIGHT:
                if (fby < FB_HEIGHT-1) begin
                    fby <= fby + 1;
                end else begin
                    fbx <= 0;
                    fby <= 0;
                    state <= LEFT;
                end
            LEFT:
                if (fby < FB_HEIGHT-1) begin
                    fby <= fby + 1;
                end else begin
                    state <= BOTTOM;
                end
            BOTTOM:
                if (fbx < FB_WIDTH-1) begin
                    fbx <= fbx + 1;
                end else begin
                    fb_we <= 0;
                    state <= DONE;
                end
            IDLE:
                if (frame_sys) begin
                    fbx <= 0;
                    fby <= 0;
                    fb_cidx <= 4'h0;  // palette index
                    fb_we <= 1;
                    state <= TOP;
                end
            default: state <= DONE;  // done forever!
        endcase
    end

    // reading from FB takes one cycle: delay display signals to match
    logic hsync_p1, vsync_p1;
    always_ff @(posedge clk_pix) begin
        hsync_p1 <= hsync;
        vsync_p1 <= vsync;
    end

    // VGA output
    always_ff @(posedge clk_pix) begin
        vga_hsync <= hsync_p1;
        vga_vsync <= vsync_p1;
        vga_r <= fb_red;
        vga_g <= fb_green;
        vga_b <= fb_blue;
    end
endmodule

We’re back to drawing a border around David, but note the clock domain! We’re drawing at 100 MHz, while the pixel clock is only 25.2 MHz. The iCEBreaker version continues to use the pixel clock for drawing.

Quick Aside: BRAM Optimization
You’d expect the final David design to use 3x18Kb BRAMs (1.5x36Kb) on the Xilinx FPGA, but because all three colour channels are the same for the greyscale palette, the linebuffer is optimised from 3x18Kb BRAMs to 1x18Kb. If you try the warm palette, the linebuffer will use 1.5x36Kb BRAMs for a total of 5.5.

Creating Your Own Images

You can easily create your own images using img2fmem. The script is written in Python and uses the Pillow image library to perform the conversion. You can find it in the Project F FPGA Tools repo. Make sure your images are the same dimensions as the framebuffer you’re using.

To convert an image called acme.png to 4-bit colour with a 12-bit palette for use with $readmemh:

img2fmem.py acme.png 4 mem 12

For details on installation and command-line options, see the img2fmem README.

Explore

I hope you enjoyed this instalment of Exploring FPGA Graphics, but nothing beats creating your own designs. Here are a few suggestions to get you started:

  • Load your own picture into the framebuffer using img2fmem
  • Use a BRAM for the colour lookup table, so you can change the palette at runtime
  • Scale up the fizzlefade using the framebuffer module - watch out for the different clock domains
  • Combine sprites with a framebuffer: what effects can you create?

Next Time

In the next part, we’ll take our framebuffer and implement Conway’s Game of Life in Life on Screen before moving onto drawing Lines and Triangles.

Constructive feedback is always welcome. Get in touch with @WillFlux or open an issue on GitHub.

©2021 Will Green, Project F