Welcome back to Exploring FPGA Graphics. In the previous two parts, we worked with sprites, but another approach is needed as graphics become more complex. Instead of drawing directly to the screen, we draw to a framebuffer, which is read out to the screen. This post provides an introduction to framebuffers and how to scale them up. We’ll also learn how to fizzlefade graphics Wolfenstein 3D style. In the next part, we’ll use a framebuffer to visualize a simulation of life. This post was last updated in January 2022.
I am working on a revised version of this post for summer 2022.
In this series, we explore graphics at the hardware level and get a feel for the power of FPGAs. We’ll learn how displays work, race the beam with Pong, animate starfields and sprites, paint Michelangelo’s David, simulate life with bitmaps, draw lines and shapes, and create smooth animation with double buffering. New to the series? Start with Intro to FPGA Graphics.
Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)
Series Outline
- Beginning FPGA Graphics - video signals and basic graphics
- Racing the Beam - simple demos with minimal logic
- FPGA Pong - recreate the classic arcade on an FPGA
- Display Signals - revist display signals and meet colour palettes
- Hardware Sprites - fast, colourful graphics for games
- Ad Astra - demo with starfields and hardware sprites
- Framebuffers (this post) - bitmap graphics featuring Michelangelo’s David
- Life on Screen - Conway’s Game of Life in logic
- Lines and Triangles - drawing lines and triangles
- 2D Shapes - filled shapes and simple pictures
- Animated Shapes - animation and double-buffering
Sponsor My Work
If you like what I do, consider sponsoring me on GitHub.
I love FPGAs and want to help more people discover and use them in their projects.
My hardware designs are open source, and my blog is advert free.
Requirements
For this series, you need an FPGA board with video output. We’ll be working at 640x480, so pretty much any video output will do. It helps to be comfortable with programming your FPGA board and reasonably familiar with Verilog.
We’ll be demoing with these boards:
- iCEBreaker (Lattice iCE40) with 12-Bit DVI Pmod
- Digilent Arty A7-35T (Xilinx Artix-7) with Pmod VGA
Source
The SystemVerilog designs featured in this series are available from the projf-explore git repo under the open-source MIT licence: build on them to your heart’s content. The rest of the blog content is subject to standard copyright restrictions: don’t republish it without permission.
Framebuffer
A framebuffer is an in-memory bitmap that drives pixels on the screen. When you write to a memory location within the framebuffer, the corresponding pixel will change on the screen. Using a framebuffer provides two big benefits: we’re free to create sophisticated graphics using whatever technique we like, and the setting of pixel colour is separated from the process of driving the screen. The flexibility of a framebuffer comes at the cost of increased memory storage and latency.
A Small Buffer
A framebuffer requires enough memory to hold the complete frame. To keep things simple, we’ll store our framebuffer in internal FPGA block memory (BRAM). The iCEBreaker’s iCE40 FPGA has thirty 4kb BRAMs, for 120 kb in total, so that’s what we’ll target. You can learn more about block ram in FPGA Memory Types.
If we divide our 640x480 screen by four in both dimensions, we get 160x120. 160x120 is 19,200 pixels, so a monochrome framebuffer requires 18.75 kilobits of memory (19,200 = 18.75 * 1024).
We’ll start by wiring up our usual display module with a framebuffer based on BRAM, then draw a simple horizontal line in it to confirm everything is working.
- Arty (XC7): xc7/top_line.sv
- iCEBreaker (iCE40): ice40/top_line.sv
The Arty version is shown below:
module top_line (
input wire logic clk_100m, // 100 MHz clock
input wire logic btn_rst, // reset button (active low)
output logic vga_hsync, // horizontal sync
output logic vga_vsync, // vertical sync
output logic [3:0] vga_r, // 4-bit VGA red
output logic [3:0] vga_g, // 4-bit VGA green
output logic [3:0] vga_b // 4-bit VGA blue
);
// generate pixel clock
logic clk_pix;
logic clk_locked;
clock_gen_480p clock_pix_inst (
.clk(clk_100m),
.rst(!btn_rst), // reset button is active low
.clk_pix,
.clk_locked
);
// display sync signals and coordinates
localparam CORDW = 16;
logic signed [CORDW-1:0] sx, sy;
logic hsync, vsync;
logic frame;
display_480p #(.CORDW(CORDW)) display_inst (
.clk_pix,
.rst(!clk_locked),
.sx,
.sy,
.hsync,
.vsync,
.de(),
.frame,
.line()
);
// framebuffer (FB)
localparam FB_WIDTH = 160;
localparam FB_HEIGHT = 120;
localparam FB_PIXELS = FB_WIDTH * FB_HEIGHT;
localparam FB_ADDRW = $clog2(FB_PIXELS);
localparam FB_DATAW = 1; // colour bits per pixel
logic fb_we;
logic [FB_ADDRW-1:0] fb_addr_write, fb_addr_read;
logic [FB_DATAW-1:0] fb_colr_write, fb_colr_read;
bram_sdp #(
.WIDTH(FB_DATAW),
.DEPTH(FB_PIXELS)
) bram_inst (
.clk_write(clk_pix),
.clk_read(clk_pix),
.we(fb_we),
.addr_write(fb_addr_write),
.addr_read(fb_addr_read),
.data_in(fb_colr_write),
.data_out(fb_colr_read)
);
// draw line across middle of framebuffer
logic [$clog2(FB_WIDTH)-1:0] cnt_draw;
enum {IDLE, DRAW, DONE} state;
always_ff @(posedge clk_pix) begin
case (state)
DRAW:
if (cnt_draw < FB_WIDTH-1) begin
fb_addr_write <= fb_addr_write + 1;
cnt_draw <= cnt_draw + 1;
end else begin
fb_we <= 0;
state <= DONE;
end
IDLE:
if (frame) begin
fb_colr_write <= 1;
fb_we <= 1;
fb_addr_write <= (FB_HEIGHT>>1) * FB_WIDTH;
cnt_draw <= 0;
state <= DRAW;
end
default: state <= DONE; // done forever!
endcase
end
logic paint; // which area of the framebuffer should we paint?
always_comb paint = (sy >= 0 && sy < FB_HEIGHT && sx >= 0 && sx < FB_WIDTH);
// calculate framebuffer read address for display output
// we start at address zero, so calculation doesn't add latency
always_ff @(posedge clk_pix) begin
if (frame) begin // reset address at start of frame
fb_addr_read <= 0;
end else if (paint) begin // increment address in painting area
fb_addr_read <= fb_addr_read + 1;
end
end
// reading from BRAM takes one cycle: delay display signals to match
logic paint_p1, hsync_p1, vsync_p1;
always_ff @(posedge clk_pix) begin
paint_p1 <= paint;
hsync_p1 <= hsync;
vsync_p1 <= vsync;
end
// VGA output
always_ff @(posedge clk_pix) begin
vga_hsync <= hsync_p1;
vga_vsync <= vsync_p1;
vga_r <= (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
vga_g <= (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
vga_b <= (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
end
endmodule
Build this design, and you should see a white horizontal line near the top left of your screen. It’s only 160 pixels long, so it won’t go right across the screen.
Building the Designs
In the Framebuffers section of the git repo, you’ll find the design files, a makefile for iCEBreaker, a Vivado project for Arty, and instructions for building the designs for both boards.
A Small Bitmap
Now our framebuffer wired up correctly, let’s load something into it. A small monochrome framebuffer calls for a striking image: I’ve chosen David by Michelangelo.
The version on the right is the original from Wikipedia; it has 64 shades of grey. I created the middle image by reducing the original to 16 colours using img2fmem with no dithering (we will discuss this tool later in the post). The monochrome image on the left was created by Gerbrant using Floyd-Steinberg dithering.
Create a new top module to load the monochrome image of David. We’ll also draw a white border around the edge of the image using a simple finite state machine.
- Arty (XC7): xc7/top_david_v1.sv
- iCEBreaker (iCE40): ice40/top_david_v1.sv
top_david_v1 BRAM usage: 1x36Kb on Arty and 10x4Kb on iCEBreaker
The iCEBreaker version is shown below:
module top_david_v1 (
input wire logic clk_12m, // 12 MHz clock
input wire logic btn_rst, // reset button (active high)
output logic dvi_clk, // DVI pixel clock
output logic dvi_hsync, // DVI horizontal sync
output logic dvi_vsync, // DVI vertical sync
output logic dvi_de, // DVI data enable
output logic [3:0] dvi_r, // 4-bit DVI red
output logic [3:0] dvi_g, // 4-bit DVI green
output logic [3:0] dvi_b // 4-bit DVI blue
);
// generate pixel clock
logic clk_pix;
logic clk_locked;
clock_gen_480p clock_pix_inst (
.clk(clk_12m),
.rst(btn_rst),
.clk_pix,
.clk_locked
);
// display sync signals and coordinates
localparam CORDW = 16;
logic signed [CORDW-1:0] sx, sy;
logic hsync, vsync;
logic de, frame;
display_480p #(.CORDW(CORDW)) display_inst (
.clk_pix,
.rst(!clk_locked),
.sx,
.sy,
.hsync,
.vsync,
.de,
.frame,
.line()
);
// framebuffer (FB)
localparam FB_WIDTH = 160;
localparam FB_HEIGHT = 120;
localparam FB_PIXELS = FB_WIDTH * FB_HEIGHT;
localparam FB_ADDRW = $clog2(FB_PIXELS);
localparam FB_DATAW = 1; // colour bits per pixel
localparam FB_IMAGE = "../res/david/david_1bit.mem";
logic fb_we;
logic [FB_ADDRW-1:0] fb_addr_write, fb_addr_read;
logic [FB_DATAW-1:0] fb_colr_write, fb_colr_read;
bram_sdp #(
.WIDTH(FB_DATAW),
.DEPTH(FB_PIXELS),
.INIT_F(FB_IMAGE)
) bram_inst (
.clk_write(clk_pix),
.clk_read(clk_pix),
.we(fb_we),
.addr_write(fb_addr_write),
.addr_read(fb_addr_read),
.data_in(fb_colr_write),
.data_out(fb_colr_read)
);
// draw box around framebuffer
logic [$clog2(FB_WIDTH)-1:0] cnt_draw;
enum {IDLE, TOP, RIGHT, BOTTOM, LEFT, DONE} state;
always_ff @(posedge clk_pix) begin
case (state)
TOP:
if (cnt_draw < FB_WIDTH-1) begin
fb_addr_write <= fb_addr_write + 1;
cnt_draw <= cnt_draw + 1;
end else begin
cnt_draw <= 0;
state <= RIGHT;
end
RIGHT:
if (cnt_draw < FB_HEIGHT-1) begin
fb_addr_write <= fb_addr_write + FB_WIDTH;
cnt_draw <= cnt_draw + 1;
end else begin
fb_addr_write <= 0;
cnt_draw <= 0;
state <= LEFT;
end
LEFT:
if (cnt_draw < FB_HEIGHT-1) begin
fb_addr_write <= fb_addr_write + FB_WIDTH;
cnt_draw <= cnt_draw + 1;
end else begin
cnt_draw <= 0;
state <= BOTTOM;
end
BOTTOM:
if (cnt_draw < FB_WIDTH-1) begin
fb_addr_write <= fb_addr_write + 1;
cnt_draw <= cnt_draw + 1;
end else begin
fb_we <= 0;
state <= DONE;
end
IDLE:
if (frame) begin
fb_colr_write <= 1;
fb_we <= 1;
cnt_draw <= 0;
state <= TOP;
end
default: state <= DONE; // done forever!
endcase
if (!clk_locked) state <= IDLE;
end
logic paint; // which area of the framebuffer should we paint?
always_comb paint = (sy >= 0 && sy < FB_HEIGHT && sx >= 0 && sx < FB_WIDTH);
// calculate framebuffer read address for display output
always_ff @(posedge clk_pix) begin
if (frame) begin // reset address at start of frame
fb_addr_read <= 0;
end else if (paint) begin // increment address in painting area
fb_addr_read <= fb_addr_read + 1;
end
end
// reading from BRAM takes one cycle: delay display signals to match
logic paint_p1, hsync_p1, vsync_p1, de_p1;
always_ff @(posedge clk_pix) begin
paint_p1 <= paint;
hsync_p1 <= hsync;
vsync_p1 <= vsync;
de_p1 <= de;
end
logic [3:0] red, green, blue; // output colour
always_comb begin
red = (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
green = (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
blue = (paint_p1 && fb_colr_read) ? 4'hF : 4'h0;
end
// Output DVI clock: 180° out of phase with other DVI signals
SB_IO #(
.PIN_TYPE(6'b010000) // PIN_OUTPUT_DDR
) dvi_clk_io (
.PACKAGE_PIN(dvi_clk),
.OUTPUT_CLK(clk_pix),
.D_OUT_0(1'b0),
.D_OUT_1(1'b1)
);
// Output DVI signals
SB_IO #(
.PIN_TYPE(6'b010100) // PIN_OUTPUT_REGISTERED
) dvi_signal_io [14:0] (
.PACKAGE_PIN({dvi_hsync, dvi_vsync, dvi_de, dvi_r, dvi_g, dvi_b}),
.OUTPUT_CLK(clk_pix),
.D_OUT_0({hsync_p1, vsync_p1, de_p1, red, green, blue}),
.D_OUT_1()
);
endmodule
Build this design and program your board. You should see the dithered image of David looking at you from the top left of your screen.
There is one cycle between supplying an address to the BRAM and receiving the data. We delay the display signals one cycle, so they remain in sync with the pixels we’re reading from BRAM.
iCE40: One Bit Too Many
The iCE40 version uses more block ram than you might expect: 160x120 should fit into five 4kb BRAMs, but thedavid_v1
design requires ten. This is because the iCE40 BRAMs are a minimum of two bits wide. When we expand the colour depth to 4-bit, the BRAM usage will be the expected twenty. Learn more in the Lattice ICE Technology Library.
Casting Shade
We can increase the bits assigned to each pixel to support more colours, or in the case of David, more shades. Our video output is 12-bit, which means there are 16 possible shades of grey.
As with the hedgehog graphic in Hardware Sprites, we store the palette in a file: david_palette.mem.
FFF EEE DDD CCC BBB AAA 999 888 777 666 555 444 333 222 111 000
We load the palette file into a colour lookup table (CLUT) ROM, which is used to find the colour of each pixel for display. If you need a refresher on CLUTs, see A Refined Palette.
Fizzle Out
In FPGA Ad Astra, we used linear feedback shift registers (LFSRs) to create animated starfields. LFSRs have another graphical use: creating random dissolve effects, as used in Wolfenstein 3D and explained by Fabien Sanglard in his excellent post: Fizzlefade.
We’ll apply the fizzlefade to David to dissolve our image to black after a delay.
The fizzle logic is show below. Most of the logic is for delaying and controlling the fade rate.
// fizzlefade!
logic lfsr_en;
logic [14:0] lfsr;
lfsr #( // 15-bit LFSR (160x120 < 2^15)
.LEN(15),
.TAPS(15'b110000000000000)
) lsfr_fz (
.clk(clk_pix),
.rst(!clk_locked),
.en(lfsr_en),
.seed(0), // use default seed
.sreg(lfsr)
);
localparam FADE_WAIT = 600; // wait for 600 frames before fading
localparam FADE_RATE = 3200; // every 3200 pixel clocks update LFSR
logic [$clog2(FADE_WAIT)-1:0] cnt_wait;
logic [$clog2(FADE_RATE)-1:0] cnt_rate;
always_ff @(posedge clk_pix) begin
if (frame) begin
cnt_wait <= (cnt_wait != FADE_WAIT-1) ? cnt_wait + 1 : cnt_wait;
end
if (cnt_wait == FADE_WAIT-1) begin
if (cnt_rate == FADE_RATE-1) begin
lfsr_en <= 1;
fb_we <= 1;
fb_addr_write <= lfsr;
cnt_rate <= 0;
end else begin
cnt_rate <= cnt_rate + 1;
lfsr_en <= 0;
fb_we <= 0;
end
end
fb_cidx_write <= 4'hF; // palette index
end
Build the updated top module with 4-bit greyscale David and fizzlefade:
- Arty (XC7): xc7/top_david_v2.sv
- iCEBreaker (iCE40): ice40/top_david_v2.sv
top_david_v2 BRAM usage: 4x36Kb on Arty and 20x4Kb on iCEBreaker
You may have noticed that the top-left pixel doesn’t fade out; this is because the LFSR produces every value except zero. To fix this, you need to add some logic to specifically update the pixel at address zero when fading.
Warm Tones
Because the palette is separate from the image, we can quickly change it. In top_david_v2.sv
, update localparam FB_PALETTE
to reference david_palette_warm.mem
and rebuild.
The warm palette looks like this:
FED EDC DCB CBA BA9 A98 987 876 765 654 543 432 321 210 100 000
There is also an inverted palette, david_palette_invert.mem
, or you can create your own.
Colour Palettes
Wikipedia has a lovely list of color palettes for different hardware systems.
Crude Scaling
Our framebuffer is too small to fill a 640x480 screen, and modern monitors don’t support lower resolutions. To make our framebuffer fill the screen, we need to scale it up.
We’ve made our framebuffer an integer fraction of our display resolution, so scaling ought to be simple: just divide the screen coordinates by four!
fb_addr_read <= FB_WIDTH * (sy/4) + (sx/4);
This does work, as the third version of David demonstrates:
- Arty (XC7): xc7/top_david_v3.sv
- iCEBreaker (iCE40): ice40/top_david_v3.sv
However, I don’t recommend using this approach in any but the most basic designs.
With crude scaling, the memory access pattern is dictated by the pixel clock and beam position (sx,sy):
- Every framebuffer memory location is read multiple times (16x for four-fold scaling)
- It uses multiplication, even though display output is sequential
- It only works for powers of two (we can’t scale our framebuffer by 6 or 10)
- Memory access is driven by the pixel clock
The fundamental problem is that our framebuffer is tied to the display logic.
Even without scaling, we soon run into severe limitations because the whole design is bound to the pixel clock clk_pix
:
- Sharing memory between the framebuffer other logic forces it to run at
clk_pix
- External memory, such as DRAM, has its own timings that don’t match
clk_pix
- Different screen resolutions require different
clk_pix
, which changes the whole design
FPGAs are adroit at handling such situations: let’s build a better framebuffer.
A Better Buffer
We want to isolate our display logic, and its pixel clock, from the rest of our design.
We need two clock domains! Has the fear taken you yet?
Never fear, dual-port BRAM comes to the rescue. Rather than directly reading the framebuffer for display, we can use a linebuffer. The linebuffer has three BRAMs, one for each colour channel: red, green, blue.
diagram to follow
For our 160x120 framebuffer, we copy a single 160-pixel line into the linebuffer; then we read this out for four display lines before going back to the framebuffer for the following line of data. We write data into the linebuffer using our chosen design clock frequency and read it out in the pixel clock domain.
We’ve freed out design from the tyranny of the pixel clock.
iCEBreaker BRAM & PLL
While iCE40UP5K BRAM is not true dual-port, it does support async read and writes.
However, iCE40UP5K has only one PLL, so you can only generate one clock.
Linebuffer Module
Add the linebuffer module - [linebuffer.sv]:
module linebuffer #(
parameter WIDTH=8, // data width of each channel
parameter LEN=640, // length of line
parameter SCALE=1 // scaling factor (>=1)
) (
input wire logic clk_in, // input clock
input wire logic clk_out, // output clock
input wire logic rst_in, // reset (clk_in)
input wire logic rst_out, // reset (clk_out)
output logic data_req, // request input data (clk_in)
input wire logic en_in, // enable input (clk_in)
input wire logic en_out, // enable output (clk_out)
input wire logic frame, // start a new frame (clk_out)
input wire logic line, // start a new line (clk_out)
input wire logic [WIDTH-1:0] din_0, din_1, din_2, // data in (clk_in)
output logic [WIDTH-1:0] dout_0, dout_1, dout_2 // data out (clk_out)
);
// output data to display
logic [$clog2(LEN)-1:0] addr_out; // output address (pixel counter)
logic [$clog2(SCALE)-1:0] cnt_v, cnt_h; // scale counters
logic set_end; // line set end
logic get_data; // fresh data needed
always_ff @(posedge clk_out) begin
if (frame) begin // reset addr and counters at frame start
addr_out <= 0;
cnt_h <= 0;
cnt_v <= 0;
set_end <= 1; // ensure first line of frame triggers data_req
end else if (en_out && !set_end) begin
if (cnt_h == SCALE-1) begin
cnt_h <= 0;
if (addr_out == LEN-1) begin // end of line
addr_out <= 0;
if (cnt_v == SCALE-1) begin // end of line set
cnt_v <= 0;
set_end <= 1;
end else cnt_v <= cnt_v + 1;
end else addr_out <= addr_out + 1;
end else cnt_h <= cnt_h + 1;
end else if (get_data) set_end <= 0;
if (rst_out) begin
addr_out <= 0;
cnt_h <= 0;
cnt_v <= 0;
set_end <= 0;
end
end
// request new data on at end of line set (needs to be in clk_in domain)
always_comb get_data = (line && set_end);
xd xd_req (.clk_i(clk_out), .clk_o(clk_in),
.rst_i(rst_out), .rst_o(rst_in), .i(get_data), .o(data_req));
// read data in
logic [$clog2(LEN)-1:0] addr_in;
always_ff @(posedge clk_in) begin
if (en_in) addr_in <= (addr_in == LEN-1) ? 0 : addr_in + 1;
if (data_req) addr_in <= 0; // reset addr_in when we request new data
if (rst_in) addr_in <= 0;
end
// channel 0
bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch0 (
.clk_write(clk_in),
.clk_read(clk_out),
.we(en_in),
.addr_write(addr_in),
.addr_read(addr_out),
.data_in(din_0),
.data_out(dout_0)
);
// channel 1
bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch1 (
.clk_write(clk_in),
.clk_read(clk_out),
.we(en_in),
.addr_write(addr_in),
.addr_read(addr_out),
.data_in(din_1),
.data_out(dout_1)
);
// channel 2
bram_sdp #(.WIDTH(WIDTH), .DEPTH(LEN)) ch2 (
.clk_write(clk_in),
.clk_read(clk_out),
.we(en_in),
.addr_write(addr_in),
.addr_read(addr_out),
.data_in(din_2),
.data_out(dout_2)
);
endmodule
There’s a test bench you can use to exercise the module with Vivado: [xc7/linebuffer_tb.sv].
Crossing the Streams
When the output (clk_out domain) finishes with a line of data, it needs to signal the input (clk_in domain) to provide another line. We use the clock/xd
module to achieve this. Read Simple Clock Domain Crossing to learn more about this module.
Add the [clock/xd.sv] module.
Framebuffer Module
The linebuffer doesn’t know anything about the framebuffer design. This is deliberate. We want the freedom to store our framebuffer however we like or even push data into the linebuffer from other sources. This post will use BRAM to create a simple framebuffer module that drives the linebuffer for us.
Add the framebuffer module for BRAM - [framebuffer_bram.sv]:
module framebuffer_bram #(
parameter CORDW=16, // signed coordinate width (bits)
parameter WIDTH=320, // width of framebuffer in pixels
parameter HEIGHT=180, // height of framebuffer in pixels
parameter CIDXW=4, // colour index data width: 4=16, 8=256 colours
parameter CHANW=4, // width of RGB colour channels (4 or 8 bit)
parameter SCALE=4, // display output scaling factor (>=1)
parameter F_IMAGE="", // image file to load into framebuffer
parameter F_PALETTE="" // palette file to load into CLUT
) (
input wire logic clk_sys, // system clock
input wire logic clk_pix, // pixel clock
input wire logic rst_sys, // reset (clk_sys)
input wire logic rst_pix, // reset (clk_pix)
input wire logic de, // data enable for display (clk_pix)
input wire logic frame, // start a new frame (clk_pix)
input wire logic line, // start a new screen line (clk_pix)
input wire logic we, // write enable
input wire logic signed [CORDW-1:0] x, // horizontal pixel coordinate
input wire logic signed [CORDW-1:0] y, // vertical pixel coordinate
input wire logic [CIDXW-1:0] cidx, // framebuffer colour index
output logic busy, // busy with reading for display output
output logic clip, // pixel coordinate outside buffer
output logic [CHANW-1:0] red, // colour output to display (clk_pix)
output logic [CHANW-1:0] green, // " " " " "
output logic [CHANW-1:0] blue // " " " " "
);
logic frame_sys; // start of new frame in system clock domain
xd xd_frame (.clk_i(clk_pix), .clk_o(clk_sys),
.rst_i(rst_pix), .rst_o(rst_sys), .i(frame), .o(frame_sys));
// framebuffer (FB)
localparam FB_PIXELS = WIDTH * HEIGHT;
localparam FB_ADDRW = $clog2(FB_PIXELS);
localparam FB_DEPTH = FB_PIXELS;
localparam FB_DATAW = CIDXW;
localparam FB_DUALPORT = 1; // separate read and write ports?
logic [FB_ADDRW-1:0] fb_addr_read, fb_addr_write;
logic [FB_DATAW-1:0] fb_cidx_read, fb_cidx_read_p1;
// write address components
logic signed [CORDW-1:0] x_add; // pixel position on line
logic [FB_ADDRW-1:0] fb_addr_line; // address of line for drawing
// calculate write address from pixel coordinates (two stages: mul then add)
always_ff @(posedge clk_sys) begin
fb_addr_line <= WIDTH * y; // write address 1st stage (y could be negative)
x_add <= x; // save x for write address 2nd stage
fb_addr_write <= fb_addr_line + x_add; // clipping is checked later
end
// draw colour and write enable (delay to match address calculation)
logic fb_we, we_in_p1;
logic [FB_DATAW-1:0] fb_cidx_write, cidx_in_p1;
always_ff @(posedge clk_sys) begin
// first stage
we_in_p1 <= we;
cidx_in_p1 <= cidx; // draw colour
clip <= (y < 0 || y >= HEIGHT || x < 0 || x >= WIDTH); // clipped?
// second stage
fb_we <= (busy || clip) ? 0 : we_in_p1; // write if neither busy nor clipped
fb_cidx_write <= cidx_in_p1;
end
// framebuffer memory (BRAM)
bram_sdp #(
.WIDTH(FB_DATAW),
.DEPTH(FB_DEPTH),
.INIT_F(F_IMAGE)
) bram_inst (
.clk_write(clk_sys),
.clk_read(clk_sys),
.we(fb_we),
.addr_write(fb_addr_write),
.addr_read(fb_addr_read),
.data_in(fb_cidx_write),
.data_out(fb_cidx_read)
);
// linebuffer (LB)
localparam LB_SCALE = SCALE; // scale (horizontal and vertical)
localparam LB_LEN = WIDTH; // line length matches framebuffer
localparam LB_BPC = CHANW; // bits per colour channel
logic lb_data_req; // LB requesting data
logic [$clog2(LB_LEN+1)-1:0] cnt_h; // count pixels in line to read
// LB enable (not corrected for latency)
logic lb_en_in, lb_en_out;
always_comb lb_en_in = cnt_h < LB_LEN;
always_comb lb_en_out = de;
// LB enable in: BRAM, address calc, and CLUT reg add three cycles of latency
localparam LAT = 3; // write latency
logic [LAT-1:0] lb_en_in_sr;
always_ff @(posedge clk_sys) begin
lb_en_in_sr <= {lb_en_in, lb_en_in_sr[LAT-1:1]};
if (rst_sys) lb_en_in_sr <= 0;
end
// Load data from FB into LB
always_ff @(posedge clk_sys) begin
if (fb_addr_read < FB_PIXELS-1) begin
if (lb_data_req) begin
cnt_h <= 0; // start new line
if (!FB_DUALPORT) busy <= 1; // set busy flag if not dual port
end else if (cnt_h < LB_LEN) begin // advance to start of next line
cnt_h <= cnt_h + 1;
fb_addr_read <= fb_addr_read + 1;
end
end else cnt_h <= LB_LEN;
if (frame_sys) begin
fb_addr_read <= 0; // new frame
busy <= 0; // LB reads don't cross frame boundary
end
if (rst_sys) begin
fb_addr_read <= 0;
busy <= 0;
cnt_h <= LB_LEN; // don't start reading after reset
end
if (lb_en_in_sr == 3'b100) busy <= 0; // LB read done: match latency `LAT`
end
// LB colour channels
logic [LB_BPC-1:0] lb_in_0, lb_in_1, lb_in_2;
logic [LB_BPC-1:0] lb_out_0, lb_out_1, lb_out_2;
linebuffer #(
.WIDTH(LB_BPC), // data width of each channel
.LEN(LB_LEN), // length of line
.SCALE(LB_SCALE) // scaling factor (>=1)
) lb_inst (
.clk_in(clk_sys), // input clock
.clk_out(clk_pix), // output clock
.rst_in(rst_sys), // reset (clk_in)
.rst_out(rst_pix), // reset (clk_out)
.data_req(lb_data_req), // request input data (clk_in)
.en_in(lb_en_in_sr[0]), // enable input (clk_in)
.en_out(lb_en_out), // enable output (clk_out)
.frame, // start a new frame (clk_out)
.line, // start a new line (clk_out)
.din_0(lb_in_0), // data in (clk_in)
.din_1(lb_in_1),
.din_2(lb_in_2),
.dout_0(lb_out_0), // data out (clk_out)
.dout_1(lb_out_1),
.dout_2(lb_out_2)
);
// improve timing with register between BRAM and async ROM
always_ff @(posedge clk_sys) fb_cidx_read_p1 <= fb_cidx_read;
// colour lookup table (ROM)
localparam CLUTW = 3 * CHANW;
logic [CLUTW-1:0] clut_colr;
rom_async #(
.WIDTH(CLUTW),
.DEPTH(2**CIDXW),
.INIT_F(F_PALETTE)
) clut (
.addr(fb_cidx_read_p1),
.data(clut_colr)
);
// map colour index to palette using CLUT and read into LB
always_ff @(posedge clk_sys) {lb_in_2, lb_in_1, lb_in_0} <= clut_colr;
logic lb_en_out_p1; // LB enable out: reading from LB BRAM takes one cycle
always_ff @(posedge clk_pix) lb_en_out_p1 <= lb_en_out;
// colour output - combinational because top module should register
always_comb begin
red = lb_en_out_p1 ? lb_out_2 : 0;
green = lb_en_out_p1 ? lb_out_1 : 0;
blue = lb_en_out_p1 ? lb_out_0 : 0;
end
endmodule
There’s a test bench you can use to exercise the module with Vivado: [xc7/framebuffer_bram_tb.sv].
Explanation of framebuffer module will be added shortly.
Build the updated top module with the new framebuffer module:
- Arty (XC7): xc7/top_david.sv
- iCEBreaker (iCE40): ice40/top_david.sv
top_david BRAM usage: 4.5x36Kb on Arty and 23x4Kb on iCEBreaker
The Arty version is shown below:
module top_david (
input wire logic clk_100m, // 100 MHz clock
input wire logic btn_rst, // reset button (active low)
output logic vga_hsync, // horizontal sync
output logic vga_vsync, // vertical sync
output logic [3:0] vga_r, // 4-bit VGA red
output logic [3:0] vga_g, // 4-bit VGA green
output logic [3:0] vga_b // 4-bit VGA blue
);
// generate pixel clock
logic clk_pix;
logic clk_locked;
clock_gen_480p clock_pix_inst (
.clk(clk_100m),
.rst(!btn_rst), // reset button is active low
.clk_pix,
.clk_locked
);
// display sync signals and coordinates
localparam CORDW = 16;
logic hsync, vsync;
logic de, frame, line;
display_480p #(.CORDW(CORDW)) display_inst (
.clk_pix,
.rst(!clk_locked),
.sx(),
.sy(),
.hsync,
.vsync,
.de,
.frame,
.line
);
logic frame_sys; // start of new frame in system clock domain
xd xd_frame (.clk_i(clk_pix), .clk_o(clk_100m),
.rst_i(1'b0), .rst_o(1'b0), .i(frame), .o(frame_sys));
// framebuffer (FB)
localparam FB_WIDTH = 160;
localparam FB_HEIGHT = 120;
localparam FB_CIDXW = 4;
localparam FB_CHANW = 4;
localparam FB_SCALE = 4;
localparam FB_IMAGE = "david.mem";
localparam FB_PALETTE = "david_palette.mem";
logic fb_we; // write enable
logic signed [CORDW-1:0] fbx, fby; // draw coordinates
logic [FB_CIDXW-1:0] fb_cidx; // draw colour index
logic [FB_CHANW-1:0] fb_red, fb_green, fb_blue; // colours for display output
framebuffer_bram #(
.WIDTH(FB_WIDTH),
.HEIGHT(FB_HEIGHT),
.CIDXW(FB_CIDXW),
.CHANW(FB_CHANW),
.SCALE(FB_SCALE),
.F_IMAGE(FB_IMAGE),
.F_PALETTE(FB_PALETTE)
) fb_inst (
.clk_sys(clk_100m),
.clk_pix,
.rst_sys(1'b0),
.rst_pix(1'b0),
.de,
.frame,
.line,
.we(fb_we),
.x(fbx),
.y(fby),
.cidx(fb_cidx),
.clip(),
.busy(),
.red(fb_red),
.green(fb_green),
.blue(fb_blue)
);
// draw box around framebuffer
enum {IDLE, TOP, RIGHT, BOTTOM, LEFT, DONE} state;
always_ff @(posedge clk_100m) begin
case (state)
TOP:
if (fbx < FB_WIDTH-1) begin
fbx <= fbx + 1;
end else begin
state <= RIGHT;
end
RIGHT:
if (fby < FB_HEIGHT-1) begin
fby <= fby + 1;
end else begin
fbx <= 0;
fby <= 0;
state <= LEFT;
end
LEFT:
if (fby < FB_HEIGHT-1) begin
fby <= fby + 1;
end else begin
state <= BOTTOM;
end
BOTTOM:
if (fbx < FB_WIDTH-1) begin
fbx <= fbx + 1;
end else begin
fb_we <= 0;
state <= DONE;
end
IDLE:
if (frame_sys) begin
fbx <= 0;
fby <= 0;
fb_cidx <= 4'h0; // palette index
fb_we <= 1;
state <= TOP;
end
default: state <= DONE; // done forever!
endcase
end
// reading from FB takes one cycle: delay display signals to match
logic hsync_p1, vsync_p1;
always_ff @(posedge clk_pix) begin
hsync_p1 <= hsync;
vsync_p1 <= vsync;
end
// VGA output
always_ff @(posedge clk_pix) begin
vga_hsync <= hsync_p1;
vga_vsync <= vsync_p1;
vga_r <= fb_red;
vga_g <= fb_green;
vga_b <= fb_blue;
end
endmodule
We’re back to drawing a border around David, but note the clock domain! We’re drawing at 100 MHz, while the pixel clock is only 25.2 MHz. The iCEBreaker version continues to use the pixel clock for drawing (the iCE40UP has only one PLL).
BRAM Optimization
You’d expect the final David design to use 3x18Kb BRAMs (1.5x36Kb) on the Xilinx FPGA, but because all three colour channels are the same for the greyscale palette, the linebuffer is optimised from 3x18Kb BRAMs to 1x18Kb. If you try the warm palette, the linebuffer will use 1.5x36Kb BRAMs for a total of 5.5.
Creating Your Own Images
You can easily create your own images using img2fmem. The script is written in Python and uses the Pillow image library to perform the conversion. You can find it in the Project F FPGA Tools repo. Make sure your images are the same dimensions as the framebuffer you’re using.
To convert an image called acme.png
to 4-bit colour with a 12-bit palette for use with $readmemh
:
img2fmem.py acme.png 4 mem 12
For details on installation and command-line options, see the img2fmem README.
Explore
I hope you enjoyed this instalment of Exploring FPGA Graphics, but nothing beats creating your own designs. Here are a few suggestions to get you started:
- Load your own picture into the framebuffer using img2fmem
- Use a BRAM for the colour lookup table, so you can change the palette at runtime
- Scale up the fizzlefade using the framebuffer module - watch out for the different clock domains
- Combine sprites with a framebuffer: what effects can you create?
Next Time
In the next part, we’ll take our framebuffer and implement Conway’s Game of Life in Life on Screen before moving onto drawing Lines and Triangles. Check out the site map for more FPGA projects and tutorials.
Get in touch: GitHub Issues, 1BitSquared Discord, @WillFlux (Mastodon), @WillFlux (Twitter)