FPGA Memory Types
Designing with FPGAs involves many types of memory, some familiar from other devices, but some that are specific to FPGAs. This how to gives a quick overview of the different flavours, together with their strengths and weaknesses, and some sample designs. This guide includes external memory types, such as SRAM and HBM, that are used in CPUs and GPUs, so much of what is said here is generally applicable, but the focus is on FPGAs. You might also be interested in Initialize Memory in Verilog.
To give you a sense of the memory capability of small FPGAs, we include the memory specifications for Lattice iCE40 UP5K and Xilinx Spartan XC7S25.
You can find BRAM ROM and RAM designs in the Project F Verilog Library.
This is a draft post. More content to follow.
- Memory Terminology
- Distributed RAM
- Block RAM and UltraRAM
- High Bandwidth Memory
- Static RAM
- Dynamic RAM
- Pseudo SRAM
- Address Width - how many bits are needed to address all the elements in the memory array
- Bandwidth - how much data can be transferred by a memory interface each second
- (Data) Width - how many bits there are in each element
- Depth - how many elements there are in the memory array
- Organisation - depth x width (see below)
- Latency - how long it takes for a memory interface to start returning data
ProTip: In SystemVerilog you can use
$clog2 to calculate the address width from the depth.
Memory is described by its organisation or depth x width. For example, a 4 Mb SRAM could be described as 512K x 8, which means it has 524,288 (512 x 1024) locations, each of which holds 8 bits. The same memory capacity could also be organised as 256K x 16, which means it has 262,144 (256 x 1024) locations, each of which holds 16 bits.
Flip-flops are the state keepers of FPGAs. Each flip-flop holds one bit.
When you create a simple counter you’re using flip-flops:
logic [15:0] cnt = 0; // 16 flip-flops always_ff @(posedge clk) begin cnt <= cnt + 1; end
Flip-flops are great for saving state, but they’re not ideal for all but the smallest memories: their numbers are limited, they’re spread throughout the FPGA (making routing larger memories hard), and they don’t support multiple ports.
The Lattice iCE40 UP5K has 5,280 flip-flops, while the Xilinx Spartan 7S25 has 29,200.
Distributed ram is built with LUTs. LUTs are usually used to create the logic of your design, but can also support memory in some FPGAs. Distributed ram is, as its name suggests, distributed throughout the FPGA. A single 6-input LUT can store 64 bits.
Distributed ram is read asynchronously, but written to synchronously (requires a clock). Writes are limited to a single port, but you can read from up to four ports in some FPGAs. Distributed ram is flexible in the data width it supports, for example, if dealing with 32-level data you use a width of 5 bits.
Given the asynchronous nature of reads, distributed ram is ideal for fast buffers: you can use a value immediately, rather than waiting for the next clock tick. You can also use distributed ram to create small ROMs. However, distributed ram is not suited to large memories, you’ll get better performance (and lower power consumption) for memories larger than about 128 bits (based on Xilinx 7 Series) using block ram (see next section).
In Xilinx 7 Series FPGAs, only LUTs in SLICEM blocks may be used as memory. A Spartan 7S25 FPGA has 14,600 6-input LUTs, of which 5,000 are SLICEM, so you have a maximum of: 5,000 x 64 bits = 320,000 bits.
For more on Xilinx distributed ram see UG474: 7 Series FPGAs Configurable Logic Block.
The following example shows a ROM module using distributed ram (note the lack of a clock):
module rom_async #( parameter WIDTH=8, parameter DEPTH=256, parameter INIT_F="", localparam ADDRW=$clog2(DEPTH) ) ( input wire logic [ADDRW-1:0] addr, output logic [WIDTH-1:0] data ); logic [WIDTH-1:0] memory [DEPTH]; initial begin if (INIT_F != 0) begin $display("Creating rom_async from init file '%s'.", INIT_F); $readmemh(INIT_F, memory); end end always_comb data = memory[addr]; endmodule
To learn more about loading memory with
$readmemb see Initialize Memory in Verilog.
Block ram (BRAM) is implemented using dedicated ram circuitry within the FPGA and is ideal for memories from a few hundred bits up to hundreds of kilobits. But BRAM is not just better than distributed ram for larger memories: I’d go so far as to say that block RAM is one of the great things about developing hardware with FPGAs.
Xilinx 7 series FPGAs have 36 Kb BRAM blocks with two ports and up to 72 bit data width. Each port has an independent clock, so you can share data across clock domains. For example, you could have a RISC-V CPU accessing a block ram at 180 MHz, while your custom hardware accesses the same block at 133.3 MHz.
BRAMs are quite flexible in organisation; with two read/write ports (true dual-port BRAM) you can have data widths of 1, 2, 4, 9, 18, and 36 bits. The reasons the latter widths are multiples of 9 is because of ECC (error correction) support. For example, if you have 4-bit data then the organisation would be 8K x 4 (only 32 of the 36 Kb are used in this case). If you restrict yourself to one read and one write port (simple dual-port BRAM) then you can have 72-bit wide data.
Each 36 Kb block may also be split into two independent 18 Kb BRAMs, though the maximum data width of each is reduced.
- Inference vs Primitives
- Larger memories with multiple blocks…
- Byte-level writes…
- Block ram columns…
- Collisions occur when both ports access the same address…
- Output registers for higher clock speeds…
For more details on Xilinx BRAM see UG473: 7 Series FPGAs Memory Resources.
BRAM Data Sheet Capacity
Be careful when interpreting the headline BRAM capacity on FPGA data sheets. For example, if you look at Xilinx’s 7 Series Overview, you’ll see that the Spartan S25 has 1,620 Kb of block ram. However, you shouldn’t think of this as having ~200 KiB of ram like you would on a microcontroller. This figure includes the 9th bit usually used for error correction; for 32-bit data the capacity is 1,440 Kb. More importantly, BRAM is composed of many small blocks spread across the FPGA: you can’t expect to combine them all together into one big memory and get good performance.
iCE40 5UP has 4 Kb BRAMs…
For more details on Lattice BRAM see Memory Usage Guide for iCE40 Devices.
First in, first out…
The following example shows a dual-por block ram module for iCE40 (a port can only be read or write on this FPGA):
New example to be added
The following example shows a simple Xilinx block ram module using a single read and a single write port (ports may be both read and write):
module bram_basic_xc7 #( parameter WIDTH=8, parameter DEPTH=256, parameter INIT_F="", localparam ADDRW=$clog2(DEPTH) ) ( input wire logic clk, // clock (port a & b) input wire logic we, // write enable (port a) input wire logic [ADDRW-1:0] addr_write, // write address (port a) input wire logic [ADDRW-1:0] addr_read, // read address (port b) input wire logic [WIDTH-1:0] data_in, // data in (port a) output logic [WIDTH-1:0] data_out // data out (port b) ); /* verilator lint_off MULTIDRIVEN */ logic [WIDTH-1:0] memory [DEPTH]; /* verilator lint_on MULTIDRIVEN */ initial begin if (INIT_F != 0) begin $display("Loading memory init file '%s' into bram_basic.", INIT_F); $readmemh(INIT_F, memory); end end // Port A: Sync Write always_ff @(posedge clk) begin if (we) begin memory[addr_write] <= data_in; end end // Port B: Sync Read always_ff @(posedge clk) begin data_out <= memory[addr_read]; end endmodule
The following example shows a synchronous ROM module that should work with both Xilinx and Lattice BRAM:
module rom_sync #( parameter WIDTH=8, parameter DEPTH=512, parameter INIT_F="", localparam ADDRW=$clog2(DEPTH) ) ( input wire logic clk, input wire logic [ADDRW-1:0] addr, output logic [WIDTH-1:0] data ); logic [WIDTH-1:0] memory [DEPTH]; initial begin if (INIT_F != 0) begin $display("Creating rom_sync from init file '%s'.", INIT_F); $readmemh(INIT_F, memory); end end always_ff @(posedge clk) begin data <= memory[addr]; end endmodule
UltraRAM is a type of memory available in Xilinx UltraScale and UltraScale+ FPGAs. UltraRAM is like block ram on steroids: bigger but less agile. The blocks are 288 Kb in size (8x regular BRAM): combining all the blocks in a column gives you up to 36 Mb of memory to play with. However, while UltraRAM is dual-ported, it does not support independent clocks on each port, and is a fixed 72 bits wide.
High Bandwidth Memory
Better known as HBM - Bandwidth, bandwidth and more bandwidth…
Found only on high-end FPGAs and graphics cards…
If you were to imagine a memory chip, you’d probably imagine something like static ram (SRAM). You provide the address of a the element you want on the address bus, wait some nanoseconds and the data is returned on the data bus.
SRAM’s big issue is cost, which also limits the capacities its available in.
Few recent FPGA dev boards include SRAM, which is a shame, as this ram is ideal for beginners and low-latency hardware designs. One that does is the Digilent Cmod A7, which features 4 Mb (512K x 8) of 8ns SRAM.
Static RAM is available in both synchronous and asynchronous types…
Async SRAM doesn’t use a clock… speed in nanoseconds
Async SRAM is most commonly ~3V, which work well with current FPGA designs.
4 Mb asynchronous SRAMs ICs are cheap (for SRAM) and widely available at speeds of 10 ns with a 8 or 16-bit data bus. A 4Mb (512 KiB) 10ns SRAM costs around $2 in small quantities. A 4 Mb (512K x 8) SRAM has 8 data pins and 19 address pins, and typically three control pins: write enable, output enable, and chip select.
Faster SRAM is clocked… speed in megahertz… not always full random I/O… pipelined vs flow-through…
The iCE40 UP FPGAs include Synchronous SRAM, see SPRAM on iCE40 FPGA.
DRAM is the everyday memory that forms the main memory in your PC. Dynamic ram needs to be refreshed periodically, otherwise the data is lost. It’s cheap and offers plenty of bandwidth, but DRAM is complex to interface with: you’re almost certainly going to have to use vendor IP blocks and a cache to make DRAM usable. If you’re accessing data sequentially, or interfacing with a CPU via a cache, then DRAM works well, but for random I/O the high latency is a significant issue.
SRAM is expensive, and DRAM is complex, and both use a significant number of I/O pins. By using a self-refreshing circuit and a simplified interface, similar to SPI flash, a new type of ram can be created. This type of ram is usually referred to as Pseudo SRAM.
The two interfaces you’ll come across are HyperRAM, created by Cypress, and xSPI (OctalSPI), standardised by JEDEC in early 2020. In both cases you only need 11 or 12 FPGA pins to interface with the ram. As of summer 2020 this ram is available in capacities up to 256 Mb (32 MiB) in 1.8V and 3V. 64 Mb (8 MiB) parts costs around $3 in small quantities.
We haven’t seen this ram on many FPGA dev boards, but Kevin Hubbard’s open source HyperRAM Pmod has proved popular, and is available pre-assembled from 1BitSquared. I hope we’ll see an updated version with faster xSPI ram in due course.
Check out my FPGA demos or the FPGA graphics tutorials.
Have a question or suggestion? Contact @WillFlux or join me on Project F Discussions or 1BitSquared Discord. If you like what I do, consider sponsoring me on GitHub. Thank you.
The wonderful image of the Micron MT4C1024 DRAM used in the social media card for this post comes from Zeptobars and is licensed under a Creative Commons licence.