24 August 2020

FPGA Memory Types

Designing with FPGAs involves many types of memory, some familiar from other devices, but some that are specific to FPGAs. This post gives a quick overview of the different flavours, together with their strengths and weaknesses, and some sample designs. This guide includes external memory types, such as SRAM and HBM, that are used in CPUs and GPUs, so much of what is said here is generally applicable, but the focus is on FPGAs.

To give you a sense of the memory capability of small FPGAs, we include the memory specifications for Lattice iCE40 UP5K and Xilinx Spartan 7S25.

NB. This post is an early draft: expect mistakes and missing content.

Updated 2020-08-28. Feedback to @WillFlux is most welcome.

Contents

Memory Terminology

  • Address Width - how many bits are needed to address all the elements in the memory array
  • Bandwidth - how much data can be transferred by a memory interface each second
  • (Data) Width - how many bits there are in each element
  • Depth - how many elements there are in the memory array
  • Organisation - depth x width (see below)
  • Latency - how long it takes for a memory interface to start returning data

ProTip: In SystemVerilog you can use $clog2 to calculate the address width from the depth.

Memory Organisation

Memory is described by its organisation or depth x width. For example, a 4 Mb SRAM could be described as 512K x 8, which means it has 524,288 (512 x 1024) locations, each of which holds 8 bits. The same memory capacity could also be organised as 256K x 16, which means it has 262,144 (256 x 1024) locations, each of which holds 16 bits.

Flip-Flops

Flip-flops are the state keepers of FPGAs. Each flip-flop holds one bit.

When you create a simple counter you’re using flip-flops:

    logic [15:0] cnt = 0;  // 16 flip-flops
    always_ff @(posedge clk) begin
        cnt <= cnt + 1;
    end

Flip-flops are great for saving state, but they’re not ideal for all but the smallest memories: their numbers are limited, they’re spread throughout the FPGA (making routing larger memories hard), and they don’t support multiple ports.

The Lattice iCE40 UP5K has 5,280 flip-flops, while the Xilinx Spartan 7S25 has 29,200.

Distributed RAM

Distributed ram is built with LUTs. LUTs are usually used to create the logic of your design, but can also support memory in some FPGAs. Distributed ram is, as its name suggests, distributed throughout the FPGA. A single 6-input LUT can store 64 bits.

Distributed ram is read asynchronously, but written to synchronously (requires a clock). Writes are limited to a single port, but you can read from up to four ports in some FPGAs. Distributed ram is flexible in the data width it supports, for example, if dealing with 32-level data you use a width of 5 bits.

Given the asynchronous nature of reads, distributed ram is ideal for fast buffers: you can use a value immediately, rather than waiting for the next clock tick. You can also use distributed ram to create small ROMs. However, distributed ram is not suited to large memories, you’ll get better performance (and lower power consumption) for memories larger than about 128 bits (based on Xilinx 7 Series) using block ram (see next section).

Not all FPGAs support distributed ram; for example, the iCE40 UP5K doesn’t. In Xilinx 7 Series FPGAs, only LUTs in SLICEM blocks may be used as memory. A Spartan 7S25 FPGA has 14,600 6-input LUTs, of which 5,000 are SLICEM, so you have a maximum of: 5,000 x 64 bits = 320,000 bits.

For more on Xilinx distributed ram see UG474: 7 Series FPGAs Configurable Logic Block.

The following example shows a ROM module using distributed ram (note the lack of a clock):

module rom_async #(
    parameter WIDTH=8, 
    parameter DEPTH=256, 
    parameter INIT_F="",
    localparam ADDRW=$clog2(DEPTH)
    ) (
    input wire logic [ADDRW-1:0] addr,
    output     logic [WIDTH-1:0] data
    );

    logic [WIDTH-1:0] memory [DEPTH];

    initial begin
        if (INIT_F != 0) begin
            $display("Creating rom_async from init file '%s'.", INIT_F);
            $readmemh(INIT_F, memory);
        end
    end

    always_comb data = memory[addr];
endmodule

To learn more about loading memory with $readmemh and $readmemb see Initialize Memory in Verilog.

Block RAM

Block ram (BRAM) is implemented using dedicated ram circuitry within the FPGA and is ideal for memories from a few hundred bits up to hundreds of kilobits. But BRAM is not just better than distributed ram for larger memories: I’d go so far as to say that block RAM is one of the great things about developing hardware with FPGAs.

Being written…

Xilinx BRAM

Xilinx 7 series FPGAs have 36 Kb BRAM blocks with two ports and up to 72 bit data width. Each port has an independent clock, so you can share data across clock domains. For example, you could have a RISC-V CPU accessing a block ram at 180 MHz, while your custom hardware accesses the same block at 133.3 MHz.

BRAMs are quite flexible in organisation; with two read/write ports (true dual-port BRAM) you can have data widths of 1, 2, 4, 9, 18, and 36 bits. The reasons the latter widths are multiples of 9 is because of ECC (error correction) support. For example, if you have 4-bit data then the organisation would be 8K x 4 (only 32 of the 36 Kb are used in this case). If you restrict yourself to one read and one write port (simple dual-port BRAM) then you can have 72-bit wide data.

Each 36 Kb block may also be split into two independent 18 Kb BRAMs, though the maximum data width of each is reduced.

  • Inference vs Primitives
  • Larger memories with multiple blocks…
  • Byte-level writes…
  • Block ram columns…
  • Collisions occur when both ports access the same address…
  • Output registers for higher clock speeds…

For more details on Xilinx BRAM see UG473: 7 Series FPGAs Memory Resources.

BRAM Data Sheet Capacity
Be careful when interpreting the headline BRAM capacity on FPGA data sheets. For example, if you look at Xilinx’s 7 Series Overview, you’ll see that the Spartan S25 has 1,620 Kb of block ram. However, you shouldn’t think of this as having ~200 KiB of ram like you would on a microcontroller. This figure includes the 9th bit usually used for error correction; for 32-bit data the capacity is 1,440 Kb. More importantly, BRAM is composed of many small blocks spread across the FPGA: you can’t expect to combine them all together into one big memory and get good performance.

iCE40 BRAM

iCE40 5UP has 4 Kb BRAMs…

Being written…

For more details on Lattice BRAM see Memory Usage Guide for iCE40 Devices.

FIFOs

First in, first out…

Example Modules

The following example shows a Lattice iCE40 block ram module.

module bram_ice40 #(
    parameter WIDTH=8, 
    parameter DEPTH=512, 
    parameter INIT_F="",
    localparam ADDRW=$clog2(DEPTH)
    ) (
    input wire logic clk,
    input wire logic [ADDRW-1:0] addr,
    input wire logic we,
    input wire logic [WIDTH-1:0] data_in,
    output     logic [WIDTH-1:0] data_out
    );

    logic [WIDTH-1:0] memory [DEPTH];

    initial begin
        if (INIT_F != 0) begin
            $display("Loading memory init file '%s' into bram.", INIT_F);
            $readmemh(INIT_F, memory);
        end
    end

    always_ff @(posedge clk) begin
        if (we) memory[addr] <= data_in;
        data_out = memory[addr];
    end
endmodule

The following example shows a basic Xilinx block ram module using a single read and a single write port.

module bram_basic_xc7 #(
    parameter WIDTH=8, 
    parameter DEPTH=256, 
    parameter INIT_F="",
    localparam ADDRW=$clog2(DEPTH)
    ) (
    input wire logic clk,                       // clock (port a & b)
    input wire logic we,                        // write enable (port a)
    input wire logic [ADDRW-1:0] addr_write,    // write address (port a)
    input wire logic [ADDRW-1:0] addr_read,     // read address (port b)
    input wire logic [WIDTH-1:0] data_in,       // data in (port a)
    output     logic [WIDTH-1:0] data_out       // data out (port b)
    );

    /* verilator lint_off MULTIDRIVEN */
    logic [WIDTH-1:0] memory [DEPTH];
    /* verilator lint_on MULTIDRIVEN */

    initial begin
        if (INIT_F != 0) begin
            $display("Loading memory init file '%s' into bram_basic.", INIT_F);
            $readmemh(INIT_F, memory);
        end
    end

    // Port A: Sync Write
    always_ff @(posedge clk) begin
        if (we) begin
            memory[addr_write] <= data_in;
        end
    end

    // Port B: Sync Read
    always_ff @(posedge clk) begin
        data_out <= memory[addr_read];
    end
endmodule

UltraRAM

UltraRAM is a type of memory available in Xilinx UltraScale and UltraScale+ FPGAs. UltraRAM is like block ram on steroids: bigger but less agile. The blocks are 288 Kb in size (8x regular BRAM): combining all the blocks in a column gives you up to 36 Mb of memory to play with. However, while UltraRAM is dual-ported, it does not support independent clocks on each port, and is a fixed 72 bits wide.

High Bandwidth Memory

Being written…

Better known as HBM - Bandwidth, bandwidth and more bandwidth…

Found only on high-end FPGAs and graphics cards…

Static RAM

Being written…

If you were to imagine a memory chip, you’d probably imagine something like static ram (SRAM). You provide the address of a the element you want on the address bus, wait some nanoseconds and the data is returned on the data bus.

SRAM’s big issue is cost, which also limits the capacities its available in.

Few recent FPGA dev boards include SRAM, which is a shame, as this ram is ideal for beginners and low-latency hardware designs. One that does is the Digilent Cmod A7, which features 4 Mb (512K x 8) of 8ns SRAM.

Static RAM is available in both synchronous and asynchronous types…

Asynchronous SRAM

Async SRAM doesn’t use a clock… speed in nanoseconds

Async SRAM is most commonly ~3V, which work well with current FPGA designs.

4 Mb asynchronous SRAMs ICs are cheap (for SRAM) and widely available at speeds of 10 ns with a 8 or 16-bit data bus. A 4Mb (512 KiB) 10ns SRAM costs around $2 in small quantities. A 4 Mb (512K x 8) SRAM has 8 data pins and 19 address pins, and typically three control pins: write enable, output enable, and chip select.

Synchronous SRAM

Faster SRAM is clocked… speed in megahertz… not always full random I/O… pipelined vs flow-through…

Dynamic RAM

Being written…

DRAM is the everyday memory that forms the main memory in your PC. Dynamic ram needs to be refreshed periodically, otherwise the data is lost. It’s cheap and offers plenty of bandwidth, but DRAM is complex to interface with: you’re almost certainly going to have to use vendor IP blocks and a cache to make DRAM usable. If you’re accessing data sequentially, or interfacing with a CPU via a cache, then DRAM works well, but for random I/O the high latency is a significant issue.

Pseudo SRAM

Being written…

SRAM is expensive, and DRAM is complex, and both use a significant number of I/O pins. By using a self-refreshing circuit and a simplified interface, similar to SPI flash, a new type of ram can be created. This type of ram is usually referred to as Pseudo SRAM.

The two interfaces you’ll come across are HyperRAM, created by Cypress, and xSPI (OctalSPI), standardised by JEDEC in early 2020. In both cases you only need 11 or 12 FPGA pins to interface with the ram. As of summer 2020 this ram is available in capacities up to 256 Mb (32 MiB) in 1.8V and 3V. 64 Mb (8 MiB) parts costs around $3 in small quantities.

We haven’t seen this ram on many FPGA dev boards, but Kevin Hubbard’s open source HyperRAM Pmod has proved popular, and is available pre-assembled from 1BitSquared. I hope we’ll see an updated version with faster xSPI ram in due course.

PS. The wonderful image of the Micron MT4C1024 DRAM used in the social media card for this post comes from Zeptobars and is licensed under a Creative Commons licence.

©2020 Will Green, Project F