SPDK From First Principles

SPDK deep learning path

Chapter 4: NAND Flash And SSD Internals

SPDK programmers do not usually program NAND directly. They program bdevs, NVMe namespaces, queue pairs, and DMA buffers. But SSD behavior leaks through every abstraction:...

Source: drafts/hardware/04-nand-ssd-internals.md

Chapter Goal

SPDK programmers do not usually program NAND directly. They program bdevs, NVMe namespaces, queue pairs, and DMA buffers. But SSD behavior leaks through every abstraction: latency cliffs, write amplification, endurance limits, trim behavior, power-loss protection, thermal throttling, and zoned constraints all originate below the block interface.

This chapter gives you the SSD mental model needed to understand why the same SPDK workload can be smooth on one drive and chaotic on another.

Beginner Mental Model

Think of an SSD as a warehouse that cannot overwrite labels in place. When you update "box 100," the warehouse puts the new box somewhere else, updates an index card saying "box 100 is now on shelf 8," and later cleans up the old shelf.

host view:
  LBA 100 -> latest bytes

SSD internal view:
  old physical page A: stale copy of LBA 100
  new physical page B: current copy of LBA 100
  mapping table: LBA 100 -> physical page B

That mapping table is the flash translation layer, or FTL. It is the heart of a normal SSD. The host sees logical blocks. The SSD internally manages pages, erase blocks, dies, channels, bad blocks, wear, error correction, and background cleanup.

NAND Geometry

The names vary by vendor and generation, but the useful hierarchy is:

SSD controller
  channels
    packages
      dies
        planes
          blocks / erase blocks
            pages
              sectors / codewords

The important constraints are:

  • Read is page-sized or near page-sized internally.
  • Program writes data to pages, usually with restrictions on order and repetition.
  • Erase works on a much larger erase block.
  • A page generally cannot be overwritten in place without erasing the whole block first.
  • Erase wears out the media.

The SPDK documentation doc/ssd_internals.md:18 describes erase blocks as large implementation-specific units and explains the asymmetric write/erase behavior. That document intentionally warns that it is a simplified software-developer model, which is exactly the right lens here.

Why SSDs Remap Writes

Suppose a filesystem overwrites LBA 7:

before:
  LBA 7 -> physical page 111

write new data to LBA 7:
  controller chooses empty physical page 829
  controller writes data there
  controller changes map: LBA 7 -> physical page 829
  physical page 111 becomes invalid

This is out-of-place update. It turns random host overwrites into sequential-ish media programs. The price is garbage collection: eventually the device must reclaim blocks full of stale physical pages.

SPDK's FTL library exposes the same concepts in software. doc/ftl.md:11 defines L2P, the logical-to-physical map. lib/ftl/ftl_core.h:121 contains the l2p pointer in struct spdk_ftl_dev. lib/ftl/ftl_internal.h:76 describes P2L, the physical-to-logical mapping used during relocation and dirty shutdown recovery.

You should not assume a commercial SSD's firmware matches SPDK FTL internals. The value of SPDK FTL as a source anchor is that it makes the otherwise hidden terms concrete.

Pages, Erase Blocks, And Write Amplification

If the host writes 4 KiB, the SSD may have to write much more than 4 KiB internally. Write amplification is:

physical bytes written to NAND / logical bytes written by host

Write amplification comes from metadata, garbage collection, relocation, parity/internal RAID, write shaping, and poor alignment. A drive with a write amplification of 3 writes three NAND bytes for every host byte. That matters for performance and endurance.

A simplified garbage collection cycle:

erase block before GC:
  [valid][stale][stale][valid][stale][free? no][valid][stale]

GC:
  read valid pages
  write valid pages elsewhere
  erase whole block
  return block to free pool

erase block after GC:
  [empty][empty][empty][empty][empty][empty][empty][empty]

doc/ssd_internals.md:61 through doc/ssd_internals.md:71 gives the same high-level GC sequence. SPDK FTL models reusable regions as bands. doc/ftl.md:80 explains relocation: valid blocks are copied so a band can be reused. lib/ftl/ftl_band.h:34 defines band states such as FREE, OPEN, FULL, CLOSING, and CLOSED.

Parallelism: Channels, Dies, Planes

SSDs are fast because they do many slow things in parallel. One NAND die is not magic. A controller spreads reads and writes across channels and dies, much like a storage-aware RAID engine. Sequential writes can fill parallel lanes efficiently. Random small writes may force more metadata work, read-modify-write, and garbage collection.

This is why queue depth helps until it does not. More outstanding work gives the controller scheduling freedom. Too much outstanding work can increase tail latency, make flushes wait behind a backlog, or cause thermal/power throttling.

SPDK exposes queue control at higher layers. The bdev chapter discussed max_rw_size, write unit size, and alignment. The NVMe chapter will explain how those logical I/O requests become NVMe submission queue entries.

Overprovisioning And Spare Area

An SSD usually has more physical NAND than it reports as logical capacity. That spare area is overprovisioning. It gives the controller room to:

  • keep free erase blocks available,
  • replace bad blocks,
  • absorb bursts,
  • lower write amplification,
  • spread wear,
  • recover from power failures or metadata updates.

SPDK's FTL configuration makes overprovisioning explicit: include/spdk/ftl.h:89 stores the percentage of base device blocks not exposed to the user. That mirrors what hardware SSDs do internally. A cloud system can create a similar effect by not filling a drive to 100% logical occupancy.

Misconception to kill: "A 7.68 TB SSD contains exactly 7.68 TB of NAND." It almost certainly contains more raw NAND and exposes less after reserved area, metadata, bad block handling, and formatting.

Wear Leveling And Endurance

Each erase block has a finite program/erase lifetime. Wear leveling tries to avoid killing a small subset of blocks while others stay fresh. Dynamic wear leveling spreads new writes. Static wear leveling occasionally moves cold data so old blocks are not permanently occupied by rarely changing data.

Endurance is usually specified as drive writes per day (DWPD) or total bytes written (TBW/PBW). Write amplification connects host workload to NAND wear:

NAND writes = host writes * write amplification

Cloud volume systems should treat endurance as a shared resource. A noisy tenant with random sync writes can consume more NAND lifetime than raw host bytes suggest.

TRIM, UNMAP, And Deallocate

When the host deletes data, the SSD cannot infer that from ordinary overwrites or filesystem metadata. It needs an explicit hint. The command family is called TRIM in SATA, UNMAP in SCSI, and deallocate in NVMe Dataset Management.

When the SSD knows LBAs are no longer live, GC can skip their old physical pages. doc/ssd_internals.md:44 explains this from the device perspective. In SPDK FTL, trim state is explicit: lib/ftl/ftl_core.h:168 has a trim submission queue and lib/ftl/ftl_core.h:171 has a trim valid map. doc/ftl.md:149 also notes FTL trim alignment constraints.

Do not overpromise unmap. It is usually a hint or logical deallocation operation, not a secure erase guarantee. It may improve performance later, but issuing unmap in the foreground can still cost time now.

ECC, Read Disturb, Retention, And Bad Blocks

NAND stores charge. Charge leaks. Reads can disturb neighboring cells. Program operations are imperfect. SSDs use error-correcting codes, read retry, refresh, bad block maps, and media management to hide this from the host.

Symptoms that may bubble up:

  • A read that used to be fast becomes slow because the controller performs retries.
  • A drive starts reporting media errors or health warnings.
  • Latency increases as the drive refreshes or relocates data.
  • SMART / NVMe health data shows spare depletion or temperature warnings.

Beginner trap: a successful read does not mean the media was easy to read. It only means the controller recovered the data within its error budget.

Power-Loss Protection

Power-loss protection is not one feature. It can include capacitors, firmware protocols, non-volatile cache, metadata journaling, and conservative completion rules.

Without PLP, a device may complete a write when data is in volatile cache. With PLP, the same completion may be safe because the cache can be drained after power loss. Flush latency and sustained sync-write performance often differ dramatically between consumer and enterprise SSDs because of PLP.

For SPDK users, the practical rule is: do not infer durability from performance. Read the device data, test power-fail behavior when possible, and preserve flush semantics through virtual bdev stacks.

Thermal And Power Throttling

SSDs are active computers. Controllers and NAND heat up. Firmware may reduce performance to stay inside thermal or power envelopes. This creates confusing behavior: a benchmark looks excellent for 60 seconds, then collapses; or reads stay stable while writes slow down.

Operational hints:

  • Check drive temperature and warning logs.
  • Run long enough benchmarks to hit steady state.
  • Compare cold-start, preconditioned, and sustained measurements.
  • Watch tail latency, not just average throughput.

Zoned Namespaces As The FTL Leaking Upward

Zoned Namespaces (ZNS) expose some placement rules to the host. Instead of pretending every LBA can be overwritten freely, the device divides capacity into zones with write pointers. The host writes sequentially within zones and resets zones when data is no longer needed.

SPDK exposes zoned bdev concepts in include/spdk/bdev_zone.h. NVMe ZNS structures appear in include/spdk/nvme_spec.h:4675, including zone capacity, start LBA, and write pointer fields. The mental model is: ZNS shifts some FTL responsibility from opaque firmware to host software so the system can reduce write amplification and improve predictability.

Misconception to kill: "ZNS is just a faster normal SSD." It is a different contract. Host software must respect zone state and write-pointer rules.

Latency Cliffs

A latency cliff happens when a workload crosses an internal threshold:

  • free block pool becomes low,
  • background GC cannot keep up,
  • SLC cache fills,
  • thermal limit engages,
  • metadata cache misses increase,
  • queue depth hides then amplifies tail latency,
  • drive reaches a write cliff after preconditioning.

SPDK can make latency cliffs more visible because it removes scheduler and interrupt overhead. That is a benefit, but it also means the application must understand what the hardware is doing.

Source Anchors

  • doc/ssd_internals.md:18: erase blocks and asymmetric erase/program behavior.
  • doc/ssd_internals.md:27: logical blocks as firmware constructs rather than fixed media locations.
  • doc/ssd_internals.md:61: garbage collection sequence.
  • doc/ftl.md:11: L2P map.
  • doc/ftl.md:25: bands and sequential writing.
  • doc/ftl.md:80: relocation/garbage collection.
  • doc/ftl.md:109: FTL metadata.
  • lib/ftl/ftl_core.h:103: array of bands.
  • lib/ftl/ftl_core.h:121: L2P table.
  • lib/ftl/ftl_core.h:130: valid map.
  • lib/ftl/ftl_internal.h:76: P2L mapping explanation.
  • lib/ftl/ftl_band.h:34: band states.
  • include/spdk/ftl.h:89: overprovisioning configuration.

Operational Lab

Use a paper model with four erase blocks, each containing four pages. The host writes LBAs in this order:

0, 1, 2, 3, 0, 1, 4, 5, 0, 6

Rules:

  • A page can be programmed once.
  • Updating an LBA writes a new physical page.
  • The old physical page for that LBA becomes stale.
  • When no empty page exists, choose the erase block with the fewest valid pages, move its valid pages elsewhere, then erase it.

Tasks:

  1. Draw the mapping after each write.
  2. Count stale pages after the tenth write.
  3. Pick a GC victim.
  4. Count how many extra page writes GC creates.
  5. Compute write amplification for this tiny example.

This exercise is intentionally small. Real SSDs have far more levels, but the mapping pressure is the same.

Source Reading Exercise

Read doc/ftl.md:25 through doc/ftl.md:52. Then open lib/ftl/ftl_band.h:34 through lib/ftl/ftl_band.h:94.

Answer:

  • Which states represent an empty reusable region?
  • Which states represent a region accepting writes?
  • Where is the write pointer stored?
  • Why does a band need a close sequence ID?
  • Why does the P2L map need a checksum?

Self-Check

  1. Why is overwrite-in-place a bad model for NAND SSDs?
  2. What is write amplification?
  3. Why does overprovisioning improve random write behavior?
  4. Why can an unmap improve future garbage collection?
  5. Why can a read be slow even when no host write is active?
  6. How does ZNS change the host/device contract?
  7. Why is preconditioning necessary for serious SSD benchmarks?

References

  • Local SPDK documentation: doc/ssd_internals.md.
  • Local SPDK documentation: doc/ftl.md.
  • Local source: lib/ftl/ftl_core.h, lib/ftl/ftl_internal.h, lib/ftl/ftl_band.h, and include/spdk/ftl.h.
  • NVM Express specifications landing page, including NVM and ZNS command set specifications: https://nvmexpress.org/specifications/
  • NVM Express Base Specification page: https://nvmexpress.org/specification/nvm-express-base-specification/