SPDK From First Principles

SPDK deep learning path

Chapter 5: NVMe SSDs As Queue Machines

NVMe is not "a faster disk command set." It is a queue protocol designed for many-core hosts and parallel SSD controllers. The host and controller communicate mostly through...

Source: drafts/hardware/05-nvme-queue-machine.md

Chapter Goal

NVMe is not "a faster disk command set." It is a queue protocol designed for many-core hosts and parallel SSD controllers. The host and controller communicate mostly through shared memory queues and small MMIO doorbell writes. SPDK's NVMe driver is built around that model: allocate queue pairs, submit commands, ring doorbells, poll completions, and invoke callbacks.

By the end of this chapter, you should be able to draw the submission/completion flow, explain SQs and CQs, identify command and completion fields in SPDK source, reason about phase bits and queue wraparound, and understand why a qpair is normally owned by one thread.

Beginner Mental Model

An NVMe queue pair is two circular arrays in host memory:

host memory

submission queue (SQ)                  completion queue (CQ)
tail -> next free slot                 head -> next completion to consume

  +-----+-----+-----+-----+              +-----+-----+-----+-----+
  | cmd | cmd |     |     |              | cpl | cpl |     |     |
  +-----+-----+-----+-----+              +-----+-----+-----+-----+
     ^ device reads commands                ^ host reads completions

MMIO doorbells:
  host writes SQ tail to tell controller new commands exist
  host writes CQ head to tell controller completions were consumed

The host writes command entries into the SQ. The controller DMA-reads them. Later the controller DMA-writes completion entries into the CQ. The host polls the CQ and runs callbacks. Doorbells are small MMIO register writes that synchronize producer/consumer positions.

flowchart LR app[SPDK caller] --> tracker[Request tracker and CID] tracker --> sq[Submission Queue in host memory] sq --> db[MMIO SQ tail doorbell] db --> ctrlr[NVMe controller] ctrlr --> media[SSD media and FTL] media --> cq[Completion Queue in host memory] cq --> poll[SPDK qpair poller] poll --> cb[User completion callback] poll --> cqdb[MMIO CQ head doorbell]

Why NVMe Exists

Legacy storage protocols were designed around fewer queues, deeper kernel mediation, and mechanical disks. SSDs need:

  • many independent queues,
  • high queue depth,
  • low per-command overhead,
  • no interrupt required for every completion,
  • command formats sized for cache lines,
  • parallelism that maps to CPU cores and controller internals.

NVMe over PCIe uses host memory queues and PCIe DMA. NVMe over Fabrics carries the same command model over transports such as RDMA and TCP. This chapter focuses on local PCIe because it exposes the hardware mechanics most directly.

The Command And Completion Structures

SPDK's local copy of the NVMe command layout is in include/spdk/nvme_spec.h:1452. struct spdk_nvme_cmd is 64 bytes (include/spdk/nvme_spec.h:1504). Key fields:

  • opc: opcode, such as read, write, flush, identify, create queue.
  • fuse: fused operation marker.
  • cid: command identifier, used to match completion to request state.
  • nsid: namespace identifier.
  • mptr: metadata pointer.
  • dptr: PRP or SGL data pointer.
  • cdw10 through cdw15: command-specific dwords.

The completion entry is struct spdk_nvme_cpl at include/spdk/nvme_spec.h:1519, and it is 16 bytes (include/spdk/nvme_spec.h:1537). Key fields:

  • cdw0 and cdw1: command-specific result.
  • sqhd: submission queue head pointer from the controller's point of view.
  • sqid: submission queue identifier.
  • cid: command identifier.
  • status: status code, status code type, retry hints, do-not-retry bit, and phase tag.

Beginner trap: the command carries the request; the completion does not contain the original data buffer. The driver uses cid and its own tracker/request tables to find the callback and buffer state.

Admin Queues And I/O Queues

Every NVMe controller has an admin queue pair. Admin commands create and delete I/O queues, identify controllers and namespaces, get logs, set features, abort commands, and manage asynchronous events. I/O queue pairs carry reads, writes, flushes, write zeroes, dataset management, and command-set-specific I/O operations.

SPDK's controller register structure in include/spdk/nvme_spec.h:550 includes admin queue attributes and base addresses:

  • aqa: admin queue attributes.
  • asq: admin submission queue base address.
  • acq: admin completion queue base address.

The same register structure exposes doorbells at include/spdk/nvme_spec.h:611: each queue has a submission queue tail doorbell and completion queue head doorbell.

The Queue Pair In SPDK

For PCIe, SPDK's struct nvme_pcie_qpair is in lib/nvme/nvme_pcie_internal.h:140. The hot fields are the queue indices and flags:

  • num_entries
  • last_sq_tail
  • sq_tail
  • cq_head
  • sq_head
  • flags.phase
  • sq_vaddr
  • cq_vaddr
  • cmd_bus_addr
  • cpl_bus_addr

The names tell the story. sq_vaddr and cq_vaddr are CPU virtual addresses for the host-side arrays. cmd_bus_addr and cpl_bus_addr are bus/IOVA addresses the device can DMA to or from. The host updates sq_tail; the controller updates completions; the host advances cq_head.

SPDK's own documentation says qpair scaling is lock-free but thread-constrained. doc/nvme.md:153 through doc/nvme.md:160 explains that queue pairs contain no locks or atomics and a given qpair may only be used by a single thread at a time. Violating this is undefined behavior.

Misconception to kill: "NVMe has many queues so any thread can submit to any queue." The scalable model is many queues with clear ownership, not one shared queue with hidden locks.

Submission Flow

The simplified SPDK PCIe submission path:

application / bdev_nvme
  builds an nvme_request
  calls nvme_qpair_submit_request()

common qpair layer
  queues request if the qpair is backed up
  calls transport submit

PCIe transport
  copies command to SQ slot
  associates CID with tracker
  advances sq_tail
  rings SQ doorbell

controller
  sees new tail
  DMA-reads SQ entries
  executes commands

The common submit wrapper is in lib/nvme/nvme_qpair.c:1171. It handles queued requests and the -EAGAIN case by inserting requests into qpair->queued_req rather than failing the user operation immediately.

The PCIe doorbell function is nvme_pcie_qpair_ring_sq_doorbell() in lib/nvme/nvme_pcie_internal.h:248. It writes the new SQ tail with spdk_mmio_write_4() at lib/nvme/nvme_pcie_internal.h:272. There is a memory barrier before the MMIO write at lib/nvme/nvme_pcie_internal.h:269; the host must make sure command memory is visible before telling the device to fetch it.

Completion Flow

The simplified completion path:

controller
  DMA-writes CQE
  toggles / sets phase tag as appropriate

host poller
  reads CQE at cq_head
  checks phase bit
  looks up tracker by cid
  completes request callback
  advances cq_head
  rings CQ doorbell

The CQ doorbell function is nvme_pcie_qpair_ring_cq_doorbell() in lib/nvme/nvme_pcie_internal.h:277. It writes the consumed CQ head to the controller at lib/nvme/nvme_pcie_internal.h:295.

The phase bit solves a ring-buffer ambiguity. When the CQ wraps, slot 0 is reused. Without an extra bit, the host could not reliably tell whether a slot contains an old completion from the previous lap or a new completion from this lap. SPDK stores the expected phase in flags.phase at lib/nvme/nvme_pcie_internal.h:159.

Diagram in prose:

CQ has 4 slots. Expected phase = 1.

lap 1:
  slot 0 phase 1 -> new
  slot 1 phase 1 -> new
  slot 2 phase 1 -> new
  slot 3 phase 1 -> new
  wrap; expected phase becomes 0

lap 2:
  slot 0 phase 0 -> new
  old phase 1 entries are ignored after expected phase flips

Doorbells And MMIO

Doorbells are not normal memory writes. They are MMIO writes to registers mapped from the PCIe device's BAR. MMIO writes can be expensive compared with ordinary cached stores. That is why batching and shadow doorbells exist.

SPDK models controller registers in include/spdk/nvme_spec.h:540 through include/spdk/nvme_spec.h:615. The doorbell array begins at include/spdk/nvme_spec.h:611. In the PCIe qpair code, SPDK optionally updates shadow doorbells and only performs MMIO when required (lib/nvme/nvme_pcie_internal.h:260 through lib/nvme/nvme_pcie_internal.h:274).

Misconception to kill: "Ringing the doorbell moves the command." It does not copy the command. The command is already in host memory. The doorbell tells the controller that the producer index changed.

Namespaces And Controllers

An NVMe controller is the command-processing entity. A namespace is a block address space exposed through that controller. A physical SSD may expose one namespace or many. Multipath and NVMe-oF can make this more complex, but the beginner model is:

controller
  admin qpair
  io qpair 1
  io qpair 2
  namespace 1: logical blocks
  namespace 2: logical blocks

An I/O command usually names the namespace in cmd.nsid and the LBA/range in command-specific dwords. The qpair is the transport path; the namespace is the storage object.

Queue Full, Backpressure, And -EAGAIN

Queues are finite. Trackers are finite. Requests can be temporarily impossible to submit even though the device is healthy. SPDK's common qpair layer turns some -EAGAIN returns into internal queueing (lib/nvme/nvme_qpair.c:1192 through lib/nvme/nvme_qpair.c:1197).

This matters operationally. A user-level submit API returning success may mean "accepted by SPDK for eventual transport submission," not necessarily "already placed into the hardware SQ." Completion callback is still the truth for command completion.

Timeouts, Aborts, And Resets

NVMe has explicit error paths:

  • A command can complete with an error status.
  • A qpair can fail or disconnect.
  • A command can time out in the host.
  • The host can issue an abort.
  • The controller or queue can be reset.
  • The bdev layer can reset above the NVMe layer.

The completion status includes dnr ("do not retry") and status code/type fields in include/spdk/nvme_spec.h:1506 through include/spdk/nvme_spec.h:1513. SPDK also has queued-request abort paths around lib/nvme/nvme_qpair.c:1222.

Beginner trap: an abort is itself a command and may race with normal completion. A command can complete just as the host decides it timed out. Correct code must tolerate late completions, failed aborts, and reset-driven cleanup.

CQ Full And Completion Flow Control

A controller cannot write infinite completions. If the host stops polling, the CQ fills. Once the CQ fills, the controller has limited room to complete more commands. That can stall progress even while SQ entries were submitted correctly.

SPDK's model is polling-first. doc/nvme.md:111 through doc/nvme.md:116 states that the application submits I/O and must poll each queue pair with outstanding I/O by calling spdk_nvme_qpair_process_completions().

Misconception to kill: "Polling is just busy waiting." In SPDK, polling is the completion engine. If you do not poll, callbacks do not run, buffers are not released, and higher layers may stop making progress.

Multipath And ANA Preview

Asymmetric Namespace Access (ANA) and multipath are advanced topics for later chapters, but they start from this chapter's model. If there are multiple controllers or paths to a namespace, each path has its own queues and state. A path can become optimized, non-optimized, inaccessible, or lost. The host must decide where to submit I/O and how to recover when a path's queues fail.

The key mental model: multipath is not one magic queue. It is multiple queue machines coordinated by policy.

Source Reading Exercise

Read:

  1. include/spdk/nvme_spec.h:1452 through include/spdk/nvme_spec.h:1537.
  2. include/spdk/nvme_spec.h:550 through include/spdk/nvme_spec.h:615.
  3. lib/nvme/nvme_pcie_internal.h:140 through lib/nvme/nvme_pcie_internal.h:203.
  4. lib/nvme/nvme_pcie_internal.h:248 through lib/nvme/nvme_pcie_internal.h:298.
  5. lib/nvme/nvme_qpair.c:1171 through lib/nvme/nvme_qpair.c:1200.

Answer:

  • Which structure is 64 bytes and which is 16 bytes?
  • Which field matches a completion to a submitted command?
  • Why does SPDK need both virtual addresses and bus addresses for queues?
  • What memory barrier appears before ringing the SQ doorbell?
  • What does the common qpair layer do with -EAGAIN?

Operational Lab

Build a paper queue with depth 4. Use one empty slot rule if you like, or allow all 4 entries with an explicit count. Submit commands with CIDs 10, 11, 12, 13, then complete them out of order as 11, 10, 13, 12.

Tasks:

  1. Track sq_tail after each submission.
  2. Track cq_head after each completion is consumed.
  3. Explain why out-of-order completion is fine.
  4. Explain why cid is necessary.
  5. Wrap the completion queue once and show when the phase bit changes.

Self-Check

  1. What is the difference between an SQ and a CQ?
  2. What does a doorbell write communicate?
  3. Why is a qpair usually single-thread owned in SPDK?
  4. Why does a CQE contain cid?
  5. What problem does the phase bit solve?
  6. Why must an SPDK application poll completions?
  7. What is the difference between controller and namespace?

References

  • Local source: include/spdk/nvme_spec.h.
  • Local source: lib/nvme/nvme_pcie_internal.h.
  • Local source: lib/nvme/nvme_qpair.c.
  • Local SPDK documentation: doc/nvme.md.
  • NVM Express specifications landing page: https://nvmexpress.org/specifications/
  • NVM Express Base Specification page: https://nvmexpress.org/specification/nvm-express-base-specification/