Chapter Goal
NVMe is not "a faster disk command set." It is a queue protocol designed for many-core hosts and parallel SSD controllers. The host and controller communicate mostly through shared memory queues and small MMIO doorbell writes. SPDK's NVMe driver is built around that model: allocate queue pairs, submit commands, ring doorbells, poll completions, and invoke callbacks.
By the end of this chapter, you should be able to draw the submission/completion flow, explain SQs and CQs, identify command and completion fields in SPDK source, reason about phase bits and queue wraparound, and understand why a qpair is normally owned by one thread.
Beginner Mental Model
An NVMe queue pair is two circular arrays in host memory:
host memory
submission queue (SQ) completion queue (CQ)
tail -> next free slot head -> next completion to consume
+-----+-----+-----+-----+ +-----+-----+-----+-----+
| cmd | cmd | | | | cpl | cpl | | |
+-----+-----+-----+-----+ +-----+-----+-----+-----+
^ device reads commands ^ host reads completions
MMIO doorbells:
host writes SQ tail to tell controller new commands exist
host writes CQ head to tell controller completions were consumed
The host writes command entries into the SQ. The controller DMA-reads them. Later the controller DMA-writes completion entries into the CQ. The host polls the CQ and runs callbacks. Doorbells are small MMIO register writes that synchronize producer/consumer positions.
Why NVMe Exists
Legacy storage protocols were designed around fewer queues, deeper kernel mediation, and mechanical disks. SSDs need:
- many independent queues,
- high queue depth,
- low per-command overhead,
- no interrupt required for every completion,
- command formats sized for cache lines,
- parallelism that maps to CPU cores and controller internals.
NVMe over PCIe uses host memory queues and PCIe DMA. NVMe over Fabrics carries the same command model over transports such as RDMA and TCP. This chapter focuses on local PCIe because it exposes the hardware mechanics most directly.
The Command And Completion Structures
SPDK's local copy of the NVMe command layout is in include/spdk/nvme_spec.h:1452. struct spdk_nvme_cmd is 64 bytes (include/spdk/nvme_spec.h:1504). Key fields:
opc: opcode, such as read, write, flush, identify, create queue.fuse: fused operation marker.cid: command identifier, used to match completion to request state.nsid: namespace identifier.mptr: metadata pointer.dptr: PRP or SGL data pointer.cdw10throughcdw15: command-specific dwords.
The completion entry is struct spdk_nvme_cpl at include/spdk/nvme_spec.h:1519, and it is 16 bytes (include/spdk/nvme_spec.h:1537). Key fields:
cdw0andcdw1: command-specific result.sqhd: submission queue head pointer from the controller's point of view.sqid: submission queue identifier.cid: command identifier.status: status code, status code type, retry hints, do-not-retry bit, and phase tag.
Beginner trap: the command carries the request; the completion does not contain the original data buffer. The driver uses cid and its own tracker/request tables to find the callback and buffer state.
Admin Queues And I/O Queues
Every NVMe controller has an admin queue pair. Admin commands create and delete I/O queues, identify controllers and namespaces, get logs, set features, abort commands, and manage asynchronous events. I/O queue pairs carry reads, writes, flushes, write zeroes, dataset management, and command-set-specific I/O operations.
SPDK's controller register structure in include/spdk/nvme_spec.h:550 includes admin queue attributes and base addresses:
aqa: admin queue attributes.asq: admin submission queue base address.acq: admin completion queue base address.
The same register structure exposes doorbells at include/spdk/nvme_spec.h:611: each queue has a submission queue tail doorbell and completion queue head doorbell.
The Queue Pair In SPDK
For PCIe, SPDK's struct nvme_pcie_qpair is in lib/nvme/nvme_pcie_internal.h:140. The hot fields are the queue indices and flags:
num_entrieslast_sq_tailsq_tailcq_headsq_headflags.phasesq_vaddrcq_vaddrcmd_bus_addrcpl_bus_addr
The names tell the story. sq_vaddr and cq_vaddr are CPU virtual addresses for the host-side arrays. cmd_bus_addr and cpl_bus_addr are bus/IOVA addresses the device can DMA to or from. The host updates sq_tail; the controller updates completions; the host advances cq_head.
SPDK's own documentation says qpair scaling is lock-free but thread-constrained. doc/nvme.md:153 through doc/nvme.md:160 explains that queue pairs contain no locks or atomics and a given qpair may only be used by a single thread at a time. Violating this is undefined behavior.
Misconception to kill: "NVMe has many queues so any thread can submit to any queue." The scalable model is many queues with clear ownership, not one shared queue with hidden locks.
Submission Flow
The simplified SPDK PCIe submission path:
application / bdev_nvme
builds an nvme_request
calls nvme_qpair_submit_request()
common qpair layer
queues request if the qpair is backed up
calls transport submit
PCIe transport
copies command to SQ slot
associates CID with tracker
advances sq_tail
rings SQ doorbell
controller
sees new tail
DMA-reads SQ entries
executes commands
The common submit wrapper is in lib/nvme/nvme_qpair.c:1171. It handles queued requests and the -EAGAIN case by inserting requests into qpair->queued_req rather than failing the user operation immediately.
The PCIe doorbell function is nvme_pcie_qpair_ring_sq_doorbell() in lib/nvme/nvme_pcie_internal.h:248. It writes the new SQ tail with spdk_mmio_write_4() at lib/nvme/nvme_pcie_internal.h:272. There is a memory barrier before the MMIO write at lib/nvme/nvme_pcie_internal.h:269; the host must make sure command memory is visible before telling the device to fetch it.
Completion Flow
The simplified completion path:
controller
DMA-writes CQE
toggles / sets phase tag as appropriate
host poller
reads CQE at cq_head
checks phase bit
looks up tracker by cid
completes request callback
advances cq_head
rings CQ doorbell
The CQ doorbell function is nvme_pcie_qpair_ring_cq_doorbell() in lib/nvme/nvme_pcie_internal.h:277. It writes the consumed CQ head to the controller at lib/nvme/nvme_pcie_internal.h:295.
The phase bit solves a ring-buffer ambiguity. When the CQ wraps, slot 0 is reused. Without an extra bit, the host could not reliably tell whether a slot contains an old completion from the previous lap or a new completion from this lap. SPDK stores the expected phase in flags.phase at lib/nvme/nvme_pcie_internal.h:159.
Diagram in prose:
CQ has 4 slots. Expected phase = 1.
lap 1:
slot 0 phase 1 -> new
slot 1 phase 1 -> new
slot 2 phase 1 -> new
slot 3 phase 1 -> new
wrap; expected phase becomes 0
lap 2:
slot 0 phase 0 -> new
old phase 1 entries are ignored after expected phase flips
Doorbells And MMIO
Doorbells are not normal memory writes. They are MMIO writes to registers mapped from the PCIe device's BAR. MMIO writes can be expensive compared with ordinary cached stores. That is why batching and shadow doorbells exist.
SPDK models controller registers in include/spdk/nvme_spec.h:540 through include/spdk/nvme_spec.h:615. The doorbell array begins at include/spdk/nvme_spec.h:611. In the PCIe qpair code, SPDK optionally updates shadow doorbells and only performs MMIO when required (lib/nvme/nvme_pcie_internal.h:260 through lib/nvme/nvme_pcie_internal.h:274).
Misconception to kill: "Ringing the doorbell moves the command." It does not copy the command. The command is already in host memory. The doorbell tells the controller that the producer index changed.
Namespaces And Controllers
An NVMe controller is the command-processing entity. A namespace is a block address space exposed through that controller. A physical SSD may expose one namespace or many. Multipath and NVMe-oF can make this more complex, but the beginner model is:
controller
admin qpair
io qpair 1
io qpair 2
namespace 1: logical blocks
namespace 2: logical blocks
An I/O command usually names the namespace in cmd.nsid and the LBA/range in command-specific dwords. The qpair is the transport path; the namespace is the storage object.
Queue Full, Backpressure, And -EAGAIN
Queues are finite. Trackers are finite. Requests can be temporarily impossible to submit even though the device is healthy. SPDK's common qpair layer turns some -EAGAIN returns into internal queueing (lib/nvme/nvme_qpair.c:1192 through lib/nvme/nvme_qpair.c:1197).
This matters operationally. A user-level submit API returning success may mean "accepted by SPDK for eventual transport submission," not necessarily "already placed into the hardware SQ." Completion callback is still the truth for command completion.
Timeouts, Aborts, And Resets
NVMe has explicit error paths:
- A command can complete with an error status.
- A qpair can fail or disconnect.
- A command can time out in the host.
- The host can issue an abort.
- The controller or queue can be reset.
- The bdev layer can reset above the NVMe layer.
The completion status includes dnr ("do not retry") and status code/type fields in include/spdk/nvme_spec.h:1506 through include/spdk/nvme_spec.h:1513. SPDK also has queued-request abort paths around lib/nvme/nvme_qpair.c:1222.
Beginner trap: an abort is itself a command and may race with normal completion. A command can complete just as the host decides it timed out. Correct code must tolerate late completions, failed aborts, and reset-driven cleanup.
CQ Full And Completion Flow Control
A controller cannot write infinite completions. If the host stops polling, the CQ fills. Once the CQ fills, the controller has limited room to complete more commands. That can stall progress even while SQ entries were submitted correctly.
SPDK's model is polling-first. doc/nvme.md:111 through doc/nvme.md:116 states that the application submits I/O and must poll each queue pair with outstanding I/O by calling spdk_nvme_qpair_process_completions().
Misconception to kill: "Polling is just busy waiting." In SPDK, polling is the completion engine. If you do not poll, callbacks do not run, buffers are not released, and higher layers may stop making progress.
Multipath And ANA Preview
Asymmetric Namespace Access (ANA) and multipath are advanced topics for later chapters, but they start from this chapter's model. If there are multiple controllers or paths to a namespace, each path has its own queues and state. A path can become optimized, non-optimized, inaccessible, or lost. The host must decide where to submit I/O and how to recover when a path's queues fail.
The key mental model: multipath is not one magic queue. It is multiple queue machines coordinated by policy.
Source Reading Exercise
Read:
include/spdk/nvme_spec.h:1452throughinclude/spdk/nvme_spec.h:1537.include/spdk/nvme_spec.h:550throughinclude/spdk/nvme_spec.h:615.lib/nvme/nvme_pcie_internal.h:140throughlib/nvme/nvme_pcie_internal.h:203.lib/nvme/nvme_pcie_internal.h:248throughlib/nvme/nvme_pcie_internal.h:298.lib/nvme/nvme_qpair.c:1171throughlib/nvme/nvme_qpair.c:1200.
Answer:
- Which structure is 64 bytes and which is 16 bytes?
- Which field matches a completion to a submitted command?
- Why does SPDK need both virtual addresses and bus addresses for queues?
- What memory barrier appears before ringing the SQ doorbell?
- What does the common qpair layer do with
-EAGAIN?
Operational Lab
Build a paper queue with depth 4. Use one empty slot rule if you like, or allow all 4 entries with an explicit count. Submit commands with CIDs 10, 11, 12, 13, then complete them out of order as 11, 10, 13, 12.
Tasks:
- Track
sq_tailafter each submission. - Track
cq_headafter each completion is consumed. - Explain why out-of-order completion is fine.
- Explain why
cidis necessary. - Wrap the completion queue once and show when the phase bit changes.
Self-Check
- What is the difference between an SQ and a CQ?
- What does a doorbell write communicate?
- Why is a qpair usually single-thread owned in SPDK?
- Why does a CQE contain
cid? - What problem does the phase bit solve?
- Why must an SPDK application poll completions?
- What is the difference between controller and namespace?
References
- Local source:
include/spdk/nvme_spec.h. - Local source:
lib/nvme/nvme_pcie_internal.h. - Local source:
lib/nvme/nvme_qpair.c. - Local SPDK documentation:
doc/nvme.md. - NVM Express specifications landing page: https://nvmexpress.org/specifications/
- NVM Express Base Specification page: https://nvmexpress.org/specification/nvm-express-base-specification/