SPDK From First Principles

SPDK deep learning path

Orientation: SPDK Is Not Magic

The whole map: hardware, runtime, bdev graph, transports, JSON-RPC, and diskengine.

Source: content/chapters/00-orientation.md

The honest starting point

SPDK feels like a black box because it sits at the intersection of several things that are usually hidden from application developers: SSD firmware behavior, NVMe queue mechanics, PCIe device access, DMA memory, Linux driver binding, event-loop scheduling, asynchronous C callbacks, and a large plugin-based storage framework. If you start by opening lib/bdev/bdev.c at a random line, the code looks like a pile of callbacks and intrusive lists. If you start at the bottom and climb one layer at a time, it becomes a machine.

This book is written for that second path. The goal is not to memorize every SPDK function. The goal is to build enough mental machinery that when you read a source file you know what questions to ask:

  • What object is this function manipulating?
  • Which thread owns that object?
  • Is this code running in a reactor event, an spdk_thread message, a poller, an RPC handler, or a completion callback?
  • Is the work synchronous, or did the function merely submit async work?
  • Who completes the operation?
  • Who frees the memory?
  • What happens during reset, remove, -ENOMEM, hotplug, or shutdown?

Once those questions become automatic, SPDK source stops being mysterious. It becomes dense, but readable.

The full stack you are trying to understand

At the highest level, your diskengine world uses SPDK as a storage engine daemon. diskengine is Go code. It does not link libspdk directly. It speaks JSON-RPC over a Unix socket. SPDK owns the C runtime objects: bdevs, lvolstores, NVMe controllers, NVMe-oF subsystems, vhost controllers, pollers, threads, and channels.

The full path looks like this:

  1. A guest VM issues storage IO.
  2. QEMU exposes that IO to a host-side backend such as SPDK vhost-blk or vfio-user NVMe.
  3. SPDK turns guest queue activity into bdev IO.
  4. A RAID bdev may mirror or split that IO across base bdevs.
  5. The base bdevs may be remote NVMe-oF controllers attached over RDMA.
  6. On storage nodes, those remote exports are lvol bdevs.
  7. lvol bdevs are blobs inside a blobstore.
  8. Blobstore maps blob clusters onto a base bdev.
  9. The base bdev may be a physical NVMe namespace.
  10. The NVMe library turns IO into NVMe commands, puts them in submission queues, rings doorbells, polls completions, and returns completion callbacks back up the stack.

That is one path. SPDK also includes iSCSI, accel, crypto, malloc bdevs, null bdevs, passthru bdevs, TCP transports, RDMA transports, and many more pieces. The book focuses on the pieces you need for diskengine and for reading/extending the C source.

The four mental models

1. The hardware model

An SSD is not a magic byte array. It is a controller plus NAND flash. NAND has pages and erase blocks. It cannot overwrite in place the way RAM does. The SSD controller maintains a flash translation layer that maps host logical block addresses to physical flash locations. This is why write amplification, garbage collection, wear leveling, latency cliffs, TRIM/UNMAP, and power-loss behavior matter.

NVMe is the protocol host software uses to talk to modern SSDs. NVMe is a queue protocol. The host writes commands into submission queues in host memory. The device reads them with DMA, transfers payload data with DMA, writes completions into completion queues, and the host observes those completions.

2. The kernel-bypass model

Normal storage IO goes through Linux system calls, kernel block layers, kernel drivers, interrupts, wakeups, and copies. SPDK moves the driver and storage stack into userspace. It uses DPDK and VFIO to set up hugepage-backed DMA memory and direct device access. It polls instead of waiting for interrupts. This burns CPU to remove latency variance and kernel crossings.

This does not mean the kernel is gone. The kernel still provides process isolation, IOMMU support, VFIO, memory management, scheduling, and filesystems for normal files. SPDK removes the kernel from the hot storage data path after setup.

3. The SPDK runtime model

SPDK is cooperative, event-driven C. Reactors are pinned OS threads. spdk_thread is a lightweight SPDK execution context scheduled on a reactor. Pollers are callbacks that run repeatedly. Messages are callbacks sent to another spdk_thread. io_channel is per-thread device state that allows hot paths to avoid locks.

The rule is simple and brutal: do not block. If a poller blocks, it stalls the reactor. If a callback waits synchronously for work that needs the same thread, it can deadlock. If you touch an object from the wrong thread, debug builds often assert because SPDK would rather crash than silently corrupt state.

4. The bdev graph model

The bdev layer is SPDK's common block-device abstraction. A physical NVMe namespace is a bdev. An lvol is a bdev. A RAID device is a bdev. A vhost controller consumes a bdev. NVMe-oF exports bdevs as namespaces. A virtual bdev is just a bdev that forwards or transforms IO to one or more base bdevs.

Once you see the world as a bdev graph, diskengine becomes easier to reason about. It is constantly reconciling desired database state against actual SPDK bdev graph state.

What to read locally

Use these files as landmarks:

  • README.md for SPDK's own high-level promise.
  • app/spdk_tgt/spdk_tgt.c for the generic target entry point.
  • lib/event/app.c for app startup and subsystem initialization.
  • lib/event/reactor.c for reactor loops.
  • lib/thread/thread.c and include/spdk/thread.h for spdk_thread, messages, pollers, and channels.
  • lib/bdev/bdev.c, include/spdk/bdev.h, and include/spdk/bdev_module.h for bdev.
  • module/bdev/null/bdev_null.c for a small bdev module.
  • module/bdev/nvme/ for the NVMe bdev module.
  • lib/nvme/ for the initiator-side NVMe library.
  • lib/nvmf/ for the target-side NVMe-oF library.
  • lib/blob/, lib/lvol/, and module/bdev/lvol/ for blobstore and lvol.
  • module/bdev/raid/ for RAID.
  • lib/vhost/ and lib/nvmf/vfio_user.c for VM-facing transports.
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/ for diskengine's JSON-RPC client.

Primary external references

Keep these open while studying:

How to use this book

Read Part 1 even if you are tempted to jump to SPDK. If you do not understand why DMA memory must be special, why NVMe is queue-based, or why NAND cannot overwrite in place, many SPDK choices will look arbitrary.

When a chapter shows a source path, open it in the repo. Do not only read the quoted excerpts. The excerpt is the doorway. The file is the lesson.

When a lab asks you to predict behavior, actually predict before reading the answer or running a command. SPDK debugging is mostly state classification. You get better by forcing yourself to name the state before poking it.

Self-check

  • Can you explain why diskengine is not "using an SPDK library" in the normal linked-library sense?
  • Can you name the four core mental models: hardware, kernel bypass, runtime, bdev graph?
  • Can you point to the source file that owns reactor polling?
  • Can you point to the source file that owns bdev IO routing?
  • Can you explain why "RPC returned success" may not mean "the storage graph is fully converged"?