Chapter 7: Linux Storage Path vs SPDK Path | SPDK From First Principles

Reader Promise

By the end of this chapter you should be able to explain why SPDK exists without hand-waving. You should be able to draw the normal Linux path for a read or write, draw the SPDK path for the same operation, and name the tradeoffs instead of repeating "kernel slow, SPDK fast". That phrase is not good enough. The real difference is ownership of queues, memory, threads, interrupts, context switches, batching, and failure handling.

The Linux stack is a general-purpose storage operating system. SPDK is a specialized userspace storage runtime. Linux optimizes for protection, sharing, fairness, device diversity, filesystems, page cache, suspend/resume, hotplug, and decades of applications. SPDK optimizes for a narrower target: high-throughput, low-latency storage services that can reserve CPU cores, hugepage memory, and direct device ownership.

The Normal Linux Read Path

Imagine an application calls pread(fd, buf, 4096, offset). If the file is on a filesystem backed by an NVMe SSD, the path is roughly:

application thread
  |
  | syscall boundary
  v
kernel VFS
  |
  | file, inode, page cache, filesystem mapping
  v
filesystem
  |
  | logical file blocks -> block device sectors
  v
Linux block layer
  |
  | request allocation, scheduler/merge, bio/request mapping
  v
NVMe kernel driver
  |
  | submit command to hardware queue
  v
NVMe SSD
  |
  | interrupt or poll completion
  v
kernel completion path
  |
  | wake task, copy/map data, return from syscall
  v
application thread

That stack is not dumb. It provides a lot:

Permission checks and process isolation.
Filesystem semantics.
Page cache.
Device sharing.
Request accounting.
Cgroup and IO scheduling policy.
Kernel driver recovery.
Hotplug and power-management integration.
A stable POSIX programming model.

For many systems, this is exactly what you want.

What Costs Show Up In The Kernel Path

The costs are not one single thing. They are a pile of small costs that become visible at high IOPS:

A syscall crosses from userspace into kernel mode.
The filesystem may consult metadata.
The page cache may copy, map, dirty, reclaim, or bypass pages.
The block layer allocates and transforms request objects.
The scheduler may merge or reorder requests.
The NVMe driver submits to hardware queues owned by the kernel.
Interrupt handling may move completion work onto a CPU that is not the original submitter.
The sleeping application may need to be woken.
Locks, atomics, memory barriers, and shared queues protect many users and devices.

None of these are inherently bad. They buy generality. But if your storage server already owns the disk, already speaks an async protocol, already uses direct buffers, and already dedicates CPU to IO, some of that generality is overhead.

The SPDK Path

SPDK moves the device driver, queue ownership, and polling loop into userspace. For an NVMe read through SPDK's bdev layer, the shape is closer to:

storage service callback
  |
  | spdk_bdev_read()
  v
bdev core
  |
  | allocate/route/split/QoS if needed
  v
bdev module, such as bdev_nvme
  |
  | submit to per-thread NVMe qpair
  v
NVMe submission queue in DMA-safe hugepage memory
  |
  | MMIO doorbell
  v
NVMe SSD
  |
  | completion written into host memory
  v
SPDK poller observes completion
  |
  | callback chain runs on SPDK thread
  v
storage service continuation

The crucial differences:

The hot path is async. You submit and later receive a callback.
The application usually does not block.
Completion is usually found by polling, not by sleeping and waiting for an interrupt.
Buffers are allocated from DMA-safe memory.
Device queues are owned by the SPDK process through VFIO/UIO-style binding.
Per-core or per-thread resources avoid many cross-core locks.

What SPDK Removes

SPDK can remove or reduce:

Syscall overhead on the IO hot path.
Kernel block-layer request scheduling.
Interrupt and wakeup overhead in the common poll-mode path.
Extra copies when the application already has DMA-safe buffers.
Kernel-driver queue sharing between unrelated processes.
Lock contention from multi-tenant kernel abstractions.

This is why SPDK is attractive for storage appliances, NVMe-oF targets, userspace vhost targets, virtual block device stacks, and cloud volume services.

What SPDK Adds

SPDK does not make complexity disappear. It moves complexity into the userspace storage service:

The process must reserve hugepages.
The process must bind devices away from kernel drivers.
The process must obey strict thread-affinity rules.
The process must avoid blocking poller threads.
The process must manage async cleanup, resets, removals, and reconnects.
The process must expose its own control plane.
The process must be monitored like a storage operating system.

This matters for diskengine. If diskengine tells SPDK to create a volume, attach an NVMe-oF controller, or export a bdev, diskengine is depending on an external userspace storage runtime. The failure modes are not just Linux file errors. They include JSON-RPC replay errors, bdev examine delays, VFIO binding problems, reactor stalls, qpair resets, and metadata operations in blobstore/lvol.

Side-By-Side Mental Model

Linux path                           SPDK path
----------                           ---------
application calls read/write         service calls async SPDK API
kernel owns device driver            SPDK process owns device driver
kernel manages NVMe queues           SPDK manages NVMe queues
interrupts wake sleepers             pollers find completions
general scheduling/fairness          dedicated cores and explicit queues
page cache often involved            DMA-safe buffers preferred
blocking API is common               callback state machines are normal
OS handles broad policy              storage service must own policy

When The Linux Path Is Better

SPDK is not automatically the right answer. The Linux path may be better when:

You need normal filesystems and POSIX semantics.
You need ordinary process isolation and device sharing.
IO rate is modest and development simplicity matters more.
CPU cores cannot be dedicated to polling.
Operational teams already understand kernel storage better.
Latency targets are not tight enough to justify SPDK complexity.
The workload benefits from the page cache.
You need mature kernel features such as broad hardware quirks, power management, or conventional multipath integration.

The worst SPDK design is a system that pays all of SPDK's complexity costs but does not use its queue ownership, polling, batching, or async model.

When SPDK Is The Right Tool

SPDK starts to make sense when:

You are building a storage service rather than an ordinary application.
The service owns the storage devices or remote NVMe-oF connections.
The workload is high IOPS or latency-sensitive.
You can dedicate CPU cores.
The data path is already asynchronous.
You can allocate DMA-safe memory and avoid page-cache semantics.
The control plane can tolerate async operations and explicit failure handling.
You need userspace composition of bdevs, lvol, RAID, vhost, NVMe-oF, or vfio-user.

This is why it fits diskengine. diskengine is not trying to be cp or sqlite on a laptop filesystem. It is orchestrating cloud volumes, bdev stacks, NVMe-oF exports, vhost/vfio-user exposure, snapshots, RAID, and recovery loops.

Source Anchors

Use these files to connect the mental model to code:

lib/event/app.c: spdk_app_start() sets up the SPDK application framework.
lib/event/reactor.c: reactor loops run pollers and messages.
lib/thread/thread.c: spdk_thread_send_msg(), pollers, and channel ownership.
lib/bdev/bdev.c: public bdev submission becomes module-specific work.
module/bdev/nvme/bdev_nvme.c: bdev requests become NVMe library submissions.
lib/nvme/nvme.c, lib/nvme/nvme_qpair.c: NVMe controller and qpair mechanics.
lib/env_dpdk/env.c: SPDK's DPDK-backed environment.
scripts/setup.sh: device binding, hugepages, and VFIO setup.

Read those files with one question in mind: who owns the thread, who owns the queue, and who calls the completion?

Edge Cases And Failure Modes

Polling burns CPU. A poll-mode storage server can use a full core even when the workload is light.
Blocking is poisonous. A blocking call on a reactor can stall unrelated IO.
Wrong memory can fail DMA. Ordinary process memory is not automatically appropriate for every DMA path.
Wrong device binding hides disks from SPDK. If Linux owns the NVMe controller, SPDK cannot use it as a userspace PCIe device.
Kernel bypass changes observability. iostat may not show the real SPDK data path. You need SPDK stats, logs, tracepoints, and RPCs.
Page cache assumptions break. SPDK bdev IO is block IO, not filesystem buffered IO.
Crash semantics move upward. If the SPDK process dies, your service design must define what happens to exported volumes and in-flight operations.

Misconceptions To Kill

"SPDK is just faster read()." No. It is a different runtime and driver model.
"Polling is always better." No. Polling trades CPU for latency and throughput.
"Kernel storage is obsolete." No. Kernel storage is the right default for many systems.
"Userspace means unsafe." Not exactly. VFIO and IOMMU exist to make userspace DMA device access controlled, but the application still has more responsibility.
"SPDK means no kernel at all." No. Linux still provides process isolation, memory management, VFIO/IOMMU infrastructure, networking support, and scheduling of the userspace process.

Lab: Trace One Read Both Ways

Pick one IO: a 4 KiB read at LBA 100.
Draw the Linux path from application to NVMe driver completion.
Draw the SPDK path from spdk_bdev_read() to completion callback.
Mark every place where a thread can sleep.
Mark every place where ownership crosses from one subsystem to another.
Mark where the data buffer must be DMA-safe in the SPDK path.

Source Reading Exercise

Open lib/bdev/bdev.c and find where a bdev IO is submitted. Then open module/bdev/nvme/bdev_nvme.c and find where the NVMe bdev module submits to an NVMe qpair. Write down:

The object representing the logical IO.
The object representing the per-thread channel.
The callback that finishes the IO.
The function that would be unsafe to block inside.

Operational Exercise

On a real host, classify a storage problem into Linux-path or SPDK-path first:

If lsblk does not show the NVMe device after binding to VFIO, is that a bug?
If SPDK can see the controller but iostat is quiet during heavy IO, is that a bug?
If CPU usage stays high while idle, is that a bug or an expected polling tradeoff?
If an SPDK reactor blocks in a filesystem call, which unrelated IOs might stall?

References

SPDK documentation index: https://spdk.io/doc/
SPDK message passing and concurrency: https://spdk.io/doc/concurrency.html
SPDK NVMe driver guide: https://spdk.io/doc/nvme.html
SPDK block device guide: https://spdk.io/doc/bdev.html
DPDK Environment Abstraction Layer: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html
Linux VFIO documentation: https://docs.kernel.org/driver-api/vfio.html

Self-Check

Why can the Linux path be better even if SPDK can be faster?
What does polling remove, and what does it cost?
Why does SPDK care about hugepages and DMA-safe buffers?
What is the difference between a blocking syscall path and an async callback path?
Why does diskengine need to treat SPDK as a storage runtime rather than a library-shaped black box?