SPDK From First Principles

SPDK deep learning path

Chapter 6: PCIe, MMIO, DMA, IOMMU, VFIO, And Hugepages

SPDK is fast because it puts userspace code close to hardware. That means SPDK applications must understand what the kernel normally hides: PCIe discovery, BAR mapping, MMIO...

Source: drafts/hardware/06-pcie-mmio-dma-iommu-vfio-hugepages.md

Chapter Goal

SPDK is fast because it puts userspace code close to hardware. That means SPDK applications must understand what the kernel normally hides: PCIe discovery, BAR mapping, MMIO registers, DMA-capable memory, IOVA addresses, IOMMU permissions, VFIO ownership, and hugepage-backed memory.

By the end of this chapter, you should be able to explain why SPDK setup binds devices to vfio-pci, why hugepages matter, what an IOVA is, why an IOMMU group can block device assignment, and how a user process can safely ring an NVMe doorbell without using the kernel NVMe driver.

Beginner Mental Model

A PCIe NVMe SSD is not a file. It is a device on a bus:

CPU process
  normal virtual memory
  hugepage-backed DMA buffers
  mapped PCI BAR virtual address

kernel
  VFIO owns PCI device
  IOMMU maps allowed DMA ranges

PCIe device
  BAR registers for MMIO
  DMA engine reads commands / writes completions / reads or writes data

SPDK's userspace driver needs two kinds of access:

  1. Register access: map the device BAR and write MMIO registers such as NVMe doorbells.
  2. DMA access: give the device addresses for command queues, completion queues, and data buffers.

The kernel is still involved. It enforces permissions, sets up IOMMU mappings, exposes VFIO file descriptors, and reserves hugepages. SPDK bypasses the kernel storage stack; it does not bypass hardware protection.

PCIe Device Identity: BDF And BARs

PCIe devices are identified by bus-device-function addresses, often written as:

0000:5e:00.0
domain:bus:device.function

The device exposes configuration space and Base Address Registers (BARs). A BAR describes a region of device memory or I/O space that the host can map. NVMe controllers expose registers through a memory BAR. When SPDK maps that BAR, ordinary-looking pointer writes become MMIO transactions on PCIe.

SPDK's environment API declares PCI helpers in include/spdk/env.h. The BAR mapping API is spdk_pci_device_map_bar() at include/spdk/env.h:898, and the DPDK-backed implementation is in lib/env_dpdk/pci.c:740.

MMIO Is Not RAM

MMIO means memory-mapped I/O. The CPU has a virtual address that points to a device register window, not DRAM. Stores to that address are side effects on a device.

For NVMe, doorbells are MMIO registers. SPDK models them in include/spdk/nvme_spec.h:611 as submission queue tail and completion queue head doorbells. SPDK writes them with spdk_mmio_write_4() in lib/nvme/nvme_pcie_internal.h:272 and lib/nvme/nvme_pcie_internal.h:295.

Important differences from RAM:

  • MMIO is usually uncached or specially ordered.
  • Writes may be posted and require barriers in driver logic.
  • Reading MMIO can be expensive and may have side effects.
  • You cannot treat a device register pointer as normal shared memory.

Misconception to kill: "If I have a pointer, it is memory." In userspace drivers, some pointers are portals into hardware.

DMA: The Device Touches Host Memory

Direct Memory Access lets the device read and write host memory without the CPU copying every byte.

For NVMe:

  • The controller DMA-reads SQ entries.
  • The controller DMA-writes CQ entries.
  • The controller DMA-reads host buffers for writes.
  • The controller DMA-writes host buffers for reads.

That requires addresses the device can use. CPU virtual addresses are not automatically valid PCIe DMA addresses. SPDK and DPDK translate or map memory into IOVA space.

SPDK's bdev required_alignment field (include/spdk/bdev_module.h:513) matters partly because DMA engines can have alignment restrictions. SPDK's env memory APIs include memzones that are IOVA-contiguous by default unless SPDK_MEMZONE_NO_IOVA_CONTIG is used (include/spdk/env.h:255).

Physical Address, Virtual Address, IOVA

Three address spaces matter:

  • VA: CPU virtual address in the process.
  • PA: host physical address.
  • IOVA: I/O virtual address used by a device for DMA.

With no IOMMU, devices often DMA to physical addresses. With an IOMMU, devices DMA to IOVAs, and the IOMMU translates those IOVAs to physical memory while enforcing permissions.

DPDK exposes IOVA modes. SPDK passes an explicit --iova-mode to DPDK when configured (lib/env_dpdk/init.c:523). SPDK also forces iova-mode=pa in some no-IOMMU or limited-IOMMU cases (lib/env_dpdk/init.c:530 through lib/env_dpdk/init.c:547).

Beginner trap: IOVA may equal VA, may equal PA, or may equal neither depending on mode and platform. Code that assumes one globally will eventually fail.

IOMMU Protection

An IOMMU is an MMU for device DMA. It lets the kernel say: this device may DMA only to these IOVA ranges with these permissions. Without it, a buggy or malicious device can overwrite arbitrary physical memory.

Linux VFIO documentation describes VFIO as an IOMMU/device-agnostic framework for exposing direct device access to userspace in an IOMMU-protected environment. That is the security model SPDK relies on when using vfio-pci.

SPDK checks whether it is using IOMMU-backed DMA through spdk_iommu_is_enabled() (include/spdk/env.h:702), implemented in DPDK memory code around lib/env_dpdk/memory.c:1064.

VFIO: Userspace Device Ownership

VFIO is the kernel framework that lets a userspace process safely own a device. A kernel driver such as nvme normally owns an NVMe SSD. For SPDK to drive it directly, that kernel driver must release it and vfio-pci must bind to it.

SPDK's scripts/setup.sh is the practical entry point. The help text says it allocates hugepages and binds NVMe, I/OAT, VMD, and Virtio devices (scripts/setup.sh:33). It selects vfio-pci when available (scripts/setup.sh:403) and warns about IOMMU group constraints (scripts/setup.sh:202 through scripts/setup.sh:213).

DPDK's Linux driver guide recommends vfio-pci for DPDK-bound devices and explains that most devices must be unbound from their kernel driver and bound to vfio-pci before the application runs.

Misconception to kill: "Binding to VFIO turns off the kernel." The kernel still mediates access. It just stops acting as the storage driver for that device.

IOMMU Groups

An IOMMU group is the smallest isolation unit the kernel can safely assign. If two PCIe functions cannot be isolated from each other, they appear in the same group. VFIO assignment normally requires the whole group to be safe: all devices in the group must be bound to VFIO-compatible drivers or unbound.

This is why setup can fail even when the target NVMe device looks correct. A bridge, multifunction device, or platform topology can put another active device in the same group.

Operational checklist:

readlink /sys/bus/pci/devices/0000:5e:00.0/iommu_group
ls -l /sys/bus/pci/devices/0000:5e:00.0/iommu_group/devices

If another device in the group is still bound to a kernel driver, VFIO may reject the group. DPDK's Linux driver guide calls out this limitation for devices behind bridges and multifunction groupings.

Hugepages

Hugepages are large pages, commonly 2 MiB or 1 GiB, reserved outside ordinary pageable memory. SPDK and DPDK use them because DMA wants pinned, physically manageable memory and because large pages reduce translation overhead.

SPDK setup exposes:

  • HUGEMEM for hugepage memory size (scripts/setup.sh:54).
  • NRHUGE for number of pages (scripts/setup.sh:61).
  • HUGENODE for NUMA node selection (scripts/setup.sh:62).
  • HUGEPGSZ for page size (scripts/setup.sh:67).
  • SKIP_HUGE, CLEAR_HUGE, and persistence options.

The setup script mounts hugetlbfs if needed (scripts/setup.sh:594 through scripts/setup.sh:600) and configures hugepages (scripts/setup.sh:603).

SPDK's DPDK env code can disable hugepages with --no-huge, but lib/env_dpdk/init.c:403 through lib/env_dpdk/init.c:418 shows restrictions: disabling hugepages needs explicit memory sizing and is incompatible with some hugepage options and PA IOVA assumptions.

DPDK's EAL documentation describes hugepage-backed memory allocation, and the Linux getting-started guide notes hugepages must be reserved as root before running applications as non-root.

DMA Mapping In SPDK Source

When VFIO is enabled, SPDK maps memory into the IOMMU. The core helper _vfio_iommu_map_dma() in lib/env_dpdk/memory.c:1084 fills a vfio_iommu_type1_dma_map with:

  • read/write flags,
  • virtual address,
  • IOVA,
  • size.

It then calls ioctl(g_vfio.fd, VFIO_IOMMU_MAP_DMA, ...) at lib/env_dpdk/memory.c:1119, unless mapping is deferred because no SPDK-managed VFIO device has been attached yet.

BAR mapping has a similar IOMMU detail. spdk_pci_device_map_bar() maps the BAR and, if IOMMU is enabled, maps the BAR into IOMMU space too (lib/env_dpdk/pci.c:751 through lib/env_dpdk/pci.c:773). In VA mode, SPDK uses the mapped virtual address as IOVA; in PA mode, it uses the physical address.

This is the concrete bridge between abstract diagrams and real source: SPDK does not merely "get a pointer." It arranges permissions so a device and process can safely exchange DMA and MMIO.

Kernel-Bound Versus SPDK-Bound Devices

Kernel-bound NVMe:

application -> syscall/io_uring/libaio -> kernel block layer -> kernel nvme driver -> device

SPDK-bound NVMe:

SPDK app -> SPDK NVMe driver -> VFIO/MMIO/DMA -> device

The second path avoids syscalls and kernel block scheduling on the I/O fast path, but it changes ownership:

  • The kernel no longer exposes /dev/nvmeXnY for normal use.
  • Filesystems mounted from that device must be unmounted first.
  • The SPDK process must reserve and own DMA-capable memory.
  • Operational tooling must use SPDK RPCs or NVMe passthrough paths instead of normal block tools.

Misconception to kill: "SPDK can use a mounted kernel NVMe disk directly." For local PCIe SPDK NVMe, the device is normally rebound away from the kernel NVMe driver. Sharing a mounted block device with SPDK direct hardware access would be data corruption territory.

NUMA

PCIe devices attach near a CPU socket. Memory also belongs to NUMA nodes. If the SPDK polling core is on socket 0, the NVMe device is behind socket 1, and hugepages are allocated on socket 0, every DMA and CPU access may cross inter-socket links.

Symptoms:

  • Lower bandwidth than expected.
  • Higher tail latency.
  • CPU cycles spent waiting on remote memory.
  • Performance changes when core masks or hugepage allocation changes.

SPDK setup has HUGENODE; DPDK and SPDK expose NUMA-aware allocation. The beginner operational rule is: align device, polling core, and hugepage memory when possible.

Page Faults And Pinned Memory

DMA cannot wait for the kernel to page memory in from swap. Device DMA buffers must be resident and mapped. Hugepages are reserved and pinned in a way that fits DPDK/SPDK operation. Normal malloc() memory may be unsuitable because:

  • it can be paged,
  • it may not have stable physical mappings,
  • it may not be IOVA-contiguous,
  • it may not be registered with VFIO/IOMMU.

SPDK has APIs for DMA-safe allocation; later chapters will cover iobuf, mempools, and memory domains.

Failure Modes

  • No IOMMU enabled: VFIO binding fails or DPDK requires unsafe no-IOMMU mode.
  • Wrong driver: device remains bound to nvme, so SPDK cannot claim it through VFIO.
  • Active filesystem: rebinding a live kernel device risks data loss.
  • IOMMU group not viable: another device in the group is still kernel-bound.
  • Hugepages missing: EAL initialization fails or SPDK cannot allocate DMA buffers.
  • Hugepages on wrong NUMA node: performance collapses rather than failing loudly.
  • RLIMIT_MEMLOCK too low: non-root process cannot lock enough memory.
  • DMA entry limit reached: many small mappings, especially with --no-huge, exhaust VFIO map entries.
  • IOVA mode mismatch: device receives addresses it cannot translate.
  • BAR mapping failure: process cannot access MMIO registers.
  • Page fault in data path: using non-DMA-safe memory creates failures or bounce-buffer overhead.

Operational Lab

Do this as a dry run on a development machine; do not rebind production devices.

  1. Pick a PCI device BDF from lspci.
  2. Inspect its current driver:
lspci -k -s 0000:5e:00.0
  1. Inspect its IOMMU group:
readlink /sys/bus/pci/devices/0000:5e:00.0/iommu_group
ls -l /sys/bus/pci/devices/0000:5e:00.0/iommu_group/devices
  1. Inspect hugepage state:
grep -i huge /proc/meminfo
find /sys/devices/system/node -path '*hugepages*' -name nr_hugepages -print -exec cat {} \;
  1. Explain whether the device could be safely rebound to VFIO and what other devices would be affected.

Source Reading Exercise

Read:

  1. scripts/setup.sh:33 through scripts/setup.sh:104.
  2. scripts/setup.sh:194 through scripts/setup.sh:213.
  3. scripts/setup.sh:403 through scripts/setup.sh:418.
  4. scripts/setup.sh:500 through scripts/setup.sh:607.
  5. lib/env_dpdk/pci.c:740 through lib/env_dpdk/pci.c:775.
  6. lib/env_dpdk/memory.c:1084 through lib/env_dpdk/memory.c:1127.
  7. lib/env_dpdk/init.c:523 through lib/env_dpdk/init.c:547.

Answer:

  • Which script variables control hugepage allocation?
  • Where does setup warn about IOMMU groups?
  • Which driver does setup prefer?
  • What fields are passed to VFIO_IOMMU_MAP_DMA?
  • When does SPDK choose VA as IOVA for BAR mapping?

Self-Check

  1. What is a PCI BDF?
  2. What is a BAR?
  3. Why is MMIO different from normal memory?
  4. What does DMA let the device do?
  5. Why is an IOMMU important for userspace drivers?
  6. What does VFIO provide?
  7. Why do hugepages matter for SPDK?
  8. Why can an IOMMU group stop device binding?
  9. What is the difference between VA, PA, and IOVA?

References

  • Local source: scripts/setup.sh.
  • Local source: include/spdk/env.h.
  • Local source: lib/env_dpdk/pci.c.
  • Local source: lib/env_dpdk/memory.c.
  • Local source: lib/env_dpdk/init.c.
  • Linux kernel VFIO documentation: https://www.kernel.org/doc/html/v6.6/driver-api/vfio.html
  • Linux kernel IOMMU userspace API documentation: https://www.kernel.org/doc/html/v6.7/userspace-api/iommu.html
  • Linux kernel IOMMUFD documentation: https://www.kernel.org/doc/html/v6.11/userspace-api/iommufd.html
  • DPDK Linux drivers guide, VFIO and IOMMU groups: https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html
  • DPDK EAL programmer's guide: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html
  • DPDK non-root and hugepage setup guide: https://doc.dpdk.org/guides/linux_gsg/enable_func.html
  • NVM Express specifications landing page: https://nvmexpress.org/specifications/