SPDK From First Principles

SPDK deep learning path

Chapter 8: DPDK EAL And SPDK Env

By the end of this chapter, a beginner should be able to explain why an SPDK process starts by initializing an "environment" before it initializes storage subsystems, why that...

Source: drafts/runtime/08-dpdk-eal-and-spdk-env.md

Reader Promise

By the end of this chapter, a beginner should be able to explain why an SPDK process starts by initializing an "environment" before it initializes storage subsystems, why that environment is usually DPDK EAL, and why errors about hugepages, VFIO, IOVA, NUMA, core masks, and permissions are not random setup chores. They are the foundation that lets later SPDK code poll devices from userspace and hand DMA-safe buffers to hardware.

This chapter is intentionally practical. If a production diskengine node fails before bdevs appear, the failure usually lives here: CPU selection, hugepage memory, PCI ownership, IOMMU/VFIO, shared memory IDs, or virtual-to-physical translation.

Mental Model

The SPDK env layer is the process contract with the host machine.

Before SPDK can submit fast I/O, it needs answers to questions ordinary C programs usually ignore:

  • Which CPU cores will run the event loops?
  • Where can the process allocate memory that will not move?
  • Can hardware DMA to that memory?
  • How does a userspace pointer become an IOVA or physical address?
  • Which PCI devices are visible to the process?
  • Is this process the only SPDK process using the hugepage namespace, or is it sharing state with another process?

DPDK EAL answers much of that. SPDK wraps it in spdk_env_* APIs so most SPDK libraries do not call DPDK directly.

The beginner trap is to think of EAL as "the networking library." In SPDK, EAL is the platform bring-up layer: core masks, hugepage-backed allocation, PCI enumeration, memzones, mempools, and thread launching.

Where The Env Fits In Startup

Prose diagram:

main()
  prepares spdk_app_opts
  calls spdk_app_start()
    app_setup_env()
      fills spdk_env_opts from app opts
      calls spdk_env_init()
        builds DPDK EAL command line
        calls rte_eal_init()
        initializes PCI env, memory map, vtophys
    initializes reactors and threads
    initializes subsystems

This chapter focuses on the app_setup_env() -> spdk_env_init() segment. The next startup chapter picks up at reactors and subsystems.

Source Anchors

  • include/spdk/env.h: struct spdk_env_opts, spdk_env_opts_init(), spdk_env_init(), spdk_env_fini(), spdk_malloc(), spdk_zmalloc(), spdk_dma_malloc(), spdk_dma_zmalloc(), spdk_mempool_create(), spdk_memzone_reserve(), spdk_vtophys()
  • include/spdk/event.h: struct spdk_app_opts, spdk_app_opts_init(), spdk_app_start()
  • lib/event/app.c: app_setup_env(), spdk_app_start(), spdk_app_opts_init()
  • lib/env_dpdk/init.c: build_eal_cmdline(), spdk_env_init(), spdk_env_dpdk_post_init(), spdk_env_fini()
  • lib/env_dpdk/env.c: spdk_malloc(), spdk_zmalloc(), spdk_dma_malloc_socket(), spdk_dma_zmalloc_socket(), spdk_mempool_create_ctor(), spdk_memzone_reserve_aligned()
  • lib/env_dpdk/memory.c: vtophys_init(), spdk_vtophys(), mem_disable_vtophys(), vtophys_notify(), vtophys_iommu_init()
  • lib/env_dpdk/pci.c: spdk_pci_device_map_bar(), spdk_pci_device_unmap_bar(), hotplug and DMA BAR mapping paths
  • scripts/setup.sh: host preparation for hugepages, VFIO/UIO binding, and device setup

The Two Option Structures

SPDK has both application options and environment options.

struct spdk_app_opts is the public event-framework option structure. It includes things such as the app name, JSON config, RPC address, reactor mask, memory size, PCI allow/block lists, hugepage options, interrupt mode, trace options, and delay_subsystem_init.

struct spdk_env_opts is lower-level. It is what the env implementation needs: process name, core mask or lcore map, shared memory ID, memory channel count, main core, hugepage flags, PCI settings, IOVA mode, base virtual address, and NUMA behavior.

The bridge is lib/event/app.c:app_setup_env(). It creates a local struct spdk_env_opts, calls spdk_env_opts_init(), copies fields from struct spdk_app_opts, then calls spdk_env_init().

The important beginner detail: a command-line option often lands in spdk_app_opts, but the failure message may come later from DPDK EAL or the env layer after the value has been translated.

How EAL Arguments Are Built

lib/env_dpdk/init.c:build_eal_cmdline() converts spdk_env_opts into DPDK arguments. It is worth reading slowly because many startup failures are explained there.

Key decisions:

  • If shm_id < 0, SPDK adds --no-shconf. That is a single-process style where DPDK shared configuration files are disabled.
  • Exactly one of core_mask and lcore_map must be set. If both are set or neither is set, initialization fails.
  • lcore_map becomes a DPDK --lcores=... argument.
  • A core mask beginning with [ is treated as a core list and converted to -l.
  • A core mask beginning with - is treated as literal EAL arguments.
  • Otherwise, the value is passed as -c <mask>.
  • mem_channel > 0 becomes -n.
  • mem_size >= 0 becomes -m.
  • no_huge disables hugepage behavior and has compatibility checks.
  • no_pci adds --no-pci and disables vtophys mapping.
  • iova_mode is passed through to EAL where supported.

That means "reactor mask" is not merely SPDK policy. It becomes an EAL CPU selection argument. If it is malformed, DPDK can reject the process before any SPDK subsystem exists.

What spdk_env_init() Actually Does

lib/env_dpdk/init.c:spdk_env_init() is the DPDK-backed implementation.

Its sequence is:

  1. Validate whether this is first initialization or reinitialization.
  2. Validate opts_user and opts_size.
  3. Copy options using env_copy_opts().
  4. Initialize OpenSSL settings.
  5. Call build_eal_cmdline().
  6. Print the DPDK EAL parameter list.
  7. Copy the argument array because DPDK may rearrange it.
  8. Call rte_eal_init().
  9. Determine whether legacy memory mode is needed.
  10. Call spdk_env_dpdk_post_init().

spdk_env_dpdk_post_init() initializes:

  • PCI environment through pci_env_init().
  • SPDK memory map through mem_map_init().
  • virtual-to-physical translation through vtophys_init().

So when spdk_env_init() succeeds, the process has more than "DPDK started." It has an SPDK-compatible env implementation ready for memory allocation, PCI, and address translation.

Hugepages And Why Normal malloc() Is Not Enough

SPDK storage paths often pass buffers to hardware or to other DMA-capable components. Normal heap memory can be paged, relocated by virtual memory mappings, split into many small physical pages, or lack the address translation metadata SPDK needs.

SPDK's DMA allocation APIs are declared in include/spdk/env.h:

  • spdk_dma_malloc()
  • spdk_dma_malloc_socket()
  • spdk_dma_zmalloc()
  • spdk_dma_zmalloc_socket()
  • spdk_dma_realloc()
  • spdk_dma_free()

In the DPDK env implementation, lib/env_dpdk/env.c:spdk_dma_malloc_socket() calls spdk_malloc() with SPDK_MALLOC_DMA | SPDK_MALLOC_SHARE. spdk_malloc() uses DPDK's rte_malloc_socket() and enforces at least cache-line alignment.

Beginner rule:

If an SPDK API says "must be allocated with spdk_dma_malloc() or variants," do not substitute malloc(). The code may compile, but a controller, DMA engine, RDMA NIC, or zero-copy path may fail later when it tries to translate or register the buffer.

Vtophys And IOVA

spdk_vtophys() is the reader-friendly name for a hard problem: translate a virtual address in the process into an address usable for DMA. In the DPDK env, see lib/env_dpdk/memory.c:spdk_vtophys().

It uses g_vtophys_map, initialized by lib/env_dpdk/memory.c:vtophys_init(). The map is populated by callbacks such as vtophys_notify() as memory is registered, mapped, or unmapped.

Important modes:

  • With IOVA as physical address, devices use physical addresses.
  • With IOVA as virtual address, devices may use virtual-address-like IOVAs through the IOMMU.
  • With --no-pci, lib/env_dpdk/init.c:build_eal_cmdline() calls mem_disable_vtophys(), and spdk_vtophys() may return the virtual address directly because no PCI DMA translation is needed.

Misconception to kill:

"Hugepages automatically mean every pointer can be used for DMA." No. The buffer still needs to come from the right allocator or memory registration path, and the device must be able to address it under the current IOVA/IOMMU mode.

PCI Ownership And VFIO

SPDK is a userspace storage stack. For direct NVMe PCI access, the kernel NVMe driver must not own the controller. The device is usually bound to vfio-pci, and the process uses VFIO and DPDK PCI enumeration.

This is why scripts/setup.sh matters. It is not a ceremonial install script. It prepares hugepages and driver binding so DPDK can discover and map devices.

Failure patterns:

  • Kernel still owns the NVMe device: SPDK cannot directly drive it.
  • IOMMU/VFIO is unavailable or misconfigured: DPDK may fail to map DMA.
  • Running without needed privileges: lib/event/app.c:app_setup_env() logs that you may need root after spdk_env_init() fails and getuid() != 0.
  • PCI allowlist excludes the target device: env initializes, but the expected controller does not appear.

Mempools And Memzones

SPDK uses fixed-size object pools heavily because runtime allocation is expensive and failure-prone in hot I/O paths.

The public mempool APIs live in include/spdk/env.h:

  • spdk_mempool_create()
  • spdk_mempool_create_ctor()
  • spdk_mempool_get()
  • spdk_mempool_get_bulk()
  • spdk_mempool_put()
  • spdk_mempool_put_bulk()
  • spdk_mempool_count()
  • spdk_mempool_lookup()

The DPDK-backed implementations are in lib/env_dpdk/env.c.

Examples elsewhere:

  • lib/thread/thread.c:_thread_lib_init() creates g_spdk_msg_mempool for cross-thread messages.
  • lib/event/reactor.c:spdk_reactors_init() creates g_spdk_event_mempool for events.
  • lib/thread/iobuf.c:spdk_iobuf_initialize() creates shared iobuf backing pools through lower-level ring and memory helpers.

Memzones are named shared memory regions. They support cases where a component needs a named, aligned region rather than many small objects.

NUMA Is A Performance Feature And A Failure Mode

SPDK often runs with one or more reactors pinned to cores. Memory locality matters because a core polling an NVMe qpair or transport queue may touch buffers, descriptors, and completion state millions of times per second.

struct spdk_env_opts includes enforce_numa. In lib/env_dpdk/init.c:build_eal_cmdline(), this calls mem_enforce_numa(). In lib/env_dpdk/env.c:spdk_malloc() and spdk_zmalloc(), allocation falls back to SOCKET_ID_ANY when allocation on the requested NUMA node fails unless NUMA is enforced.

Beginner rule:

If performance is unexpectedly uneven, inspect NUMA. If startup fails only with strict NUMA options, inspect hugepage distribution per NUMA node.

Edge Cases And Failure Modes

  • Core mask and lcore map both set: build_eal_cmdline() rejects it.
  • Neither core mask nor lcore map set at env level: rejected, though spdk_app_start() sets a default reactor mask if the app left both unset.
  • --no-huge combined with hugepage-specific options: rejected.
  • --no-huge without explicit memory sizing: rejected by the DPDK env code path.
  • iova-mode=pa with --no-huge: rejected in the no-huge checks.
  • no_pci disables PCI and vtophys behavior; that is valid for some tests but wrong for direct NVMe PCI.
  • Root permissions may be needed for hugepages, VFIO, device binding, or memory locking.
  • Reinitialization has special rules: spdk_env_init(NULL) is used after a prior spdk_env_fini() in the same process.
  • opts_size too small can hide newer fields. Both app and env options use opts_size to preserve ABI compatibility.

Misconceptions To Kill

  • "SPDK bypasses Linux, so Linux setup does not matter." It bypasses parts of the kernel I/O path, but it depends heavily on Linux hugepages, VFIO/IOMMU, PCI binding, and process permissions.
  • "A reactor mask is just an SPDK preference." It becomes an EAL CPU argument and determines where OS threads are launched.
  • "DMA-safe memory is just aligned memory." Alignment is necessary but not sufficient. The memory must be pinned/registered/translated for the device path.
  • "If env init succeeds, all NVMe devices are ready." Env init means the platform is ready. Controllers still need probing, attachment, bdev creation, and subsystem config.
  • "Mempool exhaustion is like malloc slowness." In hot paths, exhaustion usually means a designed backpressure path, NOMEM retry path, or fatal configuration error.

Diskengine Relevance

In an excloud diskengine-style deployment, SPDK often runs as an external daemon controlled by RPC. If the daemon never reaches RPC runtime state, diskengine cannot reconcile devices, volumes, or exports.

When diagnosing an early failure, classify it before chasing bdev code:

  • Env failure: EAL rejects arguments, hugepages unavailable, VFIO missing.
  • Startup failure: reactors or app thread fail after env init.
  • Subsystem failure: one subsystem init callback returns non-zero.
  • Config failure: startup or runtime JSON RPC fails.

This chapter covers the first class.

Prose Diagram: Address Translation Path

Imagine a write buffer as a card moving through five boxes:

  1. The application has a C pointer, like 0x7f....
  2. The pointer comes from spdk_dma_zmalloc(), so it belongs to SPDK/DPDK managed memory.
  3. SPDK's memory map knows the virtual range.
  4. spdk_vtophys() translates it to an address valid for the current IOVA mode.
  5. A device or transport can use that address in a descriptor, SGE, or DMA mapping.

If the card starts from plain malloc(), it may fall out between boxes 2 and 3.

Source Reading Exercise

Read these functions in order:

  1. lib/event/app.c:spdk_app_start()
  2. lib/event/app.c:app_setup_env()
  3. lib/env_dpdk/init.c:spdk_env_init()
  4. lib/env_dpdk/init.c:build_eal_cmdline()
  5. lib/env_dpdk/init.c:spdk_env_dpdk_post_init()
  6. lib/env_dpdk/memory.c:vtophys_init()

Questions while reading:

  • Where is the default reactor mask chosen?
  • Which options are copied from app opts into env opts?
  • What happens if DPDK returns EALREADY?
  • Which function initializes vtophys?
  • Which options disable or alter vtophys behavior?

Operational Lab

No live NVMe device is required.

  1. Run scripts/setup.sh status and write down hugepage count, device binding, and IOMMU/VFIO status.
  2. Inspect an SPDK app command line that uses -m, -c, --no-pci, or --wait-for-rpc.
  3. Map each option to fields in struct spdk_app_opts and struct spdk_env_opts.
  4. Predict the DPDK EAL argument that build_eal_cmdline() will produce.
  5. Compare your prediction to the startup log line beginning with DPDK EAL parameters.

Debug variation:

  • Try a deliberately invalid combination in a disposable dev environment, such as both a core mask and an lcore map, and identify where initialization rejects it.

Self-Check

  1. Why does SPDK initialize env before reactors?
  2. What is the difference between spdk_app_opts and spdk_env_opts?
  3. Why does spdk_dma_zmalloc() matter for DMA paths?
  4. What is the role of spdk_vtophys()?
  5. Why can --no-pci be useful for tests but wrong for NVMe PCI?
  6. What does opts_size protect against?
  7. How can NUMA settings affect both startup and performance?

References

  • Local source: include/spdk/env.h
  • Local source: include/spdk/event.h
  • Local source: lib/event/app.c
  • Local source: lib/env_dpdk/init.c
  • Local source: lib/env_dpdk/env.c
  • Local source: lib/env_dpdk/memory.c
  • Local source: scripts/setup.sh