SPDK From First Principles

SPDK deep learning path

Chapter 12: Memory, Iobuf, Mempools, Zero Copy

By the end of this chapter, a beginner should be able to explain SPDK's memory categories, why DMA-safe allocation is different from ordinary allocation, how mempools and iobuf...

Source: drafts/runtime/12-memory-iobuf-mempools-zero-copy.md

Reader Promise

By the end of this chapter, a beginner should be able to explain SPDK's memory categories, why DMA-safe allocation is different from ordinary allocation, how mempools and iobuf reduce hot-path allocation, how iobuf wait queues handle NOMEM pressure, and what "zero copy" really means in SPDK contexts.

The chapter also kills a dangerous myth: zero copy does not mean "no memory management." It usually means memory ownership, alignment, lifetime, and device compatibility become more strict.

Mental Model

SPDK memory choices answer three questions:

  1. Who owns this memory?
  2. Can hardware or another process safely access it?
  3. What happens when memory is temporarily unavailable?

Common categories:

  • ordinary C heap: okay for control-plane metadata, not generally for DMA
  • SPDK DMA memory: allocated through spdk_dma_*, suitable for many device paths
  • mempool objects: fixed-size reusable objects, often for messages or I/O descriptors
  • memzones: named shared/aligned regions
  • iobuf buffers: shared runtime data buffers with per-thread caches and wait queues
  • memory domains: abstraction for memory owned by another DMA-capable domain such as RDMA

Source Anchors

  • include/spdk/env.h: spdk_malloc(), spdk_zmalloc(), spdk_dma_malloc(), spdk_dma_zmalloc(), spdk_dma_free(), spdk_mempool_create(), spdk_mempool_get(), spdk_mempool_put(), spdk_memzone_reserve(), spdk_vtophys()
  • lib/env_dpdk/env.c: spdk_malloc(), spdk_zmalloc(), spdk_dma_malloc_socket(), spdk_dma_zmalloc_socket(), spdk_mempool_create_ctor(), spdk_mempool_get(), spdk_mempool_put()
  • lib/env_dpdk/memory.c: vtophys_init(), spdk_vtophys(), vtophys_notify(), vtophys_iommu_init()
  • include/spdk/dma.h: struct spdk_memory_domain, spdk_memory_domain_create(), spdk_memory_domain_set_translation(), spdk_memory_domain_translate_data(), spdk_memory_domain_transfer_data(), spdk_memory_domain_get_system_domain()
  • include/spdk/thread.h: struct spdk_iobuf_opts, struct spdk_iobuf_channel, spdk_iobuf_initialize(), spdk_iobuf_finish(), spdk_iobuf_register_module(), spdk_iobuf_channel_init(), spdk_iobuf_get(), spdk_iobuf_put(), spdk_iobuf_entry_abort(), spdk_iobuf_get_stats()
  • lib/thread/iobuf.c: spdk_iobuf_initialize(), spdk_iobuf_set_opts(), spdk_iobuf_channel_init(), spdk_iobuf_channel_fini(), spdk_iobuf_get(), spdk_iobuf_put(), spdk_iobuf_for_each_entry(), spdk_iobuf_entry_abort(), spdk_iobuf_get_stats()
  • lib/nvmf/transport.c: spdk_iobuf_register_module() use, spdk_iobuf_channel_init() use, spdk_iobuf_get() use, spdk_iobuf_put() use, nvmf_request_iobuf_get_cb()
  • include/spdk/nvmf.h: opts->no_srq and opts->zero_copy related target options, including "Use zero-copy operations if the underlying bdev supports them"
  • include/spdk_internal/sock_module.h: zerocopy_threshold for socket implementations

DMA-Safe Allocation

SPDK APIs that interact with devices often require DMA-safe buffers. The public API is in include/spdk/env.h:

  • spdk_dma_malloc()
  • spdk_dma_malloc_socket()
  • spdk_dma_zmalloc()
  • spdk_dma_zmalloc_socket()
  • spdk_dma_realloc()
  • spdk_dma_free()

In this DPDK env, lib/env_dpdk/env.c:spdk_dma_zmalloc_socket() calls spdk_zmalloc() with SPDK_MALLOC_DMA | SPDK_MALLOC_SHARE. spdk_zmalloc() uses DPDK allocation and aligns to at least cache line size.

The unused parameter must be NULL. The implementation returns NULL if it is not.

Beginner rule:

Use the allocation family expected by the API you call. Do not pass a stack buffer or ordinary heap buffer to a DMA path just because the type is void *.

Alignment

Alignment appears everywhere in storage:

  • cache-line alignment avoids false sharing and supports CPU efficiency
  • device descriptors may require specific alignment
  • metadata/DIF layouts may require block or protection information boundaries
  • hugepage-backed memory simplifies translation and pinning

lib/env_dpdk/env.c:spdk_malloc() and spdk_zmalloc() apply at least RTE_CACHE_LINE_SIZE alignment. lib/thread/iobuf.c:spdk_iobuf_initialize() rounds small and large iobuf sizes up to IOBUF_ALIGNMENT.

Misconception to kill:

"If the address is aligned, it is DMA-safe." No. Alignment is one requirement. DMA safety also needs appropriate allocation, mapping, lifetime, and address translation.

Mempools

Mempools are fixed-size object pools. They are ideal for small objects with high allocation frequency and predictable sizes.

In include/spdk/env.h, the mempool API includes:

  • create/free: spdk_mempool_create(), spdk_mempool_create_ctor(), spdk_mempool_free()
  • get/put: spdk_mempool_get(), spdk_mempool_get_bulk(), spdk_mempool_put(), spdk_mempool_put_bulk()
  • introspection: spdk_mempool_count(), spdk_mempool_lookup(), spdk_mempool_obj_iter(), spdk_mempool_mem_iter()

Examples:

  • lib/thread/thread.c:_thread_lib_init() creates a message mempool.
  • lib/event/reactor.c:spdk_reactors_init() creates an event mempool.
  • NVMf RDMA and iSCSI modules create transport/session/task pools.

Operational meaning:

Mempools turn allocation pressure into explicit resource pressure. If a mempool is empty, the system should either queue, retry, apply backpressure, or fail cleanly.

Iobuf: Why It Exists

iobuf is a shared pool of data buffers with per-thread caches and wait queues. It exists because many SPDK transports and modules need temporary data buffers, but allocating from the heap in the I/O path is too slow and unpredictable.

The public iobuf API lives in include/spdk/thread.h, not in a separate iobuf.h in this tree.

Important types:

  • struct spdk_iobuf_opts: pool counts, buffer sizes, NUMA behavior
  • struct spdk_iobuf_channel: per-thread cache state
  • struct spdk_iobuf_entry: wait queue entry for async buffer acquisition

Important functions:

  • spdk_iobuf_set_opts()
  • spdk_iobuf_initialize()
  • spdk_iobuf_register_module()
  • spdk_iobuf_channel_init()
  • spdk_iobuf_get()
  • spdk_iobuf_put()
  • spdk_iobuf_entry_abort()
  • spdk_iobuf_channel_fini()
  • spdk_iobuf_finish()

Iobuf Initialization

lib/thread/iobuf.c:spdk_iobuf_initialize():

  1. Rounds small and large buffer sizes up to the iobuf alignment.
  2. Initializes iobuf nodes for each relevant NUMA ID.
  3. Registers &g_iobuf as an io_device.
  4. Marks iobuf initialized.

Because iobuf is registered as an io_device, it uses the io_channel mechanism from the previous chapter.

spdk_iobuf_finish() unregisters that io_device and eventually frees modules and node pools in iobuf_unregister_cb().

Iobuf Module Registration

Only registered iobuf modules can create iobuf channels. lib/thread/iobuf.c:spdk_iobuf_register_module() stores module names in g_iobuf.modules. spdk_iobuf_channel_init() searches for the module name before creating the channel.

This is a useful guardrail:

  • It lets stats be grouped by module.
  • It prevents accidental anonymous pool usage.
  • It makes wait queues module-aware.

Example:

lib/nvmf/transport.c builds an iobuf module name for transports and calls spdk_iobuf_register_module() when the transport uses iobuf.

Iobuf Channels And Per-Thread Caches

lib/thread/iobuf.c:spdk_iobuf_channel_init():

  1. Verifies the module exists.
  2. Gets a parent io_channel for &g_iobuf.
  3. Finds a free channel slot in the parent channel context.
  4. Sets ch->parent and ch->module.
  5. Initializes small and large caches for each NUMA ID.
  6. Populates caches from central pools.

The per-thread channel caches reduce contention on central pools. A hot thread can get and put buffers from its local cache most of the time.

Failure mode:

If cache population cannot dequeue enough buffers from the central pool, initialization returns -ENOMEM and logs that the user may need to increase small_pool_count or large_pool_count.

spdk_iobuf_get()

lib/thread/iobuf.c:spdk_iobuf_get() takes:

  • iobuf channel
  • requested length
  • optional wait entry
  • optional callback

It asserts the parent io_channel belongs to the current spdk_thread.

Then it chooses the small or large pool:

  • len <= small.bufsize: small pool
  • otherwise: large pool, with an assertion that len <= large.bufsize

Then:

- if an entry is provided, queue the entry and callback - return NULL

  • If a local cached buffer exists, return it immediately.
  • Otherwise dequeue a batch from the central ring, cache all but one, and return one.
  • If no central buffers exist:

Important callback detail:

If a buffer is available immediately, the callback is not executed. The caller receives the buffer directly.

spdk_iobuf_put()

lib/thread/iobuf.c:spdk_iobuf_put() returns a buffer.

It:

- returns to local cache or central pool depending on cache size

- removes the first waiter - calls the waiter's callback with the returned buffer

  1. Chooses NUMA ID if iobuf NUMA is enabled.
  2. Chooses small or large pool based on the same length rule.
  3. If no waiters exist:
  4. If waiters exist:

This is an important backpressure path. A waiting I/O can resume when another I/O returns a buffer.

Beginner rule:

The len passed to spdk_iobuf_put() must match the length class used by spdk_iobuf_get(). The public docs explicitly say it must be the exact same value.

NOMEM Is Often A Designed State

When spdk_iobuf_get() returns NULL with a queued entry, that is not necessarily fatal. It can mean "wait until a buffer is returned."

But NOMEM can also indicate bad sizing:

  • too few small buffers
  • too few large buffers
  • per-channel caches too large for central pool
  • I/O unit size larger than large buffer size
  • leak where buffers are not returned
  • wrong path using iobuf for unexpectedly large data

lib/nvmf/transport.c checks iobuf options against transport io_unit_size. It warns when requested shared buffers exceed available pool size.

Aborting Iobuf Waiters

Sometimes a request waiting for a buffer must be canceled because the connection, qpair, or operation is torn down.

lib/thread/iobuf.c:spdk_iobuf_entry_abort() walks NUMA caches and removes the entry from the appropriate wait queue.

lib/nvmf/transport.c uses spdk_iobuf_for_each_entry() and spdk_iobuf_entry_abort() to abort pending buffer requests for requests that should no longer continue.

Edge case:

If teardown forgets to abort waiters, a later buffer return can call a callback for an operation that no longer has valid ownership.

Memory Domains

include/spdk/dma.h defines memory domains. They abstract memory that may belong to different DMA-capable domains.

Key functions:

  • spdk_memory_domain_create()
  • spdk_memory_domain_set_translation()
  • spdk_memory_domain_set_pull()
  • spdk_memory_domain_set_push()
  • spdk_memory_domain_set_data_transfer()
  • spdk_memory_domain_translate_data()
  • spdk_memory_domain_transfer_data()
  • spdk_memory_domain_get_system_domain()

Why this exists:

Some data may live in memory registered with an RDMA NIC, accelerator, GPU-like device, or another domain. Instead of always copying into system memory first, SPDK can ask domains how to translate, pull, push, or transfer data.

Beginner simplification:

Memory domains are SPDK's way to ask, "Can this memory be used from there, and if not, how do we move it?"

Zero Copy

"Zero copy" means a data path avoids one or more CPU memory copies. It does not mean:

  • no DMA
  • no descriptors
  • no memory registration
  • no ownership rules
  • no fallback path
  • no metadata handling

SPDK has several zero-copy-adjacent concepts:

  • DMA-safe buffers avoid copying into driver-owned kernel memory.
  • NVMe-oF may use transport buffers or bdev-provided buffers.
  • Socket implementations may use MSG_ZEROCOPY depending on thresholds; see include/spdk_internal/sock_module.h.
  • NVMf target options include using zero-copy operations if the underlying bdev supports them; see include/spdk/nvmf.h.
  • Memory domains can allow direct data movement between domains.

The practical question is not "is this zero-copy?" It is:

  • Who owns the buffer?
  • Is the buffer valid until completion?
  • Is it aligned and registered for the device or transport?
  • Can the next layer consume the same iovecs?
  • What fallback happens if zero-copy is unsupported?

Metadata And DIF/DIX

Storage buffers may include metadata or protection information. The performance app references DIF/DIX paths in app/spdk_nvme_perf/perf.c, and many public APIs distinguish data and metadata buffers.

Metadata complicates zero-copy:

  • metadata may be separate from data
  • protection information may need insert, strip, generate, check, or update
  • hardware and transport capabilities differ
  • "hide metadata" options can change what a host sees

Beginner rule:

Do not assume a block is only user data bytes. Always inspect bdev block size, metadata size, and protection information settings before reasoning about buffer length.

Bounce Buffers

A bounce buffer is a temporary buffer used when the original buffer cannot be used directly.

Reasons:

  • source memory is not DMA-safe
  • alignment does not meet device requirements
  • memory domain cannot be translated
  • metadata layout does not match the next layer
  • transport requires a contiguous or differently sized buffer

Bounce buffers are not "bad" by themselves. They are a correctness fallback. But unexpected bounce-buffer use can destroy performance, so it should be visible in source reading and metrics.

Edge Cases And Failure Modes

  • Passing non-NULL unused to SPDK allocation APIs: returns NULL.
  • Ordinary malloc() buffer passed to DMA path: may fail translation or device access.
  • iobuf request larger than large buffer size: assertion path.
  • iobuf put with mismatched length: wrong pool/cache behavior.
  • Module not registered with iobuf: spdk_iobuf_channel_init() returns -ENODEV.
  • Not enough pool entries to populate channel caches: -ENOMEM.
  • Forgetting to return iobuf buffers: pool starvation and wait queues grow.
  • Forgetting to abort waiters on teardown: callback can fire after owner is gone.
  • NUMA pool too small for selected topology: startup or channel init fails.
  • Zero-copy path unsupported by underlying bdev: must fall back or fail as designed.
  • Metadata/DIF path changes data length assumptions.

Misconceptions To Kill

  • "Zero copy means no buffers." Zero-copy is still buffer management.
  • "DMA memory is just fast memory." It is memory with properties needed for device access.
  • "Mempools are premature optimization." In SPDK hot paths, they are how failure and latency are controlled.
  • "NOMEM always means fatal." It may mean queue and retry.
  • "iobuf is global only." It has global pools plus per-thread channel caches.
  • "Metadata is rare, so ignore it." Metadata and protection info are central to many enterprise storage paths.

Diskengine Relevance

Diskengine may observe SPDK failures as generic I/O failures or slow operations. Memory pressure can be the hidden cause.

Useful classification:

  • Env memory failure: hugepages or DPDK memory unavailable.
  • Mempool exhaustion: fixed object pool too small or leak.
  • Iobuf pressure: data buffers exhausted; requests wait.
  • DMA translation failure: buffer or device domain mismatch.
  • Zero-copy fallback: correctness preserved but latency/CPU changes.
  • Metadata mismatch: request shape incompatible with bdev or export path.

When a diskengine reconciliation loop repeatedly retries an SPDK RPC or I/O operation, inspect whether SPDK is making progress or waiting on buffers.

Prose Diagram: Iobuf Get/Put

Think of iobuf as a warehouse with thread-local shelves:

  1. A module registers as a warehouse customer.
  2. Each SPDK thread opens a local shelf with spdk_iobuf_channel_init().
  3. spdk_iobuf_get() first checks the local shelf.
  4. If empty, it takes a box from the central warehouse and may stock extra boxes on the shelf.
  5. If the warehouse is empty, the request writes its name on a waiting list.
  6. spdk_iobuf_put() either puts the box back on the shelf or hands it directly to the first waiter.

The warehouse is shared, but shelf access is thread-owned.

Source Reading Exercise

Read iobuf allocation flow:

  1. lib/thread/iobuf.c:spdk_iobuf_initialize()
  2. lib/thread/iobuf.c:spdk_iobuf_register_module()
  3. lib/thread/iobuf.c:spdk_iobuf_channel_init()
  4. lib/thread/iobuf.c:spdk_iobuf_get()
  5. lib/thread/iobuf.c:spdk_iobuf_put()
  6. lib/thread/iobuf.c:spdk_iobuf_entry_abort()

Then connect it to a transport:

  1. lib/nvmf/transport.c:nvmf_transport_use_iobuf()
  2. lib/nvmf/transport.c:spdk_iobuf_register_module() call sites
  3. lib/nvmf/transport.c:spdk_iobuf_channel_init() call sites
  4. lib/nvmf/transport.c:spdk_iobuf_get() call sites
  5. lib/nvmf/transport.c:nvmf_request_iobuf_get_cb()

Questions:

  • Which path returns a buffer immediately?
  • Which path queues an entry?
  • Where is the module recorded on the wait entry?
  • How does teardown abort pending entries?

Operational Lab

Source-only sizing lab:

  1. Read struct spdk_iobuf_opts in include/spdk/thread.h.
  2. Write down small pool count, large pool count, small buffer size, large buffer size, and NUMA behavior.
  3. Find a transport or module that calls spdk_iobuf_channel_init().
  4. Compare its small and large cache sizes to the global pool counts.
  5. Explain what happens if every reactor creates a channel at once.

Runtime lab:

  1. Start SPDK with a workload that uses NVMf or another iobuf consumer.
  2. Query iobuf stats if the RPC is available in the built app.
  3. Watch cache hits, main pool use, and retry counts.
  4. Increase queue depth or I/O size and observe whether retries grow.

Self-Check

  1. Why is spdk_dma_zmalloc() different from calloc()?
  2. What are mempools good for?
  3. Why does iobuf have per-thread channels?
  4. What happens when spdk_iobuf_get() cannot allocate a buffer and an entry is provided?
  5. Why must teardown abort iobuf waiters?
  6. What does zero-copy not guarantee?
  7. How can metadata change buffer reasoning?

References

  • Local source: include/spdk/env.h
  • Local source: lib/env_dpdk/env.c
  • Local source: lib/env_dpdk/memory.c
  • Local source: include/spdk/dma.h
  • Local source: include/spdk/thread.h
  • Local source: lib/thread/iobuf.c
  • Local source: lib/nvmf/transport.c
  • Local source: include/spdk/nvmf.h
  • Local source: include/spdk_internal/sock_module.h