Reader Promise
By the end of this chapter, a beginner should be able to explain SPDK's memory categories, why DMA-safe allocation is different from ordinary allocation, how mempools and iobuf reduce hot-path allocation, how iobuf wait queues handle NOMEM pressure, and what "zero copy" really means in SPDK contexts.
The chapter also kills a dangerous myth: zero copy does not mean "no memory management." It usually means memory ownership, alignment, lifetime, and device compatibility become more strict.
Mental Model
SPDK memory choices answer three questions:
- Who owns this memory?
- Can hardware or another process safely access it?
- What happens when memory is temporarily unavailable?
Common categories:
- ordinary C heap: okay for control-plane metadata, not generally for DMA
- SPDK DMA memory: allocated through
spdk_dma_*, suitable for many device paths - mempool objects: fixed-size reusable objects, often for messages or I/O descriptors
- memzones: named shared/aligned regions
- iobuf buffers: shared runtime data buffers with per-thread caches and wait queues
- memory domains: abstraction for memory owned by another DMA-capable domain such as RDMA
Source Anchors
include/spdk/env.h:spdk_malloc(),spdk_zmalloc(),spdk_dma_malloc(),spdk_dma_zmalloc(),spdk_dma_free(),spdk_mempool_create(),spdk_mempool_get(),spdk_mempool_put(),spdk_memzone_reserve(),spdk_vtophys()lib/env_dpdk/env.c:spdk_malloc(),spdk_zmalloc(),spdk_dma_malloc_socket(),spdk_dma_zmalloc_socket(),spdk_mempool_create_ctor(),spdk_mempool_get(),spdk_mempool_put()lib/env_dpdk/memory.c:vtophys_init(),spdk_vtophys(),vtophys_notify(),vtophys_iommu_init()include/spdk/dma.h:struct spdk_memory_domain,spdk_memory_domain_create(),spdk_memory_domain_set_translation(),spdk_memory_domain_translate_data(),spdk_memory_domain_transfer_data(),spdk_memory_domain_get_system_domain()include/spdk/thread.h:struct spdk_iobuf_opts,struct spdk_iobuf_channel,spdk_iobuf_initialize(),spdk_iobuf_finish(),spdk_iobuf_register_module(),spdk_iobuf_channel_init(),spdk_iobuf_get(),spdk_iobuf_put(),spdk_iobuf_entry_abort(),spdk_iobuf_get_stats()lib/thread/iobuf.c:spdk_iobuf_initialize(),spdk_iobuf_set_opts(),spdk_iobuf_channel_init(),spdk_iobuf_channel_fini(),spdk_iobuf_get(),spdk_iobuf_put(),spdk_iobuf_for_each_entry(),spdk_iobuf_entry_abort(),spdk_iobuf_get_stats()lib/nvmf/transport.c:spdk_iobuf_register_module()use,spdk_iobuf_channel_init()use,spdk_iobuf_get()use,spdk_iobuf_put()use,nvmf_request_iobuf_get_cb()include/spdk/nvmf.h:opts->no_srqandopts->zero_copyrelated target options, including "Use zero-copy operations if the underlying bdev supports them"include/spdk_internal/sock_module.h:zerocopy_thresholdfor socket implementations
DMA-Safe Allocation
SPDK APIs that interact with devices often require DMA-safe buffers. The public API is in include/spdk/env.h:
spdk_dma_malloc()spdk_dma_malloc_socket()spdk_dma_zmalloc()spdk_dma_zmalloc_socket()spdk_dma_realloc()spdk_dma_free()
In this DPDK env, lib/env_dpdk/env.c:spdk_dma_zmalloc_socket() calls spdk_zmalloc() with SPDK_MALLOC_DMA | SPDK_MALLOC_SHARE. spdk_zmalloc() uses DPDK allocation and aligns to at least cache line size.
The unused parameter must be NULL. The implementation returns NULL if it is not.
Beginner rule:
Use the allocation family expected by the API you call. Do not pass a stack buffer or ordinary heap buffer to a DMA path just because the type is void *.
Alignment
Alignment appears everywhere in storage:
- cache-line alignment avoids false sharing and supports CPU efficiency
- device descriptors may require specific alignment
- metadata/DIF layouts may require block or protection information boundaries
- hugepage-backed memory simplifies translation and pinning
lib/env_dpdk/env.c:spdk_malloc() and spdk_zmalloc() apply at least RTE_CACHE_LINE_SIZE alignment. lib/thread/iobuf.c:spdk_iobuf_initialize() rounds small and large iobuf sizes up to IOBUF_ALIGNMENT.
Misconception to kill:
"If the address is aligned, it is DMA-safe." No. Alignment is one requirement. DMA safety also needs appropriate allocation, mapping, lifetime, and address translation.
Mempools
Mempools are fixed-size object pools. They are ideal for small objects with high allocation frequency and predictable sizes.
In include/spdk/env.h, the mempool API includes:
- create/free:
spdk_mempool_create(),spdk_mempool_create_ctor(),spdk_mempool_free() - get/put:
spdk_mempool_get(),spdk_mempool_get_bulk(),spdk_mempool_put(),spdk_mempool_put_bulk() - introspection:
spdk_mempool_count(),spdk_mempool_lookup(),spdk_mempool_obj_iter(),spdk_mempool_mem_iter()
Examples:
lib/thread/thread.c:_thread_lib_init()creates a message mempool.lib/event/reactor.c:spdk_reactors_init()creates an event mempool.- NVMf RDMA and iSCSI modules create transport/session/task pools.
Operational meaning:
Mempools turn allocation pressure into explicit resource pressure. If a mempool is empty, the system should either queue, retry, apply backpressure, or fail cleanly.
Iobuf: Why It Exists
iobuf is a shared pool of data buffers with per-thread caches and wait queues. It exists because many SPDK transports and modules need temporary data buffers, but allocating from the heap in the I/O path is too slow and unpredictable.
The public iobuf API lives in include/spdk/thread.h, not in a separate iobuf.h in this tree.
Important types:
struct spdk_iobuf_opts: pool counts, buffer sizes, NUMA behaviorstruct spdk_iobuf_channel: per-thread cache statestruct spdk_iobuf_entry: wait queue entry for async buffer acquisition
Important functions:
spdk_iobuf_set_opts()spdk_iobuf_initialize()spdk_iobuf_register_module()spdk_iobuf_channel_init()spdk_iobuf_get()spdk_iobuf_put()spdk_iobuf_entry_abort()spdk_iobuf_channel_fini()spdk_iobuf_finish()
Iobuf Initialization
lib/thread/iobuf.c:spdk_iobuf_initialize():
- Rounds small and large buffer sizes up to the iobuf alignment.
- Initializes iobuf nodes for each relevant NUMA ID.
- Registers
&g_iobufas an io_device. - Marks iobuf initialized.
Because iobuf is registered as an io_device, it uses the io_channel mechanism from the previous chapter.
spdk_iobuf_finish() unregisters that io_device and eventually frees modules and node pools in iobuf_unregister_cb().
Iobuf Module Registration
Only registered iobuf modules can create iobuf channels. lib/thread/iobuf.c:spdk_iobuf_register_module() stores module names in g_iobuf.modules. spdk_iobuf_channel_init() searches for the module name before creating the channel.
This is a useful guardrail:
- It lets stats be grouped by module.
- It prevents accidental anonymous pool usage.
- It makes wait queues module-aware.
Example:
lib/nvmf/transport.c builds an iobuf module name for transports and calls spdk_iobuf_register_module() when the transport uses iobuf.
Iobuf Channels And Per-Thread Caches
lib/thread/iobuf.c:spdk_iobuf_channel_init():
- Verifies the module exists.
- Gets a parent io_channel for
&g_iobuf. - Finds a free channel slot in the parent channel context.
- Sets
ch->parentandch->module. - Initializes small and large caches for each NUMA ID.
- Populates caches from central pools.
The per-thread channel caches reduce contention on central pools. A hot thread can get and put buffers from its local cache most of the time.
Failure mode:
If cache population cannot dequeue enough buffers from the central pool, initialization returns -ENOMEM and logs that the user may need to increase small_pool_count or large_pool_count.
spdk_iobuf_get()
lib/thread/iobuf.c:spdk_iobuf_get() takes:
- iobuf channel
- requested length
- optional wait entry
- optional callback
It asserts the parent io_channel belongs to the current spdk_thread.
Then it chooses the small or large pool:
len <= small.bufsize: small pool- otherwise: large pool, with an assertion that
len <= large.bufsize
Then:
- if an entry is provided, queue the entry and callback - return NULL
- If a local cached buffer exists, return it immediately.
- Otherwise dequeue a batch from the central ring, cache all but one, and return one.
- If no central buffers exist:
Important callback detail:
If a buffer is available immediately, the callback is not executed. The caller receives the buffer directly.
spdk_iobuf_put()
lib/thread/iobuf.c:spdk_iobuf_put() returns a buffer.
It:
- returns to local cache or central pool depending on cache size
- removes the first waiter - calls the waiter's callback with the returned buffer
- Chooses NUMA ID if iobuf NUMA is enabled.
- Chooses small or large pool based on the same length rule.
- If no waiters exist:
- If waiters exist:
This is an important backpressure path. A waiting I/O can resume when another I/O returns a buffer.
Beginner rule:
The len passed to spdk_iobuf_put() must match the length class used by spdk_iobuf_get(). The public docs explicitly say it must be the exact same value.
NOMEM Is Often A Designed State
When spdk_iobuf_get() returns NULL with a queued entry, that is not necessarily fatal. It can mean "wait until a buffer is returned."
But NOMEM can also indicate bad sizing:
- too few small buffers
- too few large buffers
- per-channel caches too large for central pool
- I/O unit size larger than large buffer size
- leak where buffers are not returned
- wrong path using iobuf for unexpectedly large data
lib/nvmf/transport.c checks iobuf options against transport io_unit_size. It warns when requested shared buffers exceed available pool size.
Aborting Iobuf Waiters
Sometimes a request waiting for a buffer must be canceled because the connection, qpair, or operation is torn down.
lib/thread/iobuf.c:spdk_iobuf_entry_abort() walks NUMA caches and removes the entry from the appropriate wait queue.
lib/nvmf/transport.c uses spdk_iobuf_for_each_entry() and spdk_iobuf_entry_abort() to abort pending buffer requests for requests that should no longer continue.
Edge case:
If teardown forgets to abort waiters, a later buffer return can call a callback for an operation that no longer has valid ownership.
Memory Domains
include/spdk/dma.h defines memory domains. They abstract memory that may belong to different DMA-capable domains.
Key functions:
spdk_memory_domain_create()spdk_memory_domain_set_translation()spdk_memory_domain_set_pull()spdk_memory_domain_set_push()spdk_memory_domain_set_data_transfer()spdk_memory_domain_translate_data()spdk_memory_domain_transfer_data()spdk_memory_domain_get_system_domain()
Why this exists:
Some data may live in memory registered with an RDMA NIC, accelerator, GPU-like device, or another domain. Instead of always copying into system memory first, SPDK can ask domains how to translate, pull, push, or transfer data.
Beginner simplification:
Memory domains are SPDK's way to ask, "Can this memory be used from there, and if not, how do we move it?"
Zero Copy
"Zero copy" means a data path avoids one or more CPU memory copies. It does not mean:
- no DMA
- no descriptors
- no memory registration
- no ownership rules
- no fallback path
- no metadata handling
SPDK has several zero-copy-adjacent concepts:
- DMA-safe buffers avoid copying into driver-owned kernel memory.
- NVMe-oF may use transport buffers or bdev-provided buffers.
- Socket implementations may use
MSG_ZEROCOPYdepending on thresholds; seeinclude/spdk_internal/sock_module.h. - NVMf target options include using zero-copy operations if the underlying bdev supports them; see
include/spdk/nvmf.h. - Memory domains can allow direct data movement between domains.
The practical question is not "is this zero-copy?" It is:
- Who owns the buffer?
- Is the buffer valid until completion?
- Is it aligned and registered for the device or transport?
- Can the next layer consume the same iovecs?
- What fallback happens if zero-copy is unsupported?
Metadata And DIF/DIX
Storage buffers may include metadata or protection information. The performance app references DIF/DIX paths in app/spdk_nvme_perf/perf.c, and many public APIs distinguish data and metadata buffers.
Metadata complicates zero-copy:
- metadata may be separate from data
- protection information may need insert, strip, generate, check, or update
- hardware and transport capabilities differ
- "hide metadata" options can change what a host sees
Beginner rule:
Do not assume a block is only user data bytes. Always inspect bdev block size, metadata size, and protection information settings before reasoning about buffer length.
Bounce Buffers
A bounce buffer is a temporary buffer used when the original buffer cannot be used directly.
Reasons:
- source memory is not DMA-safe
- alignment does not meet device requirements
- memory domain cannot be translated
- metadata layout does not match the next layer
- transport requires a contiguous or differently sized buffer
Bounce buffers are not "bad" by themselves. They are a correctness fallback. But unexpected bounce-buffer use can destroy performance, so it should be visible in source reading and metrics.
Edge Cases And Failure Modes
- Passing non-NULL
unusedto SPDK allocation APIs: returns NULL. - Ordinary
malloc()buffer passed to DMA path: may fail translation or device access. - iobuf request larger than large buffer size: assertion path.
- iobuf
putwith mismatched length: wrong pool/cache behavior. - Module not registered with iobuf:
spdk_iobuf_channel_init()returns-ENODEV. - Not enough pool entries to populate channel caches:
-ENOMEM. - Forgetting to return iobuf buffers: pool starvation and wait queues grow.
- Forgetting to abort waiters on teardown: callback can fire after owner is gone.
- NUMA pool too small for selected topology: startup or channel init fails.
- Zero-copy path unsupported by underlying bdev: must fall back or fail as designed.
- Metadata/DIF path changes data length assumptions.
Misconceptions To Kill
- "Zero copy means no buffers." Zero-copy is still buffer management.
- "DMA memory is just fast memory." It is memory with properties needed for device access.
- "Mempools are premature optimization." In SPDK hot paths, they are how failure and latency are controlled.
- "NOMEM always means fatal." It may mean queue and retry.
- "iobuf is global only." It has global pools plus per-thread channel caches.
- "Metadata is rare, so ignore it." Metadata and protection info are central to many enterprise storage paths.
Diskengine Relevance
Diskengine may observe SPDK failures as generic I/O failures or slow operations. Memory pressure can be the hidden cause.
Useful classification:
- Env memory failure: hugepages or DPDK memory unavailable.
- Mempool exhaustion: fixed object pool too small or leak.
- Iobuf pressure: data buffers exhausted; requests wait.
- DMA translation failure: buffer or device domain mismatch.
- Zero-copy fallback: correctness preserved but latency/CPU changes.
- Metadata mismatch: request shape incompatible with bdev or export path.
When a diskengine reconciliation loop repeatedly retries an SPDK RPC or I/O operation, inspect whether SPDK is making progress or waiting on buffers.
Prose Diagram: Iobuf Get/Put
Think of iobuf as a warehouse with thread-local shelves:
- A module registers as a warehouse customer.
- Each SPDK thread opens a local shelf with
spdk_iobuf_channel_init(). spdk_iobuf_get()first checks the local shelf.- If empty, it takes a box from the central warehouse and may stock extra boxes on the shelf.
- If the warehouse is empty, the request writes its name on a waiting list.
spdk_iobuf_put()either puts the box back on the shelf or hands it directly to the first waiter.
The warehouse is shared, but shelf access is thread-owned.
Source Reading Exercise
Read iobuf allocation flow:
lib/thread/iobuf.c:spdk_iobuf_initialize()lib/thread/iobuf.c:spdk_iobuf_register_module()lib/thread/iobuf.c:spdk_iobuf_channel_init()lib/thread/iobuf.c:spdk_iobuf_get()lib/thread/iobuf.c:spdk_iobuf_put()lib/thread/iobuf.c:spdk_iobuf_entry_abort()
Then connect it to a transport:
lib/nvmf/transport.c:nvmf_transport_use_iobuf()lib/nvmf/transport.c:spdk_iobuf_register_module()call siteslib/nvmf/transport.c:spdk_iobuf_channel_init()call siteslib/nvmf/transport.c:spdk_iobuf_get()call siteslib/nvmf/transport.c:nvmf_request_iobuf_get_cb()
Questions:
- Which path returns a buffer immediately?
- Which path queues an entry?
- Where is the module recorded on the wait entry?
- How does teardown abort pending entries?
Operational Lab
Source-only sizing lab:
- Read
struct spdk_iobuf_optsininclude/spdk/thread.h. - Write down small pool count, large pool count, small buffer size, large buffer size, and NUMA behavior.
- Find a transport or module that calls
spdk_iobuf_channel_init(). - Compare its small and large cache sizes to the global pool counts.
- Explain what happens if every reactor creates a channel at once.
Runtime lab:
- Start SPDK with a workload that uses NVMf or another iobuf consumer.
- Query iobuf stats if the RPC is available in the built app.
- Watch cache hits, main pool use, and retry counts.
- Increase queue depth or I/O size and observe whether retries grow.
Self-Check
- Why is
spdk_dma_zmalloc()different fromcalloc()? - What are mempools good for?
- Why does iobuf have per-thread channels?
- What happens when
spdk_iobuf_get()cannot allocate a buffer and an entry is provided? - Why must teardown abort iobuf waiters?
- What does zero-copy not guarantee?
- How can metadata change buffer reasoning?
References
- Local source:
include/spdk/env.h - Local source:
lib/env_dpdk/env.c - Local source:
lib/env_dpdk/memory.c - Local source:
include/spdk/dma.h - Local source:
include/spdk/thread.h - Local source:
lib/thread/iobuf.c - Local source:
lib/nvmf/transport.c - Local source:
include/spdk/nvmf.h - Local source:
include/spdk_internal/sock_module.h