Reader Promise
By the end of this chapter you should be able to look at a block device in SPDK and answer five practical questions:
- What object represents the device?
- Who registered it?
- Who is allowed to open it?
- What per-thread resources are used to submit I/O?
- Which function is called when an I/O reaches the module?
The short version is this: a bdev is not a disk. A bdev is SPDK's common contract for anything that behaves like a block device. It might be a physical NVMe namespace, a file, a malloc-backed fake disk, a logical volume, a RAID volume, or a virtual bdev stacked on another bdev. The bdev layer gives all of these things one uniform API and one uniform I/O object.
Why This Matters For diskengine/excloud
diskengine treats SPDK as an external storage engine. It asks SPDK to create, discover, stack, export, resize, and delete storage objects through JSON-RPC. Almost every diskengine operation eventually names a bdev: an NVMe namespace bdev, an lvol bdev, a RAID bdev, or a bdev exported through NVMe-oF, vhost, or vfio-user.
If a volume is missing, stuck, busy, or returning I/O errors, the first debugging step is usually not "look at NVMe." It is "understand the bdev object graph":
- Is the bdev registered?
- Is it still examining?
- Is it open by another module?
- Is it claimed by a virtual bdev module?
- Does the descriptor have write permission?
- Does the caller have a channel on the right SPDK thread?
- Did the module say the I/O type is supported?
Those questions are answered by the object model.
The Core Objects
struct spdk_bdev
The bdev object is the central descriptor for a block-device surface. It contains geometry, capabilities, ownership, module callbacks, and internal lifecycle state.
Source anchor: include/spdk/bdev_module.h:struct spdk_bdev.
Important public-facing fields:
name: unique bdev name.aliases: alternate names.product_name: human-readable device class.blocklen: logical block size.phys_blocklen: physical block size.blockcnt: number of logical blocks.md_len: metadata bytes per block, when metadata exists.dif_type,dif_pi_format,dif_check_flags: protection information details.required_alignment: buffer alignment requirement. The bdev layer may allocate bounce buffers if a request violates this.max_segment_size,max_num_segments,max_rw_size: constraints that can force request splitting.max_unmap,max_unmap_segments,max_write_zeroes,max_copy: operation-specific limits.reset_io_drain_timeout: reset behavior control.module: the module that registered the bdev.fn_table: the module operations called by the bdev layer.
Important internal fields:
internal.status: normal, unregistering, removing, etc.internal.open_descs: descriptors currently open on this bdev.internal.claim_typeandinternal.claim: ownership by a virtual module.internal.qos: rate-limit state.internal.reset_in_progressandinternal.queued_resets: reset serialization.internal.locked_rangesandinternal.pending_locked_ranges: quiesce and range-lock state.internal.stat: accumulated statistics from destroyed channels.
Beginner mental model: struct spdk_bdev is a device record. It does not itself do I/O. It points at the module function table that does I/O.
struct spdk_bdev_module
A bdev module is a producer of one or more bdevs. The module might be physical, like NVMe, or virtual, like passthru, lvol, RAID, crypto, or delay.
Source anchor: include/spdk/bdev_module.h:struct spdk_bdev_module.
Important callbacks and fields:
module_init: called during bdev subsystem startup.module_fini: called during shutdown.fini_start: optional early shutdown hook.config_json: emits module-level JSON config.name: module name.get_ctx_size: tells the bdev layer how much per-I/Odriver_ctxmemory to append to everystruct spdk_bdev_io.examine_config: first examine pass for virtual modules. It must complete synchronously.examine_disk: second examine pass for virtual modules. It may do I/O and complete asynchronously.async_init,async_fini,async_fini_start: tell the bdev subsystem that callbacks finish later.
The registration macro is:
Source anchor: include/spdk/bdev_module.h:SPDK_BDEV_MODULE_REGISTER().
Example source anchors:
module/bdev/null/bdev_null.c:null_if.module/bdev/null/bdev_null.c:SPDK_BDEV_MODULE_REGISTER(null, &null_if).module/bdev/passthru/vbdev_passthru.c:passthru_if.module/bdev/passthru/vbdev_passthru.c:SPDK_BDEV_MODULE_REGISTER(passthru, &passthru_if).module/bdev/nvme/bdev_nvme.c:nvme_if.module/bdev/nvme/bdev_nvme.c:SPDK_BDEV_MODULE_REGISTER(nvme, &nvme_if).
Misconception to kill: registering a module is not the same as registering a bdev. A module becomes known at process startup. A bdev becomes visible only when the module allocates a struct spdk_bdev, fills it in, and calls spdk_bdev_register().
struct spdk_bdev_fn_table
The function table is the module's implementation of the bdev contract.
Source anchor: include/spdk/bdev_module.h:struct spdk_bdev_fn_table.
Important entries:
destruct(void *ctx): destroy the backend object. May return1for asynchronous destruct and later callspdk_bdev_destruct_done().submit_request(struct spdk_io_channel ch, struct spdk_bdev_io bdev_io): handle one bdev I/O.io_type_supported(void *ctx, enum spdk_bdev_io_type type): advertise which I/O operations this bdev supports.get_io_channel(void *ctx): return a module channel for the current SPDK thread.dump_info_json,write_config_json: optional JSON output.get_memory_domains: optional memory-domain support.reset_device_stat,dump_device_stat_json: optional module-specific statistics.
Example source anchors:
module/bdev/null/bdev_null.c:null_fn_table.module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_fn_table.
Beginner mental model: bdev core code owns the generic policy and lifecycle; the module function table owns the backend-specific work.
struct spdk_bdev_desc
A descriptor is an open handle. Applications and modules do not normally submit I/O by holding only a struct spdk_bdev *; they open it and get a descriptor.
Source anchors:
lib/bdev/bdev.c:spdk_bdev_open_ext().lib/bdev/bdev.c:spdk_bdev_open_ext_v2().lib/bdev/bdev.c:bdev_open().lib/bdev/bdev.c:spdk_bdev_close().
Descriptor facts:
- It is bound to the SPDK thread that opened it.
- It records whether the opener requested write access.
- It stores the event callback used for remove and media-management events.
- It participates in the
open_descslist on the bdev. - It can own claims through newer claim APIs.
Why write permission matters: bdev_open() rejects a write descriptor if the bdev is already claimed by a module in a way that prevents additional writers. You can have many readers, but write access is deliberately constrained because virtual modules need exclusive control when they stack on a base bdev.
struct spdk_io_channel And struct spdk_bdev_channel
The public channel type is struct spdk_io_channel. The bdev layer stores bdev-specific state in a struct spdk_bdev_channel as the channel context.
Source anchors:
lib/bdev/bdev.c:spdk_bdev_get_io_channel().lib/bdev/bdev.c:bdev_channel_create().lib/bdev/bdev.c:bdev_channel_destroy().
bdev_channel_create() does several important things:
- Calls the module
get_io_channel()callback. - Gets an accel channel.
- Gets a bdev management channel.
- Creates or reuses a shared resource for NOMEM retry state.
- Initializes submitted, locked, QoS, accel, and memory-domain queues.
- Allocates per-channel statistics.
- Copies existing locked ranges into the new channel.
- Enables QoS on the channel if the bdev already has QoS.
Misconception to kill: a channel is not a queue pair by definition. For NVMe bdevs, a channel will lead to an NVMe qpair. For a null bdev, it leads to a simple poller queue. For virtual bdevs, it often contains a base bdev channel. "Channel" means per-thread module state, not a specific hardware object.
struct spdk_bdev_io
Every I/O submitted through the bdev layer becomes a struct spdk_bdev_io.
Source anchor: include/spdk/bdev_module.h:struct spdk_bdev_io.
Public-ish fields:
bdev: target bdev.type: operation type.u.bdev: block I/O parameters for read, write, unmap, flush, write zeroes, copy, and zcopy.u.reset: reset parameters.u.abort: abort parameters.u.nvme_passthru: NVMe passthrough command parameters.driver_ctx: per-I/O memory reserved for the module usingspdk_bdev_module.get_ctx_size.
Internal fields:
internal.ch: bdev channel.internal.desc: descriptor used to submit.internal.cbandinternal.caller_ctx: user completion callback.internal.status: pending, success, failed, NOMEM, NVMe error, etc.internal.submit_tsc: timestamp for latency accounting.internal.split: parent/child split tracking.internal.bufandinternal.bounce_buf: iobuf and alignment handling.internal.link: queue link reused for NOMEM, QoS, memory-domain, accel, and reset queues.
Misconception to kill: driver_ctx is not a malloc you do yourself per I/O. The bdev layer sizes struct spdk_bdev_io to include module-private memory. The module advertises the size through get_ctx_size().
Registration Lifecycle
The simple registration flow is:
- Module startup registers or prepares module-global state.
- A concrete bdev object is allocated.
- The module fills
struct spdk_bdev. - The module sets
bdev->ctxt,bdev->fn_table, andbdev->module. - The module calls
spdk_bdev_register(). - The bdev layer inserts the name, creates internal state, opens a temporary descriptor, and runs examine callbacks.
- When examine is complete, the bdev becomes generally usable.
Source anchors:
lib/bdev/bdev.c:spdk_bdev_register().lib/bdev/bdev.c:bdev_register().lib/bdev/bdev.c:bdev_examine().lib/bdev/bdev.c:spdk_bdev_wait_for_examine().module/bdev/null/bdev_null.c:bdev_null_create().module/bdev/null/bdev_null.c:bdev_null_initialize().
The important detail in spdk_bdev_register() is thread ownership. It checks spdk_thread_is_app_thread(NULL) and rejects registration from the wrong thread. This is why modules often bounce lifecycle work back to the app thread.
Open, Claim, And Stack
Virtual bdevs sit on base bdevs. To do this safely, they usually:
- Open the base bdev with
spdk_bdev_open_ext(). - Store the base descriptor.
- Claim the base bdev.
- Create and register a new virtual bdev.
- On destruct, release the claim and close the base descriptor.
Source anchors:
include/spdk/bdev_module.h:enum spdk_bdev_claim_type.include/spdk/bdev_module.h:spdk_bdev_module_claim_bdev().include/spdk/bdev_module.h:spdk_bdev_module_claim_bdev_desc().include/spdk/bdev_module.h:spdk_bdev_module_release_bdev().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_register().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_destruct().
The passthru module is intentionally simple and therefore valuable. In vbdev_passthru_register(), it opens the base bdev, copies geometry to the virtual bdev, registers an io_device for per-thread virtual-bdev state, claims the base bdev, and then registers the virtual bdev. In vbdev_passthru_destruct(), it removes itself from the global list, releases the base claim, closes the base descriptor on the original thread, and unregisters its io_device.
Misconception to kill: a claim is not the same as an open descriptor. A descriptor says "I have a handle." A claim says "my module owns a stacking relationship that constrains other writers."
Prose Diagram
Imagine a vertical diagram with five boxes:
Top box: "Application or upper SPDK layer." It holds a spdk_bdev_desc and a spdk_io_channel.
Second box: "bdev core." It validates block ranges, allocates spdk_bdev_io, applies splitting, QoS, reset, NOMEM retry, statistics, and completion routing.
Third box: "struct spdk_bdev." This is the named object with geometry and a pointer to fn_table.
Fourth box: "module function table." submit_request, get_io_channel, io_type_supported, destruct.
Bottom box: "backend." This may be an NVMe namespace, a file, malloc memory, another bdev, or a network connection.
Arrows go down for submission. Arrows go up for completion. Side arrows from the bdev box point to descriptors, claims, QoS, reset state, and locked ranges.
Edge Cases And Failure Modes
- Duplicate names:
spdk_bdev_register()can fail with-EEXIST. The name tree is global across bdev names and aliases. - Wrong thread: bdev registration must happen on the app thread; descriptor close asserts that the current thread matches the descriptor thread.
- Missing event callback:
spdk_bdev_open_ext_v2()rejects a NULL event callback. - Write denied:
bdev_open()rejects a write open if the bdev is already claimed. - Unregister while open: unregister starts removal and notifies descriptors, but final destruction can be deferred until descriptors close.
- Asynchronous destruct: modules that cannot destroy immediately return
1fromdestruct()and later callspdk_bdev_destruct_done(). - Channel destruction with queued I/O:
bdev_channel_destroy()aborts queued NOMEM and iobuf-waiting I/O and rolls channel statistics into bdev-wide stats. - Geometry mismatch in virtual modules: if a virtual bdev copies fields incorrectly from its base bdev, upper layers may submit I/O that the base cannot handle.
- Hidden metadata: descriptor options can change visible block size for callers. Do not assume
bdev->blocklenis always the byte count observed by a descriptor. - Claims released too late: shutdown can hang or removal can fail if a virtual module keeps a claim without presenting or cleaning up virtual bdevs correctly.
Misconceptions To Kill
- "A bdev is always hardware." No. It is an abstraction.
- "The module owns all policy." No. The bdev core owns common policy like splitting, QoS, reset gating, NOMEM retry, and completion routing.
- "The public channel is the module channel." Not exactly. The public channel is a bdev channel whose context contains or points to the module's channel.
- "A descriptor can be closed anywhere." No. It is tied to the opening SPDK thread.
- "A virtual bdev should just keep a pointer to its base bdev." It normally needs a descriptor, a claim, an event callback, and per-thread base channels.
- "Returning success from an RPC means all examine side effects are visible." Not always. Some paths wait for examine; others may require understanding async examine.
Source Reading Exercise
Read these in order:
module/bdev/null/bdev_null.c:bdev_null_create().module/bdev/null/bdev_null.c:null_fn_table.lib/bdev/bdev.c:spdk_bdev_register().lib/bdev/bdev.c:spdk_bdev_open_ext().lib/bdev/bdev.c:bdev_channel_create().include/spdk/bdev_module.h:struct spdk_bdev_io.
Questions:
- Where is the null bdev's backend context stored?
- Which function returns the null module's per-thread channel?
- Where does the bdev layer store the user's completion callback?
- Why does
spdk_bdev_register()open a temporary descriptor? - Which parts of
struct spdk_bdevshould a module fill, and which parts are explicitly internal?
Operational Lab
No live SPDK system is required.
- Pick a bdev name that appears in an RPC config, for example
Nvme0n1. - Determine which module owns it by finding the constructor RPC. For NVMe, the constructor is usually
bdev_nvme_attach_controller. - Find that module's
struct spdk_bdev_module. - Find the module's
struct spdk_bdev_fn_table. - Find the function that fills
bdev->name,bdev->blocklen,bdev->blockcnt,bdev->ctxt,bdev->fn_table, andbdev->module. - Write down how the module would destroy that bdev.
Expected outcome: you should be able to explain the bdev's owner, lifecycle, and I/O dispatch function without running SPDK.
Self-Check
- What is the difference between
struct spdk_bdev,struct spdk_bdev_desc, andstruct spdk_io_channel? - Why does a module provide
get_ctx_size()? - What is the role of
spdk_bdev_module_claim_bdev()in a virtual bdev? - Why can unregister be delayed after
spdk_bdev_unregister()is called? - Why is
submit_request()not a public API for applications? - Which source function should you read first when debugging "bdev exists but open fails"?
- Which source function should you read first when debugging "channel creation fails"?
- Why is copying a base bdev's block size not enough to implement a correct virtual bdev?
References
- Local source:
include/spdk/bdev_module.h. - Local source:
lib/bdev/bdev.c. - Local source:
module/bdev/null/bdev_null.c. - Local source:
module/bdev/passthru/vbdev_passthru.c. - SPDK documentation: https://spdk.io/doc/