Beginner Mental Model
SPDK bdev modules do not only create physical bdevs. Many modules create virtual bdevs on top of other bdevs. A virtual bdev module needs a way to notice that a base bdev exists and decide whether it should create a child bdev. That mechanism is bdev_examine.
In prose:
A base bdev appears.
The bdev core asks every interested module:
"Does your config say you should use this?"
"Does the disk metadata say you should use this?"
The module may claim the base bdev.
If the module recognizes the base, it creates one or more virtual bdevs.
Those new bdevs may themselves be examined by other modules.
This is how SPDK can restart, load a base device, discover it contains an lvolstore, and recreate lvol bdevs without a user manually specifying every child bdev.
Why This Matters For diskengine/excloud
Virtual bdev stacking is the difference between "I have one NVMe namespace" and "I have a storage graph":
NVMe bdev
-> RAID bdev
-> lvolstore
-> lvol bdev
-> crypto/delay/error/passthru wrapper
-> NVMe-oF namespace or vhost controller
If a reconciler creates or deletes nodes out of order, or assumes bdevs appear synchronously, it will hit races:
- A base bdev is present but not examined yet.
- A virtual bdev is not present because its module has not finished asynchronous metadata IO.
- A base bdev is claimed by one virtual module and therefore not available to another.
- Manual examine is disabled because auto-examine is enabled.
- A delete operation is blocked because a child bdev still claims the base.
The bdev Module Contract
The module structure is in:
include/spdk/bdev_module.h:struct spdk_bdev_module
Relevant function pointers:
include/spdk/bdev_module.h:struct spdk_bdev_module.examine_configinclude/spdk/bdev_module.h:struct spdk_bdev_module.examine_diskinclude/spdk/bdev_module.h:struct spdk_bdev_module.module_initinclude/spdk/bdev_module.h:struct spdk_bdev_module.module_finiinclude/spdk/bdev_module.h:struct spdk_bdev_module.async_initinclude/spdk/bdev_module.h:struct spdk_bdev_module.async_fini
Relevant APIs:
include/spdk/bdev.h:spdk_bdev_examineinclude/spdk/bdev.h:spdk_bdev_wait_for_examineinclude/spdk/bdev_module.h:spdk_bdev_module_examine_doneinclude/spdk/bdev_module.h:spdk_bdev_module_claim_bdevinclude/spdk/bdev_module.h:spdk_bdev_module_claim_bdev_descinclude/spdk/bdev_module.h:spdk_bdev_module_release_bdev
The comments in include/spdk/bdev_module.h are unusually important. They say:
examine_configis the first notification.examine_configmay create vbdevs based on configuration but cannot send IO to the bdev.examine_configmust decide synchronously whether to claim.examine_configmust callspdk_bdev_module_examine_done()before returning.examine_diskis the second notification.examine_diskmay use IO and finish asynchronously.examine_diskmust callspdk_bdev_module_examine_done()when complete.
The Core Examine Algorithm
Implementation anchors:
lib/bdev/bdev.c:bdev_examinelib/bdev/bdev.c:spdk_bdev_examinelib/bdev/bdev.c:spdk_bdev_wait_for_examinelib/bdev/bdev.c:spdk_bdev_module_examine_donelib/bdev/bdev.c:bdev_ok_to_examinelib/bdev/bdev.c:bdev_in_examine_allowlistlib/bdev/bdev.c:bdev_examine_allowlist_check
The internal bdev_examine() does two phases:
- It calls every module's
examine_config, if present. - It calls
examine_diskaccording to the bdev's claim state.
The claim state matters:
- If the bdev is unclaimed, all modules with
examine_diskmay examine it. - If the bdev has an exclusive v1 claim, only the claiming module's
examine_diskis called. - If the bdev has v2 claims, all claiming modules with
examine_diskmay examine it.
Prose diagram:
bdev_examine(bdev)
|
+-- for every module:
| module->examine_config(bdev)
| module must call examine_done
|
+-- inspect bdev claim state:
none:
call every module->examine_disk(bdev)
exclusive v1 claim:
call only claimant module->examine_disk(bdev)
v2 claims:
call each claiming module->examine_disk(bdev)
Manual examine is controlled by spdk_bdev_examine(). It must be called on the app thread and fails if auto-examine is enabled. It inserts the bdev name into an allowlist and examines immediately if the bdev already exists.
RPC anchors:
lib/bdev/bdev_rpc.c:rpc_bdev_examinelib/bdev/bdev_rpc.c:rpc_bdev_wait_for_examine
Why examine_config And examine_disk Both Exist
examine_config is for configuration-driven virtual bdev creation. It is not allowed to send IO, so it cannot read on-disk metadata. This is useful for modules that already have complete information from JSON-RPC or config replay.
examine_disk is for disk-driven discovery. It may open the base bdev, allocate an IO channel, read metadata, and finish asynchronously. lvol and RAID use this style to discover on-disk lvolstore or RAID superblock metadata.
Examples:
- lvol:
module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_confighandles external snapshot hotplug notification.module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_disktries to load an lvolstore from the bdev. - RAID:
module/bdev/raid/bdev_raid.c:raid_bdev_examineloads or checks RAID superblocks.
Claims
Claims prevent two independent modules from treating the same base bdev as their private write target.
Source anchors:
include/spdk/bdev_module.h:enum spdk_bdev_claim_typeinclude/spdk/bdev_module.h:struct spdk_bdev_claim_optslib/bdev/bdev.c:spdk_bdev_module_claim_bdevlib/bdev/bdev.c:spdk_bdev_module_claim_bdev_desclib/bdev/bdev.c:spdk_bdev_module_release_bdevlib/bdev/bdev.c:claim_verify_rwolib/bdev/bdev.c:claim_verify_romlib/bdev/bdev.c:claim_verify_rwmlib/bdev/bdev.c:claim_bdevlib/bdev/bdev.c:bdev_desc_release_claims
The older spdk_bdev_module_claim_bdev() establishes an exclusive write claim. Newer code may use descriptor claims through spdk_bdev_module_claim_bdev_desc() for read-only-many and shared-write styles.
Claim misconceptions:
- A claim is not the same as opening a bdev. A descriptor can exist without a claim.
- A claim is not a bdev reference count. It is a permission/ownership relationship.
- A claim does not submit IO. It controls which modules may build on the bdev and who may write.
- Releasing a descriptor can release associated v2 claims; examine has special logic for claims released while iterating.
Stacking Patterns
One-to-One Wrapper
A one-to-one wrapper creates one child bdev over one base bdev. It usually forwards IO after adding behavior.
Examples:
module/bdev/passthru/vbdev_passthru.cmodule/bdev/delay/vbdev_delay.cmodule/bdev/error/vbdev_error.c
Pattern:
base bdev
-> wrapper vbdev
submit_request:
maybe transform or delay
submit child IO to base
complete original bdev_io
One-to-Many Partitioning
A partition-like module creates multiple child bdevs from ranges of one base.
Examples:
lib/bdev/part.cmodule/bdev/split/vbdev_split.cmodule/bdev/gpt/vbdev_gpt.c
Important helper anchors:
lib/bdev/part.c:spdk_bdev_part_base_construct_extlib/bdev/part.c:spdk_bdev_part_submit_requestlib/bdev/part.c:spdk_bdev_part_submit_request_extlib/bdev/part.c:spdk_bdev_part_get_base_bdev
Many-to-One Aggregation
Aggregation modules create one child bdev from multiple base bdevs.
Examples:
module/bdev/raid/bdev_raid.cmodule/bdev/raid/raid0.cmodule/bdev/raid/raid1.cmodule/bdev/raid/concat.cmodule/bdev/raid/raid5f.c
Pattern:
base0 + base1 + base2 + ...
-> aggregate vbdev
submit_request:
map logical offset to one or more base offsets
submit child IO(s)
collect completions
complete original bdev_io
Object Adapter
lvol is not just a pass-through bdev wrapper. Its child bdevs are backed by blobs in a blobstore. That means the base bdev may be shared by many lvol child bdevs through the lvolstore claim.
Pattern:
base bdev
-> blobstore/lvolstore owns base
-> lvol bdev A
-> lvol bdev B
-> lvol bdev C
Recursive Discovery
When a virtual bdev is registered, it is a bdev like any other. The bdev core may examine it too. This can create stacks:
Malloc0 appears
-> lvol examine finds lvolstore
-> lvol bdev Lv0 appears
-> another module may examine Lv0
The stack is not a tree in the abstract; it is a graph constrained by claims and module behavior. A bdev can have aliases, consumers, and claims. Some modules create children from multiple bases, and some bdevs can be read by many modules.
Shutdown Ordering
The bdev subsystem tries to shut down top-down so children go away before bases. Source anchors:
lib/bdev/bdev.c:spdk_bdev_module_fini_donelib/bdev/bdev.c:spdk_bdev_module_fini_start_done
The shutdown path skips claimed bdevs at first because a claimed bdev is likely a base for a virtual child. If only claimed bdevs remain, that suggests a module failed to unclaim correctly or the graph has a loop.
Beginner misconception to kill: unregistering a base bdev while children still exist is not a normal successful teardown. Virtual modules must handle remove events, unregister children, release claims, and complete async destruction.
lvol Examine Case Study
Key source anchors:
module/bdev/lvol/vbdev_lvol.c:g_lvol_ifmodule/bdev/lvol/vbdev_lvol.c:SPDK_BDEV_MODULE_REGISTER(lvol, &g_lvol_if)module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_configmodule/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_diskmodule/bdev/lvol/vbdev_lvol.c:_vbdev_lvs_examinemodule/bdev/lvol/vbdev_lvol.c:_vbdev_lvs_examine_cbmodule/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_done
vbdev_lvs_examine_config() formats the bdev UUID and notifies lvolstores that a missing external snapshot may have appeared. It calls spdk_bdev_module_examine_done() before returning.
vbdev_lvs_examine_disk() rejects bdevs with metadata size, allocates a request, creates a blobstore device wrapper from the bdev, and calls spdk_lvs_load_ext(). Completion ultimately calls spdk_bdev_module_examine_done().
If lvolstore load succeeds, _vbdev_lvs_examine_cb() claims the base with spdk_bs_bdev_claim(), records the lvolstore/base pair, and opens every lvol so _create_lvol_disk() can register lvol bdevs.
RAID Examine Case Study
Key anchors:
module/bdev/raid/bdev_raid.c:g_raid_ifmodule/bdev/raid/bdev_raid.c:SPDK_BDEV_MODULE_REGISTER(raid, &g_raid_if)module/bdev/raid/bdev_raid.c:raid_bdev_examinemodule/bdev/raid/bdev_raid.c:raid_bdev_examine_load_sbmodule/bdev/raid/bdev_raid.c:raid_bdev_examine_contmodule/bdev/raid/bdev_raid.c:raid_bdev_examine_sbmodule/bdev/raid/bdev_raid.c:raid_bdev_examine_no_sbmodule/bdev/raid/bdev_raid.c:raid_bdev_examine_donemodule/bdev/raid/bdev_raid_sb.c:raid_bdev_load_base_bdev_superblock
RAID examine tries to read a superblock if superblocks are enabled. If it finds RAID metadata, it may create or update a RAID bdev and configure the base slot. If no superblock is found, it may still use configuration-driven RAID definitions.
Operational Debugging
When a child bdev is missing:
- Confirm the base bdev exists with
bdev_get_bdevs. - Check whether
bdev_auto_examineis enabled. - If auto-examine is disabled, confirm
bdev_examinewas called for the base bdev. - Wait for examine with
bdev_wait_for_examine. - Check whether the base bdev is claimed by an unexpected module.
- Check module logs for
examine_configorexamine_diskerrors. - Check on-disk metadata compatibility: lvolstore super blob, blobstore superblock, RAID superblock.
- Check whether a child was created but immediately unregistered due to open/claim/registration failure.
Source anchors for state:
lib/bdev/bdev_rpc.c:rpc_dump_bdev_infowritesclaimedandclaim_typefields.lib/bdev/bdev.c:bdev_examine_allowlist_config_jsonrecords manual examine allowlist in config JSON.lib/bdev/bdev.c:bdev_wait_for_examine_cbimplements wait-for-examine polling.
Labs
Lab 1: Manual Examine Mental Trace
Assume bdev_auto_examine=false.
Trace:
1. Create Malloc0.
2. Do not call bdev_examine.
3. Create an lvolstore on Malloc0.
4. Restart with only Malloc0 recreated.
5. Call bdev_get_bdevs.
6. Call bdev_examine Malloc0.
7. Call bdev_wait_for_examine.
Expected reasoning:
- Before manual examine, lvol bdevs are not auto-loaded.
spdk_bdev_examine()must run on the app thread.- It inserts the name into the allowlist and examines immediately if the bdev exists.
- lvol examine may finish asynchronously because it reads blobstore/lvol metadata.
Lab 2: Read The Claim Graph
Use bdev_get_bdevs JSON and identify:
- Which bdevs are physical bases.
- Which bdevs are virtual children.
- Which bases are claimed.
- Which module owns each claim.
- Whether any virtual child could itself be used as a base for another module.
Lab 3: Source Trace A Stack
Pick this stack:
Malloc0 -> RAID0 -> lvolstore -> lvol bdev -> passthru bdev
For each arrow, write:
- Which module creates the child.
- Which source function registers the child bdev.
- Which claim protects the base.
- Which
submit_requestfunction handles IO at that layer.
Self-Check
- Why does
examine_confighave to finish synchronously? - Why can
examine_diskfinish asynchronously? - What happens if a module forgets to call
spdk_bdev_module_examine_done()? - Why does claim state change which modules get
examine_disk? - What is the difference between opening a bdev and claiming it?
- Why can manual examine fail when auto-examine is enabled?
- Where does lvol claim the base bdev after loading an lvolstore?
- Why is shutdown ordering top-down for virtual bdev stacks?
References
- Local bdev API:
include/spdk/bdev.h - Local module API:
include/spdk/bdev_module.h - Local examine implementation:
lib/bdev/bdev.c - Local examine RPC:
lib/bdev/bdev_rpc.c - Local lvol examine:
module/bdev/lvol/vbdev_lvol.c - Local RAID examine:
module/bdev/raid/bdev_raid.c - Local virtual bdev examples:
module/bdev/passthru/vbdev_passthru.c,module/bdev/split/vbdev_split.c,module/bdev/delay/vbdev_delay.c