SPDK From First Principles

SPDK deep learning path

Chapter: bdev_examine And Virtual bdev Stacking

SPDK bdev modules do not only create physical bdevs. Many modules create virtual bdevs on top of other bdevs. A virtual bdev module needs a way to notice that a base bdev...

Source: drafts/lvol-raid/03-bdev-examine-and-stacking.md

Beginner Mental Model

SPDK bdev modules do not only create physical bdevs. Many modules create virtual bdevs on top of other bdevs. A virtual bdev module needs a way to notice that a base bdev exists and decide whether it should create a child bdev. That mechanism is bdev_examine.

In prose:

A base bdev appears.
The bdev core asks every interested module:
  "Does your config say you should use this?"
  "Does the disk metadata say you should use this?"
The module may claim the base bdev.
If the module recognizes the base, it creates one or more virtual bdevs.
Those new bdevs may themselves be examined by other modules.

This is how SPDK can restart, load a base device, discover it contains an lvolstore, and recreate lvol bdevs without a user manually specifying every child bdev.

Why This Matters For diskengine/excloud

Virtual bdev stacking is the difference between "I have one NVMe namespace" and "I have a storage graph":

NVMe bdev
  -> RAID bdev
     -> lvolstore
        -> lvol bdev
           -> crypto/delay/error/passthru wrapper
              -> NVMe-oF namespace or vhost controller

If a reconciler creates or deletes nodes out of order, or assumes bdevs appear synchronously, it will hit races:

  • A base bdev is present but not examined yet.
  • A virtual bdev is not present because its module has not finished asynchronous metadata IO.
  • A base bdev is claimed by one virtual module and therefore not available to another.
  • Manual examine is disabled because auto-examine is enabled.
  • A delete operation is blocked because a child bdev still claims the base.

The bdev Module Contract

The module structure is in:

  • include/spdk/bdev_module.h:struct spdk_bdev_module

Relevant function pointers:

  • include/spdk/bdev_module.h:struct spdk_bdev_module.examine_config
  • include/spdk/bdev_module.h:struct spdk_bdev_module.examine_disk
  • include/spdk/bdev_module.h:struct spdk_bdev_module.module_init
  • include/spdk/bdev_module.h:struct spdk_bdev_module.module_fini
  • include/spdk/bdev_module.h:struct spdk_bdev_module.async_init
  • include/spdk/bdev_module.h:struct spdk_bdev_module.async_fini

Relevant APIs:

  • include/spdk/bdev.h:spdk_bdev_examine
  • include/spdk/bdev.h:spdk_bdev_wait_for_examine
  • include/spdk/bdev_module.h:spdk_bdev_module_examine_done
  • include/spdk/bdev_module.h:spdk_bdev_module_claim_bdev
  • include/spdk/bdev_module.h:spdk_bdev_module_claim_bdev_desc
  • include/spdk/bdev_module.h:spdk_bdev_module_release_bdev

The comments in include/spdk/bdev_module.h are unusually important. They say:

  • examine_config is the first notification.
  • examine_config may create vbdevs based on configuration but cannot send IO to the bdev.
  • examine_config must decide synchronously whether to claim.
  • examine_config must call spdk_bdev_module_examine_done() before returning.
  • examine_disk is the second notification.
  • examine_disk may use IO and finish asynchronously.
  • examine_disk must call spdk_bdev_module_examine_done() when complete.

The Core Examine Algorithm

Implementation anchors:

  • lib/bdev/bdev.c:bdev_examine
  • lib/bdev/bdev.c:spdk_bdev_examine
  • lib/bdev/bdev.c:spdk_bdev_wait_for_examine
  • lib/bdev/bdev.c:spdk_bdev_module_examine_done
  • lib/bdev/bdev.c:bdev_ok_to_examine
  • lib/bdev/bdev.c:bdev_in_examine_allowlist
  • lib/bdev/bdev.c:bdev_examine_allowlist_check

The internal bdev_examine() does two phases:

  1. It calls every module's examine_config, if present.
  2. It calls examine_disk according to the bdev's claim state.

The claim state matters:

  • If the bdev is unclaimed, all modules with examine_disk may examine it.
  • If the bdev has an exclusive v1 claim, only the claiming module's examine_disk is called.
  • If the bdev has v2 claims, all claiming modules with examine_disk may examine it.

Prose diagram:

bdev_examine(bdev)
  |
  +-- for every module:
  |     module->examine_config(bdev)
  |     module must call examine_done
  |
  +-- inspect bdev claim state:
        none:
          call every module->examine_disk(bdev)
        exclusive v1 claim:
          call only claimant module->examine_disk(bdev)
        v2 claims:
          call each claiming module->examine_disk(bdev)

Manual examine is controlled by spdk_bdev_examine(). It must be called on the app thread and fails if auto-examine is enabled. It inserts the bdev name into an allowlist and examines immediately if the bdev already exists.

RPC anchors:

  • lib/bdev/bdev_rpc.c:rpc_bdev_examine
  • lib/bdev/bdev_rpc.c:rpc_bdev_wait_for_examine

Why examine_config And examine_disk Both Exist

examine_config is for configuration-driven virtual bdev creation. It is not allowed to send IO, so it cannot read on-disk metadata. This is useful for modules that already have complete information from JSON-RPC or config replay.

examine_disk is for disk-driven discovery. It may open the base bdev, allocate an IO channel, read metadata, and finish asynchronously. lvol and RAID use this style to discover on-disk lvolstore or RAID superblock metadata.

Examples:

  • lvol: module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_config handles external snapshot hotplug notification. module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_disk tries to load an lvolstore from the bdev.
  • RAID: module/bdev/raid/bdev_raid.c:raid_bdev_examine loads or checks RAID superblocks.

Claims

Claims prevent two independent modules from treating the same base bdev as their private write target.

Source anchors:

  • include/spdk/bdev_module.h:enum spdk_bdev_claim_type
  • include/spdk/bdev_module.h:struct spdk_bdev_claim_opts
  • lib/bdev/bdev.c:spdk_bdev_module_claim_bdev
  • lib/bdev/bdev.c:spdk_bdev_module_claim_bdev_desc
  • lib/bdev/bdev.c:spdk_bdev_module_release_bdev
  • lib/bdev/bdev.c:claim_verify_rwo
  • lib/bdev/bdev.c:claim_verify_rom
  • lib/bdev/bdev.c:claim_verify_rwm
  • lib/bdev/bdev.c:claim_bdev
  • lib/bdev/bdev.c:bdev_desc_release_claims

The older spdk_bdev_module_claim_bdev() establishes an exclusive write claim. Newer code may use descriptor claims through spdk_bdev_module_claim_bdev_desc() for read-only-many and shared-write styles.

Claim misconceptions:

  • A claim is not the same as opening a bdev. A descriptor can exist without a claim.
  • A claim is not a bdev reference count. It is a permission/ownership relationship.
  • A claim does not submit IO. It controls which modules may build on the bdev and who may write.
  • Releasing a descriptor can release associated v2 claims; examine has special logic for claims released while iterating.

Stacking Patterns

One-to-One Wrapper

A one-to-one wrapper creates one child bdev over one base bdev. It usually forwards IO after adding behavior.

Examples:

  • module/bdev/passthru/vbdev_passthru.c
  • module/bdev/delay/vbdev_delay.c
  • module/bdev/error/vbdev_error.c

Pattern:

base bdev
  -> wrapper vbdev
       submit_request:
         maybe transform or delay
         submit child IO to base
         complete original bdev_io

One-to-Many Partitioning

A partition-like module creates multiple child bdevs from ranges of one base.

Examples:

  • lib/bdev/part.c
  • module/bdev/split/vbdev_split.c
  • module/bdev/gpt/vbdev_gpt.c

Important helper anchors:

  • lib/bdev/part.c:spdk_bdev_part_base_construct_ext
  • lib/bdev/part.c:spdk_bdev_part_submit_request
  • lib/bdev/part.c:spdk_bdev_part_submit_request_ext
  • lib/bdev/part.c:spdk_bdev_part_get_base_bdev

Many-to-One Aggregation

Aggregation modules create one child bdev from multiple base bdevs.

Examples:

  • module/bdev/raid/bdev_raid.c
  • module/bdev/raid/raid0.c
  • module/bdev/raid/raid1.c
  • module/bdev/raid/concat.c
  • module/bdev/raid/raid5f.c

Pattern:

base0 + base1 + base2 + ...
  -> aggregate vbdev
       submit_request:
         map logical offset to one or more base offsets
         submit child IO(s)
         collect completions
         complete original bdev_io

Object Adapter

lvol is not just a pass-through bdev wrapper. Its child bdevs are backed by blobs in a blobstore. That means the base bdev may be shared by many lvol child bdevs through the lvolstore claim.

Pattern:

base bdev
  -> blobstore/lvolstore owns base
     -> lvol bdev A
     -> lvol bdev B
     -> lvol bdev C

Recursive Discovery

When a virtual bdev is registered, it is a bdev like any other. The bdev core may examine it too. This can create stacks:

Malloc0 appears
  -> lvol examine finds lvolstore
     -> lvol bdev Lv0 appears
        -> another module may examine Lv0

The stack is not a tree in the abstract; it is a graph constrained by claims and module behavior. A bdev can have aliases, consumers, and claims. Some modules create children from multiple bases, and some bdevs can be read by many modules.

Shutdown Ordering

The bdev subsystem tries to shut down top-down so children go away before bases. Source anchors:

  • lib/bdev/bdev.c:spdk_bdev_module_fini_done
  • lib/bdev/bdev.c:spdk_bdev_module_fini_start_done

The shutdown path skips claimed bdevs at first because a claimed bdev is likely a base for a virtual child. If only claimed bdevs remain, that suggests a module failed to unclaim correctly or the graph has a loop.

Beginner misconception to kill: unregistering a base bdev while children still exist is not a normal successful teardown. Virtual modules must handle remove events, unregister children, release claims, and complete async destruction.

lvol Examine Case Study

Key source anchors:

  • module/bdev/lvol/vbdev_lvol.c:g_lvol_if
  • module/bdev/lvol/vbdev_lvol.c:SPDK_BDEV_MODULE_REGISTER(lvol, &g_lvol_if)
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_config
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_disk
  • module/bdev/lvol/vbdev_lvol.c:_vbdev_lvs_examine
  • module/bdev/lvol/vbdev_lvol.c:_vbdev_lvs_examine_cb
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_done

vbdev_lvs_examine_config() formats the bdev UUID and notifies lvolstores that a missing external snapshot may have appeared. It calls spdk_bdev_module_examine_done() before returning.

vbdev_lvs_examine_disk() rejects bdevs with metadata size, allocates a request, creates a blobstore device wrapper from the bdev, and calls spdk_lvs_load_ext(). Completion ultimately calls spdk_bdev_module_examine_done().

If lvolstore load succeeds, _vbdev_lvs_examine_cb() claims the base with spdk_bs_bdev_claim(), records the lvolstore/base pair, and opens every lvol so _create_lvol_disk() can register lvol bdevs.

RAID Examine Case Study

Key anchors:

  • module/bdev/raid/bdev_raid.c:g_raid_if
  • module/bdev/raid/bdev_raid.c:SPDK_BDEV_MODULE_REGISTER(raid, &g_raid_if)
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine_load_sb
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine_cont
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine_sb
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine_no_sb
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine_done
  • module/bdev/raid/bdev_raid_sb.c:raid_bdev_load_base_bdev_superblock

RAID examine tries to read a superblock if superblocks are enabled. If it finds RAID metadata, it may create or update a RAID bdev and configure the base slot. If no superblock is found, it may still use configuration-driven RAID definitions.

Operational Debugging

When a child bdev is missing:

  1. Confirm the base bdev exists with bdev_get_bdevs.
  2. Check whether bdev_auto_examine is enabled.
  3. If auto-examine is disabled, confirm bdev_examine was called for the base bdev.
  4. Wait for examine with bdev_wait_for_examine.
  5. Check whether the base bdev is claimed by an unexpected module.
  6. Check module logs for examine_config or examine_disk errors.
  7. Check on-disk metadata compatibility: lvolstore super blob, blobstore superblock, RAID superblock.
  8. Check whether a child was created but immediately unregistered due to open/claim/registration failure.

Source anchors for state:

  • lib/bdev/bdev_rpc.c:rpc_dump_bdev_info writes claimed and claim_type fields.
  • lib/bdev/bdev.c:bdev_examine_allowlist_config_json records manual examine allowlist in config JSON.
  • lib/bdev/bdev.c:bdev_wait_for_examine_cb implements wait-for-examine polling.

Labs

Lab 1: Manual Examine Mental Trace

Assume bdev_auto_examine=false.

Trace:

1. Create Malloc0.
2. Do not call bdev_examine.
3. Create an lvolstore on Malloc0.
4. Restart with only Malloc0 recreated.
5. Call bdev_get_bdevs.
6. Call bdev_examine Malloc0.
7. Call bdev_wait_for_examine.

Expected reasoning:

  • Before manual examine, lvol bdevs are not auto-loaded.
  • spdk_bdev_examine() must run on the app thread.
  • It inserts the name into the allowlist and examines immediately if the bdev exists.
  • lvol examine may finish asynchronously because it reads blobstore/lvol metadata.

Lab 2: Read The Claim Graph

Use bdev_get_bdevs JSON and identify:

  • Which bdevs are physical bases.
  • Which bdevs are virtual children.
  • Which bases are claimed.
  • Which module owns each claim.
  • Whether any virtual child could itself be used as a base for another module.

Lab 3: Source Trace A Stack

Pick this stack:

Malloc0 -> RAID0 -> lvolstore -> lvol bdev -> passthru bdev

For each arrow, write:

  • Which module creates the child.
  • Which source function registers the child bdev.
  • Which claim protects the base.
  • Which submit_request function handles IO at that layer.

Self-Check

  1. Why does examine_config have to finish synchronously?
  2. Why can examine_disk finish asynchronously?
  3. What happens if a module forgets to call spdk_bdev_module_examine_done()?
  4. Why does claim state change which modules get examine_disk?
  5. What is the difference between opening a bdev and claiming it?
  6. Why can manual examine fail when auto-examine is enabled?
  7. Where does lvol claim the base bdev after loading an lvolstore?
  8. Why is shutdown ordering top-down for virtual bdev stacks?

References

  • Local bdev API: include/spdk/bdev.h
  • Local module API: include/spdk/bdev_module.h
  • Local examine implementation: lib/bdev/bdev.c
  • Local examine RPC: lib/bdev/bdev_rpc.c
  • Local lvol examine: module/bdev/lvol/vbdev_lvol.c
  • Local RAID examine: module/bdev/raid/bdev_raid.c
  • Local virtual bdev examples: module/bdev/passthru/vbdev_passthru.c, module/bdev/split/vbdev_split.c, module/bdev/delay/vbdev_delay.c