SPDK From First Principles

SPDK deep learning path

Chapter: Snapshots, Clones, Resize, Delete, And Stacking Edge Cases

This chapter is the "what breaks in production" companion to the blobstore, lvol, bdev examine, and RAID chapters. It focuses on operations that appear simple from RPC names...

Source: drafts/lvol-raid/05-snapshots-clones-resize-delete-edge-cases.md

Purpose

This chapter is the "what breaks in production" companion to the blobstore, lvol, bdev examine, and RAID chapters. It focuses on operations that appear simple from RPC names but are multi-step asynchronous state machines in source:

  • Snapshot.
  • Clone.
  • External snapshot clone.
  • Resize.
  • Delete.
  • Inflate/decouple parent.
  • Base bdev remove/hotplug.
  • RAID base replacement/rebuild.
  • Stacking lvol over RAID or RAID over lvol.

The goal is to give a beginner a reliable way to classify failures and then jump to the right source anchors.

A Shared Mental Model: Every Operation Has Owners

For each operation, ask five questions:

  1. Object: what object is being changed?
  2. Owner thread: which SPDK thread owns metadata mutation?
  3. Base resources: which bdevs or bs_devs are open and claimed?
  4. Visibility: when does the child bdev appear or disappear?
  5. Cleanup: if a middle step fails, who rolls back metadata, releases claims, closes descriptors, and completes callbacks?

This keeps async code readable. Avoid starting from callback names. Start from the object and the ownership boundary.

Snapshot Edge Cases

Primary anchors:

  • lib/blob/blobstore.c:spdk_bs_create_snapshot
  • lib/blob/blobstore.c:bs_snapshot_swap_cluster_maps
  • lib/blob/blobstore.c:bs_snapshot_newblob_sync_cpl
  • lib/blob/blobstore.c:bs_snapshot_origblob_sync_cpl
  • lib/blob/blobstore.c:blob_freeze_io
  • lib/blob/blobstore.c:blob_unfreeze_io
  • lib/lvol/lvol.c:spdk_lvol_create_snapshot
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_create_snapshot
  • module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_snapshot

What can go wrong:

  • The source lvol is not open, so origlvol->blob is not usable.
  • Snapshot name conflicts with an existing or pending lvol name.
  • Allocating the new snapshot blob fails.
  • Metadata sync of the new snapshot fails after cluster maps were swapped.
  • Metadata sync of the original blob fails after parent xattrs were changed.
  • IO freeze/unfreeze fails.
  • A bdev registration failure occurs after the blob/lvol snapshot exists.

Misconception: snapshot creation is a single atomic C function. It is a callback chain with multiple points where cleanup must restore cluster maps, xattrs, and flags.

Debugging path:

RPC failed: bdev_lvol_snapshot
  -> check module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_snapshot
  -> check lvol lookup by bdev name
  -> check lib/lvol/lvol.c:spdk_lvol_create_snapshot
  -> check blobstore snapshot callbacks
  -> check module/bdev/lvol/vbdev_lvol.c:_create_lvol_disk

Self-check:

  • What state changes happen before the new snapshot bdev is visible?
  • If _create_lvol_disk() fails, does the underlying snapshot blob still exist?
  • Why must the original blob become thin-provisioned after snapshot?

Clone Edge Cases

Primary anchors:

  • lib/blob/blobstore.c:spdk_bs_create_clone
  • lib/lvol/lvol.c:spdk_lvol_create_clone
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_create_clone
  • module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_clone

What can go wrong:

  • The source is not a snapshot or is not read-only.
  • Clone name conflicts.
  • Clone creation succeeds but bdev registration fails.
  • The clone reads old data from parent and new data from itself; test expectations may assume a full copy and appear "wrong."
  • Parent snapshot deletion later rewrites clone metadata.

Misconception: a clone is a deep copy. In blobstore/lvol, clone means COW dependency unless inflated/decoupled.

Prose diagram:

snapshot S
  cluster map: [A, B, C]

clone C
  cluster map: [0, 0, 0]
  parent: S

write to C cluster 1
  cluster map: [0, D, 0]

read C cluster 0 -> S:A
read C cluster 1 -> C:D
read C cluster 2 -> S:C

Lab:

  1. Create lvol.
  2. Write pattern A to whole lvol.
  3. Snapshot.
  4. Clone snapshot.
  5. Write pattern B to one range of clone.
  6. Read from source, snapshot, clone.
  7. Explain each byte range by parent/child cluster mapping.

External Snapshot Clone Edge Cases

Primary anchors:

  • include/spdk/blob.h:spdk_bs_esnap_dev_create
  • lib/blob/blobstore.c:spdk_bs_blob_set_external_parent
  • lib/blob/blobstore.c:spdk_blob_set_esnap_bs_dev
  • lib/blob/blobstore.c:spdk_blob_is_degraded
  • lib/lvol/lvol.c:spdk_lvol_create_esnap_clone
  • lib/lvol/lvol.c:lvs_esnap_bs_dev_create
  • lib/lvol/lvol.c:spdk_lvs_esnap_missing_add
  • lib/lvol/lvol.c:spdk_lvs_esnap_missing_remove
  • lib/lvol/lvol.c:lvs_esnap_degraded_hotplug
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_create_bdev_clone
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_esnap_dev_create
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_config

What can go wrong:

  • External bdev name is wrong.
  • External bdev UUID cannot be parsed.
  • External snapshot ID length is invalid.
  • External bdev exists but cannot be opened read-only.
  • External bdev claim fails.
  • External parent disappears after clone creation.
  • External parent is missing during lvolstore load; lvol is tracked as degraded.
  • Memory domain reporting omits the missing external snapshot and consumers miss zero-copy capability.

Important source note: module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_esnap_dev_create validates the esnap ID, opens a bdev by UUID string, creates a bs_dev, and claims it. If the bdev is missing, it calls lib/lvol/lvol.c:spdk_lvs_esnap_missing_add.

Hotplug flow:

external parent bdev appears
  -> bdev core examine
  -> vbdev_lvol examine_config
  -> spdk_lvs_notify_hotplug(uuid)
  -> lvs_esnap_degraded_hotplug
  -> blobstore sets esnap bs_dev
  -> lvol bdevs for no-longer-degraded clones may be created

Lab:

  • Use test/lvol/external_snapshot.sh.
  • Simulate missing parent by loading lvolstore before recreating external snapshot bdev.
  • Predict whether lvol bdevs appear immediately or after parent hotplug.
  • Trace the call to spdk_lvs_notify_hotplug.

Resize Edge Cases

Blob/lvol anchors:

  • lib/blob/blobstore.c:spdk_blob_resize
  • lib/blob/blobstore.c:bs_resize_freeze_cpl
  • lib/blob/blobstore.c:bs_resize_unfreeze_cpl
  • lib/lvol/lvol.c:spdk_lvol_resize
  • lib/lvol/lvol.c:lvol_blob_resize_cb
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_resize
  • module/bdev/lvol/vbdev_lvol.c:_vbdev_lvol_resize_cb
  • module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_resize

RAID/base anchors:

  • module/bdev/raid/bdev_raid.c:raid_bdev_resize_base_bdev
  • module/bdev/raid/bdev_raid.c:raid_bdev_resize_write_sb_cb
  • lib/blob/blobstore.c:spdk_bs_grow_live
  • lib/lvol/lvol.c:spdk_lvs_grow_live

What can go wrong:

  • Resize target is not cluster-aligned at the lvol layer.
  • Snapshot/read-only blob cannot be resized.
  • Another locked operation is active, causing -EBUSY.
  • bdev visible size is not updated after blob resize failure or partial path failure.
  • Base RAID grows but lvolstore does not grow.
  • Base RAID shrinks and lvolstore/blobstore metadata expects old capacity.
  • Clone grows beyond parent; reads past parent should not incorrectly address parent.

Misconception: "I resized the base, so every child is larger." Each layer has its own grow/resize operation:

physical base grew
  -> RAID may resize logical bdev
  -> blobstore/lvolstore may grow to use new RAID size
  -> individual lvol may resize
  -> exported target/guest must observe size change

Lab:

  1. Create RAID bdev.
  2. Create lvolstore on RAID bdev.
  3. Create lvol.
  4. Grow base bdevs.
  5. Observe RAID size.
  6. Grow lvolstore.
  7. Resize lvol.
  8. Explain which command changed which layer.

Delete Edge Cases

Blob anchors:

  • lib/blob/blobstore.c:spdk_bs_delete_blob
  • lib/blob/blobstore.c:bs_is_blob_deletable
  • lib/blob/blobstore.c:update_clone_on_snapshot_deletion
  • lib/blob/blobstore.c:delete_snapshot_freeze_io_cb
  • lib/blob/blobstore.c:delete_snapshot_update_extent_pages
  • lib/blob/blobstore.c:delete_snapshot_sync_clone_cpl
  • lib/blob/blobstore.c:delete_snapshot_sync_snapshot_cpl

lvol/vbdev anchors:

  • lib/lvol/lvol.c:spdk_lvol_deletable
  • lib/lvol/lvol.c:spdk_lvol_destroy
  • lib/lvol/lvol.c:lvol_delete_blob_cb
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_destroy
  • module/bdev/lvol/vbdev_lvol.c:_vbdev_lvol_destroy
  • module/bdev/lvol/vbdev_lvol.c:_vbdev_lvol_destroy_cb
  • module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_delete

What can go wrong:

  • lvol is still open.
  • lvol bdev has active users and cannot unregister immediately.
  • Snapshot has more than one clone.
  • Snapshot has one clone but clone is open or locked.
  • Metadata allocation for delete path fails.
  • Clone update fails after snapshot is marked pending removal.
  • External snapshot reference must be moved from deleted snapshot to clone.
  • Degraded lvol delete has to close metadata without unregistering a normal bdev.

Important distinction:

  • spdk_lvol_deletable() asks whether an lvol can be deleted from a snapshot/clone perspective.
  • spdk_lvol_destroy() performs lvol-level checks and calls blobstore delete.
  • vbdev_lvol_destroy() handles bdev visibility and unregister before lvol destroy.

Prose diagram for delete decision:

delete lvol requested
  |
  +-- is lvol open? yes -> -EBUSY
  |
  +-- is it a snapshot with >1 clone? yes -> fail
  |
  +-- is it a snapshot with exactly 1 clone?
  |      yes -> blobstore may update clone and remove snapshot
  |
  +-- unregister bdev if visible
  |
  +-- delete blob metadata
  |
  +-- remove lvol from lvolstore list

Lab:

  1. Create source lvol.
  2. Snapshot S.
  3. Create clones C1 and C2.
  4. Try deleting S. Expect failure.
  5. Delete C2.
  6. Try deleting S again. Follow source path that updates C1.
  7. Verify C1 still reads data previously inherited from S.

Inflate And Decouple Parent

Anchors:

  • include/spdk/lvol.h:spdk_lvol_inflate
  • include/spdk/lvol.h:spdk_lvol_decouple_parent
  • lib/lvol/lvol.c:spdk_lvol_inflate
  • lib/lvol/lvol.c:spdk_lvol_decouple_parent
  • lib/blob/blobstore.c:spdk_bs_inflate_blob
  • lib/blob/blobstore.c:spdk_bs_blob_decouple_parent
  • module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_inflate
  • module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_decouple_parent

Inflate makes a blob allocate and copy all data needed so it no longer depends on sparse/backing reads. Decouple parent removes the dependency on the parent while preserving data semantics. These are IO-heavy compared with snapshot/clone metadata creation.

What can go wrong:

  • Not enough free clusters to inflate.
  • Parent is degraded or missing.
  • IO error while copying parent data.
  • Operation is attempted on the wrong blob state.
  • Another locked operation is in progress.

Misconception: decouple is not just clearing a parent pointer. It must preserve reads for every logical block.

RAID Remove/Rebuild Edge Cases

Anchors:

  • module/bdev/raid/bdev_raid.c:raid_bdev_remove_base_bdev
  • module/bdev/raid/bdev_raid.c:_raid_bdev_remove_base_bdev
  • module/bdev/raid/bdev_raid.c:raid_bdev_process_base_bdev_remove
  • module/bdev/raid/bdev_raid.c:raid_bdev_start_rebuild
  • module/bdev/raid/bdev_raid.c:raid_bdev_process_start
  • module/bdev/raid/bdev_raid.c:raid_bdev_process_finish
  • module/bdev/raid/raid1.c:raid1_submit_process_request
  • module/bdev/raid/raid5f.c:raid5f_submit_process_request

What can go wrong:

  • Removing a base drops operational count below minimum.
  • Removing the rebuild target stops rebuild.
  • Removing a non-target may or may not require stopping process depending on remaining operational bases.
  • Superblock write after remove or rebuild finish fails.
  • Foreground IO crosses rebuild processed/unprocessed boundary.
  • Replacement base has incompatible block size or size.

Debugging prompt:

  • If RAID is online but upper lvol IO fails, check whether the RAID level can actually serve the requested range degraded.
  • If RAID is configuring, check whether all expected bases have been examined and whether superblocks agree.
  • If rebuild never progresses, inspect process window, QoS options, and quiesce callbacks.

Stack Order Edge Cases

RAID Under lvol

base disks -> RAID -> lvolstore -> lvol bdevs

Pros:

  • One lvolstore sees one base.
  • RAID handles disk failure below lvol.
  • Snapshots/clones are independent of individual physical disks.

Risks:

  • RAID resize must be followed by lvolstore/blobstore grow.
  • RAID outage makes all lvols unavailable.
  • RAID rebuild can affect latency for all lvols.

Source paths:

  • RAID IO: module/bdev/raid/bdev_raid.c:raid_bdev_submit_request
  • lvol IO: module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_submit_request
  • blob IO: lib/blob/blobstore.c:blob_request_submit_op

lvol Under RAID

base disk -> lvolstore -> lvol bdev A
base disk -> lvolstore -> lvol bdev B
lvol bdev A + lvol bdev B -> RAID

Pros:

  • Can compose from logical volumes.
  • Useful for testing or special deployments.

Risks:

  • RAID base removal is now an lvol bdev lifecycle event.
  • Underlying lvol snapshot/resize/delete can disrupt RAID base assumptions.
  • Claims and delete ordering become easier to get wrong.

Rule of thumb: choose the stack where failure domains match the product model. If RAID is for physical durability, put it below lvol. If RAID is a logical experiment, document every lifecycle coupling.

Debugging Checklist By Symptom

lvol bdev missing after restart

Check:

  • lib/bdev/bdev.c:bdev_examine
  • module/bdev/lvol/vbdev_lvol.c:vbdev_lvs_examine_disk
  • lib/lvol/lvol.c:spdk_lvs_load_ext
  • lib/lvol/lvol.c:load_next_lvol
  • module/bdev/lvol/vbdev_lvol.c:_create_lvol_disk

Likely causes:

  • base bdev not present
  • manual examine not called
  • lvolstore load failed
  • base bdev already claimed
  • external snapshot missing/degraded
  • bdev registration failed

delete returns busy

Check:

  • lib/lvol/lvol.c:spdk_lvol_destroy
  • module/bdev/lvol/vbdev_lvol.c:_vbdev_lvol_destroy
  • lib/blob/blobstore.c:bs_is_blob_deletable

Likely causes:

  • lvol open reference exists
  • bdev has users
  • snapshot has multiple clones
  • clone/snapshot locked operation in progress

resize succeeds but guest sees old size

Check:

  • lib/lvol/lvol.c:spdk_lvol_resize
  • module/bdev/lvol/vbdev_lvol.c:_vbdev_lvol_resize_cb
  • exporting transport resize notification path
  • upper consumer cache/state

Likely causes:

  • blob resized but bdev notification not observed
  • target/guest did not rescan
  • wrong layer resized

RAID bdev not online

Check:

  • module/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_get_bdevs
  • module/bdev/raid/bdev_raid.c:raid_bdev_configure
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine
  • module/bdev/raid/bdev_raid.c:raid_bdev_examine_sb

Likely causes:

  • missing base
  • incompatible base size/block size
  • stale or conflicting superblock
  • too few operational bases
  • still configuring after async examine

thin write fails after create succeeded

Check:

  • lib/blob/blobstore.c:bs_claim_cluster
  • lib/blob/blobstore.c:blob_request_submit_op_single
  • lib/blob/blobstore.c:blob_insert_cluster_on_md_thread

Likely cause:

  • blobstore free clusters exhausted

Labs

Lab A: Failure Matrix

Create a table in notes with rows:

  • snapshot
  • clone
  • external clone
  • resize
  • delete clone
  • delete snapshot with one clone
  • delete snapshot with two clones
  • RAID base remove
  • RAID replacement add
  • lvolstore grow

Columns:

  • RPC entry function
  • library function
  • metadata owner
  • base bdev claim involved
  • visible bdev change
  • expected failure if object is open
  • expected failure if ENOSPC
  • recovery path after restart

Populate the table only from local source anchors in these drafts.

Lab B: Callback Chain Trace

Pick bdev_lvol_delete and trace to completion:

module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_delete
module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_destroy
module/bdev/lvol/vbdev_lvol.c:_vbdev_lvol_destroy
module/bdev/lvol/vbdev_lvol.c:_vbdev_lvol_destroy_cb
lib/lvol/lvol.c:spdk_lvol_destroy
lib/blob/blobstore.c:spdk_bs_delete_blob
lib/blob/blobstore.c:bs_is_blob_deletable

Then answer:

  • Which callbacks complete bdev unregister?
  • Which callbacks complete blob deletion?
  • Which object is freed last?
  • Where is the lvol removed from the lvolstore list?

Lab C: Prose Diagram Drill

Draw in prose a stack with:

Nvme0n1 + Nvme1n1 -> RAID1 -> lvolstore -> lvol clone from external snapshot -> NVMe-oF namespace

For each edge, write:

  • Who owns the base?
  • What claim exists?
  • Which source function maps IO?
  • What happens if that layer's base is removed?

Self-Check

  1. Why can lvol create succeed but a later thin write fail?
  2. Why is deleting a snapshot with one clone possible but deleting one with two clones blocked?
  3. What source path hotplugs a missing external snapshot parent?
  4. Why does resize have to be reasoned about one layer at a time?
  5. How can RAID rebuild affect foreground IO mapping?
  6. Why does bdev_wait_for_examine matter after creating or restoring base bdevs?
  7. What is the difference between degraded RAID and degraded lvol?
  8. Which stack order is usually easier when RAID is meant to protect physical disks?

References

  • Blobstore edge logic: lib/blob/blobstore.c
  • lvol edge logic: lib/lvol/lvol.c
  • lvol bdev edge logic: module/bdev/lvol/vbdev_lvol.c
  • bdev examine and claims: lib/bdev/bdev.c, include/spdk/bdev_module.h
  • RAID lifecycle: module/bdev/raid/bdev_raid.c, module/bdev/raid/bdev_raid_sb.c
  • Labs/tests: test/lvol/snapshot_clone.sh, test/lvol/resize.sh, test/lvol/external_snapshot.sh, test/lvol/hotremove.sh, test/bdev/bdev_raid.sh