Beginner Mental Model
Blobstore is SPDK's small storage engine for "large named chunks of blocks" called blobs. If a normal block device gives you one flat address range, blobstore turns that range into a collection of independently sized objects. Each blob has metadata, a cluster map, attributes, and optional relationships to snapshots or parents. Blobstore is not a filesystem. It does not have directories, permissions, pathnames, page cache, journaling in the filesystem sense, or POSIX file semantics. It is closer to a lightweight allocator and metadata layer for storage systems that already know they want asynchronous block IO.
The simplest picture is:
base bdev
|
| adapted by module/blob/bdev/blob_bdev.c into struct spdk_bs_dev
v
blobstore
|
+-- super block
+-- used-cluster bitmap
+-- used-blobid bitmap
+-- metadata pages
+-- blob A -> cluster map -> base device LBAs
+-- blob B -> cluster map -> base device LBAs
A blob is a logical array of IO units. Blobstore maps those IO units to physical clusters. A cluster is the allocation unit. Thin blobs may have logical clusters that are not physically allocated yet. Reads from unallocated clusters resolve through a backing device: a zeroes device for ordinary thin blobs, a snapshot blob for clones, or an external snapshot device for esnap clones.
The important beginner rule is: blobstore IO is asynchronous, but blob metadata ownership is strict. The API documentation in include/spdk/blob.h says functions that do not take an IO channel may be called only on the thread that called spdk_bs_init() or spdk_bs_load(). Functions beginning with spdk_blob_io take a channel created with spdk_bs_alloc_io_channel() and must be called on the thread that created that channel.
Why This Matters For diskengine/excloud
Cloud volume systems often need fast create, clone, snapshot, resize, and delete without blocking the IO path. Blobstore is the layer that makes those operations cheap enough to compose. lvol uses blobstore to provide logical volumes. vbdev_lvol exposes those lvols as SPDK bdevs, and RAID may sit under or over those bdevs depending on design. If blobstore semantics are misunderstood, higher layers tend to make wrong assumptions:
- Thin create is not free forever; first writes allocate clusters and can fail with
-ENOSPC. - Snapshot create is not just "copy metadata"; it freezes IO, swaps cluster maps, sets parent relationships, and persists several pieces of metadata.
- Delete of a snapshot with clones may rewrite clone metadata and parent relationships.
- Resize is asynchronous and serialized by
locked_operation_in_progress. - An lvol bdev may exist before every external snapshot parent is available; degraded lvol behavior comes from blobstore and lvol cooperation.
Core Objects
Public API objects:
struct spdk_bs_devininclude/spdk/blob.h: the device interface blobstore uses. It has function pointers forread,write,readv,writev,unmap,write_zeroes,flush,copy,is_zeroes,translate_lba, andis_degraded.struct spdk_blob_storeininclude/spdk/blob.h: opaque publicly, implemented privately inlib/blob/blobstore.h.struct spdk_blobininclude/spdk/blob.h: opaque publicly, implemented privately inlib/blob/blobstore.h.spdk_blob_id: auint64_tidentifier.
Private implementation anchors:
lib/blob/blobstore.h:struct spdk_blob_storelib/blob/blobstore.h:struct spdk_bloblib/blob/blobstore.h:struct spdk_bs_channellib/blob/blobstore.h:struct spdk_bs_super_blocklib/blob/blobstore.h:struct spdk_blob_md_pagelib/blob/blobstore.h:struct spdk_blob_md_descriptor_extent_rlelib/blob/blobstore.h:struct spdk_blob_md_descriptor_extent_tablelib/blob/blobstore.h:struct spdk_blob_md_descriptor_extent_page
The private struct spdk_blob_store contains the backing struct spdk_bs_dev, cluster sizing, metadata ranges, bit pools/arrays for allocation, trees/lists of open blobs and snapshots, and external snapshot callbacks. The private struct spdk_blob contains active and clean metadata, the blob ID, parent ID, state, cluster maps, extent pages, xattrs, open ref count, pending persists, and the backing bs_dev used for reads not satisfied by its own allocated clusters.
Units: Pages, Clusters, IO Units, LBAs
Blobstore uses several units at once:
- bdev block/LBA: the unit the backing device exposes.
- IO unit: blobstore's IO address unit. It is derived from device block length.
- metadata page: fixed blobstore metadata page unit.
- cluster: allocation unit for blob data.
- blob logical cluster: index into a blob's cluster map.
Source anchors:
lib/blob/blobstore.h:bs_byte_to_lbalib/blob/blobstore.h:bs_dev_byte_to_lbalib/blob/blobstore.h:bs_page_to_lbalib/blob/blobstore.h:bs_md_page_to_lbalib/blob/blobstore.h:bs_cluster_to_io_unitlib/blob/blobstore.h:bs_io_unit_to_clusterlib/blob/blobstore.h:bs_cluster_to_lbalib/blob/blobstore.h:bs_lba_to_clusterlib/blob/blobstore.h:bs_blob_io_unit_to_lba
Beginner misconception to kill: a blob's "size" is not necessarily allocated space. spdk_blob_get_num_clusters() returns logical size in clusters. spdk_blob_get_num_allocated_clusters() returns actual allocated clusters. In a thin blob, these may differ. In a snapshot/clone chain, a read may walk from child to parent rather than consuming an allocated cluster in the child.
Device Adaptation: bdev To Blobstore
Blobstore does not directly speak struct spdk_bdev. It speaks struct spdk_bs_dev. The adapter lives in:
include/spdk/blob_bdev.hmodule/blob/bdev/blob_bdev.c
The lvol bdev module uses spdk_bdev_create_bs_dev_ext() to wrap a bdev as a blobstore device. In the lvol examine path, see module/bdev/lvol/vbdev_lvol.c:_vbdev_lvs_examine, which calls spdk_bdev_create_bs_dev_ext(bdev->name, ...) before calling spdk_lvs_load_ext().
The reverse direction also exists: a blob can be exposed as a blobstore device. That is used for snapshot backing:
lib/blob/blob_bs_dev.c:bs_create_blob_bs_devlib/blob/blob_bs_dev.c:blob_bs_dev_readlib/blob/blob_bs_dev.c:blob_bs_is_zeroeslib/blob/blob_bs_dev.c:blob_bs_translate_lbalib/blob/blob_bs_dev.c:blob_bs_is_degraded
Prose diagram:
clone blob read
|
| if child cluster is allocated:
| read child cluster from base blobstore device
|
| if child cluster is unallocated:
| read from back_bs_dev
|
+-- back_bs_dev may be:
zeroes device -> ordinary thin empty data
blob-backed bs_dev -> internal snapshot parent
external snapshot bs_dev -> bdev outside this lvolstore
Loading And Initializing A Blobstore
Public entry points:
include/spdk/blob.h:spdk_bs_initinclude/spdk/blob.h:spdk_bs_loadinclude/spdk/blob.h:spdk_bs_growinclude/spdk/blob.h:spdk_bs_grow_liveinclude/spdk/blob.h:spdk_bs_destroyinclude/spdk/blob.h:spdk_bs_unload
Implementation anchors:
lib/blob/blobstore.c:spdk_bs_initlib/blob/blobstore.c:spdk_bs_loadlib/blob/blobstore.c:bs_alloclib/blob/blobstore.c:bs_parse_superlib/blob/blobstore.c:bs_recoverlib/blob/blobstore.c:bs_load_replay_mdlib/blob/blobstore.c:bs_load_replay_md_parse_pagelib/blob/blobstore.c:bs_load_write_used_mdlib/blob/blobstore.c:spdk_bs_growlib/blob/blobstore.c:spdk_bs_grow_live
Initialization creates new metadata. Loading reads and validates existing metadata. The load path allocates a struct spdk_bs_load_ctx, reads the superblock, reconstructs in-memory state from metadata pages, recovers if necessary, rebuilds allocation masks, and then completes with a blobstore handle.
The private load context in lib/blob/blobstore.c:struct spdk_bs_load_ctx is worth reading because it shows what blobstore believes it needs to reconstruct the world:
super: superblock buffer.mask: used-cluster or used-blobid mask.page: current metadata page.extent_pages: loaded extent pages.used_clusters: load-time allocation view.force_recover: option for recovery behavior.- iterator callback fields for walking blobs.
Beginner misconception to kill: load is not merely "read one superblock and return." It may replay metadata page chains and repair masks. Dirty shutdown changes the path.
Cluster Allocation
Source anchors:
lib/blob/blobstore.c:bs_claim_clusterlib/blob/blobstore.c:bs_release_clusterlib/blob/blobstore.c:blob_insert_cluster_on_md_threadlib/blob/blobstore.c:blob_insert_cluster_msglib/blob/blobstore.c:blob_free_cluster_on_md_threadlib/blob/blobstore.c:blob_free_cluster_msglib/blob/blobstore.c:blob_write_extent_page
Cluster allocation is the core of thin provisioning. When a thin blob receives a write to an unallocated logical cluster, blobstore claims a physical cluster from the blobstore pool, updates the blob's active metadata, persists the relevant metadata page or extent page, and then performs IO. Releasing a cluster clears the mapping, updates allocation counts, writes metadata, and returns the cluster to the free pool.
The IO path has to be careful at cluster boundaries. A single user IO may cover multiple blob clusters and may need to split. Relevant source anchors:
lib/blob/blobstore.c:blob_request_submit_oplib/blob/blobstore.c:blob_request_submit_op_singlelib/blob/blobstore.c:blob_request_submit_op_splitlib/blob/blobstore.c:blob_request_submit_op_split_nextlib/blob/blobstore.c:blob_request_submit_rw_iov
Prose diagram for a write to an unallocated cluster:
spdk_blob_io_write()
-> blob_request_submit_op()
-> determine cluster coverage
-> if one cluster: blob_request_submit_op_single()
-> if crosses boundary: blob_request_submit_op_split()
-> if target cluster missing and write needs storage:
allocate cluster on metadata thread
update cluster map / extent page
submit write to bs_dev
-> completion callback returns bserrno
IO API Surface
Public IO API:
include/spdk/blob.h:spdk_blob_io_readinclude/spdk/blob.h:spdk_blob_io_writeinclude/spdk/blob.h:spdk_blob_io_readvinclude/spdk/blob.h:spdk_blob_io_writevinclude/spdk/blob.h:spdk_blob_io_readv_extinclude/spdk/blob.h:spdk_blob_io_writev_extinclude/spdk/blob.h:spdk_blob_io_unmapinclude/spdk/blob.h:spdk_blob_io_write_zeroes
Implementation:
lib/blob/blobstore.c:spdk_blob_io_readlib/blob/blobstore.c:spdk_blob_io_writelib/blob/blobstore.c:spdk_blob_io_unmaplib/blob/blobstore.c:spdk_blob_io_write_zeroeslib/blob/blobstore.c:spdk_blob_io_readv_extlib/blob/blobstore.c:spdk_blob_io_writev_ext
Request helpers:
lib/blob/request.c:bs_sequence_start_bslib/blob/request.c:bs_sequence_start_bloblib/blob/request.c:bs_batch_openlib/blob/request.c:bs_user_op_alloclib/blob/request.c:bs_user_op_execute
Beginner rule: callback completion is the only point at which the caller knows the operation has completed. Returning from spdk_blob_io_write() only means the operation was accepted into the async path or failed immediately by invoking/completing through the callback path depending on the specific helper.
Snapshots And Clones
Public API:
include/spdk/blob.h:spdk_bs_create_snapshotinclude/spdk/blob.h:spdk_bs_create_cloneinclude/spdk/blob.h:spdk_blob_is_snapshotinclude/spdk/blob.h:spdk_blob_is_cloneinclude/spdk/blob.h:spdk_blob_get_parent_snapshotinclude/spdk/blob.h:spdk_blob_get_clones
Implementation anchors:
lib/blob/blobstore.c:spdk_bs_create_snapshotlib/blob/blobstore.c:spdk_bs_create_clonelib/blob/blobstore.c:bs_snapshot_swap_cluster_mapslib/blob/blobstore.c:bs_snapshot_newblob_sync_cpllib/blob/blobstore.c:bs_snapshot_origblob_sync_cpllib/blob/blobstore.c:blob_freeze_iolib/blob/blobstore.c:blob_unfreeze_iolib/blob/blobstore.c:bs_blob_list_addlib/blob/blobstore.c:spdk_blob_is_snapshotlib/blob/blobstore.c:spdk_blob_is_clonelib/blob/blobstore.c:spdk_blob_get_parent_snapshotlib/blob/blobstore.c:spdk_blob_get_clones
A snapshot operation converts a mutable blob into a clone of a new read-only snapshot. That sentence is easy to miss. After snapshot creation:
- The new snapshot blob is read-only.
- The original blob remains writable but becomes thin-provisioned.
- The original blob points at the snapshot as backing storage for clusters it has not overwritten since the snapshot.
- Metadata xattrs record the relationship.
- IO is frozen while the relationship and cluster maps are changed.
Prose diagram:
Before snapshot:
volume blob V
cluster map: [A, B, C, D]
parent: none or previous parent
After snapshot named S:
snapshot blob S, read-only
cluster map: [A, B, C, D]
volume blob V, writable thin clone
cluster map: [0, 0, 0, 0]
parent/backing: S
New write to V cluster 2:
V allocates E
V cluster map: [0, 0, E, 0]
reads cluster 2 from V
reads clusters 0,1,3 from S
A clone operation creates a new writable thin blob backed by a read-only snapshot. Source readers should compare spdk_bs_create_snapshot and spdk_bs_create_clone. Snapshot creation modifies the original blob; clone creation leaves the snapshot as parent and creates a new child.
External Snapshots
External snapshots allow a blob to use a non-blobstore device as its parent. lvol exposes this through bdev_lvol_clone_bdev and parent-setting RPCs, but the blobstore core owns the metadata relationship.
Public blob anchors:
include/spdk/blob.h:spdk_bs_blob_set_external_parentinclude/spdk/blob.h:spdk_blob_get_esnap_idinclude/spdk/blob.h:spdk_blob_is_esnap_cloneinclude/spdk/blob.h:spdk_blob_set_esnap_bs_devinclude/spdk/blob.h:spdk_blob_get_esnap_bs_devinclude/spdk/blob.h:spdk_blob_is_degraded
Implementation anchors:
lib/blob/blobstore.c:spdk_bs_blob_set_external_parentlib/blob/blobstore.c:spdk_blob_set_esnap_bs_devlib/blob/blobstore.c:spdk_blob_get_esnap_bs_devlib/blob/blobstore.c:spdk_blob_is_degradedlib/blob/blobstore.c:blob_esnap_destroy_bs_dev_channelslib/blob/blobstore.c:blob_esnap_destroy_bs_channellib/blob/blobstore.c:blob_esnap_get_io_channel
The public callback type spdk_bs_esnap_dev_create in include/spdk/blob.h is central. When blobstore loads an esnap clone, it calls the consumer-provided callback to open the external snapshot as a struct spdk_bs_dev. If the external snapshot is unavailable, the consumer can arrange degraded behavior.
Beginner misconception to kill: an external snapshot is not copied into the lvolstore by default. The clone depends on the external parent until it is inflated, decoupled, shallow-copied elsewhere, or reparented.
Resize And Grow
Blob resize:
include/spdk/blob.h:spdk_blob_resizelib/blob/blobstore.c:spdk_blob_resizelib/blob/blobstore.c:blob_resizelib/blob/blobstore.c:bs_resize_freeze_cpllib/blob/blobstore.c:bs_resize_unfreeze_cpl
Blobstore grow:
include/spdk/blob.h:spdk_bs_growinclude/spdk/blob.h:spdk_bs_grow_livelib/blob/blobstore.c:spdk_bs_growlib/blob/blobstore.c:spdk_bs_grow_livelib/blob/blobstore.c:bs_load_try_to_growlib/blob/blobstore.c:bs_grow_live_load_super_cpllib/blob/blobstore.c:bs_grow_live_super_write_cpl
Resize is serialized with locked_operation_in_progress. If the blob is metadata-read-only (md_ro), resize fails with -EPERM. If the requested cluster count is unchanged, resize completes successfully without work. If another locked operation is active, resize fails with -EBUSY.
Important edge cases:
- Growing logical size may not allocate clusters immediately for a thin blob.
- Shrinking must update metadata so clusters beyond the new end are not considered part of the blob.
- A clone can grow beyond its parent; backing reads are valid only where the parent has a range. See
lib/blob/blob_bs_dev.c:blob_bs_is_range_valid, which explicitly accounts for a backing blob having fewer clusters than the child after expansion. - Blobstore grow is about the underlying device becoming larger, not an individual blob becoming larger.
Delete Semantics
Public API:
include/spdk/blob.h:spdk_bs_delete_blob
Implementation anchors:
lib/blob/blobstore.c:spdk_bs_delete_bloblib/blob/blobstore.c:bs_is_blob_deletablelib/blob/blobstore.c:update_clone_on_snapshot_deletionlib/blob/blobstore.c:delete_snapshot_open_clone_cblib/blob/blobstore.c:delete_snapshot_freeze_io_cblib/blob/blobstore.c:delete_snapshot_sync_snapshot_xattr_cpllib/blob/blobstore.c:delete_snapshot_update_extent_pageslib/blob/blobstore.c:delete_snapshot_sync_clone_cpllib/blob/blobstore.c:delete_snapshot_sync_snapshot_cpllib/blob/blobstore.c:bs_delete_blob_finish
Blob deletion has special rules for snapshots:
- A normal blob with no clones can be removed.
- A snapshot with more than one clone cannot be removed directly.
- A snapshot with one clone may be removed by updating the clone to inherit the snapshot's parent/data relationship.
- An open snapshot cannot be removed except for the internal reference patterns allowed by the deletion logic.
- Deleting a snapshot may require freezing the clone, copying cluster mappings, changing parent xattrs, handling external snapshot references, and persisting metadata in a power-failure-aware order.
Prose diagram for deleting a snapshot with one clone:
Before:
Parent P -> Snapshot S -> Clone C
Delete S:
freeze C
mark S pending removal
copy needed S cluster mappings into C
change C parent from S to P, zeroes, or external snapshot depending on S
sync C metadata
sync S metadata / remove S from snapshot lists
unfreeze C
After:
Parent P -> Clone C
Edge Cases And Failure Modes
ENOSPC On Thin Writes
Thin provisioning means creation does not reserve all clusters. First write to a previously unallocated cluster may call bs_claim_cluster and fail if the free cluster pool is exhausted. The visible failure may happen at lvol bdev write time even though lvol creation succeeded earlier.
EBUSY On Concurrent Metadata Operations
Snapshot, resize, parent changes, and some delete paths use locked operations. lib/blob/blobstore.c:spdk_blob_resize checks blob->locked_operation_in_progress and returns -EBUSY if another locked operation is active.
Read-Only Metadata
Snapshots are read-only. lib/blob/blobstore.c:spdk_blob_resize fails with -EPERM when blob->md_ro is true. Snapshot deletion temporarily overrides metadata read-only flags in controlled cleanup paths; that does not mean callers may mutate snapshot metadata freely.
Dirty Load And Recovery
Crash or power loss during metadata updates can leave dirty metadata. The load path has recovery-specific logic:
lib/blob/blobstore.c:bs_recoverlib/blob/blobstore.c:bs_load_replay_mdlib/blob/blobstore.c:bs_load_write_used_mdlib/blob/blobstore.c:bs_load_cur_md_page_valid
Blobstore metadata update ordering is designed so load can reason about partially completed operations. For example, snapshot delete uses SNAPSHOT_PENDING_REMOVAL to make recovery possible.
External Snapshot Missing
If an esnap clone is loaded but its external parent cannot be opened, the blob/lvol may be degraded. The blobstore public predicate is include/spdk/blob.h:spdk_blob_is_degraded, implemented by lib/blob/blobstore.c:spdk_blob_is_degraded.
IO Channel Lifetime
Blobstore IO APIs require a channel. External snapshot channels may need to be destroyed on all relevant threads. See:
lib/blob/blobstore.c:blob_esnap_destroy_bs_dev_channelslib/blob/blobstore.c:blob_esnap_destroy_bs_channellib/blob/blobstore.c:blob_esnap_get_io_channel
Misconceptions To Kill
- "Blobstore is a filesystem." It is not. It is an async object allocator over a block device.
- "Snapshot copies all data." It does not. It changes metadata relationships and shares clusters until overwritten.
- "Thin provisioning means writes always succeed later." No. Allocation can fail at write time.
- "Resize allocates data." Logical resize and physical allocation are separate for thin blobs.
- "A clone owns all bytes it can read." A clone may read from its own clusters, a snapshot, an external snapshot, or zeroes.
- "Delete only removes a record." Snapshot delete can rewrite clone metadata and parent relationships.
- "Load is simple." Load may replay metadata, recover dirty state, validate CRC-like metadata, and rebuild allocation masks.
Source Reading Exercise
Read these in order and write down every callback transition:
lib/blob/blobstore.c:spdk_bs_create_snapshotlib/blob/blobstore.c:bs_snapshot_newblob_sync_cpllib/blob/blobstore.c:bs_snapshot_origblob_sync_cpllib/blob/blobstore.c:blob_freeze_iolib/blob/blobstore.c:blob_unfreeze_io
Questions:
- At what point is the new snapshot marked read-only?
- Why does snapshot creation swap cluster maps?
- What must be undone if syncing the new snapshot metadata fails?
- What prevents concurrent writes while the cluster maps are being rearranged?
Operational Lab
Use test/lvol/snapshot_clone.sh as the high-level lab and test/unit/lib/blob/blob.c/blob_ut.c as the source-level lab.
Suggested lab flow:
- Create a malloc bdev.
- Create an lvolstore on it.
- Create an lvol.
- Write recognizable data to different offsets.
- Create a snapshot.
- Write new data to the original lvol.
- Create a clone from the snapshot.
- Compare reads from original, snapshot, and clone.
- Delete a snapshot with zero, one, and more than one clone, and record expected errors.
If running a full SPDK target is unavailable, trace the expected RPC-to-source path instead:
bdev_lvol_snapshot RPC
-> module/bdev/lvol/vbdev_lvol_rpc.c:rpc_bdev_lvol_snapshot
-> module/bdev/lvol/vbdev_lvol.c:vbdev_lvol_create_snapshot
-> lib/lvol/lvol.c:spdk_lvol_create_snapshot
-> lib/blob/blobstore.c:spdk_bs_create_snapshot
Self-Check
- Why can a thin blob have a large logical size but few allocated clusters?
- What is the difference between
spdk_bs_grow_live()andspdk_blob_resize()? - Why does snapshot creation freeze IO?
- Why can deleting a snapshot require updating a clone?
- What does
spdk_blob_is_degraded()mean for an esnap clone? - Why is
struct spdk_bs_devseparate fromstruct spdk_bdev? - Which functions would you inspect first for an
-EBUSYreturned by lvol resize? - Which source file adapts a blob into a backing device for clone reads?
References
- Local API:
include/spdk/blob.h - Local implementation:
lib/blob/blobstore.c - Local private structures:
lib/blob/blobstore.h - Local bdev adapter:
module/blob/bdev/blob_bdev.c - Local blob-backed bs_dev:
lib/blob/blob_bs_dev.c - Local tests:
test/blobstore/blobstore.sh,test/blobstore/blobstore_grow/blobstore_grow.sh,test/lvol/snapshot_clone.sh,test/lvol/thin_provisioning.sh