Chapter Goal
This chapter explains diskengine storage-node mode as a set of reconciliation loops around SPDK. The reader should understand how local NVMe devices are discovered, bound for SPDK, attached as bdevs, turned into lvstores, carved into lvols, exported over NVMe-oF, monitored, resized, snapshotted, deleted, and reconciled after restart.
Beginner Mental Model
Storage-node mode owns physical SSD capacity. It does not directly serve VM writes through Go handlers. Instead, it prepares SPDK objects so compute/baremetal nodes can perform data I/O through SPDK transports.
The loop is:
- Find local physical NVMe devices.
- Make them usable by SPDK.
- Attach each disk as an SPDK NVMe bdev.
- Create or import an lvol store on the disk.
- Create lvol bdevs for volumes.
- Export lvols as NVMe-oF namespaces.
- Keep DB state and SPDK state close enough that restarts can self-heal.
The design is eventually consistent. A database row enters a state like NEW, CREATING, UP, DELETING, or RESIZING. A loop observes it, tries the SPDK operation, and updates the database when the operation is confirmed.
Entry Point And Loops
Storage-node mode starts here:
/home/lolwierd/Projects/excloud/diskengine/diskengine/cmd/diskengine/main.go: main/home/lolwierd/Projects/excloud/diskengine/diskengine/cmd/diskengine/init.go: init/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/storagenode.go: Start/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/storagenode.go: Stop
Start first checks the SPDK RPC socket and runs a startup verification pass:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/utils.go: ensureSockExists/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/verifystate.go: verifyState
Then it starts loops:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/disk_init.go: diskInitLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/disk_discover.go: diskDiscoverLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/disk_health.go: diskHealthLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/resize.go: inPlaceResizeLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/nvmeofexport.go: nvmeofExportLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/snapshotcreate.go: snapshotCreateLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/provisionlvol.go: provisioningLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/lvol_delete.go: lvolDeleteLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/snapshotdelete.go: snapshotDeleteLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/iostatscraper/scraper.go: StartStorage
Disk Discovery
Disk discovery inspects Linux sysfs for kernel-visible NVMe devices:
internal/storagenode/disk_discover.go: discoverStepinternal/storagenode/disk_discover.go: enumerateSysDisks
Important caveat from the source: devices bound to vfio-pci are no longer exposed under /sys/block in the same way. Discovery sees kernel-bound devices before SPDK takes ownership. That is why discovery and initialization need to be reasoned about together.
Repository anchors include:
internal/repository/disk.go: GetNewDisksinternal/repository/disk.go: GetDisksByNode
Disk Initialization
Disk initialization handles NEW disks:
internal/storagenode/disk_init.go: processNewDisksinternal/storagenode/disk_init.go: initialiseDiskinternal/storagenode/disk_init.go: bindToVfiointernal/storagenode/disk_init.go: ensureVfioPciModuleLoadedinternal/storagenode/disk_init.go: checkIOMMUAvailableinternal/storagenode/disk_init.go: bindToDriverinternal/storagenode/disk_init.go: examineAndFindLvstoreinternal/storagenode/disk_init.go: findNvmeBdevName
The SPDK calls involved are wrapped by:
internal/spdkclient/wrappers.go: BdevNvmeAttachControllerinternal/spdkclient/wrappers.go: BdevExamineinternal/spdkclient/wrappers.go: BdevLvolGetLvstoresinternal/spdkclient/wrappers.go: BdevLvolCreateLvstoreinternal/spdkclient/wrappers.go: BdevGetBdevs
SPDK source anchors:
module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controllermodule/bdev/lvol/vbdev_lvol_rpc.c: rpc_bdev_lvol_create_lvstorelib/bdev/bdev_rpc.c: rpc_bdev_examine
Lvstore And Lvol Provisioning
An lvstore is SPDK blobstore-backed allocation space on a base bdev. An lvol is a thin logical volume bdev inside that store.
Provisioning source anchors:
internal/storagenode/provisionlvol.go: provisioningLoopinternal/storagenode/provisionlvol.go: processProvisioninginternal/storagenode/provisionlvol.go: provisionLvolinternal/storagenode/provisionlvol.go: isNamespaceAttachedinternal/storagenode/provisionlvol.go: findExistingLvolUUIDinternal/spdkclient/wrappers.go: BdevLvolCreateinternal/spdkclient/wrappers.go: NvmfSubsystemAddNs
The important sequence in provisionLvol is:
- Validate NQN/RDMA placement info.
- Ensure NVMe-oF target objects exist.
- Create the lvol.
- Attach the lvol bdev as a namespace to the subsystem.
- Finalize DB state.
If namespace attachment fails after lvol creation, the source logs that an orphaned bdev may require cleanup. This is an important production edge case: partial success is real.
NVMe-oF Export Reconciliation
Exports are reconciled separately from provisioning:
internal/storagenode/nvmeofexport.go: nvmeofExportLoopinternal/storagenode/nvmeofexport.go: reconcileExportsinternal/storagenode/nvmeofexport.go: reconcileDiskPlacementFromEnvinternal/storagenode/utils.go: ensureNvmeofReadyinternal/storagenode/nqn.go: deterministicBaseNQNinternal/storagenode/nqn.go: deterministicLvolNQN
The SPDK operations are:
nvmf_get_transportsnvmf_create_transportnvmf_get_subsystemsnvmf_create_subsystemnvmf_subsystem_add_listenernvmf_subsystem_add_nsbdev_get_bdevs
SPDK source anchors:
lib/nvmf/nvmf_rpc.c: rpc_nvmf_create_transportlib/nvmf/nvmf_rpc.c: rpc_nvmf_create_subsystemlib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_listenerlib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_ns
Health, Resize, Snapshots, And Delete
Health:
internal/storagenode/disk_health.go: checkDiskHealthinternal/storagenode/disk_health.go: classifyHealthinternal/spdkclient/wrappers.go: BdevNvmeGetControllerHealthInfo- SPDK anchor:
module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_get_controller_health_info
Resize:
internal/storagenode/resize.go: processInPlaceResizesinternal/storagenode/resize.go: zeroLvolDeltaRegioninternal/spdkclient/wrappers.go: BdevLvolResize
Snapshots:
internal/storagenode/snapshotcreate.go: snapshotCreateLoopinternal/storagenode/snapshotdelete.go: snapshotDeleteLoopinternal/spdkclient/wrappers.go: BdevLvolSnapshot
Deletion:
internal/storagenode/lvol_delete.go: lvolDeleteLoopinternal/storagenode/lvol_delete.go: processDeletingLvolsinternal/spdkclient/wrappers.go: BdevLvolDelete
Metrics:
internal/iostatscraper/collector.go: collectStorageinternal/spdkclient/wrappers.go: BdevGetIostat
Prose Diagram: Storage Node Reconciler
Draw a hub-and-spoke diagram. At the center is "SPDK RPC socket." Around it are loops:
- disk discovery,
- disk init,
- provisioning,
- export reconcile,
- resize,
- snapshot create/delete,
- lvol delete,
- health,
- iostat scraper.
On the left is the database. On the right is SPDK object state. Each loop reads database state, reads SPDK state, makes one small change, and updates database state. Under SPDK, draw the data object stack: physical NVMe bdev -> lvstore -> lvol bdev -> NVMf namespace.
Edge Cases And Failure Modes
Device disappears from sysfs after VFIO bind:
This is expected. Do not conclude the disk is gone just because /sys/block no longer shows it after initialization.
No IOMMU or VFIO failure:
bindToVfio and checkIOMMUAvailable are storage-node bootstrap risks. Without SPDK ownership of the device, no lvstore can be created.
Lvstore exists in SPDK but not DB:
verifyState and initialization recovery can self-heal by inserting or aligning DB rows, depending on the observed state.
Lvol exists but namespace missing:
Provisioning may have partially succeeded. nvmeofExportLoop should attach missing namespaces for UP/RESIZING placements, but operators should inspect for orphaned bdevs.
Transport exists but wrong listener address:
nvmf_create_transport success does not prove the listener is correct. Inspect nvmf_get_subsystems.
Resize grows DB capacity before SPDK lvol:
inPlaceResizeLoop compares SPDK bdev size against DB capacity and issues bdev_lvol_resize.
Snapshot delete while clone depends on it:
SPDK lvol/blobstore can reject unsafe deletes. The loop must leave state for retry or manual cleanup.
Misconceptions To Kill
"Storage-node mode is a data proxy."
No. It prepares SPDK exports. VM writes go through SPDK transport and bdev paths, not Go request handlers.
"Disk discovery continues to see VFIO-bound devices."
No. The source explicitly notes this limitation.
"An lvol is exported automatically when created."
No. The lvol bdev must be attached to an NVMf subsystem namespace.
"DB state is always the truth."
DB state is intended state plus reconciliation memory. SPDK state and hardware state can drift; loops compare and repair.
Lab: Provision One lvol On Paper
Create a written sequence for a new volume replica:
- Disk is already
UPwith lvstore UUID. lvolsrow entersCREATING.provisioningLoopobserves it.ensureNvmeofReadyverifies RDMA transport, subsystem, and listener.BdevLvolCreatecreates lvol bdev.NvmfSubsystemAddNsexports it.- repository finalization stores SPDK lvol UUID and NQN.
For each step, name the SPDK RPC or diskengine function.
Operational Debug Exercise
Symptom: baremetal cannot connect to a newly created volume.
On storage node:
- Is the lvol
UPin DB? - Does
bdev_get_bdevsshow the lvol UUID? - Does
nvmf_get_subsystemsshow the NQN? - Does the subsystem have a listener on the expected RDMA IP/port?
- Does the subsystem have a namespace with the lvol UUID?
- Did
nvmeofExportLooplog add-listener or add-ns errors?
Self-Check
- Why does storage-node mode bind disks to VFIO?
- What is the difference between an lvstore and an lvol?
- Which loop ensures NVMe-oF exports exist?
- Why can lvol creation and namespace attachment partially succeed?
- Why is
verifyStaterun before loops start?
References
- Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/docs/storagenode.md - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient - Local SPDK:
module/bdev/lvol/vbdev_lvol_rpc.c - Local SPDK:
lib/nvmf/nvmf_rpc.c - Local SPDK:
module/bdev/nvme/bdev_nvme_rpc.c - SPDK lvol documentation: https://spdk.io/doc/logical_volumes.html
- SPDK NVMe-oF documentation: https://spdk.io/doc/nvmf.html