SPDK From First Principles

SPDK deep learning path

Chapter 29: diskengine Storage Node Mode

This chapter explains diskengine storage-node mode as a set of reconciliation loops around SPDK. The reader should understand how local NVMe devices are discovered, bound for...

Source: drafts/transport-diskengine/29-diskengine-storage-node-mode.md

Chapter Goal

This chapter explains diskengine storage-node mode as a set of reconciliation loops around SPDK. The reader should understand how local NVMe devices are discovered, bound for SPDK, attached as bdevs, turned into lvstores, carved into lvols, exported over NVMe-oF, monitored, resized, snapshotted, deleted, and reconciled after restart.

Beginner Mental Model

Storage-node mode owns physical SSD capacity. It does not directly serve VM writes through Go handlers. Instead, it prepares SPDK objects so compute/baremetal nodes can perform data I/O through SPDK transports.

The loop is:

  1. Find local physical NVMe devices.
  2. Make them usable by SPDK.
  3. Attach each disk as an SPDK NVMe bdev.
  4. Create or import an lvol store on the disk.
  5. Create lvol bdevs for volumes.
  6. Export lvols as NVMe-oF namespaces.
  7. Keep DB state and SPDK state close enough that restarts can self-heal.

The design is eventually consistent. A database row enters a state like NEW, CREATING, UP, DELETING, or RESIZING. A loop observes it, tries the SPDK operation, and updates the database when the operation is confirmed.

Entry Point And Loops

Storage-node mode starts here:

  • /home/lolwierd/Projects/excloud/diskengine/diskengine/cmd/diskengine/main.go: main
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/cmd/diskengine/init.go: init
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/storagenode.go: Start
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/storagenode.go: Stop

Start first checks the SPDK RPC socket and runs a startup verification pass:

  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/utils.go: ensureSockExists
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/verifystate.go: verifyState

Then it starts loops:

  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/disk_init.go: diskInitLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/disk_discover.go: diskDiscoverLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/disk_health.go: diskHealthLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/resize.go: inPlaceResizeLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/nvmeofexport.go: nvmeofExportLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/snapshotcreate.go: snapshotCreateLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/provisionlvol.go: provisioningLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/lvol_delete.go: lvolDeleteLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/snapshotdelete.go: snapshotDeleteLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/iostatscraper/scraper.go: StartStorage

Disk Discovery

Disk discovery inspects Linux sysfs for kernel-visible NVMe devices:

  • internal/storagenode/disk_discover.go: discoverStep
  • internal/storagenode/disk_discover.go: enumerateSysDisks

Important caveat from the source: devices bound to vfio-pci are no longer exposed under /sys/block in the same way. Discovery sees kernel-bound devices before SPDK takes ownership. That is why discovery and initialization need to be reasoned about together.

Repository anchors include:

  • internal/repository/disk.go: GetNewDisks
  • internal/repository/disk.go: GetDisksByNode

Disk Initialization

Disk initialization handles NEW disks:

  • internal/storagenode/disk_init.go: processNewDisks
  • internal/storagenode/disk_init.go: initialiseDisk
  • internal/storagenode/disk_init.go: bindToVfio
  • internal/storagenode/disk_init.go: ensureVfioPciModuleLoaded
  • internal/storagenode/disk_init.go: checkIOMMUAvailable
  • internal/storagenode/disk_init.go: bindToDriver
  • internal/storagenode/disk_init.go: examineAndFindLvstore
  • internal/storagenode/disk_init.go: findNvmeBdevName

The SPDK calls involved are wrapped by:

  • internal/spdkclient/wrappers.go: BdevNvmeAttachController
  • internal/spdkclient/wrappers.go: BdevExamine
  • internal/spdkclient/wrappers.go: BdevLvolGetLvstores
  • internal/spdkclient/wrappers.go: BdevLvolCreateLvstore
  • internal/spdkclient/wrappers.go: BdevGetBdevs

SPDK source anchors:

  • module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller
  • module/bdev/lvol/vbdev_lvol_rpc.c: rpc_bdev_lvol_create_lvstore
  • lib/bdev/bdev_rpc.c: rpc_bdev_examine

Lvstore And Lvol Provisioning

An lvstore is SPDK blobstore-backed allocation space on a base bdev. An lvol is a thin logical volume bdev inside that store.

Provisioning source anchors:

  • internal/storagenode/provisionlvol.go: provisioningLoop
  • internal/storagenode/provisionlvol.go: processProvisioning
  • internal/storagenode/provisionlvol.go: provisionLvol
  • internal/storagenode/provisionlvol.go: isNamespaceAttached
  • internal/storagenode/provisionlvol.go: findExistingLvolUUID
  • internal/spdkclient/wrappers.go: BdevLvolCreate
  • internal/spdkclient/wrappers.go: NvmfSubsystemAddNs

The important sequence in provisionLvol is:

  1. Validate NQN/RDMA placement info.
  2. Ensure NVMe-oF target objects exist.
  3. Create the lvol.
  4. Attach the lvol bdev as a namespace to the subsystem.
  5. Finalize DB state.

If namespace attachment fails after lvol creation, the source logs that an orphaned bdev may require cleanup. This is an important production edge case: partial success is real.

NVMe-oF Export Reconciliation

Exports are reconciled separately from provisioning:

  • internal/storagenode/nvmeofexport.go: nvmeofExportLoop
  • internal/storagenode/nvmeofexport.go: reconcileExports
  • internal/storagenode/nvmeofexport.go: reconcileDiskPlacementFromEnv
  • internal/storagenode/utils.go: ensureNvmeofReady
  • internal/storagenode/nqn.go: deterministicBaseNQN
  • internal/storagenode/nqn.go: deterministicLvolNQN

The SPDK operations are:

  • nvmf_get_transports
  • nvmf_create_transport
  • nvmf_get_subsystems
  • nvmf_create_subsystem
  • nvmf_subsystem_add_listener
  • nvmf_subsystem_add_ns
  • bdev_get_bdevs

SPDK source anchors:

  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_create_transport
  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_create_subsystem
  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_listener
  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_ns

Health, Resize, Snapshots, And Delete

Health:

  • internal/storagenode/disk_health.go: checkDiskHealth
  • internal/storagenode/disk_health.go: classifyHealth
  • internal/spdkclient/wrappers.go: BdevNvmeGetControllerHealthInfo
  • SPDK anchor: module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_get_controller_health_info

Resize:

  • internal/storagenode/resize.go: processInPlaceResizes
  • internal/storagenode/resize.go: zeroLvolDeltaRegion
  • internal/spdkclient/wrappers.go: BdevLvolResize

Snapshots:

  • internal/storagenode/snapshotcreate.go: snapshotCreateLoop
  • internal/storagenode/snapshotdelete.go: snapshotDeleteLoop
  • internal/spdkclient/wrappers.go: BdevLvolSnapshot

Deletion:

  • internal/storagenode/lvol_delete.go: lvolDeleteLoop
  • internal/storagenode/lvol_delete.go: processDeletingLvols
  • internal/spdkclient/wrappers.go: BdevLvolDelete

Metrics:

  • internal/iostatscraper/collector.go: collectStorage
  • internal/spdkclient/wrappers.go: BdevGetIostat

Prose Diagram: Storage Node Reconciler

Draw a hub-and-spoke diagram. At the center is "SPDK RPC socket." Around it are loops:

  • disk discovery,
  • disk init,
  • provisioning,
  • export reconcile,
  • resize,
  • snapshot create/delete,
  • lvol delete,
  • health,
  • iostat scraper.

On the left is the database. On the right is SPDK object state. Each loop reads database state, reads SPDK state, makes one small change, and updates database state. Under SPDK, draw the data object stack: physical NVMe bdev -> lvstore -> lvol bdev -> NVMf namespace.

Edge Cases And Failure Modes

Device disappears from sysfs after VFIO bind:

This is expected. Do not conclude the disk is gone just because /sys/block no longer shows it after initialization.

No IOMMU or VFIO failure:

bindToVfio and checkIOMMUAvailable are storage-node bootstrap risks. Without SPDK ownership of the device, no lvstore can be created.

Lvstore exists in SPDK but not DB:

verifyState and initialization recovery can self-heal by inserting or aligning DB rows, depending on the observed state.

Lvol exists but namespace missing:

Provisioning may have partially succeeded. nvmeofExportLoop should attach missing namespaces for UP/RESIZING placements, but operators should inspect for orphaned bdevs.

Transport exists but wrong listener address:

nvmf_create_transport success does not prove the listener is correct. Inspect nvmf_get_subsystems.

Resize grows DB capacity before SPDK lvol:

inPlaceResizeLoop compares SPDK bdev size against DB capacity and issues bdev_lvol_resize.

Snapshot delete while clone depends on it:

SPDK lvol/blobstore can reject unsafe deletes. The loop must leave state for retry or manual cleanup.

Misconceptions To Kill

"Storage-node mode is a data proxy."

No. It prepares SPDK exports. VM writes go through SPDK transport and bdev paths, not Go request handlers.

"Disk discovery continues to see VFIO-bound devices."

No. The source explicitly notes this limitation.

"An lvol is exported automatically when created."

No. The lvol bdev must be attached to an NVMf subsystem namespace.

"DB state is always the truth."

DB state is intended state plus reconciliation memory. SPDK state and hardware state can drift; loops compare and repair.

Lab: Provision One lvol On Paper

Create a written sequence for a new volume replica:

  1. Disk is already UP with lvstore UUID.
  2. lvols row enters CREATING.
  3. provisioningLoop observes it.
  4. ensureNvmeofReady verifies RDMA transport, subsystem, and listener.
  5. BdevLvolCreate creates lvol bdev.
  6. NvmfSubsystemAddNs exports it.
  7. repository finalization stores SPDK lvol UUID and NQN.

For each step, name the SPDK RPC or diskengine function.

Operational Debug Exercise

Symptom: baremetal cannot connect to a newly created volume.

On storage node:

  1. Is the lvol UP in DB?
  2. Does bdev_get_bdevs show the lvol UUID?
  3. Does nvmf_get_subsystems show the NQN?
  4. Does the subsystem have a listener on the expected RDMA IP/port?
  5. Does the subsystem have a namespace with the lvol UUID?
  6. Did nvmeofExportLoop log add-listener or add-ns errors?

Self-Check

  1. Why does storage-node mode bind disks to VFIO?
  2. What is the difference between an lvstore and an lvol?
  3. Which loop ensures NVMe-oF exports exist?
  4. Why can lvol creation and namespace attachment partially succeed?
  5. Why is verifyState run before loops start?

References

  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/docs/storagenode.md
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient
  • Local SPDK: module/bdev/lvol/vbdev_lvol_rpc.c
  • Local SPDK: lib/nvmf/nvmf_rpc.c
  • Local SPDK: module/bdev/nvme/bdev_nvme_rpc.c
  • SPDK lvol documentation: https://spdk.io/doc/logical_volumes.html
  • SPDK NVMe-oF documentation: https://spdk.io/doc/nvmf.html