Chapter Goal
This chapter follows one guest write from a VM down to a physical SSD and then follows completion back up. It ties together vhost-blk, RAID, NVMe-oF initiator, NVMe-oF target, lvol/blobstore, and physical NVMe.
The key idea: diskengine constructs and repairs the graph, but the write itself flows through SPDK data-plane code.
Scenario
Assume a VM writes 4 KiB to a disk backed by a replicated excloud volume:
- The VM is running on a baremetal node.
- QEMU uses a vhost-blk socket created by SPDK on that baremetal node.
- The vhost-blk controller exposes
raid_123. raid_123has base bdevs produced bybdev_nvmeconnections to storage-node NQNs.- Each storage-node NQN exports one lvol as an NVMe-oF namespace.
- Each lvol is backed by an SPDK lvstore/blobstore on a local physical NVMe SSD.
Stage 0: Control Plane Already Ran
Before the write, these diskengine loops prepared the graph.
Storage node:
internal/storagenode/disk_init.go: initialiseDiskinternal/storagenode/provisionlvol.go: provisionLvolinternal/storagenode/nvmeofexport.go: reconcileExports
Baremetal:
internal/baremetal/nvme_attach.go: reconcileNVMeConnectionsinternal/baremetal/raidensure.go: ensureRaidinternal/baremetal/attach.go: ensureVhost
This distinction matters. If you are debugging one write, first decide whether the graph exists. If it does not, debug control plane. If it exists but I/O fails, debug data path.
Stage 1: Guest Submits virtio-blk Request
Inside the VM, the guest kernel submits a block write to its virtio-blk device. The guest fills virtqueue descriptors that point to guest memory containing the request header and payload. QEMU and SPDK have negotiated vhost-user memory mappings, so SPDK can translate those guest physical addresses.
SPDK anchors:
lib/vhost/vhost_blk.c: process_blk_tasklib/vhost/vhost_blk.c: process_packed_blk_tasklib/vhost/vhost_blk.c: blk_iovs_split_queue_setuplib/vhost/vhost_blk.c: blk_iovs_packed_queue_setuplib/vhost/vhost_internal.h: vhost_gpa_to_vvalib/vhost/vhost_internal.h: vhost_vring_desc_to_iov
The output of this stage is a spdk_vhost_blk_task with iovs and request metadata.
Stage 2: vhost-blk Submits bdev Write To RAID
The request reaches:
lib/vhost/vhost_blk.c: vhost_user_process_blk_requestlib/vhost/vhost_blk.c: virtio_blk_process_request
For a write, virtio_blk_process_request submits bdev I/O to the backing bdev, which is raid_123 in this scenario. The completion callback is:
lib/vhost/vhost_blk.c: blk_request_complete_cb
Nothing has reached the SSD yet. At this moment SPDK has created an asynchronous bdev write against a RAID bdev.
Stage 3: RAID Maps The Write To Base bdevs
The RAID module receives the bdev write through its module submit path:
module/bdev/raid/bdev_raid.c: raid_bdev_submit_requestmodule/bdev/raid/raid1.c
For RAID1, the write must be propagated to mirror bases according to the module logic. Each base bdev is an NVMe bdev created by bdev_nvme_attach_controller.
Useful RAID RPC/debug anchors:
module/bdev/raid/bdev_raid_rpc.c: rpc_bdev_raid_get_bdevsmodule/bdev/raid/bdev_raid_rpc.c: rpc_bdev_raid_add_base_bdevmodule/bdev/raid/bdev_raid_rpc.c: rpc_bdev_raid_remove_base_bdev
If RAID is rebuilding or degraded, behavior depends on current base state. This is why diskengine's health loop watches RAID processes:
internal/baremetal/baremetal_health.go: runHealthIteration
Stage 4: bdev_nvme Converts Base Writes To NVMe Commands
Each RAID base bdev receives a write:
module/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_requestmodule/bdev/nvme/bdev_nvme.c: _bdev_nvme_submit_requestmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_writev
The bdev module submits NVMe namespace write commands through the NVMe library:
lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_writevlib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_write_extlib/nvme/nvme_qpair.clib/nvme/nvme_rdma.c
Because this is a remote storage-node export, the NVMe transport is RDMA in the typical diskengine path. The write is now an NVMe-oF command traveling from baremetal to storage node.
Stage 5: Storage Node NVMf Target Receives The Write
On the storage node, the RDMA transport receives the request and calls the common NVMf execution path:
lib/nvmf/rdma.c: spdk_nvmf_request_execcall siteslib/nvmf/ctrlr.c: spdk_nvmf_request_execlib/nvmf/ctrlr.c: nvmf_ctrlr_process_io_cmdlib/nvmf/ctrlr.c: spdk_nvmf_request_get_bdevlib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmd
spdk_nvmf_request_get_bdev resolves the namespace ID to the lvol bdev attached to that subsystem. This is where the NQN/namespace created by storage-node control plane becomes a real bdev operation.
Stage 6: lvol/blobstore Writes To Physical NVMe bdev
The lvol bdev maps guest-visible logical blocks to blobstore clusters. Source anchors for deeper reading:
module/bdev/lvol/vbdev_lvol.cmodule/bdev/lvol/vbdev_lvol_rpc.clib/blob/blobstore.cmodule/blob/bdev/blob_bdev.c
The lvol ultimately submits I/O to its base bdev, which is the physical NVMe namespace attached on the storage node:
module/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_requestmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_writevlib/nvme/nvme_pcie.clib/nvme/nvme_qpair.c
At the physical device boundary, the write is an NVMe command on a PCIe qpair. The SSD controller writes data into NAND through its internal flash translation layer. SPDK does not manage NAND pages directly.
Stage 7: Completion Returns Upward
Completion reverses the path:
- Physical SSD completes NVMe command.
- Storage-node NVMe driver polls completion.
- lvol/blobstore completes its bdev I/O.
- NVMf target completes request with
spdk_nvmf_request_complete. - RDMA response reaches baremetal initiator.
- Baremetal
bdev_nvme_writev_donecompletes base bdev write. - RAID completes when required base writes finish.
- vhost
blk_request_complete_cbsets virtio status. vhost_user_blk_request_finishupdates the used ring.- Guest sees the virtio-blk completion.
SPDK completion anchors:
module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev_donelib/nvmf/ctrlr.c: spdk_nvmf_request_completelib/nvmf/transport.c: nvmf_transport_req_completelib/vhost/vhost_blk.c: blk_request_complete_cblib/vhost/vhost_blk.c: vhost_user_blk_request_finish
Prose Diagram: Complete Write Path
Draw two machines: baremetal node on the left, storage node on the right.
Baremetal stack, top to bottom:
VM guest filesystem -> guest virtio-blk driver -> QEMU -> SPDK vhost-blk -> raid_123 -> bdev_nvme remote base bdevs -> RDMA NIC.
Storage-node stack:
RDMA NIC -> SPDK NVMf target -> subsystem namespace -> lvol bdev -> blobstore/lvstore -> physical NVMe bdev -> PCIe SSD.
Draw the write arrow left-to-right across RDMA between the bdev_nvme layer and NVMf target. Draw completion right-to-left all the way back to the guest. Use dashed boxes around diskengine loops above both machines to show they create the graph but are not on the per-I/O arrow.
Debugging By Layer
Guest layer:
- Is the disk visible?
- Are writes hanging or failing with I/O errors?
vhost layer:
vhost_get_controllersthread_get_pollers- Source:
lib/vhost/vhost_blk.c
RAID layer:
bdev_raid_get_bdevs- base status, rebuild process, online/configuring/offline.
- Source:
module/bdev/raid/bdev_raid_rpc.c.
Baremetal NVMe initiator:
bdev_nvme_get_controllersbdev_nvme_get_io_paths- Source:
module/bdev/nvme/bdev_nvme_rpc.c.
Storage-node NVMf target:
nvmf_get_subsystemsnvmf_get_transports- Source:
lib/nvmf/nvmf_rpc.c.
Storage-node bdev/lvol:
bdev_get_bdevsbdev_lvol_get_lvstores- Source:
module/bdev/lvol/vbdev_lvol_rpc.c.
Physical NVMe:
bdev_nvme_get_controller_health_info- SMART/temperature/media errors.
Edge Cases And Misleading Symptoms
Guest hang can be vhost completion, not SSD failure:
If bdev I/O completes but the used ring is not updated or interrupt is missed, the guest waits even though lower storage is fine.
RAID online can hide one bad replica:
A degraded RAID can still serve I/O. Check rebuild and base status, not just presence.
NVMe controller enabled can still have path issues:
Use I/O path RPCs and stats, not just controller list.
Storage-node lvol exists but is not exported:
Baremetal attach fails at NVMf connect/discovery even though storage capacity exists.
Physical SSD healthy does not prove export healthy:
Network, target subsystem, namespace mapping, and bdev graph can fail above the SSD.
Control-plane race can look like data-plane failure:
If the VM starts before vhost or RAID is ready, the symptom may be a missing or failed disk. Check diskengine state transitions.
Misconceptions To Kill
"The write goes through diskengine Go code."
No. diskengine created the objects. The write flows through SPDK.
"There is one queue."
No. There are guest virtqueues, RAID bdev queues, NVMe initiator qpairs, NVMf target qpairs, lvol/blobstore work, and physical NVMe queues.
"Completion means durable on NAND."
Completion means the storage stack and device reported command completion according to their semantics. Durability depends on flush/FUA, volatile caches, SSD power-loss protection, and the protocol command used.
"A single RPC can diagnose the whole path."
No. Use layer-specific checks.
Lab: Build A Trace Checklist
For a volume 123, write a checklist with expected object names:
- vhost controller:
vhost<volume_vm_mapping_id> - RAID bdev:
raid_123 - base bdevs: derived from NQNs using
baseBdevNameFromNQN - storage-node NQNs: from
volume_lvol_mapping - namespace bdev names: lvol UUIDs
- physical controller names: storage-node
NvmeDisk<disk_id>
Then map each object to one SPDK RPC that can prove it exists.
Source Reading Exercise
Read in this order, stopping at the first async submission/completion pair in each layer:
lib/vhost/vhost_blk.c: virtio_blk_process_requestmodule/bdev/raid/bdev_raid.c: raid_bdev_submit_requestmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_writevlib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmdmodule/bdev/lvol/vbdev_lvol.c
For each layer, identify:
- input object,
- output object,
- async completion callback.
Operational Debug Exercise
Symptom: write latency spikes every few minutes.
Investigate:
- Guest/vhost queue depth and session state.
- RAID rebuild activity from
bdev_raid_get_bdevs. - NVMe-oF path reconnects or disabled paths.
- Storage-node disk health and temperature.
- lvol/blobstore free space and snapshots/clones.
- CPU reactor saturation and poller stats.
- Network RDMA counters outside SPDK if available.
Tie each observation to a layer in the diagram. Avoid jumping straight from "VM slow" to "SSD bad."
Self-Check
- Which layer translates guest descriptors into iovs?
- Which layer turns a RAID base write into an NVMe command?
- Which layer maps NVMf namespace ID to storage-node lvol bdev?
- Why is diskengine not in the write hot path?
- Name three places where completion can be delayed after the physical SSD has accepted the command.
References
- Local SPDK:
lib/vhost/vhost_blk.c - Local SPDK:
module/bdev/raid/bdev_raid.c - Local SPDK:
module/bdev/nvme/bdev_nvme.c - Local SPDK:
lib/nvmf/ctrlr.c - Local SPDK:
lib/nvmf/ctrlr_bdev.c - Local SPDK:
module/bdev/lvol/vbdev_lvol.c - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal - SPDK bdev documentation: https://spdk.io/doc/bdev.html
- SPDK NVMe-oF documentation: https://spdk.io/doc/nvmf.html