SPDK From First Principles

SPDK deep learning path

Chapter 31: Complete VM Write To SSD Walkthrough

This chapter follows one guest write from a VM down to a physical SSD and then follows completion back up. It ties together vhost-blk, RAID, NVMe-oF initiator, NVMe-oF target,...

Source: drafts/transport-diskengine/31-complete-vm-write-to-ssd-walkthrough.md

Chapter Goal

This chapter follows one guest write from a VM down to a physical SSD and then follows completion back up. It ties together vhost-blk, RAID, NVMe-oF initiator, NVMe-oF target, lvol/blobstore, and physical NVMe.

The key idea: diskengine constructs and repairs the graph, but the write itself flows through SPDK data-plane code.

flowchart TB guest[VM guest block write] --> qemu[QEMU virtio/vfio device] qemu --> export[vhost or vfio-user endpoint] export --> raid[RAID bdev on baremetal] raid --> nvmeInit[bdev_nvme initiator] nvmeInit --> fabric[NVMe-oF fabric] fabric --> nvmf[NVMe-oF target on storage node] nvmf --> lvol[lvol bdev] lvol --> blob[blobstore cluster map] blob --> phys[physical NVMe bdev] phys --> ssd[NVMe SSD] ssd --> phys phys --> blob blob --> lvol lvol --> nvmf nvmf --> fabric fabric --> nvmeInit nvmeInit --> raid raid --> export export --> qemu qemu --> guest

Scenario

Assume a VM writes 4 KiB to a disk backed by a replicated excloud volume:

  • The VM is running on a baremetal node.
  • QEMU uses a vhost-blk socket created by SPDK on that baremetal node.
  • The vhost-blk controller exposes raid_123.
  • raid_123 has base bdevs produced by bdev_nvme connections to storage-node NQNs.
  • Each storage-node NQN exports one lvol as an NVMe-oF namespace.
  • Each lvol is backed by an SPDK lvstore/blobstore on a local physical NVMe SSD.

Stage 0: Control Plane Already Ran

Before the write, these diskengine loops prepared the graph.

Storage node:

  • internal/storagenode/disk_init.go: initialiseDisk
  • internal/storagenode/provisionlvol.go: provisionLvol
  • internal/storagenode/nvmeofexport.go: reconcileExports

Baremetal:

  • internal/baremetal/nvme_attach.go: reconcileNVMeConnections
  • internal/baremetal/raidensure.go: ensureRaid
  • internal/baremetal/attach.go: ensureVhost

This distinction matters. If you are debugging one write, first decide whether the graph exists. If it does not, debug control plane. If it exists but I/O fails, debug data path.

Stage 1: Guest Submits virtio-blk Request

Inside the VM, the guest kernel submits a block write to its virtio-blk device. The guest fills virtqueue descriptors that point to guest memory containing the request header and payload. QEMU and SPDK have negotiated vhost-user memory mappings, so SPDK can translate those guest physical addresses.

SPDK anchors:

  • lib/vhost/vhost_blk.c: process_blk_task
  • lib/vhost/vhost_blk.c: process_packed_blk_task
  • lib/vhost/vhost_blk.c: blk_iovs_split_queue_setup
  • lib/vhost/vhost_blk.c: blk_iovs_packed_queue_setup
  • lib/vhost/vhost_internal.h: vhost_gpa_to_vva
  • lib/vhost/vhost_internal.h: vhost_vring_desc_to_iov

The output of this stage is a spdk_vhost_blk_task with iovs and request metadata.

Stage 2: vhost-blk Submits bdev Write To RAID

The request reaches:

  • lib/vhost/vhost_blk.c: vhost_user_process_blk_request
  • lib/vhost/vhost_blk.c: virtio_blk_process_request

For a write, virtio_blk_process_request submits bdev I/O to the backing bdev, which is raid_123 in this scenario. The completion callback is:

  • lib/vhost/vhost_blk.c: blk_request_complete_cb

Nothing has reached the SSD yet. At this moment SPDK has created an asynchronous bdev write against a RAID bdev.

Stage 3: RAID Maps The Write To Base bdevs

The RAID module receives the bdev write through its module submit path:

  • module/bdev/raid/bdev_raid.c: raid_bdev_submit_request
  • module/bdev/raid/raid1.c

For RAID1, the write must be propagated to mirror bases according to the module logic. Each base bdev is an NVMe bdev created by bdev_nvme_attach_controller.

Useful RAID RPC/debug anchors:

  • module/bdev/raid/bdev_raid_rpc.c: rpc_bdev_raid_get_bdevs
  • module/bdev/raid/bdev_raid_rpc.c: rpc_bdev_raid_add_base_bdev
  • module/bdev/raid/bdev_raid_rpc.c: rpc_bdev_raid_remove_base_bdev

If RAID is rebuilding or degraded, behavior depends on current base state. This is why diskengine's health loop watches RAID processes:

  • internal/baremetal/baremetal_health.go: runHealthIteration

Stage 4: bdev_nvme Converts Base Writes To NVMe Commands

Each RAID base bdev receives a write:

  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_request
  • module/bdev/nvme/bdev_nvme.c: _bdev_nvme_submit_request
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev

The bdev module submits NVMe namespace write commands through the NVMe library:

  • lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_writev
  • lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_write_ext
  • lib/nvme/nvme_qpair.c
  • lib/nvme/nvme_rdma.c

Because this is a remote storage-node export, the NVMe transport is RDMA in the typical diskengine path. The write is now an NVMe-oF command traveling from baremetal to storage node.

Stage 5: Storage Node NVMf Target Receives The Write

On the storage node, the RDMA transport receives the request and calls the common NVMf execution path:

  • lib/nvmf/rdma.c: spdk_nvmf_request_exec call sites
  • lib/nvmf/ctrlr.c: spdk_nvmf_request_exec
  • lib/nvmf/ctrlr.c: nvmf_ctrlr_process_io_cmd
  • lib/nvmf/ctrlr.c: spdk_nvmf_request_get_bdev
  • lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmd

spdk_nvmf_request_get_bdev resolves the namespace ID to the lvol bdev attached to that subsystem. This is where the NQN/namespace created by storage-node control plane becomes a real bdev operation.

Stage 6: lvol/blobstore Writes To Physical NVMe bdev

The lvol bdev maps guest-visible logical blocks to blobstore clusters. Source anchors for deeper reading:

  • module/bdev/lvol/vbdev_lvol.c
  • module/bdev/lvol/vbdev_lvol_rpc.c
  • lib/blob/blobstore.c
  • module/blob/bdev/blob_bdev.c

The lvol ultimately submits I/O to its base bdev, which is the physical NVMe namespace attached on the storage node:

  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_request
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev
  • lib/nvme/nvme_pcie.c
  • lib/nvme/nvme_qpair.c

At the physical device boundary, the write is an NVMe command on a PCIe qpair. The SSD controller writes data into NAND through its internal flash translation layer. SPDK does not manage NAND pages directly.

Stage 7: Completion Returns Upward

Completion reverses the path:

  1. Physical SSD completes NVMe command.
  2. Storage-node NVMe driver polls completion.
  3. lvol/blobstore completes its bdev I/O.
  4. NVMf target completes request with spdk_nvmf_request_complete.
  5. RDMA response reaches baremetal initiator.
  6. Baremetal bdev_nvme_writev_done completes base bdev write.
  7. RAID completes when required base writes finish.
  8. vhost blk_request_complete_cb sets virtio status.
  9. vhost_user_blk_request_finish updates the used ring.
  10. Guest sees the virtio-blk completion.

SPDK completion anchors:

  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev_done
  • lib/nvmf/ctrlr.c: spdk_nvmf_request_complete
  • lib/nvmf/transport.c: nvmf_transport_req_complete
  • lib/vhost/vhost_blk.c: blk_request_complete_cb
  • lib/vhost/vhost_blk.c: vhost_user_blk_request_finish

Prose Diagram: Complete Write Path

Draw two machines: baremetal node on the left, storage node on the right.

Baremetal stack, top to bottom:

VM guest filesystem -> guest virtio-blk driver -> QEMU -> SPDK vhost-blk -> raid_123 -> bdev_nvme remote base bdevs -> RDMA NIC.

Storage-node stack:

RDMA NIC -> SPDK NVMf target -> subsystem namespace -> lvol bdev -> blobstore/lvstore -> physical NVMe bdev -> PCIe SSD.

Draw the write arrow left-to-right across RDMA between the bdev_nvme layer and NVMf target. Draw completion right-to-left all the way back to the guest. Use dashed boxes around diskengine loops above both machines to show they create the graph but are not on the per-I/O arrow.

Debugging By Layer

Guest layer:

  • Is the disk visible?
  • Are writes hanging or failing with I/O errors?

vhost layer:

  • vhost_get_controllers
  • thread_get_pollers
  • Source: lib/vhost/vhost_blk.c

RAID layer:

  • bdev_raid_get_bdevs
  • base status, rebuild process, online/configuring/offline.
  • Source: module/bdev/raid/bdev_raid_rpc.c.

Baremetal NVMe initiator:

  • bdev_nvme_get_controllers
  • bdev_nvme_get_io_paths
  • Source: module/bdev/nvme/bdev_nvme_rpc.c.

Storage-node NVMf target:

  • nvmf_get_subsystems
  • nvmf_get_transports
  • Source: lib/nvmf/nvmf_rpc.c.

Storage-node bdev/lvol:

  • bdev_get_bdevs
  • bdev_lvol_get_lvstores
  • Source: module/bdev/lvol/vbdev_lvol_rpc.c.

Physical NVMe:

  • bdev_nvme_get_controller_health_info
  • SMART/temperature/media errors.

Edge Cases And Misleading Symptoms

Guest hang can be vhost completion, not SSD failure:

If bdev I/O completes but the used ring is not updated or interrupt is missed, the guest waits even though lower storage is fine.

RAID online can hide one bad replica:

A degraded RAID can still serve I/O. Check rebuild and base status, not just presence.

NVMe controller enabled can still have path issues:

Use I/O path RPCs and stats, not just controller list.

Storage-node lvol exists but is not exported:

Baremetal attach fails at NVMf connect/discovery even though storage capacity exists.

Physical SSD healthy does not prove export healthy:

Network, target subsystem, namespace mapping, and bdev graph can fail above the SSD.

Control-plane race can look like data-plane failure:

If the VM starts before vhost or RAID is ready, the symptom may be a missing or failed disk. Check diskengine state transitions.

Misconceptions To Kill

"The write goes through diskengine Go code."

No. diskengine created the objects. The write flows through SPDK.

"There is one queue."

No. There are guest virtqueues, RAID bdev queues, NVMe initiator qpairs, NVMf target qpairs, lvol/blobstore work, and physical NVMe queues.

"Completion means durable on NAND."

Completion means the storage stack and device reported command completion according to their semantics. Durability depends on flush/FUA, volatile caches, SSD power-loss protection, and the protocol command used.

"A single RPC can diagnose the whole path."

No. Use layer-specific checks.

Lab: Build A Trace Checklist

For a volume 123, write a checklist with expected object names:

  1. vhost controller: vhost<volume_vm_mapping_id>
  2. RAID bdev: raid_123
  3. base bdevs: derived from NQNs using baseBdevNameFromNQN
  4. storage-node NQNs: from volume_lvol_mapping
  5. namespace bdev names: lvol UUIDs
  6. physical controller names: storage-node NvmeDisk<disk_id>

Then map each object to one SPDK RPC that can prove it exists.

Source Reading Exercise

Read in this order, stopping at the first async submission/completion pair in each layer:

  1. lib/vhost/vhost_blk.c: virtio_blk_process_request
  2. module/bdev/raid/bdev_raid.c: raid_bdev_submit_request
  3. module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev
  4. lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmd
  5. module/bdev/lvol/vbdev_lvol.c

For each layer, identify:

  • input object,
  • output object,
  • async completion callback.

Operational Debug Exercise

Symptom: write latency spikes every few minutes.

Investigate:

  1. Guest/vhost queue depth and session state.
  2. RAID rebuild activity from bdev_raid_get_bdevs.
  3. NVMe-oF path reconnects or disabled paths.
  4. Storage-node disk health and temperature.
  5. lvol/blobstore free space and snapshots/clones.
  6. CPU reactor saturation and poller stats.
  7. Network RDMA counters outside SPDK if available.

Tie each observation to a layer in the diagram. Avoid jumping straight from "VM slow" to "SSD bad."

Self-Check

  1. Which layer translates guest descriptors into iovs?
  2. Which layer turns a RAID base write into an NVMe command?
  3. Which layer maps NVMf namespace ID to storage-node lvol bdev?
  4. Why is diskengine not in the write hot path?
  5. Name three places where completion can be delayed after the physical SSD has accepted the command.

References

  • Local SPDK: lib/vhost/vhost_blk.c
  • Local SPDK: module/bdev/raid/bdev_raid.c
  • Local SPDK: module/bdev/nvme/bdev_nvme.c
  • Local SPDK: lib/nvmf/ctrlr.c
  • Local SPDK: lib/nvmf/ctrlr_bdev.c
  • Local SPDK: module/bdev/lvol/vbdev_lvol.c
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal
  • SPDK bdev documentation: https://spdk.io/doc/bdev.html
  • SPDK NVMe-oF documentation: https://spdk.io/doc/nvmf.html