SPDK From First Principles

SPDK deep learning path

Chapter 24: vhost-blk And QEMU Exposure

This chapter explains how SPDK exposes a bdev to a VM through vhost-blk. By the end, the reader should know what the vhost-user socket represents, how QEMU virtio-blk requests...

Source: drafts/transport-diskengine/24-vhost-blk-qemu.md

Chapter Goal

This chapter explains how SPDK exposes a bdev to a VM through vhost-blk. By the end, the reader should know what the vhost-user socket represents, how QEMU virtio-blk requests are translated into SPDK bdev I/O, why teardown ordering matters, and how diskengine uses vhost controllers in baremetal mode.

Beginner Mental Model

Virtio is the guest-visible device model. vhost is the backend acceleration model. vhost-user moves the backend into a separate userspace process and connects QEMU to that backend over a Unix domain socket.

In this setup:

  • The guest sees a virtio-blk disk.
  • QEMU owns guest emulation and passes vring information to SPDK.
  • SPDK owns the vhost-blk backend.
  • SPDK reads descriptors from guest memory, translates them into bdev I/O, and writes used-ring completions.

The guest does not know about SPDK bdev names, RAID bdevs, NVMe-oF, lvols, or storage nodes. It sees a block disk. That is the entire point.

Why This Matters For diskengine/excloud

In baremetal mode, diskengine builds a local bdev graph first:

remote lvol namespaces -> bdev_nvme bdevs -> RAID bdev -> optional QoS -> vhost-blk controller.

The VM attachment edge is:

  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/attach.go: startAttachLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/attach.go: attach
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/attach.go: ensureVhost
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: VhostCreateBlkController
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: VhostGetControllers
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: VhostDeleteController

The teardown edge is:

  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/vhost_detach.go: startVhostDetachLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/vhost_detach.go: detachVhost
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/raid_detach.go: finalizeVolumeDetach

The vhost controller name is part of the contract between diskengine and VM launch orchestration. If QEMU points at the wrong socket or stale socket, the guest disk will not appear even if RAID is healthy.

RPC And Controller Creation

The SPDK RPC handler is:

  • lib/vhost/vhost_rpc.c: rpc_vhost_create_blk_controller
  • lib/vhost/vhost_rpc.c: rpc_vhost_create_blk_controller_decoders
  • lib/vhost/vhost_rpc.c: rpc_vhost_delete_controller
  • lib/vhost/vhost_rpc.c: rpc_vhost_get_controllers

The block controller constructor is:

  • lib/vhost/vhost_blk.c: spdk_vhost_blk_construct

That constructor opens the backing bdev and creates the vhost-user device through:

  • lib/vhost/rte_vhost_user.c: vhost_user_dev_create
  • lib/vhost/rte_vhost_user.c: vhost_user_dev_start
  • lib/vhost/vhost_internal.h: struct spdk_vhost_dev
  • lib/vhost/vhost_internal.h: struct spdk_vhost_session

Config output for replay includes:

  • lib/vhost/vhost_blk.c: vhost_blk_write_config_json

Request Path

The core virtio-blk request handling lives in:

  • lib/vhost/vhost_blk.c: process_blk_task
  • lib/vhost/vhost_blk.c: process_packed_blk_task
  • lib/vhost/vhost_blk.c: blk_iovs_split_queue_setup
  • lib/vhost/vhost_blk.c: blk_iovs_packed_queue_setup
  • lib/vhost/vhost_blk.c: vhost_user_process_blk_request
  • lib/vhost/vhost_blk.c: virtio_blk_process_request
  • lib/vhost/vhost_blk.c: blk_request_complete_cb
  • lib/vhost/vhost_blk.c: vhost_user_blk_request_finish

Descriptor translation helpers live in:

  • lib/vhost/vhost_internal.h: vhost_vq_get_desc
  • lib/vhost/vhost_internal.h: vhost_vq_get_desc_packed
  • lib/vhost/vhost_internal.h: vhost_vring_desc_to_iov
  • lib/vhost/vhost_internal.h: vhost_vring_packed_desc_to_iov
  • lib/vhost/vhost_internal.h: vhost_gpa_to_vva

Once virtio_blk_process_request understands the request type, it submits bdev I/O. Reads, writes, flushes, unmaps, and write-zeroes become bdev calls. Completion returns through blk_request_complete_cb, which sets virtio status and enqueues the used-ring completion.

Prose Diagram: Guest Write Through vhost-blk

Draw a four-lane sequence:

  1. Guest kernel virtio-blk driver.
  2. QEMU vhost-user front side.
  3. SPDK vhost-blk backend.
  4. SPDK bdev graph.

The write path:

Guest fills virtqueue descriptors -> QEMU/vhost-user shares vring and memory mapping -> SPDK poller finds available descriptor -> process_blk_task builds iovs -> virtio_blk_process_request submits bdev write -> bdev graph completes -> blk_request_complete_cb records status -> vhost_user_blk_request_finish updates used ring -> guest gets interrupt or polls completion.

The diagram should show guest memory as a shared memory region next to lanes 2 and 3, because descriptor iovs refer to guest memory.

Teardown Ordering

vhost teardown is a common source of data-plane bugs. A controller with an active VM session is not just a config object. The guest may still have outstanding writes. Deleting the RAID under an active vhost can make I/O fail or hang.

diskengine uses several gates:

  • vhost_detach.go: detachVhost deletes mapping-level vhost controllers and then cleans markers.
  • raid_detach.go: finalizeVolumeDetach checks that no vhost controllers remain before deleting RAID and detaching NVMe.
  • attach.go: ensureVhost creates vhost only after RAID exists and QoS can be applied.

SPDK-side session structures are in:

  • lib/vhost/vhost_internal.h: struct spdk_vhost_session
  • lib/vhost/rte_vhost_user.c: vhost_user_dev_create
  • lib/vhost/rte_vhost_user.c: vhost_user_dev_start

Edge Cases And Failure Modes

Socket exists but controller is stale:

A Unix socket path can remain from an older run or a controller can exist without the VM currently being attached. Use vhost_get_controllers, not filesystem checks alone.

Guest still connected:

Deleting the controller may fail or be unsafe if a vhost session is active. Inspect vhost_get_controllers session data before deleting lower bdevs.

Descriptor is invalid:

Bad or unexpected virtqueue descriptors fail at descriptor-to-iov setup. Source anchors: blk_iovs_split_queue_setup, blk_iovs_packed_queue_setup, vhost_vring_desc_to_iov.

Read-only device:

virtio_blk_process_request can reject writes if the controller or backing bdev is read-only.

NOMEM:

vhost-blk can queue an I/O wait when bdev submission returns no memory. Source anchors: lib/vhost/vhost_blk.c: blk_request_queue_io, lib/vhost/vhost_blk.c: blk_request_resubmit.

Packed versus split queues:

Modern virtio can use packed queues. Debugging only split-ring code can miss the active path. Read both process_blk_task and process_packed_blk_task.

Misconceptions To Kill

"vhost-blk is an NVMe device."

No. The guest sees virtio-blk. The backing bdev may eventually hit NVMe, RAID, or lvol, but that is hidden.

"The vhost socket contains the data."

No. The socket carries control messages and file descriptors. Bulk data is in shared guest memory referenced by descriptors.

"Deleting a vhost controller deletes the volume."

No. It removes a VM exposure endpoint. The underlying RAID, NVMe bdevs, and storage-node lvols are separate objects.

"QEMU can use any bdev name directly."

QEMU uses a vhost-user socket. diskengine and SPDK map that socket/controller to a bdev.

Lab: Trace One Write Request

Open lib/vhost/vhost_blk.c and follow one request:

  1. process_blk_task
  2. blk_iovs_split_queue_setup
  3. vhost_user_process_blk_request
  4. virtio_blk_process_request
  5. the write case inside that function
  6. blk_request_complete_cb
  7. vhost_user_blk_request_finish

For each step, write whether it is parsing guest descriptors, submitting SPDK I/O, or completing guest-visible status.

Operational Debug Exercise

Symptom: VM boots but disk is missing.

Check:

  1. Does diskengine think the VM mapping is ATTACHING or ATTACHED?
  2. Does vhost_get_controllers show the expected vhost<volume_vm_mapping_id>?
  3. Does that controller point at the expected RAID bdev?
  4. Does QEMU reference the same socket path/name?
  5. Does bdev_raid_get_bdevs show the RAID online?
  6. Are remote NVMe bdevs enabled underneath the RAID?

Do not start by debugging the SSD. A missing guest disk is often a vhost or QEMU socket wiring problem.

Self-Check

  1. What object does QEMU connect to?
  2. What object does SPDK submit bdev I/O to?
  3. Why can a vhost controller exist without proving a guest is currently using it?
  4. Where does SPDK translate guest descriptors into iovs?
  5. Why should RAID deletion wait until vhost exposure is gone?

References

  • Local SPDK: lib/vhost/vhost_rpc.c
  • Local SPDK: lib/vhost/vhost_blk.c
  • Local SPDK: lib/vhost/rte_vhost_user.c
  • Local SPDK: lib/vhost/vhost_internal.h
  • Local SPDK: include/spdk/vhost.h
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/attach.go
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/vhost_detach.go
  • SPDK vhost documentation: https://spdk.io/doc/vhost.html
  • QEMU vhost-user documentation: https://www.qemu.org/docs/master/interop/vhost-user.html