SPDK From First Principles

SPDK deep learning path

Chapter 25: vfio-user NVMe Exposure

This chapter explains vfio-user as a way to expose an emulated NVMe controller through a Unix socket, with guest-visible PCI/NVMe semantics and SPDK bdev-backed storage. The...

Source: drafts/transport-diskengine/25-vfio-user-nvme-exposure.md

Chapter Goal

This chapter explains vfio-user as a way to expose an emulated NVMe controller through a Unix socket, with guest-visible PCI/NVMe semantics and SPDK bdev-backed storage. The reader should understand how this differs from vhost-blk, how queue memory and doorbells enter the design, and where to look when a vfio-user endpoint wedges.

Beginner Mental Model

vhost-blk gives a VM a virtio-blk device. vfio-user can give a VM something that behaves like a PCI device. In SPDK's NVMe-oF vfio-user transport, the exported device presents NVMe controller semantics. The guest or client sees NVMe queues, doorbells, admin commands, I/O commands, and completions.

That makes vfio-user closer to "a userspace PCIe NVMe device" than to "a network socket storage protocol." It still uses a Unix socket for control and memory setup, but the guest-facing model is NVMe.

The useful beginner comparison:

  • vhost-blk: virtio request -> SPDK bdev I/O.
  • vfio-user NVMe: NVMe SQE/CQE and doorbells -> SPDK NVMf request -> SPDK bdev I/O.

SPDK Source Anchors

The target-side vfio-user transport is:

  • lib/nvmf/vfio_user.c: struct nvmf_vfio_user_req
  • lib/nvmf/vfio_user.c: struct nvmf_vfio_user_sq
  • lib/nvmf/vfio_user.c: struct nvmf_vfio_user_cq
  • lib/nvmf/vfio_user.c: struct nvmf_vfio_user_ctrlr
  • lib/nvmf/vfio_user.c: struct nvmf_vfio_user_endpoint
  • lib/nvmf/vfio_user.c: struct nvmf_vfio_user_poll_group
  • lib/nvmf/vfio_user.c: vfio_user_ctrlr_switch_doorbells
  • lib/nvmf/vfio_user.c: ctrlr_kick
  • lib/nvmf/vfio_user.c: poll_group_kick
  • lib/nvmf/vfio_user.c: SPDK_NVMF_TRANSPORT_REGISTER

The common NVMf request execution still ends up at:

  • lib/nvmf/ctrlr.c: spdk_nvmf_request_exec
  • lib/nvmf/ctrlr.c: nvmf_ctrlr_process_admin_cmd
  • lib/nvmf/ctrlr.c: nvmf_ctrlr_process_io_cmd
  • lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmd
  • lib/nvmf/ctrlr.c: spdk_nvmf_request_complete

The generic vfio-user protocol structures are:

  • include/spdk/vfio_user_spec.h: enum vfio_user_command
  • include/spdk/vfio_user_spec.h: struct vfio_user_header
  • include/spdk/vfio_user_spec.h: struct vfio_user_dma_map
  • include/spdk/vfio_user_spec.h: struct vfio_user_region_access

SPDK also has a vfio-user client path for connecting to vfio-user PCI devices:

  • lib/vfio_user/host/vfio_user.c: vfio_user_dev_send_request
  • lib/vfio_user/host/vfio_user.c: vfio_user_check_version
  • lib/vfio_user/host/vfio_user.c: vfio_user_dev_dma_map_unmap
  • lib/vfio_user/host/vfio_user_pci.c: spdk_vfio_user_setup
  • lib/vfio_user/host/vfio_user_pci.c: spdk_vfio_user_pci_bar_access
  • lib/nvme/nvme_vfio_user.c: nvme_vfio_ctrlr_construct
  • lib/nvme/nvme_vfio_user.c: nvme_vfio_ctrlr_enable
  • lib/nvme/nvme_vfio_user.c: nvme_vfio_setup_bar0

How vfio-user NVMe Differs From Network NVMe-oF

The common NVMf layer still sees requests. But the way requests arrive is different:

  • RDMA/TCP transports receive commands over a network transport.
  • vfio-user receives guest/device interactions through a local socket and mapped memory.
  • Queue doorbells are device register writes, not packets.
  • Queue entries live in guest memory regions mapped into the backend.
  • Completion visibility depends on CQ updates, eventfds/interrupts, and poll mode behavior.

For a beginner, the critical shift is this: with vfio-user, a doorbell write is an I/O signal. The guest updates a submission queue in memory and then rings a doorbell. The backend must notice the doorbell, consume SQEs, execute requests, write CQEs, and notify the guest.

Doorbells, SQs, And CQs

NVMe has submission queues and completion queues. A submission queue tail doorbell tells the controller new commands are available. A completion queue head doorbell tells the controller the host has consumed completions.

In lib/nvmf/vfio_user.c, the structures make those ideas explicit:

  • nvmf_vfio_user_sq stores submission queue state.
  • nvmf_vfio_user_cq stores completion queue state.
  • nvmf_vfio_user_ctrlr stores controller-wide BAR0 doorbell and shadow doorbell pointers.
  • vfio_user_ctrlr_switch_doorbells switches queue doorbell locations between BAR0 and shadow doorbell memory.

The source also has helper functions for head/tail movement:

  • lib/nvmf/vfio_user.c: sq_headp
  • lib/nvmf/vfio_user.c: sq_dbl_tailp
  • lib/nvmf/vfio_user.c: cq_dbl_headp
  • lib/nvmf/vfio_user.c: cq_tailp
  • lib/nvmf/vfio_user.c: sq_head_advance
  • lib/nvmf/vfio_user.c: cq_tail_advance

These are worth reading slowly. Many vfio-user bugs are queue-index bugs in disguise.

Prose Diagram: vfio-user NVMe Request Flow

Draw a diagram with two boxes at the top:

  • Guest VM NVMe driver.
  • SPDK vfio-user endpoint.

Between them draw:

  • Unix socket control channel.
  • Shared memory mappings.
  • BAR/register access path.
  • eventfd/interrupt path.

Then show the I/O flow:

Guest writes command into SQ memory -> guest writes SQ tail doorbell -> SPDK vfio-user backend observes doorbell/kick -> backend builds spdk_nvmf_request -> common NVMf controller path executes command -> bdev I/O completes -> backend writes CQE into completion queue memory -> backend signals interrupt/event or relies on polling -> guest observes completion and advances CQ head.

The diagram should make clear that command data and completion queues are memory structures, not JSON-RPC messages.

Control Plane

vfio-user endpoint base path is controlled by:

  • lib/vfu_tgt/tgt_rpc.c: rpc_vfu_tgt_set_base_path
  • lib/vfu_tgt/tgt_rpc.c: SPDK_RPC_REGISTER("vfu_tgt_set_base_path", ...)

NVMe-oF vfio-user endpoint creation is normally connected to NVMf subsystem/listener setup. The transport type is part of NVMf transport registration and listener handling. The conceptual target objects are still subsystem, listener, namespace, qpair, and request.

Edge Cases And Failure Modes

Doorbell lost or not observed:

The guest can write SQEs correctly but I/O will not progress if the backend misses a doorbell or poll group kick. Source anchors: ctrlr_kick, poll_group_kick, and doorbell helper functions.

CQ full:

If completions cannot be posted, requests can stall even though bdev I/O finished. Debug CQ head/tail state and guest consumption.

Wrong memory mapping:

If guest memory is not mapped or DMA map/unmap handling fails, descriptors or queue entries may point to inaccessible memory. Source anchors: vfio_user_dev_dma_map_unmap, spdk_vfio_user_setup.

Interrupt versus poll mode:

The guest may rely on interrupts while the backend expects polling or vice versa. Check whether completions are written but not signaled.

Queue deletion while I/O is outstanding:

Submission queue and completion queue lifetime is separate from individual requests. A teardown path must account for in-flight work.

Migration/shadow doorbells:

Shadow doorbells exist for live migration and stop-and-copy style handling. They add another place where queue state may live. Beginners should not ignore them when debugging migrated endpoints.

Misconceptions To Kill

"vfio-user is just faster vhost."

No. It exposes a different device model. vhost-blk exposes virtio-blk; vfio-user NVMe exposes NVMe controller semantics.

"Because it uses a Unix socket, I/O data is copied through the socket."

No. The socket is for protocol/control and passing mappings. Queue and data access use mapped memory.

"NVMe-oF means network."

In SPDK, the NVMf target layer also has a vfio-user transport. It is fabrics-style controller semantics over a local vfio-user transport.

"A completion callback means the guest has seen the completion."

No. It means SPDK has completed internal work. The backend still must write CQE and notify or make it visible to the guest.

Lab: Queue State Reading

Open lib/nvmf/vfio_user.c and find:

  1. struct nvmf_vfio_user_sq
  2. struct nvmf_vfio_user_cq
  3. sq_head_advance
  4. cq_tail_advance
  5. vfio_user_ctrlr_switch_doorbells

Write a short note explaining where the host-owned index lives and where the controller-owned index lives. Then compare this to the NVMe queue model from the hardware chapters.

Operational Debug Exercise

Symptom: VM sees vfio-user NVMe device but I/O hangs.

Check:

  1. Did the endpoint socket accept a connection?
  2. Did the guest create admin and I/O queues?
  3. Are SQ doorbells changing?
  4. Are requests reaching spdk_nvmf_request_exec?
  5. Are bdev I/O completions returning?
  6. Are CQEs written?
  7. Is the guest being notified, or is it polling the CQ?

If requests never reach spdk_nvmf_request_exec, debug vfio-user/queue handling. If they reach bdev and do not complete, debug the bdev graph. If bdev completes and the guest still hangs, debug CQ/interrupt delivery.

Self-Check

  1. What does a doorbell indicate?
  2. Why is guest memory mapping central to vfio-user?
  3. How does vfio-user NVMe differ from vhost-blk?
  4. Which common SPDK function executes a built NVMf request?
  5. Why can bdev completion still leave the guest waiting?

References

  • Local SPDK: lib/nvmf/vfio_user.c
  • Local SPDK: include/spdk/vfio_user_spec.h
  • Local SPDK: lib/vfio_user/host/vfio_user.c
  • Local SPDK: lib/vfio_user/host/vfio_user_pci.c
  • Local SPDK: lib/nvme/nvme_vfio_user.c
  • Local SPDK: lib/vfu_tgt/tgt_rpc.c
  • QEMU vfio-user protocol documentation: https://www.qemu.org/docs/master/interop/vfio-user.html
  • NVM Express specifications: https://nvmexpress.org/specifications/