Chapter Goal
This chapter explains vfio-user as a way to expose an emulated NVMe controller through a Unix socket, with guest-visible PCI/NVMe semantics and SPDK bdev-backed storage. The reader should understand how this differs from vhost-blk, how queue memory and doorbells enter the design, and where to look when a vfio-user endpoint wedges.
Beginner Mental Model
vhost-blk gives a VM a virtio-blk device. vfio-user can give a VM something that behaves like a PCI device. In SPDK's NVMe-oF vfio-user transport, the exported device presents NVMe controller semantics. The guest or client sees NVMe queues, doorbells, admin commands, I/O commands, and completions.
That makes vfio-user closer to "a userspace PCIe NVMe device" than to "a network socket storage protocol." It still uses a Unix socket for control and memory setup, but the guest-facing model is NVMe.
The useful beginner comparison:
- vhost-blk: virtio request -> SPDK bdev I/O.
- vfio-user NVMe: NVMe SQE/CQE and doorbells -> SPDK NVMf request -> SPDK bdev I/O.
SPDK Source Anchors
The target-side vfio-user transport is:
lib/nvmf/vfio_user.c: struct nvmf_vfio_user_reqlib/nvmf/vfio_user.c: struct nvmf_vfio_user_sqlib/nvmf/vfio_user.c: struct nvmf_vfio_user_cqlib/nvmf/vfio_user.c: struct nvmf_vfio_user_ctrlrlib/nvmf/vfio_user.c: struct nvmf_vfio_user_endpointlib/nvmf/vfio_user.c: struct nvmf_vfio_user_poll_grouplib/nvmf/vfio_user.c: vfio_user_ctrlr_switch_doorbellslib/nvmf/vfio_user.c: ctrlr_kicklib/nvmf/vfio_user.c: poll_group_kicklib/nvmf/vfio_user.c: SPDK_NVMF_TRANSPORT_REGISTER
The common NVMf request execution still ends up at:
lib/nvmf/ctrlr.c: spdk_nvmf_request_execlib/nvmf/ctrlr.c: nvmf_ctrlr_process_admin_cmdlib/nvmf/ctrlr.c: nvmf_ctrlr_process_io_cmdlib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmdlib/nvmf/ctrlr.c: spdk_nvmf_request_complete
The generic vfio-user protocol structures are:
include/spdk/vfio_user_spec.h: enum vfio_user_commandinclude/spdk/vfio_user_spec.h: struct vfio_user_headerinclude/spdk/vfio_user_spec.h: struct vfio_user_dma_mapinclude/spdk/vfio_user_spec.h: struct vfio_user_region_access
SPDK also has a vfio-user client path for connecting to vfio-user PCI devices:
lib/vfio_user/host/vfio_user.c: vfio_user_dev_send_requestlib/vfio_user/host/vfio_user.c: vfio_user_check_versionlib/vfio_user/host/vfio_user.c: vfio_user_dev_dma_map_unmaplib/vfio_user/host/vfio_user_pci.c: spdk_vfio_user_setuplib/vfio_user/host/vfio_user_pci.c: spdk_vfio_user_pci_bar_accesslib/nvme/nvme_vfio_user.c: nvme_vfio_ctrlr_constructlib/nvme/nvme_vfio_user.c: nvme_vfio_ctrlr_enablelib/nvme/nvme_vfio_user.c: nvme_vfio_setup_bar0
How vfio-user NVMe Differs From Network NVMe-oF
The common NVMf layer still sees requests. But the way requests arrive is different:
- RDMA/TCP transports receive commands over a network transport.
- vfio-user receives guest/device interactions through a local socket and mapped memory.
- Queue doorbells are device register writes, not packets.
- Queue entries live in guest memory regions mapped into the backend.
- Completion visibility depends on CQ updates, eventfds/interrupts, and poll mode behavior.
For a beginner, the critical shift is this: with vfio-user, a doorbell write is an I/O signal. The guest updates a submission queue in memory and then rings a doorbell. The backend must notice the doorbell, consume SQEs, execute requests, write CQEs, and notify the guest.
Doorbells, SQs, And CQs
NVMe has submission queues and completion queues. A submission queue tail doorbell tells the controller new commands are available. A completion queue head doorbell tells the controller the host has consumed completions.
In lib/nvmf/vfio_user.c, the structures make those ideas explicit:
nvmf_vfio_user_sqstores submission queue state.nvmf_vfio_user_cqstores completion queue state.nvmf_vfio_user_ctrlrstores controller-wide BAR0 doorbell and shadow doorbell pointers.vfio_user_ctrlr_switch_doorbellsswitches queue doorbell locations between BAR0 and shadow doorbell memory.
The source also has helper functions for head/tail movement:
lib/nvmf/vfio_user.c: sq_headplib/nvmf/vfio_user.c: sq_dbl_tailplib/nvmf/vfio_user.c: cq_dbl_headplib/nvmf/vfio_user.c: cq_tailplib/nvmf/vfio_user.c: sq_head_advancelib/nvmf/vfio_user.c: cq_tail_advance
These are worth reading slowly. Many vfio-user bugs are queue-index bugs in disguise.
Prose Diagram: vfio-user NVMe Request Flow
Draw a diagram with two boxes at the top:
- Guest VM NVMe driver.
- SPDK vfio-user endpoint.
Between them draw:
- Unix socket control channel.
- Shared memory mappings.
- BAR/register access path.
- eventfd/interrupt path.
Then show the I/O flow:
Guest writes command into SQ memory -> guest writes SQ tail doorbell -> SPDK vfio-user backend observes doorbell/kick -> backend builds spdk_nvmf_request -> common NVMf controller path executes command -> bdev I/O completes -> backend writes CQE into completion queue memory -> backend signals interrupt/event or relies on polling -> guest observes completion and advances CQ head.
The diagram should make clear that command data and completion queues are memory structures, not JSON-RPC messages.
Control Plane
vfio-user endpoint base path is controlled by:
lib/vfu_tgt/tgt_rpc.c: rpc_vfu_tgt_set_base_pathlib/vfu_tgt/tgt_rpc.c: SPDK_RPC_REGISTER("vfu_tgt_set_base_path", ...)
NVMe-oF vfio-user endpoint creation is normally connected to NVMf subsystem/listener setup. The transport type is part of NVMf transport registration and listener handling. The conceptual target objects are still subsystem, listener, namespace, qpair, and request.
Edge Cases And Failure Modes
Doorbell lost or not observed:
The guest can write SQEs correctly but I/O will not progress if the backend misses a doorbell or poll group kick. Source anchors: ctrlr_kick, poll_group_kick, and doorbell helper functions.
CQ full:
If completions cannot be posted, requests can stall even though bdev I/O finished. Debug CQ head/tail state and guest consumption.
Wrong memory mapping:
If guest memory is not mapped or DMA map/unmap handling fails, descriptors or queue entries may point to inaccessible memory. Source anchors: vfio_user_dev_dma_map_unmap, spdk_vfio_user_setup.
Interrupt versus poll mode:
The guest may rely on interrupts while the backend expects polling or vice versa. Check whether completions are written but not signaled.
Queue deletion while I/O is outstanding:
Submission queue and completion queue lifetime is separate from individual requests. A teardown path must account for in-flight work.
Migration/shadow doorbells:
Shadow doorbells exist for live migration and stop-and-copy style handling. They add another place where queue state may live. Beginners should not ignore them when debugging migrated endpoints.
Misconceptions To Kill
"vfio-user is just faster vhost."
No. It exposes a different device model. vhost-blk exposes virtio-blk; vfio-user NVMe exposes NVMe controller semantics.
"Because it uses a Unix socket, I/O data is copied through the socket."
No. The socket is for protocol/control and passing mappings. Queue and data access use mapped memory.
"NVMe-oF means network."
In SPDK, the NVMf target layer also has a vfio-user transport. It is fabrics-style controller semantics over a local vfio-user transport.
"A completion callback means the guest has seen the completion."
No. It means SPDK has completed internal work. The backend still must write CQE and notify or make it visible to the guest.
Lab: Queue State Reading
Open lib/nvmf/vfio_user.c and find:
struct nvmf_vfio_user_sqstruct nvmf_vfio_user_cqsq_head_advancecq_tail_advancevfio_user_ctrlr_switch_doorbells
Write a short note explaining where the host-owned index lives and where the controller-owned index lives. Then compare this to the NVMe queue model from the hardware chapters.
Operational Debug Exercise
Symptom: VM sees vfio-user NVMe device but I/O hangs.
Check:
- Did the endpoint socket accept a connection?
- Did the guest create admin and I/O queues?
- Are SQ doorbells changing?
- Are requests reaching
spdk_nvmf_request_exec? - Are bdev I/O completions returning?
- Are CQEs written?
- Is the guest being notified, or is it polling the CQ?
If requests never reach spdk_nvmf_request_exec, debug vfio-user/queue handling. If they reach bdev and do not complete, debug the bdev graph. If bdev completes and the guest still hangs, debug CQ/interrupt delivery.
Self-Check
- What does a doorbell indicate?
- Why is guest memory mapping central to vfio-user?
- How does vfio-user NVMe differ from vhost-blk?
- Which common SPDK function executes a built NVMf request?
- Why can bdev completion still leave the guest waiting?
References
- Local SPDK:
lib/nvmf/vfio_user.c - Local SPDK:
include/spdk/vfio_user_spec.h - Local SPDK:
lib/vfio_user/host/vfio_user.c - Local SPDK:
lib/vfio_user/host/vfio_user_pci.c - Local SPDK:
lib/nvme/nvme_vfio_user.c - Local SPDK:
lib/vfu_tgt/tgt_rpc.c - QEMU vfio-user protocol documentation: https://www.qemu.org/docs/master/interop/vfio-user.html
- NVM Express specifications: https://nvmexpress.org/specifications/