Chapter Goal
After this chapter, the reader should be able to explain how SPDK exposes a local bdev as a remote NVMe namespace. They should know the difference between a target, transport, subsystem, listener, namespace, controller, qpair, poll group, and request. They should also be able to trace a remote write from an RDMA/TCP/vfio-user transport callback into spdk_nvmf_request_exec() and then into bdev I/O.
Beginner Mental Model
NVMe-oF is NVMe queue semantics carried over a fabric. A remote host still believes it is submitting NVMe commands to a controller. The controller is not a physical PCIe device on that host; it is represented by SPDK inside nvmf_tgt.
Think of the target as a building:
- The target is the whole building.
- A transport is a road type into the building: RDMA, TCP, FC, or vfio-user.
- A listener is a doorway on a road: an address and port/socket.
- A subsystem is a named storage tenant, identified by NQN.
- A namespace is one block device inside that subsystem.
- A qpair is one active queue connection from a host.
- A request is one NVMe command flowing through that qpair.
- A poll group is the CPU-local worker that polls transport events and bdev completions.
NVMe-oF is not "a network filesystem." It does not understand files, directories, extents, or VM images. It exports block namespaces. The guest or host above it owns the filesystem or partition table.
Why This Matters For diskengine/excloud
In diskengine storage-node mode, each lvol becomes reachable from compute/baremetal nodes by being added to an NVMe-oF subsystem as a namespace. The storage node loop creates or verifies:
- an RDMA transport,
- a subsystem NQN,
- a listener using the storage node RDMA IP and port,
- a namespace pointing at the lvol bdev.
In diskengine baremetal mode, the other side of the same relationship appears as bdev_nvme_attach_controller. The baremetal node connects to the storage node's NQN and sees a local SPDK bdev, usually named from the controller name plus namespace suffix.
Object Lifecycle
The event subsystem initializes the target before user workloads can use it. The source tour starts at:
module/event/subsystems/nvmf/nvmf_tgt.c: nvmf_subsystem_initmodule/event/subsystems/nvmf/nvmf_tgt.c: nvmf_tgt_advance_statemodule/event/subsystems/nvmf/nvmf_tgt.c: nvmf_tgt_create_targetmodule/event/subsystems/nvmf/nvmf_tgt.c: nvmf_tgt_create_poll_groupsmodule/event/subsystems/nvmf/nvmf_tgt.c: nvmf_subsystem_write_config_json
The target object itself is created by the library:
lib/nvmf/nvmf.c: spdk_nvmf_tgt_createlib/nvmf/nvmf.c: spdk_nvmf_tgt_destroylib/nvmf/nvmf.c: spdk_nvmf_poll_group_createlib/nvmf/nvmf.c: spdk_nvmf_tgt_new_qpair
Transport implementations register themselves behind a common interface:
lib/nvmf/transport.c: spdk_nvmf_transport_registerlib/nvmf/transport.c: spdk_nvmf_transport_create_asynclib/nvmf/transport.c: spdk_nvmf_transport_listenlib/nvmf/transport.c: nvmf_transport_poll_group_createinclude/spdk/nvmf_transport.h: struct spdk_nvmf_transport_ops
Subsystems and namespaces are managed here:
lib/nvmf/subsystem.c: spdk_nvmf_subsystem_createlib/nvmf/subsystem.c: spdk_nvmf_subsystem_startlib/nvmf/subsystem.c: spdk_nvmf_subsystem_stoplib/nvmf/subsystem.c: spdk_nvmf_subsystem_add_listener_extlib/nvmf/subsystem.c: spdk_nvmf_subsystem_add_ns_extinclude/spdk/nvmf.h: spdk_nvmf_subsystem_add_ns_extinclude/spdk/nvmf.h: spdk_nvmf_subsystem_add_listener_ext
Control-plane RPCs for this chapter are registered in:
lib/nvmf/nvmf_rpc.c: rpc_nvmf_create_transportlib/nvmf/nvmf_rpc.c: rpc_nvmf_create_subsystemlib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_listenerlib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_nslib/nvmf/nvmf_rpc.c: rpc_nvmf_get_subsystems
The I/O Path
A host sends a command over RDMA, TCP, FC, or vfio-user. The transport decodes enough of the wire/device protocol to create an spdk_nvmf_request. It then calls the common execution path:
lib/nvmf/ctrlr.c: spdk_nvmf_request_exec
That function classifies the command. Admin commands go to:
lib/nvmf/ctrlr.c: nvmf_ctrlr_process_admin_cmd
I/O commands go to:
lib/nvmf/ctrlr.c: nvmf_ctrlr_process_io_cmd
For namespace I/O, the controller code resolves the namespace to a bdev descriptor and channel:
lib/nvmf/ctrlr.c: spdk_nvmf_request_get_bdev
The bdev-backed command helpers live in:
lib/nvmf/ctrlr_bdev.c: nvmf_ctrlr_process_io_cmd_resubmitlib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrl_queue_iolib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_read_cmdlib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmdlib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_flush_cmdlib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_unmap
The write command eventually submits spdk_bdev_writev_blocks_ext; the read command uses spdk_bdev_readv_blocks_ext. Those calls are asynchronous. The NVMe-oF request is not complete when the bdev I/O is submitted. Completion happens when the bdev module calls back, the NVMe status is filled in, and the request is returned through:
lib/nvmf/ctrlr.c: spdk_nvmf_request_completelib/nvmf/transport.c: nvmf_transport_req_complete
Transport-specific completion code then sends a CQE or response to the host.
Prose Diagram: Target Request Flow
Picture a left-to-right diagram with five vertical lanes:
- Remote host NVMe driver.
- SPDK transport.
- SPDK NVMf controller.
- SPDK bdev layer.
- Physical or virtual backing bdev.
The arrows are:
Host submits SQE -> transport receives capsule/request -> spdk_nvmf_request_exec -> nvmf_ctrlr_process_io_cmd -> spdk_nvmf_request_get_bdev -> spdk_bdev_writev_blocks_ext or spdk_bdev_readv_blocks_ext -> backing bdev finishes -> bdev callback -> spdk_nvmf_request_complete -> transport sends CQE -> host observes completion.
The important visual detail is that completion is a separate arrow coming back later. Nothing should be drawn as a blocking function call waiting for the SSD.
RDMA, TCP, And vfio-user Distinctions
The common NVMf layer does not care whether a write arrived from RDMA or TCP once it has an spdk_nvmf_request. The transport matters before and after that point.
RDMA:
- Uses RDMA queue pairs and memory registration.
- Sensitive to RNIC, RDMA CM, MTU, PFC/ECN, and hostaddr binding.
- Common in diskengine's storage-node to baremetal path.
- Source anchors:
lib/nvmf/rdma.c: spdk_nvmf_request_execcall sites,lib/nvmf/rdma.c: spdk_nvmf_request_completecall sites.
TCP:
- Uses sockets instead of RDMA verbs.
- Easier to bring up, often lower operational barrier, usually higher CPU cost.
- Source anchors:
lib/nvmf/tcp.c: spdk_nvmf_request_execcall sites.
vfio-user:
- Looks like a local PCIe device to a VM or client process over a Unix socket.
- Uses guest memory mapping and doorbell handling rather than network packets.
- Source anchors:
lib/nvmf/vfio_user.c: nvmf_vfio_user_poll_group,lib/nvmf/vfio_user.c: vfio_user_ctrlr_switch_doorbells,lib/nvmf/vfio_user.c: spdk_nvmf_request_execcall sites.
Edge Cases And Failure Modes
Listener exists but subsystem has no namespace:
The host may discover or connect to a subsystem but see no usable capacity. Check nvmf_get_subsystems and verify namespace entries. In diskengine, this usually points at internal/storagenode/provisionlvol.go: provisionLvol or internal/storagenode/nvmeofexport.go: reconcileExports.
Namespace bdev disappeared:
If the bdev backing a namespace is removed, outstanding I/O may fail and reconnect behavior depends on the initiator. Read lib/nvmf/subsystem.c namespace removal paths and lib/nvmf/ctrlr.c: spdk_nvmf_request_get_bdev.
Host NQN mismatch:
If allow_any_host is false and the host was not added, connect fails even though the network path is fine. Source anchors: include/spdk/nvmf.h: spdk_nvmf_subsystem_add_host_ext, include/spdk/nvmf.h: spdk_nvmf_subsystem_set_allow_any_host.
Transport exists but does not listen:
nvmf_create_transport creates the transport object. It does not by itself create every subsystem listener. Listener creation is separate through nvmf_subsystem_add_listener, which may call target listen functions as needed.
Buffer pressure:
Transports may need iobufs for request data. Source anchors: lib/nvmf/transport.c: spdk_nvmf_request_get_buffers, lib/nvmf/transport.c: nvmf_request_iobuf_get_cb.
Subsystem state transitions:
Some changes require pause/stop/resume semantics. A beginner mistake is to think a namespace list is just an array that can be mutated freely while I/O is running. Use the subsystem state functions as the source of truth.
Misconceptions To Kill
"NVMe-oF exports disks."
More precisely, it exports namespaces backed by SPDK bdevs. The bdev may be an NVMe namespace, an lvol, a RAID bdev, a malloc bdev, or another virtual bdev.
"An NQN is an IP address."
An NQN is a name. A listener provides addressability. diskengine stores both because compute nodes need the NQN and the RDMA endpoint.
"Creating a subsystem moves data."
Creating a subsystem changes the target's namespace/control-plane state. Data moves only when a host sends I/O.
"The target thread blocks on remote writes."
The target submits asynchronous bdev I/O and returns later through completion callbacks. Blocking a reactor would harm every qpair on that thread.
Lab: Build A Minimal Mental Config
Without running SPDK, write the minimal JSON-RPC sequence for a storage node exporting one lvol named abcd-uuid through RDMA:
nvmf_create_transportwithtrtype=RDMA.nvmf_create_subsystemwith an NQN such asnqn.2024-01.io.excloud:storage.node.disk.lvol.nvmf_subsystem_add_listenerwithtraddr,trsvcid,adrfam, andtrtype.nvmf_subsystem_add_nswith namespace bdevabcd-uuid.
Then inspect lib/nvmf/nvmf_rpc.c and identify which C RPC handler decodes each step.
Source Reading Exercise
Start at lib/nvmf/ctrlr.c: spdk_nvmf_request_exec. Follow only the I/O command path. Write down:
- Where the opcode is classified.
- Where
nsidbecomes a bdev. - Where the bdev call is submitted.
- Where
spdk_nvmf_request_completeis called after bdev completion.
Do not read the whole file linearly. Use symbol search and call references.
Operational Debug Exercise
Symptom: baremetal node cannot attach a volume over RDMA.
Classify it:
- Does storage node show the RDMA transport in
nvmf_get_transports? - Does
nvmf_get_subsystemsshow the target NQN? - Does that subsystem have a listener with the expected RDMA IP and port?
- Does it have a namespace whose
bdev_nameis the lvol UUID? - On the baremetal node, does
bdev_nvme_attach_controllerfail during connect or succeed but no bdev appears?
Only after answering these should you suspect the bdev layer or lvol metadata.
Self-Check
- What object owns the NQN?
- What object owns the RDMA IP and port?
- Why can a subsystem exist without being useful to a host?
- Why is
spdk_nvmf_request_completenot called immediately afterspdk_bdev_writev_blocks_ext? - Which diskengine storage-node functions create or verify NVMe-oF exports?
References
- Local SPDK:
include/spdk/nvmf.h - Local SPDK:
include/spdk/nvmf_transport.h - Local SPDK:
module/event/subsystems/nvmf/nvmf_tgt.c - Local SPDK:
lib/nvmf/nvmf_rpc.c - Local SPDK:
lib/nvmf/ctrlr.c - Local SPDK:
lib/nvmf/ctrlr_bdev.c - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/nvmeofexport.go - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/provisionlvol.go - SPDK NVMe-oF documentation: https://spdk.io/doc/nvmf.html
- NVM Express specifications: https://nvmexpress.org/specifications/