Chapter Goal
This chapter explains how SPDK acts as an NVMe initiator and then exposes connected namespaces as SPDK bdevs. For diskengine, this is the core of baremetal mode: compute-side SPDK connects to storage-node NVMe-oF subsystems and turns each remote lvol namespace into a local bdev that can be assembled into RAID and exported to QEMU.
Beginner Mental Model
The NVMe-oF target chapter described a remote subsystem that exports a namespace. The initiator side is the mirror image. It asks:
Given an NQN and a transport address, can I create a controller connection and produce one or more local bdev names?
SPDK's bdev_nvme module wraps the lower-level NVMe library. The NVMe library knows how to connect, create admin and I/O qpairs, submit commands, poll completions, reconnect, and reset. The bdev module translates generic bdev I/O into NVMe namespace commands.
To a user of the bdev layer, the result is just another bdev. It can be a base for RAID, lvol, vhost, NBD, or another virtual bdev. The fact that it is remote over RDMA is hidden behind the bdev module.
Why This Matters For diskengine/excloud
diskengine baremetal mode does not write to storage-node lvols by calling storage-node APIs per I/O. It connects once through SPDK, assembles local bdev graph objects, and then data I/O stays in SPDK. The Go process reconciles attachment state; it is not in the write fast path.
The key diskengine source anchors are:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: startNvmeAttachLoop/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: reconcileNVMeConnections/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: attachNvmeConnection/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: controllerNameForNQN/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/utils.go: baseBdevNameFromNQN/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeAttachController/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeGetControllers/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeGetIoPaths/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeSetMultipathPolicy
RPC To bdev Creation
The RPC entry point is:
module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controllermodule/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_decodersmodule/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_donemodule/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_examined
That handler decodes fields such as name, trtype, traddr, trsvcid, adrfam, subnqn, hostnqn, hostaddr, multipath, ctrlr_loss_timeout_sec, reconnect_delay_sec, and fast_io_fail_timeout_sec. It then calls:
module/bdev/nvme/bdev_nvme.c: spdk_bdev_nvme_create
The public header for the module is:
include/spdk/module/bdev/nvme.h: spdk_bdev_nvme_createinclude/spdk/module/bdev/nvme.h: spdk_bdev_nvme_deleteinclude/spdk/module/bdev/nvme.h: spdk_bdev_nvme_set_multipath_policyinclude/spdk/module/bdev/nvme.h: struct spdk_bdev_nvme_ctrlr_opts
Controller and bdev object creation flows through:
module/bdev/nvme/bdev_nvme.c: nvme_bdev_ctrlr_createmodule/bdev/nvme/bdev_nvme.c: nvme_bdev_createmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_create_ctrlr_channel_cbmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_create_bdev_channel_cbmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_get_io_channel
The module registers a bdev function table with:
module/bdev/nvme/bdev_nvme.c: nvmelib_fn_table
The function table is where the bdev layer learns how to submit I/O to this module.
bdev I/O To NVMe Command
Once the controller and namespace bdev exist, normal bdev I/O enters:
module/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_request_initialmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_requestmodule/bdev/nvme/bdev_nvme.c: _bdev_nvme_submit_request
Reads and writes reach:
module/bdev/nvme/bdev_nvme.c: bdev_nvme_readvmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_writevmodule/bdev/nvme/bdev_nvme.c: bdev_nvme_readv_donemodule/bdev/nvme/bdev_nvme.c: bdev_nvme_writev_done
The actual NVMe namespace commands are lower-level library calls:
lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_readvlib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_writevlib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_read_extlib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_write_ext
For RDMA transport mechanics, start at:
lib/nvme/nvme_rdma.clib/nvme/nvme_fabric.clib/nvme/nvme_qpair.clib/nvme/nvme_poll_group.c
Prose Diagram: Initiator Object Stack
Draw a top-down stack:
- diskengine baremetal loop.
- SPDK JSON-RPC
bdev_nvme_attach_controller. bdev_nvmecontroller object.- NVMe controller connection and admin qpair.
- Per-thread bdev channel.
- NVMe I/O qpair.
- Remote NVMe-oF subsystem namespace.
- Storage-node lvol bdev.
Next to the stack, draw a separate horizontal data path:
bdev write to NvmeRemoteNqn1 -> bdev_nvme_submit_request -> spdk_nvme_ns_cmd_write* -> RDMA/TCP qpair -> storage-node NVMf target -> storage-node bdev.
The diagram should show diskengine only above the stack, not in the data path.
Multipath In The Beginner Model
Multipath means one logical NVMe bdev may have more than one path to storage. A path may be another controller connection or route to the same namespace. The policy chooses which path receives I/O, and failover behavior controls what happens when a path degrades or disappears.
SPDK source anchors:
module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_get_io_pathsmodule/bdev/nvme/bdev_nvme_rpc.c: _rpc_bdev_nvme_get_io_pathsmodule/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_set_multipath_policymodule/bdev/nvme/bdev_nvme.c: nvme_bdev_channelmodule/bdev/nvme/bdev_nvme.h: enum spdk_bdev_nvme_multipath_policy
diskengine baremetal attaches with deterministic controller names derived from NQN, refreshes I/O path information, and calls bdev_nvme_set_multipath_policy for managed bdevs. This matters because storage nodes and networks fail independently. A single logical volume may have primary and secondary lvol placements.
Reconnect, Loss Timeout, And Fast Fail
Three knobs are easy to confuse:
reconnect_delay_sec: how long to wait between reconnect attempts.ctrlr_loss_timeout_sec: how long a controller may be lost before SPDK gives up.fast_io_fail_timeout_sec: how soon queued I/O may fail while the controller is unavailable.
diskengine uses short reconnect-oriented values in baremetal attach references from /home/lolwierd/Projects/excloud/diskengine/diskengine/docs/baremetal.md, and the code constructs those params in internal/baremetal/nvme_attach.go: attachNvmeConnection.
The operational tradeoff is simple: a long loss timeout hides transient network loss but can make guest I/O appear hung. A short fast-fail timeout reveals problems quickly but may surface transient blips to the VM.
Edge Cases And Failure Modes
Attach succeeds but no bdev appears:
The controller may connect while namespace discovery or bdev examine lags. Check the RPC return from bdev_nvme_attach_controller, bdev_get_bdevs, and bdev_nvme_get_controllers.
Controller name collision:
SPDK controller names are not just labels. diskengine derives names from NQN using controllerNameForNQN; if naming changes, cleanup and base-bdev derivation can break.
NQN maps to expected bdev name incorrectly:
diskengine assumes one namespace per subsystem in baseBdevNameFromNQN, producing a controller-derived name plus n1. If a subsystem exports multiple namespaces, that assumption fails.
Local RDMA source address missing:
Baremetal attach may fail before SPDK connect if diskengine cannot find a usable local RDMA interface. See internal/baremetal/nvme_attach.go: localRDMAHostAddr.
Concurrent attach/detach:
The diskengine client notes in internal/spdkclient/coord.go: LockController explain why overlapping controller operations are risky. Even when SPDK is asynchronous, object state transitions can be externally visible through RPCs.
Reset or reconnect during bdev graph inspection:
diskengine avoids some bdev_get_bdevs paths during reset because production crashes were observed. See internal/baremetal/utils.go: areBaseBdevsReady and internal/baremetal/utils.go: areBaseBdevsPresentViaIoPaths.
Misconceptions To Kill
"The remote lvol is copied to baremetal when attached."
No. Attach creates a controller connection and local bdev representation. Data is read and written remotely as I/O occurs.
"A controller is the same thing as a bdev."
No. A controller can expose one or more namespaces. Each namespace can become a bdev. diskengine mostly assumes one namespace, but the concepts remain separate.
"Multipath means RAID."
No. Multipath is multiple transport paths to the same logical namespace. RAID combines multiple bdevs into another bdev.
"If bdev_nvme_get_controllers shows enabled, all data paths are healthy."
Not always. Use bdev_nvme_get_io_paths, bdev I/O stats, and RAID state to understand actual path use.
Lab: Attach Sequence Walkthrough
Given:
- NQN:
nqn.2024-01.io.excloud:storage.nodeA.disk123.lvol456 - Target RDMA IP:
10.10.1.20 - Target RDMA port:
4420 - Local host RDMA IP:
10.10.2.30
Write the intended bdev_nvme_attach_controller params:
name: deterministic name from NQN.trtype:RDMA.adrfam:IPv4orIPv6.traddr: target IP.trsvcid: target port.subnqn: NQN.hostaddr: local RDMA IP if required.- reconnect/loss/fast-fail values.
multipath: enabled where diskengine expects it.
Then inspect module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_decoders and verify every field exists.
Source Reading Exercise
Start at module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller. Follow the path until bdev registration:
- Where is JSON decoded?
- Where is
spdk_bdev_nvme_createcalled? - Where does a namespace become an SPDK bdev?
- Where is the bdev function table installed?
Then start at module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev and identify where the completion callback returns to the bdev layer.
Operational Debug Exercise
Symptom: RAID stays configuring on baremetal.
Check in this order:
- Does
bdev_nvme_get_controllersshow enabled controllers for all required NQNs? - Does
bdev_nvme_get_io_pathsshow bdev names matching diskengine'sbaseBdevNameFromNQNoutput? - Does
bdev_raid_get_bdevs category=allshow the RAID with missing bases? - Did diskengine skip
bdev_get_bdevsand rely on controller/path checks during reset? - Are storage-node exports still present in
nvmf_get_subsystems?
Self-Check
- What is the difference between
subnqnandtraddr? - Why does diskengine need deterministic controller names?
- Which SPDK function is the RPC bridge into NVMe bdev creation?
- Why is
bdev_nvmestill a bdev module even when the target is remote? - What is the risk of assuming every subsystem has exactly one namespace?
References
- Local SPDK:
module/bdev/nvme/bdev_nvme_rpc.c - Local SPDK:
module/bdev/nvme/bdev_nvme.c - Local SPDK:
include/spdk/module/bdev/nvme.h - Local SPDK:
lib/nvme/nvme_rdma.c - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go - Local diskengine:
/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/utils.go - SPDK NVMe driver documentation: https://spdk.io/doc/nvme.html
- SPDK bdev documentation: https://spdk.io/doc/bdev.html