SPDK From First Principles

SPDK deep learning path

Chapter 23: NVMe-oF Initiator Through `bdev_nvme`

This chapter explains how SPDK acts as an NVMe initiator and then exposes connected namespaces as SPDK bdevs. For diskengine, this is the core of baremetal mode: compute-side...

Source: drafts/transport-diskengine/23-nvme-of-initiator-bdev-nvme.md

Chapter Goal

This chapter explains how SPDK acts as an NVMe initiator and then exposes connected namespaces as SPDK bdevs. For diskengine, this is the core of baremetal mode: compute-side SPDK connects to storage-node NVMe-oF subsystems and turns each remote lvol namespace into a local bdev that can be assembled into RAID and exported to QEMU.

Beginner Mental Model

The NVMe-oF target chapter described a remote subsystem that exports a namespace. The initiator side is the mirror image. It asks:

Given an NQN and a transport address, can I create a controller connection and produce one or more local bdev names?

SPDK's bdev_nvme module wraps the lower-level NVMe library. The NVMe library knows how to connect, create admin and I/O qpairs, submit commands, poll completions, reconnect, and reset. The bdev module translates generic bdev I/O into NVMe namespace commands.

To a user of the bdev layer, the result is just another bdev. It can be a base for RAID, lvol, vhost, NBD, or another virtual bdev. The fact that it is remote over RDMA is hidden behind the bdev module.

Why This Matters For diskengine/excloud

diskengine baremetal mode does not write to storage-node lvols by calling storage-node APIs per I/O. It connects once through SPDK, assembles local bdev graph objects, and then data I/O stays in SPDK. The Go process reconciles attachment state; it is not in the write fast path.

The key diskengine source anchors are:

  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: startNvmeAttachLoop
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: reconcileNVMeConnections
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: attachNvmeConnection
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go: controllerNameForNQN
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/utils.go: baseBdevNameFromNQN
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeAttachController
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeGetControllers
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeGetIoPaths
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go: BdevNvmeSetMultipathPolicy

RPC To bdev Creation

The RPC entry point is:

  • module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller
  • module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_decoders
  • module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_done
  • module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_examined

That handler decodes fields such as name, trtype, traddr, trsvcid, adrfam, subnqn, hostnqn, hostaddr, multipath, ctrlr_loss_timeout_sec, reconnect_delay_sec, and fast_io_fail_timeout_sec. It then calls:

  • module/bdev/nvme/bdev_nvme.c: spdk_bdev_nvme_create

The public header for the module is:

  • include/spdk/module/bdev/nvme.h: spdk_bdev_nvme_create
  • include/spdk/module/bdev/nvme.h: spdk_bdev_nvme_delete
  • include/spdk/module/bdev/nvme.h: spdk_bdev_nvme_set_multipath_policy
  • include/spdk/module/bdev/nvme.h: struct spdk_bdev_nvme_ctrlr_opts

Controller and bdev object creation flows through:

  • module/bdev/nvme/bdev_nvme.c: nvme_bdev_ctrlr_create
  • module/bdev/nvme/bdev_nvme.c: nvme_bdev_create
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_create_ctrlr_channel_cb
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_create_bdev_channel_cb
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_get_io_channel

The module registers a bdev function table with:

  • module/bdev/nvme/bdev_nvme.c: nvmelib_fn_table

The function table is where the bdev layer learns how to submit I/O to this module.

bdev I/O To NVMe Command

Once the controller and namespace bdev exist, normal bdev I/O enters:

  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_request_initial
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_submit_request
  • module/bdev/nvme/bdev_nvme.c: _bdev_nvme_submit_request

Reads and writes reach:

  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_readv
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_readv_done
  • module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev_done

The actual NVMe namespace commands are lower-level library calls:

  • lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_readv
  • lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_writev
  • lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_read_ext
  • lib/nvme/nvme_ns_cmd.c: spdk_nvme_ns_cmd_write_ext

For RDMA transport mechanics, start at:

  • lib/nvme/nvme_rdma.c
  • lib/nvme/nvme_fabric.c
  • lib/nvme/nvme_qpair.c
  • lib/nvme/nvme_poll_group.c

Prose Diagram: Initiator Object Stack

Draw a top-down stack:

  1. diskengine baremetal loop.
  2. SPDK JSON-RPC bdev_nvme_attach_controller.
  3. bdev_nvme controller object.
  4. NVMe controller connection and admin qpair.
  5. Per-thread bdev channel.
  6. NVMe I/O qpair.
  7. Remote NVMe-oF subsystem namespace.
  8. Storage-node lvol bdev.

Next to the stack, draw a separate horizontal data path:

bdev write to NvmeRemoteNqn1 -> bdev_nvme_submit_request -> spdk_nvme_ns_cmd_write* -> RDMA/TCP qpair -> storage-node NVMf target -> storage-node bdev.

The diagram should show diskengine only above the stack, not in the data path.

Multipath In The Beginner Model

Multipath means one logical NVMe bdev may have more than one path to storage. A path may be another controller connection or route to the same namespace. The policy chooses which path receives I/O, and failover behavior controls what happens when a path degrades or disappears.

SPDK source anchors:

  • module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_get_io_paths
  • module/bdev/nvme/bdev_nvme_rpc.c: _rpc_bdev_nvme_get_io_paths
  • module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_set_multipath_policy
  • module/bdev/nvme/bdev_nvme.c: nvme_bdev_channel
  • module/bdev/nvme/bdev_nvme.h: enum spdk_bdev_nvme_multipath_policy

diskengine baremetal attaches with deterministic controller names derived from NQN, refreshes I/O path information, and calls bdev_nvme_set_multipath_policy for managed bdevs. This matters because storage nodes and networks fail independently. A single logical volume may have primary and secondary lvol placements.

Reconnect, Loss Timeout, And Fast Fail

Three knobs are easy to confuse:

  • reconnect_delay_sec: how long to wait between reconnect attempts.
  • ctrlr_loss_timeout_sec: how long a controller may be lost before SPDK gives up.
  • fast_io_fail_timeout_sec: how soon queued I/O may fail while the controller is unavailable.

diskengine uses short reconnect-oriented values in baremetal attach references from /home/lolwierd/Projects/excloud/diskengine/diskengine/docs/baremetal.md, and the code constructs those params in internal/baremetal/nvme_attach.go: attachNvmeConnection.

The operational tradeoff is simple: a long loss timeout hides transient network loss but can make guest I/O appear hung. A short fast-fail timeout reveals problems quickly but may surface transient blips to the VM.

Edge Cases And Failure Modes

Attach succeeds but no bdev appears:

The controller may connect while namespace discovery or bdev examine lags. Check the RPC return from bdev_nvme_attach_controller, bdev_get_bdevs, and bdev_nvme_get_controllers.

Controller name collision:

SPDK controller names are not just labels. diskengine derives names from NQN using controllerNameForNQN; if naming changes, cleanup and base-bdev derivation can break.

NQN maps to expected bdev name incorrectly:

diskengine assumes one namespace per subsystem in baseBdevNameFromNQN, producing a controller-derived name plus n1. If a subsystem exports multiple namespaces, that assumption fails.

Local RDMA source address missing:

Baremetal attach may fail before SPDK connect if diskengine cannot find a usable local RDMA interface. See internal/baremetal/nvme_attach.go: localRDMAHostAddr.

Concurrent attach/detach:

The diskengine client notes in internal/spdkclient/coord.go: LockController explain why overlapping controller operations are risky. Even when SPDK is asynchronous, object state transitions can be externally visible through RPCs.

Reset or reconnect during bdev graph inspection:

diskengine avoids some bdev_get_bdevs paths during reset because production crashes were observed. See internal/baremetal/utils.go: areBaseBdevsReady and internal/baremetal/utils.go: areBaseBdevsPresentViaIoPaths.

Misconceptions To Kill

"The remote lvol is copied to baremetal when attached."

No. Attach creates a controller connection and local bdev representation. Data is read and written remotely as I/O occurs.

"A controller is the same thing as a bdev."

No. A controller can expose one or more namespaces. Each namespace can become a bdev. diskengine mostly assumes one namespace, but the concepts remain separate.

"Multipath means RAID."

No. Multipath is multiple transport paths to the same logical namespace. RAID combines multiple bdevs into another bdev.

"If bdev_nvme_get_controllers shows enabled, all data paths are healthy."

Not always. Use bdev_nvme_get_io_paths, bdev I/O stats, and RAID state to understand actual path use.

Lab: Attach Sequence Walkthrough

Given:

  • NQN: nqn.2024-01.io.excloud:storage.nodeA.disk123.lvol456
  • Target RDMA IP: 10.10.1.20
  • Target RDMA port: 4420
  • Local host RDMA IP: 10.10.2.30

Write the intended bdev_nvme_attach_controller params:

  • name: deterministic name from NQN.
  • trtype: RDMA.
  • adrfam: IPv4 or IPv6.
  • traddr: target IP.
  • trsvcid: target port.
  • subnqn: NQN.
  • hostaddr: local RDMA IP if required.
  • reconnect/loss/fast-fail values.
  • multipath: enabled where diskengine expects it.

Then inspect module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller_decoders and verify every field exists.

Source Reading Exercise

Start at module/bdev/nvme/bdev_nvme_rpc.c: rpc_bdev_nvme_attach_controller. Follow the path until bdev registration:

  1. Where is JSON decoded?
  2. Where is spdk_bdev_nvme_create called?
  3. Where does a namespace become an SPDK bdev?
  4. Where is the bdev function table installed?

Then start at module/bdev/nvme/bdev_nvme.c: bdev_nvme_writev and identify where the completion callback returns to the bdev layer.

Operational Debug Exercise

Symptom: RAID stays configuring on baremetal.

Check in this order:

  1. Does bdev_nvme_get_controllers show enabled controllers for all required NQNs?
  2. Does bdev_nvme_get_io_paths show bdev names matching diskengine's baseBdevNameFromNQN output?
  3. Does bdev_raid_get_bdevs category=all show the RAID with missing bases?
  4. Did diskengine skip bdev_get_bdevs and rely on controller/path checks during reset?
  5. Are storage-node exports still present in nvmf_get_subsystems?

Self-Check

  1. What is the difference between subnqn and traddr?
  2. Why does diskengine need deterministic controller names?
  3. Which SPDK function is the RPC bridge into NVMe bdev creation?
  4. Why is bdev_nvme still a bdev module even when the target is remote?
  5. What is the risk of assuming every subsystem has exactly one namespace?

References

  • Local SPDK: module/bdev/nvme/bdev_nvme_rpc.c
  • Local SPDK: module/bdev/nvme/bdev_nvme.c
  • Local SPDK: include/spdk/module/bdev/nvme.h
  • Local SPDK: lib/nvme/nvme_rdma.c
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/utils.go
  • SPDK NVMe driver documentation: https://spdk.io/doc/nvme.html
  • SPDK bdev documentation: https://spdk.io/doc/bdev.html