SPDK From First Principles

SPDK deep learning path

Chapter 22: NVMe-oF Target

After this chapter, the reader should be able to explain how SPDK exposes a local bdev as a remote NVMe namespace. They should know the difference between a target, transport,...

Source: drafts/transport-diskengine/22-nvme-of-target.md

Chapter Goal

After this chapter, the reader should be able to explain how SPDK exposes a local bdev as a remote NVMe namespace. They should know the difference between a target, transport, subsystem, listener, namespace, controller, qpair, poll group, and request. They should also be able to trace a remote write from an RDMA/TCP/vfio-user transport callback into spdk_nvmf_request_exec() and then into bdev I/O.

Beginner Mental Model

NVMe-oF is NVMe queue semantics carried over a fabric. A remote host still believes it is submitting NVMe commands to a controller. The controller is not a physical PCIe device on that host; it is represented by SPDK inside nvmf_tgt.

Think of the target as a building:

  • The target is the whole building.
  • A transport is a road type into the building: RDMA, TCP, FC, or vfio-user.
  • A listener is a doorway on a road: an address and port/socket.
  • A subsystem is a named storage tenant, identified by NQN.
  • A namespace is one block device inside that subsystem.
  • A qpair is one active queue connection from a host.
  • A request is one NVMe command flowing through that qpair.
  • A poll group is the CPU-local worker that polls transport events and bdev completions.

NVMe-oF is not "a network filesystem." It does not understand files, directories, extents, or VM images. It exports block namespaces. The guest or host above it owns the filesystem or partition table.

Why This Matters For diskengine/excloud

In diskengine storage-node mode, each lvol becomes reachable from compute/baremetal nodes by being added to an NVMe-oF subsystem as a namespace. The storage node loop creates or verifies:

  • an RDMA transport,
  • a subsystem NQN,
  • a listener using the storage node RDMA IP and port,
  • a namespace pointing at the lvol bdev.

In diskengine baremetal mode, the other side of the same relationship appears as bdev_nvme_attach_controller. The baremetal node connects to the storage node's NQN and sees a local SPDK bdev, usually named from the controller name plus namespace suffix.

Object Lifecycle

The event subsystem initializes the target before user workloads can use it. The source tour starts at:

  • module/event/subsystems/nvmf/nvmf_tgt.c: nvmf_subsystem_init
  • module/event/subsystems/nvmf/nvmf_tgt.c: nvmf_tgt_advance_state
  • module/event/subsystems/nvmf/nvmf_tgt.c: nvmf_tgt_create_target
  • module/event/subsystems/nvmf/nvmf_tgt.c: nvmf_tgt_create_poll_groups
  • module/event/subsystems/nvmf/nvmf_tgt.c: nvmf_subsystem_write_config_json

The target object itself is created by the library:

  • lib/nvmf/nvmf.c: spdk_nvmf_tgt_create
  • lib/nvmf/nvmf.c: spdk_nvmf_tgt_destroy
  • lib/nvmf/nvmf.c: spdk_nvmf_poll_group_create
  • lib/nvmf/nvmf.c: spdk_nvmf_tgt_new_qpair

Transport implementations register themselves behind a common interface:

  • lib/nvmf/transport.c: spdk_nvmf_transport_register
  • lib/nvmf/transport.c: spdk_nvmf_transport_create_async
  • lib/nvmf/transport.c: spdk_nvmf_transport_listen
  • lib/nvmf/transport.c: nvmf_transport_poll_group_create
  • include/spdk/nvmf_transport.h: struct spdk_nvmf_transport_ops

Subsystems and namespaces are managed here:

  • lib/nvmf/subsystem.c: spdk_nvmf_subsystem_create
  • lib/nvmf/subsystem.c: spdk_nvmf_subsystem_start
  • lib/nvmf/subsystem.c: spdk_nvmf_subsystem_stop
  • lib/nvmf/subsystem.c: spdk_nvmf_subsystem_add_listener_ext
  • lib/nvmf/subsystem.c: spdk_nvmf_subsystem_add_ns_ext
  • include/spdk/nvmf.h: spdk_nvmf_subsystem_add_ns_ext
  • include/spdk/nvmf.h: spdk_nvmf_subsystem_add_listener_ext

Control-plane RPCs for this chapter are registered in:

  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_create_transport
  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_create_subsystem
  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_listener
  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_subsystem_add_ns
  • lib/nvmf/nvmf_rpc.c: rpc_nvmf_get_subsystems

The I/O Path

A host sends a command over RDMA, TCP, FC, or vfio-user. The transport decodes enough of the wire/device protocol to create an spdk_nvmf_request. It then calls the common execution path:

  • lib/nvmf/ctrlr.c: spdk_nvmf_request_exec

That function classifies the command. Admin commands go to:

  • lib/nvmf/ctrlr.c: nvmf_ctrlr_process_admin_cmd

I/O commands go to:

  • lib/nvmf/ctrlr.c: nvmf_ctrlr_process_io_cmd

For namespace I/O, the controller code resolves the namespace to a bdev descriptor and channel:

  • lib/nvmf/ctrlr.c: spdk_nvmf_request_get_bdev

The bdev-backed command helpers live in:

  • lib/nvmf/ctrlr_bdev.c: nvmf_ctrlr_process_io_cmd_resubmit
  • lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrl_queue_io
  • lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_read_cmd
  • lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_write_cmd
  • lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_flush_cmd
  • lib/nvmf/ctrlr_bdev.c: nvmf_bdev_ctrlr_unmap

The write command eventually submits spdk_bdev_writev_blocks_ext; the read command uses spdk_bdev_readv_blocks_ext. Those calls are asynchronous. The NVMe-oF request is not complete when the bdev I/O is submitted. Completion happens when the bdev module calls back, the NVMe status is filled in, and the request is returned through:

  • lib/nvmf/ctrlr.c: spdk_nvmf_request_complete
  • lib/nvmf/transport.c: nvmf_transport_req_complete

Transport-specific completion code then sends a CQE or response to the host.

Prose Diagram: Target Request Flow

Picture a left-to-right diagram with five vertical lanes:

  1. Remote host NVMe driver.
  2. SPDK transport.
  3. SPDK NVMf controller.
  4. SPDK bdev layer.
  5. Physical or virtual backing bdev.

The arrows are:

Host submits SQE -> transport receives capsule/request -> spdk_nvmf_request_exec -> nvmf_ctrlr_process_io_cmd -> spdk_nvmf_request_get_bdev -> spdk_bdev_writev_blocks_ext or spdk_bdev_readv_blocks_ext -> backing bdev finishes -> bdev callback -> spdk_nvmf_request_complete -> transport sends CQE -> host observes completion.

The important visual detail is that completion is a separate arrow coming back later. Nothing should be drawn as a blocking function call waiting for the SSD.

RDMA, TCP, And vfio-user Distinctions

The common NVMf layer does not care whether a write arrived from RDMA or TCP once it has an spdk_nvmf_request. The transport matters before and after that point.

RDMA:

  • Uses RDMA queue pairs and memory registration.
  • Sensitive to RNIC, RDMA CM, MTU, PFC/ECN, and hostaddr binding.
  • Common in diskengine's storage-node to baremetal path.
  • Source anchors: lib/nvmf/rdma.c: spdk_nvmf_request_exec call sites, lib/nvmf/rdma.c: spdk_nvmf_request_complete call sites.

TCP:

  • Uses sockets instead of RDMA verbs.
  • Easier to bring up, often lower operational barrier, usually higher CPU cost.
  • Source anchors: lib/nvmf/tcp.c: spdk_nvmf_request_exec call sites.

vfio-user:

  • Looks like a local PCIe device to a VM or client process over a Unix socket.
  • Uses guest memory mapping and doorbell handling rather than network packets.
  • Source anchors: lib/nvmf/vfio_user.c: nvmf_vfio_user_poll_group, lib/nvmf/vfio_user.c: vfio_user_ctrlr_switch_doorbells, lib/nvmf/vfio_user.c: spdk_nvmf_request_exec call sites.

Edge Cases And Failure Modes

Listener exists but subsystem has no namespace:

The host may discover or connect to a subsystem but see no usable capacity. Check nvmf_get_subsystems and verify namespace entries. In diskengine, this usually points at internal/storagenode/provisionlvol.go: provisionLvol or internal/storagenode/nvmeofexport.go: reconcileExports.

Namespace bdev disappeared:

If the bdev backing a namespace is removed, outstanding I/O may fail and reconnect behavior depends on the initiator. Read lib/nvmf/subsystem.c namespace removal paths and lib/nvmf/ctrlr.c: spdk_nvmf_request_get_bdev.

Host NQN mismatch:

If allow_any_host is false and the host was not added, connect fails even though the network path is fine. Source anchors: include/spdk/nvmf.h: spdk_nvmf_subsystem_add_host_ext, include/spdk/nvmf.h: spdk_nvmf_subsystem_set_allow_any_host.

Transport exists but does not listen:

nvmf_create_transport creates the transport object. It does not by itself create every subsystem listener. Listener creation is separate through nvmf_subsystem_add_listener, which may call target listen functions as needed.

Buffer pressure:

Transports may need iobufs for request data. Source anchors: lib/nvmf/transport.c: spdk_nvmf_request_get_buffers, lib/nvmf/transport.c: nvmf_request_iobuf_get_cb.

Subsystem state transitions:

Some changes require pause/stop/resume semantics. A beginner mistake is to think a namespace list is just an array that can be mutated freely while I/O is running. Use the subsystem state functions as the source of truth.

Misconceptions To Kill

"NVMe-oF exports disks."

More precisely, it exports namespaces backed by SPDK bdevs. The bdev may be an NVMe namespace, an lvol, a RAID bdev, a malloc bdev, or another virtual bdev.

"An NQN is an IP address."

An NQN is a name. A listener provides addressability. diskengine stores both because compute nodes need the NQN and the RDMA endpoint.

"Creating a subsystem moves data."

Creating a subsystem changes the target's namespace/control-plane state. Data moves only when a host sends I/O.

"The target thread blocks on remote writes."

The target submits asynchronous bdev I/O and returns later through completion callbacks. Blocking a reactor would harm every qpair on that thread.

Lab: Build A Minimal Mental Config

Without running SPDK, write the minimal JSON-RPC sequence for a storage node exporting one lvol named abcd-uuid through RDMA:

  1. nvmf_create_transport with trtype=RDMA.
  2. nvmf_create_subsystem with an NQN such as nqn.2024-01.io.excloud:storage.node.disk.lvol.
  3. nvmf_subsystem_add_listener with traddr, trsvcid, adrfam, and trtype.
  4. nvmf_subsystem_add_ns with namespace bdev abcd-uuid.

Then inspect lib/nvmf/nvmf_rpc.c and identify which C RPC handler decodes each step.

Source Reading Exercise

Start at lib/nvmf/ctrlr.c: spdk_nvmf_request_exec. Follow only the I/O command path. Write down:

  1. Where the opcode is classified.
  2. Where nsid becomes a bdev.
  3. Where the bdev call is submitted.
  4. Where spdk_nvmf_request_complete is called after bdev completion.

Do not read the whole file linearly. Use symbol search and call references.

Operational Debug Exercise

Symptom: baremetal node cannot attach a volume over RDMA.

Classify it:

  1. Does storage node show the RDMA transport in nvmf_get_transports?
  2. Does nvmf_get_subsystems show the target NQN?
  3. Does that subsystem have a listener with the expected RDMA IP and port?
  4. Does it have a namespace whose bdev_name is the lvol UUID?
  5. On the baremetal node, does bdev_nvme_attach_controller fail during connect or succeed but no bdev appears?

Only after answering these should you suspect the bdev layer or lvol metadata.

Self-Check

  1. What object owns the NQN?
  2. What object owns the RDMA IP and port?
  3. Why can a subsystem exist without being useful to a host?
  4. Why is spdk_nvmf_request_complete not called immediately after spdk_bdev_writev_blocks_ext?
  5. Which diskengine storage-node functions create or verify NVMe-oF exports?

References

  • Local SPDK: include/spdk/nvmf.h
  • Local SPDK: include/spdk/nvmf_transport.h
  • Local SPDK: module/event/subsystems/nvmf/nvmf_tgt.c
  • Local SPDK: lib/nvmf/nvmf_rpc.c
  • Local SPDK: lib/nvmf/ctrlr.c
  • Local SPDK: lib/nvmf/ctrlr_bdev.c
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/nvmeofexport.go
  • Local diskengine: /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/storagenode/provisionlvol.go
  • SPDK NVMe-oF documentation: https://spdk.io/doc/nvmf.html
  • NVM Express specifications: https://nvmexpress.org/specifications/