SPDK From First Principles

SPDK deep learning path

Chapter 17: NVMe Initiator Library

By the end of this chapter you should be able to explain SPDK's NVMe initiator as a queue-machine library. You should be able to find the source paths for probe, connect,...

Source: drafts/bdev-nvme/17-nvme-initiator-library.md

Reader Promise

By the end of this chapter you should be able to explain SPDK's NVMe initiator as a queue-machine library. You should be able to find the source paths for probe, connect, controller initialization, admin queue progress, I/O qpair allocation, qpair completion polling, poll groups, namespaces, detach, reset, and hotremove callbacks.

This chapter is about lib/nvme, not the bdev module. The bdev module uses this library, but the initiator library can also be used directly by applications.

Mental Model

NVMe is not "a disk API." It is a controller and queue protocol. The host allocates submission queues and completion queues, submits commands, rings doorbells or otherwise notifies the transport, and polls completions. SPDK's NVMe initiator library owns the userspace implementation of that model across PCIe and NVMe-oF transports.

The major objects are:

  • struct spdk_nvme_transport_id: where and how to connect.
  • struct spdk_nvme_probe_ctx: async probe/connect state.
  • struct spdk_nvme_ctrlr: one attached NVMe controller.
  • struct spdk_nvme_ns: one namespace under a controller.
  • struct spdk_nvme_qpair: one admin or I/O queue pair.
  • struct spdk_nvme_poll_group: a group of I/O qpairs polled together.
  • struct spdk_nvme_transport: transport-specific operations for PCIe, TCP, RDMA, etc.

Source anchors:

  • include/spdk/nvme.h:struct spdk_nvme_transport_id.
  • include/spdk/nvme.h:struct spdk_nvme_probe_ctx.
  • include/spdk/nvme.h:struct spdk_nvme_ctrlr.
  • include/spdk/nvme.h:struct spdk_nvme_qpair.
  • include/spdk/nvme.h:struct spdk_nvme_ns.
  • include/spdk/nvme.h:struct spdk_nvme_poll_group.
  • lib/nvme/nvme_internal.h:struct spdk_nvme_qpair.
  • lib/nvme/nvme_internal.h:struct spdk_nvme_ctrlr.
  • lib/nvme/nvme_internal.h:struct spdk_nvme_ns.
  • lib/nvme/nvme_internal.h:struct spdk_nvme_probe_ctx.

Why This Matters For diskengine/excloud

diskengine asks SPDK to attach local PCIe NVMe drives on storage nodes and remote NVMe-oF namespaces on baremetal nodes. In both cases, the bdev module eventually uses the NVMe initiator library to create controllers, namespaces, qpairs, and poll groups.

When something fails, the symptom may appear as "bdev missing" or "volume I/O hung," but the actual cause may be lower:

  • PCIe device not bound to VFIO.
  • NVMe-oF connect failed.
  • Admin queue never reached ready.
  • I/O qpair failed.
  • Controller reset is in progress.
  • Namespace changed or disappeared.
  • Poll group is not making completions.
  • Reconnect timeout policy expired.

The NVMe library is where those states are represented.

Probe And Connect

There are two common entry styles:

  • Probe: enumerate matching controllers and use callbacks to decide which to attach.
  • Connect: directly connect to one transport ID.

Public source anchors:

  • include/spdk/nvme.h:spdk_nvme_probe().
  • include/spdk/nvme.h:spdk_nvme_probe_ext().
  • include/spdk/nvme.h:spdk_nvme_connect().
  • include/spdk/nvme.h:spdk_nvme_probe_async_ext().
  • include/spdk/nvme.h:spdk_nvme_connect_async().
  • include/spdk/nvme.h:spdk_nvme_probe_poll_async().

Implementation source anchors:

  • lib/nvme/nvme.c:spdk_nvme_probe().
  • lib/nvme/nvme.c:spdk_nvme_probe_ext().
  • lib/nvme/nvme.c:spdk_nvme_connect().
  • lib/nvme/nvme.c:spdk_nvme_probe_async_ext().
  • lib/nvme/nvme.c:spdk_nvme_connect_async().
  • lib/nvme/nvme.c:spdk_nvme_probe_poll_async().
  • lib/nvme/nvme.c:nvme_probe_ctx_init().
  • lib/nvme/nvme.c:nvme_probe_internal().
  • lib/nvme/nvme.c:nvme_init_controllers().

Synchronous Probe

spdk_nvme_probe() calls spdk_nvme_probe_ext(). If no transport ID is provided, spdk_nvme_probe_ext() creates a PCIe wildcard transport ID. It then creates an async probe context with spdk_nvme_probe_async_ext() and drives it to completion with nvme_init_controllers().

Callbacks:

  • probe_cb: decide whether to attach a discovered controller and optionally modify controller options.
  • attach_cb: receive a ready controller.
  • attach_fail_cb: optional failure callback.
  • remove_cb: optional callback for controllers no longer present.

Direct Connect

spdk_nvme_connect() requires a transport ID. It initializes driver state, copies controller options safely, creates an async connect context with spdk_nvme_connect_async(), drives initialization with nvme_init_controllers(), then finds the attached controller by transport ID and host NQN.

Direct connect is common for NVMe-oF. Probe is common for PCIe discovery and for cases where multiple controllers may match.

Misconception to kill: "connect" still uses the probe machinery internally. It is direct-connect flavored probe, not an entirely separate stack.

Async Probe Context

The async path exists because controller initialization can take time and must be polled without blocking.

Source anchors:

  • lib/nvme/nvme.c:spdk_nvme_probe_async_ext().
  • lib/nvme/nvme.c:spdk_nvme_connect_async().
  • lib/nvme/nvme.c:spdk_nvme_probe_poll_async().

spdk_nvme_probe_poll_async():

  • Polls every controller in probe_ctx->init_ctrlrs.
  • Polls destruction of failed controllers.
  • Marks the global driver initialized when all init and failed lists are empty.
  • Frees the probe context and returns 0 when done.
  • Returns -EAGAIN while work remains.

This pattern is visible in the NVMe bdev module:

Source anchor: module/bdev/nvme/bdev_nvme.c:bdev_nvme_async_poll().

That poller calls spdk_nvme_probe_poll_async() until the attach work finishes.

Controller State Machine

The controller object represents an attached NVMe controller and its admin queue. Initialization is a state machine.

Source anchors:

  • lib/nvme/nvme.c:nvme_ctrlr_poll_internal().
  • lib/nvme/nvme_ctrlr.c:nvme_ctrlr_process_init().
  • lib/nvme/nvme_ctrlr.c:NVME_CTRLR_STATE_CONNECT_ADMINQ.
  • lib/nvme/nvme_ctrlr.c:NVME_CTRLR_STATE_WAIT_FOR_CONNECT_ADMINQ.

One key point in the source: when the controller state is NVME_CTRLR_STATE_CONNECT_ADMINQ, SPDK asks the transport to connect the admin qpair. In NVME_CTRLR_STATE_WAIT_FOR_CONNECT_ADMINQ, it calls spdk_nvme_qpair_process_completions(ctrlr->adminq, 0) and watches the qpair state transition to connected/enabled.

Beginner mental model: the controller is not "ready" when memory for the object is allocated. It becomes ready after a series of admin queue and identify/configuration steps complete.

Admin Queue

The admin queue is used for controller management: identify, get log page, set features, namespace management, async events, and fabrics connect.

Source anchors:

  • lib/nvme/nvme_ctrlr.c:nvme_ctrlr_process_init().
  • lib/nvme/nvme_ctrlr.c:spdk_nvme_ctrlr_process_admin_completions().
  • lib/nvme/nvme_qpair.c:spdk_nvme_qpair_process_completions().

Admin completions often drive state transitions. If admin queue progress stops, controller initialization, reset, namespace changes, and asynchronous event handling can stall.

I/O Qpairs

Applications submit namespace I/O on I/O qpairs, not on the admin queue.

Source anchors:

  • include/spdk/nvme.h:spdk_nvme_ctrlr_alloc_io_qpair().
  • lib/nvme/nvme_ctrlr.c:spdk_nvme_ctrlr_alloc_io_qpair().
  • lib/nvme/nvme_ctrlr.c:spdk_nvme_ctrlr_connect_io_qpair().
  • lib/nvme/nvme_ctrlr.c:spdk_nvme_ctrlr_disconnect_io_qpair().

spdk_nvme_ctrlr_alloc_io_qpair():

  • Locks the controller.
  • Requires controller state NVME_CTRLR_STATE_READY.
  • Copies default qpair options and user overrides.
  • Validates caller-provided SQ/CQ buffers if used.
  • Rejects incompatible interrupt and delayed-submit options.
  • Creates an I/O qpair.
  • Connects it unless create_only was requested.

Misconception to kill: allocating an I/O qpair can fail because the controller is resetting or initializing. It is not just memory allocation.

Namespace Lifecycle

Namespaces are the NVMe units that look like block devices. The initiator library exposes namespace objects, while the bdev module turns namespaces into bdevs.

Source anchors:

  • include/spdk/nvme.h:spdk_nvme_ctrlr_get_ns().
  • include/spdk/nvme.h:spdk_nvme_ns_get_ctrlr().
  • lib/nvme/nvme_internal.h:struct spdk_nvme_ns.

The initiator library can also notice changed namespace lists through admin events and log pages. Higher layers must decide how to present additions/removals. The NVMe bdev module handles this in its namespace populate/depopulate path.

Command Submission And Completion

The namespace command APIs build NVMe commands and submit them on qpairs.

Source anchors:

  • include/spdk/nvme.h:spdk_nvme_ns_cmd_readv().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_readv_with_md().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_readv_ext().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_writev().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_writev_with_md().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_writev_ext().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_flush().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_write_zeroes().
  • lib/nvme/nvme_ns_cmd.c:nvme_ns_cmd_rw_ext().
  • lib/nvme/nvme_ns_cmd.c:nvme_ns_cmd_rwv_ext().
  • lib/nvme/nvme_ns_cmd.c:spdk_nvme_ns_cmd_flush().

Completion progress is explicit:

Source anchor: lib/nvme/nvme_qpair.c:spdk_nvme_qpair_process_completions().

This function:

  • Processes register operations and transport events for admin queues.
  • Detects failed or removed controllers.
  • Rejects work when qpair is not enabled except for connecting/disconnecting states.
  • Handles error injection queues.
  • Calls transport-specific completion processing.
  • Resubmits queued requests when possible.

Misconception to kill: if nobody polls completions, I/O will not finish. SPDK is poll-driven unless using interrupt integrations that still ultimately schedule completion processing.

Poll Groups

Polling each qpair individually is possible, but poll groups let SPDK group qpairs by transport and process completions together.

Source anchors:

  • include/spdk/nvme.h:spdk_nvme_poll_group_create().
  • include/spdk/nvme.h:spdk_nvme_poll_group_add().
  • include/spdk/nvme.h:spdk_nvme_poll_group_process_completions().
  • lib/nvme/nvme_poll_group.c:spdk_nvme_poll_group_create().
  • lib/nvme/nvme_poll_group.c:spdk_nvme_poll_group_add().
  • lib/nvme/nvme_poll_group.c:spdk_nvme_poll_group_process_completions().

spdk_nvme_poll_group_create() allocates the group, copies optional accel callbacks, creates an fd group for interrupt mode when supported, and initializes transport-group state.

spdk_nvme_poll_group_add() requires the qpair to be disconnected, validates interrupt compatibility, creates a transport poll group if needed, and delegates to the transport.

spdk_nvme_poll_group_process_completions() prevents reentrant polling, loops all transport poll groups, accumulates completions, and returns a negative error if any transport reports one.

The NVMe bdev module builds on this:

Source anchors:

  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_create_poll_group_cb().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_poll().

Reset, Detach, And Hotremove

Detach source anchors:

  • include/spdk/nvme.h:spdk_nvme_detach().
  • include/spdk/nvme.h:spdk_nvme_detach_async().
  • include/spdk/nvme.h:spdk_nvme_detach_poll_async().
  • lib/nvme/nvme.c:spdk_nvme_detach().
  • lib/nvme/nvme.c:spdk_nvme_detach_async().
  • lib/nvme/nvme.c:spdk_nvme_detach_poll_async().

Controller shutdown and detach may require polling the admin queue and checking controller status.

Source anchor: lib/nvme/nvme_ctrlr.c:nvme_ctrlr_shutdown_poll_async().

Hotremove can surface as failed qpair processing, remove callbacks from probe, or bdev module hotplug logic. For PCIe, physical presence and VFIO ownership matter. For fabrics, disconnect/reconnect policy matters.

Prose Diagram

Imagine a controller as a box with one admin qpair at the top and multiple I/O qpairs below it. To the left is spdk_nvme_transport_id, which says PCIe address or fabrics address. To the right is a poll loop.

Probe/connect creates a probe context. The probe context creates controller objects. Controller initialization drives the admin qpair until the controller is ready. The application allocates I/O qpairs. Namespace commands go down I/O qpairs. The poll loop calls either spdk_nvme_qpair_process_completions() or spdk_nvme_poll_group_process_completions() and completions invoke user callbacks.

Edge Cases And Failure Modes

  • Probe callback rejects a controller: no attach callback for it.
  • Attach callback receives a controller that is ready, but later namespace changes still need handling.
  • Direct connect with bad transport ID: probe/connect context may fail or attach nothing.
  • Controller already attached: shared controller logic and transport ID comparison matter.
  • Controller not ready: spdk_nvme_ctrlr_alloc_io_qpair() returns NULL.
  • Interrupt mode mismatch: poll group rejects qpairs with incompatible interrupt settings.
  • Qpair failed or removed: completion polling can return -ENXIO.
  • Admin queue stuck: controller initialization, reset, or detach may not progress.
  • No polling: commands remain outstanding.
  • PCIe secondary process: probe polling has special behavior for non-primary processes.
  • User detaches while other threads use controller: API docs warn the application must ensure no other users remain.

Misconceptions To Kill

  • "SPDK NVMe is synchronous because spdk_nvme_connect() returns a controller." Internally it drives an async initialization process.
  • "A namespace is a bdev." No. Namespace is an NVMe library object. NVMe bdev maps namespace to SPDK bdev.
  • "I/O qpairs are global." They are resources tied to a controller and commonly allocated per thread or channel.
  • "Poll group is optional decoration." For high-scale initiator use, poll groups are central to efficient completion handling.
  • "Admin queue is only used at startup." It is also used for health, async events, log pages, reset, detach, and passthrough admin commands.

Source Reading Exercise

Read:

  1. lib/nvme/nvme.c:spdk_nvme_connect().
  2. lib/nvme/nvme.c:spdk_nvme_connect_async().
  3. lib/nvme/nvme.c:spdk_nvme_probe_poll_async().
  4. lib/nvme/nvme_ctrlr.c:nvme_ctrlr_process_init().
  5. lib/nvme/nvme_ctrlr.c:spdk_nvme_ctrlr_alloc_io_qpair().
  6. lib/nvme/nvme_poll_group.c:spdk_nvme_poll_group_process_completions().
  7. lib/nvme/nvme_qpair.c:spdk_nvme_qpair_process_completions().

Questions:

  • How does spdk_nvme_connect() reuse probe machinery?
  • What return value means async probe is still in progress?
  • Why can I/O qpair allocation fail when memory is available?
  • Which function must run to make completions happen?
  • Where is interrupt-mode compatibility checked for poll groups?

Operational Lab

Build a debug checklist for "NVMe-oF attach hangs":

  1. Was the transport ID parsed correctly?
  2. Did spdk_nvme_connect_async() return a probe context?
  3. Is spdk_nvme_probe_poll_async() being called repeatedly?
  4. Did an attach callback run?
  5. Did controller initialization reach ready?
  6. Is admin queue completion processing happening?
  7. Were I/O qpairs allocated and connected?
  8. Is a poll group processing completions?
  9. Did qpair failure reason become non-none?

For each item, write the source function you would instrument first.

Self-Check

  1. What is the difference between probe and connect?
  2. Why does async probe return -EAGAIN?
  3. What does the admin qpair do during controller initialization?
  4. Why are I/O qpairs separate from the admin qpair?
  5. What does a poll group contain?
  6. Why can completion polling return a negative value?
  7. What object represents a namespace in lib/nvme?
  8. Why is direct use of the NVMe library lower level than bdev use?

References

  • Local source: include/spdk/nvme.h.
  • Local source: lib/nvme/nvme.c.
  • Local source: lib/nvme/nvme_ctrlr.c.
  • Local source: lib/nvme/nvme_qpair.c.
  • Local source: lib/nvme/nvme_poll_group.c.
  • Local source: lib/nvme/nvme_ns_cmd.c.
  • NVM Express specifications: https://nvmexpress.org/specifications/
  • SPDK documentation: https://spdk.io/doc/