SPDK From First Principles

SPDK deep learning path

Chapter 18: NVMe bdev Module

By the end of this chapter you should be able to trace `bdev_nvme_attach_controller` from JSON-RPC to NVMe connect to namespace bdev registration, then trace a read or write...

Source: drafts/bdev-nvme/18-nvme-bdev-module.md

Reader Promise

By the end of this chapter you should be able to trace bdev_nvme_attach_controller from JSON-RPC to NVMe connect to namespace bdev registration, then trace a read or write from bdev I/O to spdk_nvme_ns_cmd_* and completion polling. You should also understand the per-thread qpair model, poll groups, multipath/failover basics, reset, namespace changes, health/stat surfaces, and common failure modes.

This chapter connects Part 4's bdev model to Chapter 17's NVMe initiator library.

Mental Model

The NVMe bdev module adapts NVMe controllers and namespaces into SPDK bdevs.

At the bottom:

  • lib/nvme owns controllers, namespaces, qpairs, commands, and completions.

In the middle:

  • module/bdev/nvme owns controller groups, namespace-to-bdev mapping, per-thread bdev channels, NVMe qpair channels, poll groups, multipath, reconnect, and bdev-specific options.

At the top:

  • Applications see ordinary bdevs named from the attach controller base name and namespace ID.

Key source anchors:

  • module/bdev/nvme/bdev_nvme.c:nvme_if.
  • module/bdev/nvme/bdev_nvme.c:struct nvme_bdev_io.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_bdev_ctrlr.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_bdev.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_qpair.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_ctrlr_channel.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_io_path.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_bdev_channel.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_poll_group.

Why This Matters For diskengine/excloud

diskengine uses NVMe bdevs in two major ways:

  • Storage node: attach local PCIe NVMe SSDs and build higher-level storage on top.
  • Baremetal or consumer node: attach remote NVMe-oF namespaces, often with multipath/failover policy.

The NVMe bdev module is where control-plane terms such as "controller name," "transport ID," "host NQN," "subsystem NQN," "reconnect delay," "controller loss timeout," "multipath," and "namespace bdev name" become source-level state.

When a diskengine operation says "attach this remote disk," this module is usually the bridge from JSON-RPC intent to real queues.

Module Registration And Options

The module is registered as nvme.

Source anchors:

  • module/bdev/nvme/bdev_nvme.c:nvme_if.
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_get_ctx_size().
  • module/bdev/nvme/bdev_nvme.c:SPDK_BDEV_MODULE_REGISTER(nvme, &nvme_if).

The module sets:

  • module_init = bdev_nvme_init.
  • module_fini = bdev_nvme_fini.
  • async_fini = true.
  • config_json = bdev_nvme_config_json.
  • get_ctx_size = bdev_nvme_get_ctx_size.

Per-I/O context is struct nvme_bdev_io.

Source anchor: module/bdev/nvme/bdev_nvme.c:struct nvme_bdev_io.

That context stores:

  • SGL iteration state.
  • Current I/O path.
  • NVMe completion status.
  • Extended I/O options.
  • fused-command state.
  • retry count and retry time.
  • zone-report state.
  • submit timestamp.

RPC Attach Controller

The main user-facing RPC is bdev_nvme_attach_controller.

Source anchors:

  • module/bdev/nvme/bdev_nvme_rpc.c:struct rpc_bdev_nvme_attach_controller.
  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_attach_controller_decoders.
  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_attach_controller().
  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_attach_controller_done().
  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_attach_controller_examined().
  • module/bdev/nvme/bdev_nvme_rpc.c:SPDK_RPC_REGISTER(\"bdev_nvme_attach_controller\", ...).

The RPC:

  1. Allocates request context.
  2. Gets default NVMe controller options with spdk_nvme_ctrlr_get_default_ctrlr_opts().
  3. Gets default bdev NVMe controller options with spdk_bdev_nvme_get_default_ctrlr_opts().
  4. Decodes JSON.
  5. Parses trtype, traddr, optional adrfam, trsvcid, subnqn, host fields, digest/auth fields, timeout fields, and multipath mode.
  6. Checks duplicate controller/path cases.
  7. Validates num_io_queues.
  8. Calls spdk_bdev_nvme_create().
  9. Waits for bdev examine before returning the created bdev names.

Misconception to kill: bdev_nvme_attach_controller does not itself create queues and bdevs inline. It validates RPC input and hands off to the async attach path.

Programmatic Create Path

The bdev module public create API is:

Source anchors:

  • include/spdk/module/bdev/nvme.h:spdk_bdev_nvme_create().
  • module/bdev/nvme/bdev_nvme.c:spdk_bdev_nvme_create().

spdk_bdev_nvme_create():

  • Rejects duplicate transport ID/host NQN.
  • Validates controller name length.
  • Validates controller loss/reconnect/fast-fail parameters.
  • Allocates nvme_async_probe_ctx.
  • Copies base name, output names array, callback, transport ID, bdev options, and driver options.
  • Applies module-wide options such as transport retry count, keep-alive timeout, admin read-ANA behavior, TOS, and interrupt mode.
  • Resolves PSK or DH-HMAC-CHAP keys if configured.
  • Starts an async NVMe probe/connect and registers a poller to finish it.

Connect callbacks:

  • module/bdev/nvme/bdev_nvme.c:connect_attach_cb().
  • module/bdev/nvme/bdev_nvme.c:connect_set_failover_cb().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_async_poll().

bdev_nvme_async_poll() calls spdk_nvme_probe_poll_async() until the NVMe library attach process is done.

Controller Grouping

The NVMe bdev module groups one or more NVMe controllers under a bdev controller name.

Source anchors:

  • module/bdev/nvme/bdev_nvme.h:struct nvme_bdev_ctrlr.
  • module/bdev/nvme/bdev_nvme.c:g_nvme_bdev_ctrlrs.
  • module/bdev/nvme/bdev_nvme.c:nvme_bdev_ctrlr_get_by_name().
  • module/bdev/nvme/bdev_nvme.c:nvme_bdev_ctrlr_create().

struct nvme_bdev_ctrlr contains:

  • name: logical controller name from RPC.
  • ctrlrs: one or more struct nvme_ctrlr paths.
  • bdevs: namespace bdevs created from those controllers.

This structure is what lets SPDK represent multipath or failover for one logical controller name.

Edge case: when adding another path to an existing controller name, the module checks that multipath/failover configuration, host NQN, subnqn, and path details are compatible.

Source anchors:

  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_attach_controller().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_check_multipath().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_add_secondary_trid().

Namespace To bdev Mapping

Once a controller is attached, namespaces are populated and mapped to bdevs.

Source anchors:

  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_populate_namespaces().
  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_populate_namespace().
  • module/bdev/nvme/bdev_nvme.c:nvme_bdev_create().
  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_populate_namespace_done().
  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_populate_namespaces_done().

nvme_bdev_create():

  • Allocates struct nvme_bdev.
  • Builds the embedded struct spdk_bdev with nbdev_create().
  • Registers an io_device for the NVMe bdev.
  • Links the namespace to the bdev.
  • Links the bdev to the controller group.
  • Calls spdk_bdev_register().

If a bdev for the namespace already exists, nvme_ctrlr_populate_namespace() adds the namespace path to the existing bdev instead of creating a duplicate.

Namespace removal source anchors:

  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_depopulate_namespace().
  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_depopulate_namespace_done().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_delete_io_path_done().

If the last namespace path for a bdev disappears, the module unregisters the bdev.

Misconception to kill: one NVMe controller attach can create multiple bdevs, one per namespace. The RPC returns an array of bdev names for that reason.

Per-Thread qpair And Channel Model

The bdev layer asks the module for an I/O channel:

Source anchor: module/bdev/nvme/bdev_nvme.c:bdev_nvme_get_io_channel().

The NVMe bdev channel is:

Source anchor: module/bdev/nvme/bdev_nvme.h:struct nvme_bdev_channel.

It stores:

  • Current I/O path.
  • Multipath policy and selector.
  • Round-robin state.
  • List of available I/O paths.
  • Retry I/O list and retry poller.
  • Resetting flag.

Poll group state is:

Source anchor: module/bdev/nvme/bdev_nvme.h:struct nvme_poll_group.

It stores:

  • struct spdk_nvme_poll_group *group.
  • accel channel.
  • poller.
  • qpair list.
  • interrupt state.
  • spin-time stats.

Poll group creation:

Source anchor: module/bdev/nvme/bdev_nvme.c:bdev_nvme_create_poll_group_cb().

It calls spdk_nvme_poll_group_create(), registers bdev_nvme_poll() as the poller, and sets up interrupt integration when enabled.

Completion polling:

Source anchor: module/bdev/nvme/bdev_nvme.c:bdev_nvme_poll().

It calls spdk_nvme_poll_group_process_completions() and checks disconnected qpairs.

Beginner mental model: an NVMe bdev channel is the bdev-facing per-thread object. Under it are one or more NVMe I/O paths, each pointing at a namespace and a qpair. The poll group is the per-thread completion engine for those qpairs.

I/O Submission

The bdev function table points to:

Source anchors:

  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_submit_request_initial().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_submit_request().
  • module/bdev/nvme/bdev_nvme.c:_bdev_nvme_submit_request().

bdev_nvme_submit_request_initial() initializes retry tracking, then calls bdev_nvme_submit_request().

bdev_nvme_submit_request():

  • Stores submit timestamp.
  • Records trace.
  • Chooses an I/O path with bdev_nvme_find_io_path().
  • Fails non-admin I/O with -ENXIO if no path exists.
  • Calls _bdev_nvme_submit_request().

_bdev_nvme_submit_request() switches on bdev_io->type and calls NVMe-specific helpers.

Relevant I/O helper source anchors:

  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_readv().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_writev().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_unmap().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_flush().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_get_buf_cb().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_admin_passthru().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_abort().

NVMe library command source anchors used by the module:

  • include/spdk/nvme.h:spdk_nvme_ns_cmd_readv_with_md().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_readv_ext().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_writev_with_md().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_writev_ext().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_flush().
  • include/spdk/nvme.h:spdk_nvme_ns_cmd_write_zeroes().
  • include/spdk/nvme.h:spdk_nvme_ctrlr_cmd_admin_raw().

Completion source anchors:

  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_io_complete_nvme_status().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_io_complete().

The NVMe completion status is preserved so upper layers can inspect NVMe status code type and status code through bdev error helpers.

Multipath, Failover, ANA, And I/O Paths

The bdev module's I/O path object is:

Source anchor: module/bdev/nvme/bdev_nvme.h:struct nvme_io_path.

It ties:

  • nvme_ns: namespace path.
  • qpair: qpair for that path.
  • nbdev_ch: bdev channel cache.
  • optional per-path stats.

Multipath selection is stored on struct nvme_bdev and copied/cache-managed on channels.

Source anchors:

  • module/bdev/nvme/bdev_nvme.h:struct nvme_bdev.
  • module/bdev/nvme/bdev_nvme.h:struct nvme_bdev_channel.
  • include/spdk/module/bdev/nvme.h:spdk_bdev_nvme_set_multipath_policy().

ANA state can influence path selection for NVMe-oF multipath. Namespace populate can parse ANA log information before adding namespace paths.

Source anchors:

  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_populate_namespace().
  • module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_read_ana_log_page().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_parse_ana_log_page().

Misconception to kill: multipath does not mean every I/O is blindly sprayed across every controller. The module tracks path state, policy, selector, retry, ANA information, and qpair health.

Reset And Reconnect

There are two layers of reset:

  • bdev core reset: freezes all bdev channels and submits an I/O of type SPDK_BDEV_IO_TYPE_RESET.
  • NVMe bdev reset: resets one or more NVMe controller paths and rebuilds qpairs.

Source anchors:

  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_io().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_freeze_bdev_channel().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_freeze_bdev_channel_done().
  • module/bdev/nvme/bdev_nvme.c:_bdev_nvme_reset_io().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_io_continue().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_io_complete().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_ctrlr().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_ctrlr_unsafe().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_create_qpair().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_destroy_qpairs().

The reset I/O freezes NVMe bdev channels, resets controller paths sequentially, and then unfreezes channels and aborts retry I/O. When a controller is already resetting, reset I/O can be queued to avoid fighting the app framework's reset strategy.

Reconnect and failover use controller loss parameters from the attach options:

  • ctrlr_loss_timeout_sec.
  • reconnect_delay_sec.
  • fast_io_fail_timeout_sec.

Validation source anchor: module/bdev/nvme/bdev_nvme.c:bdev_nvme_check_io_error_resiliency_params().

Edge cases:

  • ctrlr_loss_timeout_sec == 0 means no reconnect delay/fast-fail timeout should be set.
  • ctrlr_loss_timeout_sec == -1 means keep trying indefinitely, but reconnect delay must be nonzero.
  • fast_io_fail_timeout_sec must not be less than reconnect delay.
  • finite controller loss timeout must not be less than reconnect delay or fast-fail timeout.

Health, Stats, And Config

The NVMe bdev module exposes module options and controller information through RPC/config JSON paths.

Source anchors:

  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_set_options().
  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_dump_nvme_bdev_controller_info().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_config_json().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_device_stat().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_format_nvme_status().

Per-bdev error statistics can be enabled through module options. Path statistics can also be enabled. These are useful when a logical bdev has multiple paths and one path is unhealthy.

Prose Diagram

Imagine three layers.

Top layer: bdev API. A caller opens Nvme0n1, gets a bdev channel, submits a read.

Middle layer: NVMe bdev module. The bdev channel points to struct nvme_bdev_channel, which has a list of nvme_io_path objects. Path selection picks one path. The selected path points to a namespace and qpair.

Bottom layer: NVMe library. The namespace command API builds an NVMe read command and submits it on the qpair. The qpair belongs to a poll group. bdev_nvme_poll() polls the poll group. The NVMe completion callback completes the bdev I/O.

Side boxes: controller group, namespace tree, ANA state, retry queue, reset state, reconnect timers, and JSON-RPC configuration.

Edge Cases And Failure Modes

  • Attach creates zero bdevs: controller may attach but namespaces are unsupported, inactive, filtered, or populate failed.
  • max_bdevs too small: attach can create more bdevs than names returned to RPC; the module logs when it cannot return all names.
  • Duplicate controller name with same path: RPC rejects it.
  • Duplicate controller name with different subnqn or hostnqn: RPC rejects it.
  • Multipath disabled: adding a second path to same controller name is rejected.
  • Same namespace reached through multiple controllers: module adds namespace path to existing bdev.
  • Last path removed: namespace bdev unregisters.
  • No I/O path: non-admin I/O completes -ENXIO.
  • Qpair failure: poll path clears I/O path caches.
  • Admin queue failure: admin poller triggers disconnected/failover handling.
  • Reset while another reset runs: reset I/O can queue.
  • Reconnect settings invalid: create rejects them before connect.
  • Interrupt mode with non-PCIe: create rejects it.
  • Base bdev semantics not relevant: NVMe bdev is physical in bdev terms, but remote fabrics may disappear like a network resource.

Misconceptions To Kill

  • "NVMe bdev is just a thin wrapper around spdk_nvme_connect()." No. It adds bdev registration, namespace mapping, channels, qpairs, poll groups, multipath, retry, reset, stats, JSON config, and RPC validation.
  • "One controller attach means one bdev." No. One controller can expose multiple namespaces.
  • "One bdev means one controller." Not with multipath. One namespace bdev can have multiple namespace paths.
  • "Reset is handled entirely by bdev core." No. bdev core coordinates reset I/O, then NVMe bdev resets controllers and qpairs.
  • "A path failure immediately means the bdev is gone." Not necessarily. Multipath or reconnect may keep the bdev visible.
  • "Admin commands use the current I/O path." Admin passthrough has its own handling and can proceed even when no regular I/O path is selected.

Source Reading Exercise

Trace attach:

  1. module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_attach_controller().
  2. include/spdk/module/bdev/nvme.h:spdk_bdev_nvme_create().
  3. module/bdev/nvme/bdev_nvme.c:spdk_bdev_nvme_create().
  4. module/bdev/nvme/bdev_nvme.c:connect_attach_cb().
  5. module/bdev/nvme/bdev_nvme.c:nvme_bdev_ctrlr_create().
  6. module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_populate_namespaces().
  7. module/bdev/nvme/bdev_nvme.c:nvme_ctrlr_populate_namespace().
  8. module/bdev/nvme/bdev_nvme.c:nvme_bdev_create().
  9. lib/bdev/bdev.c:spdk_bdev_register().

Trace one read:

  1. lib/bdev/bdev.c:spdk_bdev_readv_blocks().
  2. lib/bdev/bdev.c:bdev_io_submit().
  3. module/bdev/nvme/bdev_nvme.c:bdev_nvme_submit_request_initial().
  4. module/bdev/nvme/bdev_nvme.c:bdev_nvme_submit_request().
  5. module/bdev/nvme/bdev_nvme.c:bdev_nvme_readv().
  6. include/spdk/nvme.h:spdk_nvme_ns_cmd_readv_with_md() or include/spdk/nvme.h:spdk_nvme_ns_cmd_readv_ext().
  7. module/bdev/nvme/bdev_nvme.c:bdev_nvme_poll().
  8. lib/nvme/nvme_poll_group.c:spdk_nvme_poll_group_process_completions().
  9. module/bdev/nvme/bdev_nvme.c:bdev_nvme_io_complete_nvme_status().

Questions:

  • Where is JSON converted to struct spdk_nvme_transport_id?
  • Where does the module decide whether this is a second path?
  • Where does namespace ID become bdev identity?
  • Where is the per-thread qpair reached from a bdev channel?
  • Where does no-path turn into an I/O failure?

Operational Lab

Debug "remote NVMe bdev exists but I/O hangs" on paper.

Checklist:

  1. Is bdev_nvme_poll() registered on the thread that owns the qpair?
  2. Does spdk_nvme_poll_group_process_completions() return completions, zero, or negative?
  3. Does qpair failure reason indicate a disconnected qpair?
  4. Did bdev_nvme_check_io_qpairs() clear path caches?
  5. Does bdev_nvme_find_io_path() return NULL?
  6. Are I/O sitting on retry_io_list?
  7. Is nbdev_ch->resetting true?
  8. Is admin queue poller seeing failure and triggering failover?
  9. Are reconnect and fast-fail timeouts configured coherently?

For each "yes/no," write the source function where you would add a log line.

Self-Check

  1. Why does bdev_nvme_attach_controller return an array?
  2. What is the difference between struct nvme_bdev_ctrlr and struct nvme_ctrlr?
  3. What does struct nvme_io_path connect together?
  4. Why does the NVMe bdev module need a poll group?
  5. What happens when no I/O path is available for a normal read?
  6. Why are reconnect timeout combinations validated before attach?
  7. How does a namespace removal become a bdev unregister?
  8. Why is multipath more than a list of transport IDs?

References

  • Local source: include/spdk/module/bdev/nvme.h.
  • Local source: module/bdev/nvme/bdev_nvme.c.
  • Local source: module/bdev/nvme/bdev_nvme.h.
  • Local source: module/bdev/nvme/bdev_nvme_rpc.c.
  • Local source: include/spdk/nvme.h.
  • Local source: lib/nvme/nvme.c.
  • Local source: lib/nvme/nvme_poll_group.c.
  • NVM Express specifications: https://nvmexpress.org/specifications/
  • SPDK documentation: https://spdk.io/doc/