Chapter 33: Debugging Playbooks | SPDK From First Principles

Reader Promise

This chapter turns the failure taxonomy into action. The goal is not to memorize every SPDK failure. The goal is to stop debugging randomly. Each playbook starts with a symptom, identifies the likely layer, names the first safe observations, and then points to the SPDK source families that explain the behavior.

The rule: observe before mutating. Do not delete bdevs, kill controllers, or replay config until you know which layer is inconsistent.

Playbook Format

Field	Meaning
Symptom	What the operator or diskengine loop sees.
First classification	Hardware/env, runtime, bdev graph, metadata, transport, RPC/config, or diskengine reconciliation.
First safe checks	Read-only RPCs, logs, stats, or source inspection.
Likely source family	The SPDK files to read next.
Mutation boundary	The point where it becomes risky to change state.

Volume Missing

Step	Question	Next action
1	Does `bdev_get_bdevs` show the base device?	If no, debug NVMe attach, VFIO binding, or NVMe-oF initiator.
2	Does the lvstore exist?	If no, check bdev examine and blobstore/lvol import.
3	Does the lvol exist but not export?	Debug transport/export state, not blobstore first.
4	Does diskengine DB say the volume exists?	Reconcile desired state against observed SPDK state.

Source anchors:

lib/bdev/bdev.c
module/bdev/nvme/bdev_nvme.c
module/bdev/lvol/vbdev_lvol.c
lib/lvol/lvol.c
lib/blob/blobstore.c

Mutation boundary: do not recreate an lvol until you know whether metadata import is still pending or the lvol exists under a different name.

RAID Stuck Configuring

First classify whether the RAID bdev lacks base bdevs, has metadata disagreement, or is waiting on rebuild/online transition.

Checks:

List bdevs and confirm every base name.
Confirm base block sizes and lengths.
Check RAID state and base membership.
Check whether a base was removed and later re-added with a different identity.

Source anchors:

module/bdev/raid/bdev_raid.c
module/bdev/raid/raid1.c
module/bdev/raid/raid0.c
lib/bdev/bdev.c

Mutation boundary: do not force base replacement until you know whether the current array state is degraded-but-recoverable or inconsistent.

Controller Reconnect Loop

Symptoms:

Repeated attach/reconnect logs.
IO stalls then resumes.
NVMe-oF path oscillates.
Multipath never settles on an active path.

Checks:

Identify transport: PCIe, RDMA, TCP, or vfio-user.
Check controller timeout and reconnect options.
Check qpair failure reason if available.
Confirm whether the issue is one controller, one path, or every path.

Source anchors:

lib/nvme/nvme_ctrlr.c
lib/nvme/nvme_qpair.c
module/bdev/nvme/bdev_nvme.c:bdev_nvme_reconnect_ctrlr
module/bdev/nvme/bdev_nvme.c:bdev_nvme_failover_ctrlr
module/bdev/nvme/bdev_nvme.c:bdev_nvme_reset_ctrlr

Mutation boundary: do not detach a controller backing live bdevs until descriptors, exports, and diskengine desired state are accounted for.

Guest IO Hang

The key question is where the IO stopped:

guest driver
  -> vhost/vfio-user queue
    -> SPDK transport request
      -> bdev submit
        -> lower bdev/lvol/RAID/NVMe
          -> completion callback
            -> CQ/used-ring notification

Checks:

Did the guest submit a descriptor, SQE, or doorbell?
Did SPDK receive the request?
Did bdev stats increase?
Did the lower bdev complete?
Did SPDK write completion state back to the guest-visible queue?
Did the guest receive an interrupt or poll the completion?

Source anchors:

lib/vhost/vhost_blk.c
lib/nvmf/vfio_user.c
lib/nvmf/ctrlr_bdev.c
lib/bdev/bdev.c
module/bdev/raid/bdev_raid.c
module/bdev/lvol/vbdev_lvol.c

Mutation boundary: do not reset the guest-facing device until you know whether the bdev completed. A completed bdev with a waiting guest points to export/notification, not storage media.

Config Replay Failure

Symptoms:

SPDK starts but expected objects are missing.
Replay logs show duplicate-name or missing-base errors.
diskengine retries restore and produces repeated RPC failures.

Checks:

Determine which RPC failed first.
Check whether prior RPCs partially succeeded.
Compare saved config, current SPDK graph, and diskengine DB.
Identify whether bdev examine is still in progress.

Source anchors:

lib/rpc/rpc.c
lib/init/json_config.c
lib/init/subsystem.c
module/bdev/nvme/bdev_nvme_rpc.c
module/bdev/lvol/vbdev_lvol_rpc.c
lib/nvmf/nvmf_rpc.c

Mutation boundary: do not rerun a non-idempotent create loop until duplicate-name and partial-success state are understood.

High Latency

Classify the latency:

Device media latency.
NVMe qpair timeout/retry.
Reactor starvation.
bdev queueing/QoS.
lvol/blobstore metadata operation.
RAID rebuild or degraded path.
Transport congestion.
Guest notification delay.

First safe checks:

bdev stats.
reactor/thread/poller stats.
NVMe health/log pages.
transport-specific counters.
diskengine reconciliation logs.

Source anchors:

lib/thread/thread.c
lib/event/reactor.c
lib/bdev/bdev.c
module/bdev/nvme/bdev_nvme.c
lib/nvme/nvme_ctrlr_cmd.c

Mutation boundary: do not tune queue depth or disable polling until you know whether latency is queueing, media, CPU starvation, or transport.

ENOMEM And NOMEM

SPDK often distinguishes ordinary allocation failure from the bdev NOMEM retry path. Do not treat every memory-looking error as fatal.

Checks:

Which allocation failed?
Is the IO queued for retry?
Is an iobuf wait path involved?
Is there a no-memory poller?
Are large requests being split?

Source anchors:

lib/bdev/bdev.c:_bdev_io_handle_no_mem
lib/bdev/bdev.c:bdev_io_retry
lib/thread/thread.c
lib/env_dpdk/env.c
lib/util/iobuf.c

Mutation boundary: do not fail user-visible IO until you know whether SPDK expects to retry internally.

Self-Check

Why should a guest IO hang be split into "bdev completed" and "guest notified" branches?
What makes config replay dangerous to retry blindly?
Why can RAID configuring be a base-device problem rather than a RAID algorithm problem?
Why is high latency a taxonomy problem before it is a tuning problem?
What is the first safe action in every playbook?