Chapter 32: Failure Mode Taxonomy | SPDK From First Principles

Chapter Goal

This chapter gives a structured method for debugging SPDK failures. It does not try to list every possible bug. Instead, it teaches how to classify symptoms, choose the right first probe, and follow the failure to source code. The goal is to avoid random command execution and build a repeatable debugging habit.

Beginner Mental Model

Most SPDK failures fall into a few buckets:

process does not start.
RPC socket is not usable.
configuration replay fails.
object creation fails.
I/O does not move.
I/O moves but is slow.
transport connects but later disconnects.
memory or buffer pools are exhausted.
an assert or crash stops the process.
shutdown hangs or leaks objects.

The first job is classification. Do not debug a config replay failure like a data-path latency issue. Do not debug a missing RPC method like a JSON syntax issue.

symptom
  |
  +-- classify failure bucket
  |
  +-- collect narrow evidence
  |
  +-- map evidence to source branch
  |
  +-- confirm state with RPC or debugger
  |
  +-- reduce reproduction

Source Anchors

lib/event/app.c: spdk_app_start: application startup and shutdown framework.
lib/init/json_config.c: spdk_subsystem_load_config: JSON config replay.
lib/rpc/rpc.c: jsonrpc_handler: RPC method lookup and dispatch.
lib/jsonrpc/jsonrpc_client_tcp.c: client-side socket connection errors.
lib/jsonrpc/jsonrpc_server.c: server parse and response behavior.
lib/bdev/bdev.c: core bdev open, submit, complete, and unregister logic.
lib/bdev/bdev_rpc.c: bdev query and stats RPCs.
lib/thread/thread.c: thread, poller, message, and I/O channel behavior.
lib/event/app_rpc.c: thread, poller, reactor, and scheduler RPCs.
lib/nvmf/transport.c: NVMe-oF transport creation and poll group behavior.
lib/nvmf/ctrlr.c: NVMe-oF connect, qpair, controller, and command failures.
module/bdev/nvme/bdev_nvme.c: NVMe bdev attach, reset, path, and detach behavior.
lib/env_dpdk/init.c: DPDK environment initialization.
lib/env_dpdk/memory.c: vtophys and memory mapping behavior.
scripts/setup.sh: hugepage and PCI binding helper.
scripts/core-collector.sh: coredump and backtrace helper.
doc/system_configuration.md: official system configuration guide.
doc/gdb_macros.md: official GDB macro guide.

Start With A Failure Record

Before changing anything, write a short failure record. It should include:

command line used to start the app.
exact SPDK binary path.
config file or RPC sequence.
socket path.
timestamp.
workload command.
first visible error.
whether the process is still alive.
whether the symptom is deterministic.

This sounds boring. It saves time. SPDK failures are often stateful, and the state can disappear after restart.

Startup Failures

Startup failures happen before the app reaches runtime. Common causes:

missing hugepages.
PCI devices not bound to expected driver.
unsupported command-line option.
feature not compiled in.
config replay error.
RPC socket path conflict.
DPDK EAL initialization failure.

First probes:

1. capture full stdout and stderr
2. confirm command line
3. confirm build options if feature-dependent
4. inspect hugepage setup
5. inspect PCI binding
6. run with a minimal config

Source anchors:

lib/event/app.c: spdk_app_start.
lib/env_dpdk/init.c: spdk_env_init.
scripts/setup.sh: configure_linux_hugepages.
scripts/setup.sh: configure_linux_pci.
doc/applications.md.
doc/system_configuration.md.

Do not start by editing the config if the process fails before parsing it. Identify whether the failure is framework, environment, or config.

RPC Socket Failures

Socket failures mean the client cannot talk to the process. They are different from RPC method failures.

Common causes:

process is not running.
socket path is wrong.
stale socket file remains.
permissions block access.
server is listening on a different address.
TCP address is malformed.
app is still starting and socket is not open yet.

Source anchors:

lib/init/rpc.c: spdk_rpc_initialize.
lib/jsonrpc/jsonrpc_server_tcp.c: spdk_jsonrpc_server_listen.
lib/jsonrpc/jsonrpc_client_tcp.c: spdk_jsonrpc_client_connect.
app/spdk_top/spdk_top.c: spdk_jsonrpc_client_connect.

Debug sequence:

1. check process exists
2. check the exact socket path
3. check permissions
4. remove stale socket only when the process is definitely gone
5. use rpc_get_methods as a low-risk test

If rpc_get_methods succeeds, the transport is fine. Move to method or state debugging.

Unknown Method

Unknown method is not the same as bad params. The server could parse the JSON and dispatch to the RPC layer, but no allowed method matched the name.

Possible causes:

typo.
method not compiled in.
module not linked into this binary.
method hidden by current state filtering.
old client script against newer or older server.

Source anchors:

lib/rpc/rpc.c: jsonrpc_handler.
lib/rpc/rpc.c: rpc_rpc_get_methods.
include/spdk/rpc.h: SPDK_RPC_REGISTER.

Debug sequence:

1. call rpc_get_methods with current=false if available
2. search source for SPDK_RPC_REGISTER("method_name"
3. identify module and build feature
4. check binary and branch version
5. verify current framework state

If the registration is absent from the source tree, the client is wrong for this checkout. If the registration exists but not in rpc_get_methods, suspect build or state.

Params Decode Failures

Params decode failures mean the method exists but the JSON shape is wrong. The most reliable source is the decoder table in the handler.

Source anchors:

module/bdev/malloc/bdev_malloc_rpc.c.
lib/bdev/bdev_rpc.c.
lib/nvmf/nvmf_rpc.c.
module/event/subsystems/nvmf/nvmf_rpc.c.

Debug sequence:

1. find handler from SPDK_RPC_REGISTER
2. find request struct
3. find decoder table
4. check required keys
5. check type of each key
6. check semantic validation after decoding

Misleading case:

The error text may say spdk_json_decode_object failed. That only proves JSON-to-struct failed. It does not prove the referenced object exists or does not exist.

Config Replay Failures

Config replay failures are RPC failures during startup config loading. They can be caused by:

wrong method phase.
bad method order.
missing feature.
object dependency not yet created.
stale generated config from another SPDK version.
path or device name differences across machines.

Source anchors:

lib/init/json_config.c: spdk_subsystem_load_config.
lib/init/json_config.c: json_config_prepare_ctx.
lib/init/subsystem_rpc.c: rpc_framework_get_config.
include/spdk_internal/init.h: write_config_json.

Debug sequence:

1. replay from a clean process
2. identify the first failing method
3. run that method manually if possible
4. inspect method phase
5. inspect dependencies before that method
6. compare with framework_get_config output from a working process

Do not fix the third error first. The first failed method often causes many later failures.

Object Creation Failures

Object creation failures include bdev, transport, subsystem, lvol, RAID, and vhost creation. They usually fail for one of these reasons:

name conflict.
missing base object.
invalid size or alignment.
unsupported feature.
resource exhaustion.
wrong state.
ownership conflict.

First probes:

bdev_get_bdevs
framework_get_config
rpc_get_methods
log_get_flags
thread_get_io_channels

Source anchors:

lib/bdev/bdev.c: spdk_bdev_register.
lib/bdev/bdev.c: spdk_bdev_open_ext.
module/bdev/raid/bdev_raid.c: raid_bdev_write_config_json.
lib/nvmf/nvmf.c: spdk_nvmf_tgt_write_config_json.
lib/vhost/vhost_rpc.c.

If the error mentions an existing name, query current state before retrying. Repeated create attempts can make the state less clear.

No I/O Progress

"No I/O" means an application or client reports that I/O is stuck or not completing. First identify the highest layer where the request is known to exist.

Questions:

Did the client submit I/O?
Did SPDK receive it?
Did the bdev layer submit it?
Did the lower transport/device complete it?
Did completion return upward?

RPC probes:

bdev_get_iostat
thread_get_stats
thread_get_pollers
framework_get_reactors
nvmf_get_stats
bdev_nvme_get_transport_statistics

Source anchors:

lib/bdev/bdev.c: spdk_bdev_io_complete.
module/bdev/raid/bdev_raid.c: raid_bdev_submit_request.
module/bdev/nvme/bdev_nvme.c.
lib/nvmf/ctrlr.c: spdk_nvmf_request_exec call paths.
lib/nvmf/transport.c: nvmf_tgroup_poll.

Debug sequence:

1. sample bdev iostat twice
2. sample thread stats twice
3. inspect poller names
4. check transport stats
5. enable narrow traces for the layer where counters stop
6. search source for the relevant tracepoints or status branch

If top-layer counters increase but lower-layer counters do not, the blockage is between those layers.

Latency Or Low Throughput

Performance symptoms need deltas and baselines. One slow run without a baseline is hard to interpret.

Check:

CPU core mask.
NUMA placement.
interrupt versus polling mode.
queue depth.
block size.
bdev stack depth.
transport retransmits or disconnects.
hugepage memory pressure.
iobuf pressure.
busy versus idle reactor time.

Source anchors:

lib/event/app_rpc.c: rpc_framework_get_reactors.
lib/event/app_rpc.c: rpc_thread_get_stats.
module/scheduler/dynamic/scheduler_dynamic.c.
module/event/subsystems/iobuf/iobuf_rpc.c.
lib/bdev/bdev_rpc.c: rpc_bdev_get_iostat.
doc/performance_reports.md.
doc/system_configuration.md.

Debug sequence:

1. record workload parameters
2. capture two stats samples during steady workload
3. compare busy_tsc and idle_tsc deltas
4. compare bdev I/O deltas with client I/O deltas
5. check queue depth and poller activity
6. change one variable at a time

Avoid mixing performance tuning with correctness debugging. First prove I/O is correct and completing. Then tune.

Memory And Buffer Exhaustion

SPDK uses hugepage-backed memory and fixed-size pools in many places. Memory failures often appear as -ENOMEM, NOMEM, or messages about buffers.

Common sources:

hugepage reservation too small.
DPDK memory initialization failed.
iobuf pool exhausted.
bdev I/O pool exhausted.
transport shared buffer pool exhausted.
NVMe request pool exhausted.
external memory mapping failed.

Source anchors:

doc/memory.md.
doc/system_configuration.md.
scripts/setup.sh.
lib/env_dpdk/init.c.
lib/env_dpdk/memory.c.
lib/thread/iobuf.c.
module/event/subsystems/iobuf/iobuf_rpc.c: rpc_iobuf_get_stats.
lib/nvmf/transport.c: spdk_nvmf_request_get_buffers.

Debug sequence:

1. inspect hugepages before starting
2. inspect app mem-size and huge-dir options
3. query iobuf_get_stats if app is alive
4. check bdev_get_iostat and queue depths
5. reduce queue depth to test resource pressure
6. inspect logs for first allocation failure

Do not assume every ENOMEM is host RAM exhaustion. It may be a specific SPDK pool.

NVMe-oF Connect Failures

NVMe-oF failures may be network, transport, access control, namespace, or controller state.

Check:

transport exists.
listener address exists.
subsystem NQN matches.
host NQN is allowed.
namespace exists and is visible.
TLS or DHCHAP options match if enabled.
queue depth and kato options are valid.

Source anchors:

lib/nvmf/nvmf_rpc.c.
module/event/subsystems/nvmf/nvmf_rpc.c.
lib/nvmf/transport.c: spdk_nvmf_transport_create.
lib/nvmf/ctrlr.c: spdk_nvmf_ctrlr_connect.
lib/nvmf/ctrlr.c error branches for invalid connect parameters.
doc/nvmf.md.

Debug sequence:

1. query nvmf_get_transports
2. query nvmf_get_subsystems
3. query nvmf_subsystem_get_listeners
4. check host allow list
5. inspect target logs for connect rejection
6. inspect host-side nvme-cli error separately

Target accepted TCP connections do not prove NVMe-oF connect succeeded. The protocol can reject after socket connection.

Crashes And Asserts

Crashes need preservation. Do not immediately restart in a way that destroys core files or logs.

First steps:

1. save stdout and stderr
2. save exact binary
3. save core file
4. collect backtrace
5. capture config and reproduction
6. check whether build has debug symbols

Source anchors:

scripts/core-collector.sh.
doc/gdb_macros.md.
scripts/gdb_macros.py.
include/spdk/assert.h.
include/spdk_internal/assert.h.

Useful GDB helpers from official docs:

spdk_print_bdevs
spdk_find_bdev
spdk_print_threads
spdk_print_nvmf_subsystems

An assert means an invariant was violated. The assert line tells you the invariant. The cause may be much earlier. Build a timeline from logs, RPCs, and recent state changes.

Shutdown Hangs

Shutdown failures often involve leaked references, outstanding I/O, or callbacks that never complete.

Check:

open bdev descriptors.
outstanding bdev I/O.
pollers still registered.
threads that did not exit.
transports with remaining qpairs.
NBD or vhost users still connected.

Source anchors:

lib/event/app.c: spdk_app_start_shutdown.
lib/init/subsystem.c: spdk_subsystem_fini.
lib/bdev/bdev.c unregister and close paths.
lib/nvmf/nvmf.c disconnect paths.
lib/vhost/vhost.c and lib/vhost/vhost_rpc.c.

Debug sequence:

1. capture thread_get_pollers before shutdown
2. stop workload cleanly
3. remove external connections
4. issue shutdown
5. inspect final logs
6. attach debugger only if process remains stuck

Shutdown debugging is much easier if you first quiesce clients.

Edge Cases

The First Error Is Not The Loudest

Later errors may be noisier because cleanup fails after the first issue. Find the first error in time.

Retry Changes State

A second create call may fail with "already exists" after the first partially succeeded. Query state before retrying.

Client Error And Target Error Differ

The client can report timeout while the target reports access denied or invalid params. Collect both sides.

Debug Logging Changes Timing

Verbose logs can hide races or create new latency. Use narrow flags and traces where possible.

Counter Deltas Need A Workload Window

Counters sampled before and after different workloads are not comparable. Record when each sample was captured.

Missing Feature Looks Like Missing Method

If a method is absent, check configure options and compiled modules before editing JSON.

Misconceptions To Kill

"The last log line is the cause." It may only be the last cleanup message.
"ENOMEM means the machine ran out of RAM." It may be an SPDK pool.
"Connection accepted means protocol accepted." NVMe-oF can reject after socket connect.
"A replay failure means the config file is corrupt." It may be valid for a different build or state.
"Crashes are always data-path bugs." Control-plane lifetime errors can crash too.
"Stats prove causality." Stats suggest where to inspect next.
"Restarting is debugging." Restarting can destroy the evidence.
"All failures need GDB first." Most start with logs, RPC state, and source search.

Lab: Build A Failure Tree

Choose one error message from your logs. Search for the string in the source tree. Write the function name. Write the condition that triggers it. Write three possible upstream causes. Write one RPC or command that would confirm each cause.

Lab: No I/O Drill

Create a hypothetical workload that reports stuck writes. List the exact RPCs you would sample. For each RPC, write what result would move suspicion up or down the stack. Draw a path from client to bdev to device. Mark the first layer where counters stop.

Lab: Config Replay Reduction

Take a full generated config. Copy only the first subsystem section needed for a malloc bdev. Replay it. Add one section at a time until failure appears. Record the first method that fails. Find the corresponding handler and decoder.

Lab: Crash Preservation

Read scripts/core-collector.sh. Identify how it calls GDB. Read doc/gdb_macros.md. Write the GDB commands you would run to print SPDK threads and bdevs from a core. Explain why the exact binary matters.

Self-Check

What is the first job in debugging a new failure?
Why is an unknown method different from a params error?
What function replays JSON config?
Which RPC helps inspect bdev I/O counters?
Why should performance stats be sampled twice?
Name two SPDK-specific causes of ENOMEM.
What should you preserve before restarting after a crash?
Why can shutdown hang after clients disconnect?

References

doc/system_configuration.md for host setup.
doc/memory.md for SPDK memory behavior.
doc/applications.md for application options and coredump-related flags.
doc/gdb_macros.md for debugger helpers.
doc/tracing.md for trace collection.
doc/spdk_top.md for interactive stats.
scripts/setup.sh for hugepage and PCI setup.
scripts/core-collector.sh for coredump collection.
test/unit/unittest.sh for focused regression tests.