SPDK From First Principles

SPDK deep learning path

Chapter 27: Config Save, Replay, And --wait-for-rpc

How SPDK turns JSON-RPC state into reproducible configuration, why startup can pause for control-plane replay, and what goes wrong in diskengine restore loops.

Source: content/chapters/27-config-save-replay-wait-for-rpc.md

Reader Promise

By the end of this chapter you should understand why SPDK configuration is not just a static file. SPDK applications are built around subsystems, RPC methods, bdev examination, and asynchronous state changes. A saved config is a replay script: a list of control-plane operations that try to recreate runtime objects in the right order.

That distinction matters. A config replay can fail because a name already exists, because a base bdev has not appeared yet, because a subsystem is not initialized, because a method is runtime-only, because an operation is asynchronous, or because diskengine's database thinks a volume should exist while SPDK's current graph says it does not.

The Mental Model

desired state
  |
  | JSON-RPC methods
  v
SPDK runtime objects
  |
  | framework_get_config and subsystem writers
  v
saved JSON config
  |
  | replay at next startup
  v
runtime objects again, if dependencies exist

A saved config is not magic persistence for every bit of memory. It is a control-plane reconstruction recipe.

Why --wait-for-rpc Exists

SPDK applications can start in a mode where initialization pauses and waits for RPC commands. This lets an orchestrator connect, create or restore objects, then tell SPDK to continue startup. For a storage system, that is useful because the orchestrator may need to:

  • Attach controllers.
  • Create bdevs.
  • Load lvstores.
  • Expose subsystems.
  • Set names and policies that are not known at compile time.
  • Reconcile desired state from an external database.

The danger is that "SPDK process is running" does not mean "storage graph is ready". With --wait-for-rpc, readiness becomes a two-step concept: process alive, then control-plane initialization complete.

Config Save Is A Set Of Writers

SPDK subsystems and modules can contribute config output. The saved config usually describes objects through RPC-equivalent operations:

  • Construct this bdev.
  • Attach this NVMe controller.
  • Create this transport.
  • Create this NVMe-oF subsystem.
  • Add this namespace.
  • Create this lvol store or import one.

Source anchors:

  • lib/event/subsystems.c: subsystem initialization and config hooks.
  • lib/rpc/rpc.c: JSON-RPC server machinery.
  • module/bdev/nvme/bdev_nvme_rpc.c: NVMe bdev RPCs.
  • module/bdev/lvol/vbdev_lvol_rpc.c: lvol RPCs.
  • lib/bdev/bdev.c: bdev registration and lookup semantics.
  • scripts/rpc.py: the operator-facing Python wrapper for RPC calls.

Replay Is Order-Sensitive

The replay order matters because objects depend on other objects:

attach physical controller
  -> physical namespace bdev appears
    -> create/import lvstore
      -> lvol bdevs appear
        -> export lvol over NVMe-oF/vhost/vfio-user

If you try to export an lvol before the lvol exists, replay fails. If you try to import an lvstore before its base bdev is examined, replay may need to wait or fail depending on the operation. If you create an object with a duplicate name, replay may fail even though the desired end state already exists.

diskengine Restore Loops

diskengine adds another desired-state layer. Its database can say "volume X should exist and be exported", while SPDK says:

  • Base NVMe controller is missing.
  • lvstore import has not completed.
  • lvol exists but is degraded.
  • bdev is present but export subsystem is missing.
  • export exists but listener is not reachable.
  • previous replay partially succeeded.

The control plane should be idempotent where possible. That means repeated restore attempts should converge rather than create duplicate objects or oscillate between create/delete states.

Common Replay Failure Modes

  • Duplicate names: replay tries to create Malloc0, Nvme0n1, or an lvol that already exists.
  • Missing base bdev: virtual bdev creation depends on a base device that has not appeared.
  • Late examine: bdev examine discovers metadata asynchronously, so a dependent operation may run too early.
  • Runtime-only method during startup: some RPCs make sense only after subsystem init.
  • Startup-only method during runtime: some operations are not safe after the application has started serving IO.
  • Ignored partial failure: a script logs an error but continues, leaving a half-built graph.
  • Non-idempotent delete/create: delete may be async or blocked by open descriptors, so immediate recreate can race.
  • External system disagreement: diskengine DB, SPDK graph, and guest-visible exports disagree.

How To Read Replay Code

When reading a config or restore path, ask:

  1. What object is the source of truth?
  2. What RPC creates or mutates it?
  3. What dependencies must already exist?
  4. Is the operation synchronous or callback-driven?
  5. What happens if the object already exists?
  6. What happens if the object is missing but should eventually appear?
  7. What exact error gets returned to the orchestrator?

This is the same async reasoning pattern used everywhere else in SPDK.

Source Reading Exercise

Trace one operation: NVMe bdev attach.

  1. Find the RPC entry point in module/bdev/nvme/bdev_nvme_rpc.c.
  2. Follow the call into the attach path in module/bdev/nvme/bdev_nvme.c.
  3. Identify where controller discovery becomes namespace bdev registration.
  4. Find what would be saved by config output.
  5. Write down what a replay script must assume before it can use the new bdev.

Operational Exercise

Take a hypothetical failed restore:

diskengine wants volume vol-a
SPDK has lvstore lvs0
SPDK does not have lvol vol-a
NVMe-oF subsystem nqn.excloud:vol-a exists with no namespace
guest attach is retrying

Classify the failure:

  • Is it bdev graph state?
  • lvol metadata state?
  • export state?
  • diskengine desired-state mismatch?
  • replay ordering?

Then write the safest next check. Do not start by deleting things. Start by observing names, open descriptors, and whether async operations are still in flight.

Misconceptions To Kill

  • "Config is just a file." It is a replay of operations.
  • "If replay failed, nothing changed." Many failures are partial.
  • "If the SPDK process is up, storage is ready." Startup may be paused or still examining bdevs.
  • "Retry always helps." Retrying non-idempotent operations can create duplicates or noisy error loops.
  • "The DB is truth." The DB is desired state; SPDK runtime and device reality still have to converge.

References

  • SPDK JSON-RPC guide: https://spdk.io/doc/jsonrpc.html
  • SPDK applications overview: https://spdk.io/doc/app_overview.html
  • SPDK block device guide: https://spdk.io/doc/bdev.html
  • SPDK logical volumes: https://spdk.io/doc/logical_volumes.html

Self-Check

  • Why is config replay order-sensitive?
  • What does --wait-for-rpc change about readiness?
  • Why can a replay failure be partial?
  • What makes a restore operation idempotent?
  • Why should diskengine reconcile instead of blindly recreate every missing object?