SPDK From First Principles

SPDK deep learning path

Chapter 30: diskengine Baremetal Mode

This chapter explains diskengine baremetal mode from first principles. By the end, the reader should understand:

Source: drafts/transport-diskengine/30-diskengine-baremetal-mode.md

Chapter Goal

This chapter explains diskengine baremetal mode from first principles. By the end, the reader should understand:

  • what diskengine does on a compute host when it runs with -mode baremetal;
  • what changes when SPDK owns real local NVMe PCI controllers directly;
  • what the Linux kernel loses when a controller is moved from the kernel nvme driver to VFIO or UIO;
  • why diskengine's Go code talks to SPDK through JSON-RPC instead of touching NVMe registers directly;
  • how attach, RAID, vhost, teardown, reset, and recovery paths fail in production.

There is an important naming trap. In the local diskengine source, "baremetal mode" is the compute-side reconciler for VM volumes. It currently attaches remote storage-node exports using NVMe-oF RDMA, builds a local SPDK graph, and exposes that graph to QEMU through vhost-blk. In SPDK documentation, "bare metal" often also means SPDK owns local PCIe NVMe controllers directly. The same SPDK RPC, bdev_nvme_attach_controller, covers both cases:

  • remote NVMe-oF: trtype=RDMA, traddr=<target IP>, trsvcid=<port>, subnqn=<target NQN>;
  • local PCIe: trtype=PCIe, traddr=<PCI BDF>, for example 0000:82:00.0.

The current diskengine baremetal loop uses the first form. This chapter also explains the second form because it is the key operational difference when SPDK owns physical drives on a host.

Beginner Mental Model

Think of diskengine baremetal mode as a reconciler beside SPDK:

database desired state
        |
        v
diskengine Go loops
        |
        | JSON-RPC over Unix socket
        v
SPDK C runtime
        |
        +-- bdev_nvme controller and namespace bdevs
        +-- RAID bdevs
        +-- QoS limits
        +-- vhost-blk controllers
        v
QEMU sees a virtio/vhost disk

The Go process does not allocate DMA buffers, map PCI BARs, create NVMe submission queues, or poll completion queues. It decides what should exist. SPDK's C code owns the hardware-facing objects and the user-space storage stack.

For the current remote diskengine path, the stack looks like this:

storage node lvol export
        |
        | NVMe-oF RDMA
        v
baremetal bdev_nvme bdevs, for example Nvme_xxxn1
        |
        v
raid_<volume_id>
        |
        v
vhost<volume_vm_mapping_id>
        |
        v
QEMU and guest OS

For local PCI ownership, the bottom of the stack changes:

physical NVMe controller at PCI BDF 0000:82:00.0
        |
        | bound to vfio-pci or uio, not Linux nvme
        v
SPDK bdev_nvme controller, for example Nvme0
        |
        v
Nvme0n1 namespace bdev

The rest of the bdev graph may still use RAID, lvol, QoS, vhost, or NVMe-oF target modules. The major difference is ownership of the physical controller.

What The Host Kernel Loses

When SPDK owns a local NVMe PCI controller, Linux no longer owns that controller as a block device. SPDK's userspace documentation states the practical consequence directly: after unbinding an NVMe device from the kernel, paths like /dev/nvme0n1 disappear and the kernel block stack is no longer involved.

That means:

  • mounted filesystems on that controller must be unmounted first;
  • ordinary tools that depend on /dev/nvme* stop seeing the drive;
  • udev rules for the kernel block device no longer apply;
  • kernel md/dm-multipath/LVM do not manage that device;
  • kernel block-layer accounting no longer describes that device's I/O path;
  • crashes in the SPDK process can make the device unavailable until SPDK is restarted or the driver is rebound;
  • rollback requires detaching from SPDK and rebinding the PCI function to the kernel nvme driver.

SPDK gets low-latency polling and direct queue control in exchange. It maps device BARs and DMA memory through VFIO/UIO and implements the storage path in user space.

The SPDK userspace doc describes the ownership transfer:

doc/userspace.md:18

In order for SPDK to take control of a device, it must first instruct the
operating system to relinquish control. This is often referred to as unbinding
the kernel driver from the device and on Linux is done by
[writing to a file in sysfs](https://lwn.net/Articles/143397/).
SPDK then rebinds the driver to one of two special device drivers that come
bundled with Linux -
[uio](https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html) or
[vfio](https://www.kernel.org/doc/Documentation/vfio.txt).

The important part for a beginner is not the shell mechanics. The important part is ownership. Only one driver owns a PCI function at a time. If Linux nvme owns it, SPDK cannot use VFIO to map it. If SPDK owns it through VFIO/UIO, Linux does not expose it as a normal block disk.

Entry Point And Loops

diskengine baremetal mode starts in internal/baremetal/baremetal.go. The start function checks that the SPDK RPC socket exists, creates a cancellable context, runs a recovery pass, then starts independent reconciliation loops.

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/baremetal.go:19

func Start() {
	logger.Info.Println("Starting baremetal provisioner service")
	if err := helpers.EnsureSockExists(config.Value.SPDK_RPC_SOCK); err != nil {
		logger.Error.Fatalf("SPDK RPC sock check failed: %v", err)
	}
	ctx := context.Background()
	cancelCtx, cancel = context.WithCancel(ctx)

	runSPDKRecovery()

	// Start NVMe attach loop (decoupled from RAID ensure)
	wg.Add(1)
	go func() {
		defer wg.Done()
		startNvmeAttachLoop(ctx)
	}()

Line by line:

  • EnsureSockExists makes SPDK a hard dependency. diskengine is not the storage engine itself; it orchestrates a running SPDK process.
  • cancelCtx and wg make the loops stoppable as a group.
  • runSPDKRecovery handles diskengine markers left after an SPDK restart.
  • each loop runs in a goroutine because attach, RAID creation, initialization, vhost exposure, detach, resize, health, and stats can each be blocked by different dependencies.

The loops are intentionally simple:

  • startNvmeAttachLoop: make required NVMe connections exist.
  • startRaidEnsureLoop: make raid_<volume_id> exist and heal stuck/degraded RAID state.
  • startInitialiseLoop: copy snapshot contents when needed.
  • startAttachLoop: expose initialized RAID bdevs to QEMU with vhost-blk.
  • startVhostDetachLoop: remove VM-facing controllers.
  • startRaidDetachLoop: remove RAID and unused NVMe controllers after vhost is gone.
  • startResizeLoop: preconnect pending namespaces and change RAID membership.
  • startHealthLoop and iostatscraper.StartBaremetal: observe RAID and I/O state.

JSON-RPC Is The C/Go Boundary

The source boundary is very clean. Go builds a struct, serializes it as JSON-RPC, and waits for a JSON result. SPDK receives the request in C and mutates its internal object graph.

The wrapper for bdev_nvme_attach_controller is small:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go:10

// BdevNvmeAttachController attaches a PCIe NVMe device and returns names of created bdevs.
func (c *Client) BdevNvmeAttachController(params BdevNvmeAttachControllerParams) ([]string, error) {
	resp, err := c.Call("bdev_nvme_attach_controller", params)
	if err != nil {
		return nil, fmt.Errorf("BdevNvmeAttachController call failed: %w", err)
	}
	// Response is JSON array of strings
	var names []string
	data, err := json.Marshal(resp.Result)

The comment says PCIe, but the wrapper is transport-neutral: it passes whatever Trtype appears in params. In current diskengine baremetal mode that value is rdma. For local PCI ownership it would be PCIe.

The current attach loop builds an RDMA request:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go:187

multipath := "multipath"
reconnectDelaySec := 1
ctrlrLossTimeoutSec := 10
fastIoFailTimeoutSec := 0
name := controllerNameForNQN(conn.NQN)

params := spdkclient.BdevNvmeAttachControllerParams{
	Name:                 name,
	Subnqn:               &conn.NQN,
	Trtype:               "rdma",
	Traddr:               conn.RDMAIP.String(),
	Trsvcid:              &svc,
	Adrfam:               &adrfam,
	Multipath:            &multipath,
	ReconnectDelaySec:    &reconnectDelaySec,
	CtrlrLossTimeoutSec:  &ctrlrLossTimeoutSec,
	FastIoFailTimeoutSec: &fastIoFailTimeoutSec,
}

Line by line:

  • name := controllerNameForNQN(conn.NQN) makes controller names deterministic. That is how retries avoid creating random duplicate controller names.
  • Trtype: "rdma" chooses the NVMe-oF RDMA transport. This is the current diskengine compute-host path.
  • Traddr is the storage node's RDMA IP, not a PCI BDF.
  • Trsvcid is the target port.
  • Subnqn selects the exported NVMe-oF subsystem.
  • Multipath tells SPDK that multiple paths under one controller name may represent the same logical NVMe bdev.
  • the reconnect/loss timers define how long SPDK should keep trying during transient path failure.

For IPv4 RDMA, current diskengine also requires a local source address even though SPDK's RPC field is optional. pickHostAddr() reads RDMA_IPS, chooses a local IPv4 address, and ensureHostAddrInterfaceUp() rejects the attach if no address is selected, no matching interface exists, or the interface is down. A failed attach can therefore be local RDMA_IPS or interface state, not an SPDK fabric-connect failure.

The attach params type also exposes SPDK security fields Psk, DhchapKey, and DhchapCtrlrKey, matching the RPC fields psk, dhchap_key, and dhchap_ctrlr_key. The current attach loop does not set them. Unless another layer provisions those values, this RDMA initiator path is unauthenticated at the SPDK attach call and relies on network isolation plus the target-side access policy.

For a local physical controller, the analogous JSON-RPC request would be conceptually:

{
  "method": "bdev_nvme_attach_controller",
  "params": {
    "name": "Nvme0",
    "trtype": "PCIe",
    "traddr": "0000:82:00.0"
  }
}

No subnqn, no RDMA IP, no port. The identity is the PCI BDF.

SPDK Parses The Transport ID

On the SPDK side, the RPC handler declares which JSON fields it accepts:

module/bdev/nvme/bdev_nvme_rpc.c:308

static const struct spdk_json_object_decoder rpc_bdev_nvme_attach_controller_decoders[] = {
	{"name", offsetof(struct rpc_bdev_nvme_attach_controller, name), spdk_json_decode_string},
	{"trtype", offsetof(struct rpc_bdev_nvme_attach_controller, trtype), spdk_json_decode_string},
	{"traddr", offsetof(struct rpc_bdev_nvme_attach_controller, traddr), spdk_json_decode_string},

	{"adrfam", offsetof(struct rpc_bdev_nvme_attach_controller, adrfam), spdk_json_decode_string, true},
	{"trsvcid", offsetof(struct rpc_bdev_nvme_attach_controller, trsvcid), spdk_json_decode_string, true},
	{"priority", offsetof(struct rpc_bdev_nvme_attach_controller, priority), spdk_json_decode_string, true},
	{"subnqn", offsetof(struct rpc_bdev_nvme_attach_controller, subnqn), spdk_json_decode_string, true},

Only name, trtype, and traddr are required here. Fields like adrfam, trsvcid, and subnqn are optional because PCIe does not use them while RDMA/TCP normally do.

Later the same handler parses trtype, copies traddr, and calls the bdev NVMe create API:

module/bdev/nvme/bdev_nvme_rpc.c:433

/* Parse trstring */
rc = spdk_nvme_transport_id_populate_trstring(&trid, ctx->req.trtype);
if (rc < 0) {
	SPDK_ERRLOG("Failed to parse trtype: %s\n", ctx->req.trtype);
	spdk_jsonrpc_send_error_response_fmt(request, -EINVAL, "Failed to parse trtype: %s",
					     ctx->req.trtype);
	goto cleanup;
}

/* Parse trtype */
rc = spdk_nvme_transport_id_parse_trtype(&trid.trtype, ctx->req.trtype);
assert(rc == 0);

and:

module/bdev/nvme/bdev_nvme_rpc.c:605

ctx->request = request;
/* Should already be zero due to the calloc(), but set explicitly for clarity. */
ctx->req.bdev_opts.from_discovery_service = false;
ctx->req.bdev_opts.psk = ctx->req.psk;
ctx->req.bdev_opts.dhchap_key = ctx->req.dhchap_key;
ctx->req.bdev_opts.dhchap_ctrlr_key = ctx->req.dhchap_ctrlr_key;
rc = spdk_bdev_nvme_create(&trid, ctx->req.name, ctx->names, ctx->req.max_bdevs,
			   rpc_bdev_nvme_attach_controller_done, ctx, &ctx->req.drv_opts,
			   &ctx->req.bdev_opts);

This is the handoff from JSON-RPC into SPDK's internal NVMe bdev module. If the call succeeds, SPDK eventually replies with an array of bdev names such as Nvme0n1.

The transport ID definition explains why local PCIe is different from fabrics:

include/spdk/nvme.h:435

enum spdk_nvme_transport_type {
	/**
	 * PCIe Transport (locally attached devices)
	 */
	SPDK_NVME_TRANSPORT_PCIE = 256,

	/**
	 * RDMA Transport (RoCE, iWARP, etc.)
	 */
	SPDK_NVME_TRANSPORT_RDMA = SPDK_NVMF_TRTYPE_RDMA,

and:

include/spdk/nvme.h:509

/**
 * Transport address of the NVMe-oF endpoint. For transports which use IP
 * addressing (e.g. RDMA), this should be an IP address. For PCIe, this
 * can either be a zero length string (the whole bus) or a PCI address
 * in the format DDDD:BB:DD.FF or DDDD.BB.DD.FF. For FC the string is
 * formatted as: nn-0xWWNN:pn-0xWWPN where WWNN is the Node_Name of the

The line about PCIe is the key: for local baremetal ownership, traddr is a PCI address, not a network address.

bdev_nvme_create Does The Real Attach

The C function spdk_bdev_nvme_create rejects duplicate controller identities and invalid names before it starts the async probe/connect path:

module/bdev/nvme/bdev_nvme.c:6753

if (nvme_ctrlr_get(trid, drv_opts->hostnqn) != NULL) {
	SPDK_ERRLOG("A controller with the provided trid (traddr: %s, hostnqn: %s) "
		    "already exists.\n", trid->traddr, drv_opts->hostnqn);
	return -EEXIST;
}

len = strnlen(base_name, SPDK_CONTROLLER_NAME_MAX);

if (len == 0 || len == SPDK_CONTROLLER_NAME_MAX) {
	SPDK_ERRLOG("controller name must be between 1 and %d characters\n", SPDK_CONTROLLER_NAME_MAX - 1);
	return -EINVAL;
}

For beginners, this is why name and transport identity matter separately:

  • the transport ID says which controller/path SPDK should attach;
  • the controller name becomes the prefix for bdevs;
  • a duplicate transport ID is not a harmless second attach;
  • an invalid or reused name can collide with an existing bdev graph.

The function then chooses attach behavior and starts an async connection:

module/bdev/nvme/bdev_nvme.c:6856

if (nvme_bdev_ctrlr_get_by_name(base_name) == NULL || ctx->bdev_opts.multipath) {
	attach_cb = connect_attach_cb;
} else {
	attach_cb = connect_set_failover_cb;
}

nvme_ctrlr = nvme_ctrlr_get_by_name(ctx->base_name);
if (nvme_ctrlr  && nvme_ctrlr->opts.multipath != ctx->bdev_opts.multipath) {
	/* All controllers with the same name must be configured the same
	 * way, either for multipath or failover. If the configuration doesn't
	 * match - report error.
	 */
	free_nvme_async_probe_ctx(ctx);
	return -EINVAL;
}

ctx->probe_ctx = spdk_nvme_connect_async(trid, &ctx->drv_opts, attach_cb);

For RDMA, multipath can mean several target paths under one logical NVMe bdev. For PCIe, a local controller is a local PCI function. SPDK later rejects PCIe failover paths; local PCI does not become a network multipath device just because the same RPC supports multipath for fabrics.

Host Setup: Hugepages, VFIO, And Binding

Local PCI attach only works after the host has been prepared. SPDK's scripts/setup.sh does two separate jobs:

  1. reserve hugepage memory for DMA-capable userspace buffers;
  2. move selected PCI devices from their normal kernel driver to a userspace-friendly driver such as vfio-pci.

The script's usage text summarizes the intent:

scripts/setup.sh:33

echo "Helper script for allocating hugepages and binding NVMe, I/OAT, VMD and Virtio devices"
echo "to a generic VFIO kernel driver. If VFIO is not available on the system, this script"
echo "will fall back to UIO. NVMe and Virtio devices with active mountpoints will be ignored."
echo "All hugepage operations use default hugepage size on the system (hugepagesz)."

The active-mountpoint guard is not cosmetic. Binding a mounted root or data disk away from the kernel would remove the block device underneath the filesystem.

Driver choice depends on IOMMU and available modules:

scripts/setup.sh:391

if [[ "${DRIVER_OVERRIDE}" == "none" ]]; then
	driver_name=none
elif [[ -n "${DRIVER_OVERRIDE}" ]]; then
	driver_path="$DRIVER_OVERRIDE"
	driver_name="${DRIVER_OVERRIDE##*/}"
	# modprobe and the sysfs don't use the .ko suffix.
	driver_name=${driver_name%.ko}
	# path = name -> there is no path
	if [[ "$driver_path" = "$driver_name" ]]; then
		driver_path=""
	fi
elif is_iommu_enabled; then
	driver_name=vfio-pci

If IOMMU is enabled, SPDK prefers vfio-pci. Without IOMMU, the script can fall back to UIO if available. That fallback may work in labs, but it gives up the IOMMU isolation that normally keeps a userspace DMA bug from writing arbitrary host memory.

Hugepage allocation has its own failure mode:

scripts/setup.sh:511

echo $((NRHUGE < 0 ? 0 : NRHUGE)) > "$hp_int"

allocated_hugepages=$(< "$hp_int")
if ((allocated_hugepages < NRHUGE)); then
	cat <<- ERROR

		## ERROR: requested $NRHUGE hugepages but $allocated_hugepages could be allocated ${2:+on node$2}.
		## Memory might be heavily fragmented. Please try flushing the system cache, or reboot the machine.
	ERROR
	return 1
fi

SPDK can fail before any disk attach if hugepages are exhausted or fragmented. In production, this looks like "SPDK will not start" or "SPDK cannot allocate DMA memory," not like an NVMe protocol problem.

NVMe-oF Attach Loop In Current diskengine

The current diskengine baremetal attach loop is a reconciler over database rows. It queries mapped volumes, includes resizing volumes so pending paths are preconnected, asks SPDK which paths are already connected, and attaches only missing paths.

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go:51

func reconcileNVMeConnections(ctx context.Context, client *spdkclient.Client) {
	// Build expected connections scoped to what this baremetal currently needs.
	vols, err := repository.GetVolumesMappedToBaremetalByStates(ctx, config.Value.BAREMETAL_ID, []types.VolumeBaremetalMappingState{
		types.VOLUME_BM_ASSIGNED,
		types.VOLUME_BM_INITIALISED,
		types.VOLUME_BM_ATTACHED,
	})
	if err != nil {
		logger.Error.Printf("nvme attach: failed to get mapped volumes: %v", err)
		return
	}

The important design choice is retry, not one-shot provisioning. A missing target, down RDMA interface, stale route, or temporary SPDK error logs a warning and retries next tick. That is why attach failure should not immediately corrupt the database state.

Path detection also uses SPDK controller state rather than Linux devices:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go:143

// getConnectedNVMePaths returns a map of connected paths: NQN -> IP -> enabled.
func getConnectedNVMePaths(client *spdkclient.Client) (map[string]map[string]bool, error) {
	controllers, err := client.BdevNvmeGetControllers()
	if err != nil {
		return nil, err
	}

	paths := make(map[string]map[string]bool)
	for _, ctrl := range controllers {
		for _, info := range ctrl.Ctrlrs {
			nqn := info.Trid.Subnqn
			ip := info.Trid.Traddr

For local PCIe, the same idea applies but the identity would be controller name plus BDF. Looking for /dev/nvme0n1 is the wrong check after SPDK owns the controller.

RAID Ensure And Namespace Geometry

diskengine turns attached namespace bdevs into a RAID bdev before exposing a disk to a VM. The current code expects two ready lvol replicas and names the RAID raid_<volume_id>.

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/raidensure.go:109

if len(lvols) != 2 {
	logger.Info.Printf("raid ensure: vol %d has %d lvols (want 2); skipping", vol.VolumeID, len(lvols))
	return fmt.Errorf("raid ensure: vol %d has %d lvols (want 2); skipping", vol.VolumeID, len(lvols))
}

baseBdevs := make([]string, 0, 2)
for _, l := range lvols {
	name, nerr := baseBdevNameFromNQN(l.NQN)
	if nerr != nil {
		logger.Warn.Printf("raid ensure: invalid NQN for vol %d (uuid=%s): %v", vol.VolumeID, l.UUID, nerr)

Before creating RAID, diskengine checks that bases are ready and explicitly avoids a generic bdev listing during reset windows:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/raidensure.go:128

// Guard: only attempt RAID create once base subsystems/controllers are up.
// Avoid bdev_get_bdevs here; it can crash SPDK during NVMe controller reset.
lvolNQNs := []string{lvols[0].NQN, lvols[1].NQN}
ready, err := areBaseBdevsReady(client, lvolNQNs, baseBdevs)
if err != nil {
	logger.Error.Printf("raid ensure: check base bdevs failed for vol %d: %v", vol.VolumeID, err)
	return fmt.Errorf("raid ensure: check base bdevs failed for vol %d: %w", vol.VolumeID, err)
}
if !ready {
	logger.Warn.Printf("raid ensure: base bdevs not ready yet for vol %d; deferring RAID create", vol.VolumeID)

Namespace geometry matters here. RAID assumes the base bdevs are compatible enough for the chosen RAID level: block size, size, metadata behavior, and data offset must match what higher layers expect. A local PCI namespace and a remote lvol namespace may both be "NVMe bdevs," but that does not make them interchangeable. If a resize creates a pending namespace with different geometry, the correct response is to stop and inspect, not force a RAID membership change.

JSON-RPC Replay And Stale State

SPDK's runtime state lives inside the SPDK process. diskengine has database state and saved RPC config. After a restart, those layers can disagree.

The helper comments explain the required ordering:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/helpers/spdk_helpers.go:96

// RestoreConfig replays JSON-RPC calls persisted on disk to restore SPDK state.
//
// The SPDK framework has two phases:
//  1. Pre-init phase (when SPDK is started with --wait-for-rpc)
//  2. Post-init phase (after framework_start_init)
//
// Certain RPCs must be sent *before* the framework is started; otherwise they
// are rejected and the desired settings are not applied.  To guarantee the
// correct ordering we:

The implementation splits pre-init options from post-init configuration:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/helpers/spdk_helpers.go:175

// Split RPCs by phase
preInitSet := map[string]struct{}{
	"framework_set_scheduler": {},
	"sock_set_default_impl":   {},
	"sock_impl_set_options":   {},
	"iobuf_set_options":       {},
	"accel_set_options":       {},
	"bdev_set_options":        {},
	"bdev_raid_set_options":   {},
	"bdev_nvme_set_options":   {},

This matters for stale JSON-RPC state. If a saved config tries to recreate a controller that SPDK already has, duplicate-name or duplicate-transport errors can be harmless replay artifacts or real misconfiguration. The operator must compare:

  • framework_get_config;
  • bdev_nvme_get_controllers;
  • bdev_get_bdevs when safe;
  • diskengine database mappings;
  • saved config file contents.

Do not blindly delete lower bdevs just because replay returned "already exists." First confirm whether the live graph matches desired state.

Exposing The Disk To QEMU

After RAID exists and initialization is complete, diskengine creates a vhost-blk controller. This is the VM-facing point. The guest does not see SPDK's internal bdev names; it sees a disk presented through QEMU/vhost.

The key ordering rule is:

  1. NVMe bdevs must exist.
  2. RAID must exist and be usable.
  3. Snapshot initialization must finish if required.
  4. vhost-blk can be created.
  5. VM state can move to attached/available.

Creating vhost first would expose a missing or unstable backend. Deleting RAID before deleting vhost would strand a VM-facing controller on a removed lower device.

Teardown, Detach, And Rollback

diskengine splits teardown for safety. VM-facing vhost teardown is separate from volume-level RAID/NVMe teardown. The RAID detach loop gates lower deletion on absence of all relevant vhost controllers:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/raid_detach.go:81

ctrlSet := map[string]struct{}{}
for _, c := range allControllers {
	ctrlSet[c.Ctrlr] = struct{}{}
}
for _, id := range mappingIDs {
	name := fmt.Sprintf("vhost%d", id)
	if _, ok := ctrlSet[name]; ok {
		logger.Warn.Printf("raid detach: vol %d spdk gate failed (vhost %s still present); deferring", volumeID, name)
		return nil
	}
}

Only after that gate does it delete RAID and detach unused NVMe controllers:

/home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/raid_detach.go:93

// Delete RAID bdev
raidName := fmt.Sprintf("raid_%d", volumeID)
if er := client.BdevRaidDelete(spdkclient.BdevRaidDeleteParams{Name: raidName}); er != nil {
	if !isSPDKNotFoundErr(er) {
		// Real error: do NOT transition to DETACHED. Retry next tick.
		return fmt.Errorf("delete raid %s: %w", raidName, er)
	}
}

// Detach NVMe controllers (best-effort)
if err := detachUnusedNvmeControllers(ctx, client, volumeID); err != nil {

That gate is the stricter teardown invariant: all relevant vhost controllers must be absent before lower bdevs are deleted. A controller with zero sessions is not enough because it is still an exposure object and QEMU can reconnect to its socket.

The current detachVhost implementation weakens that invariant. It checks for no active sessions, then deletes raid_<volume_id> and detaches unused NVMe controllers before calling VhostDeleteController. If the final vhost delete fails or is delayed, SPDK can briefly retain a sessionless vhost controller backed by storage that was already removed. Treat that as a correctness hazard in the current code. The RAID detach gate above is the order to preserve: remove VM-facing exposure first, then remove RAID and NVMe bases.

For local PCI rollback, there is one extra operational step outside diskengine: after SPDK detaches the controller and no SPDK process owns it, rebind the BDF to the Linux nvme driver. SPDK's setup.sh reset path exists for this purpose. Rollback is not "restart Linux and hope." The operator should know the BDF, confirm no SPDK bdev uses it, detach in SPDK, then rebind.

SPDK's detach RPC accepts optional path fields and calls spdk_bdev_nvme_delete:

module/bdev/nvme/bdev_nvme_rpc.c:750

rpc_bdev_nvme_detach_controller(struct spdk_jsonrpc_request *request,
				const struct spdk_json_val *params)
{
	struct rpc_bdev_nvme_detach_controller req = {NULL};
	struct spdk_nvme_path_id path = {};
	size_t len, maxlen;
	int rc = 0;

	if (spdk_json_decode_object(params, rpc_bdev_nvme_detach_controller_decoders,
				    SPDK_COUNTOF(rpc_bdev_nvme_detach_controller_decoders),

and:

module/bdev/nvme/bdev_nvme_rpc.c:850

rc = spdk_bdev_nvme_delete(req.name, &path, rpc_bdev_nvme_detach_controller_done, request);

if (rc != 0) {
	spdk_jsonrpc_send_error_response(request, rc, spdk_strerror(-rc));
}

Reset And Hot-Remove

Reset is not a normal attach retry. It can interrupt I/O, alter controller state, and race with bdev queries. SPDK exposes bdev_nvme_reset_controller:

module/bdev/nvme/bdev_nvme_rpc.c:1359

static void
rpc_bdev_nvme_reset_controller(struct spdk_jsonrpc_request *request,
			       const struct spdk_json_val *params)
{
	rpc_bdev_nvme_controller_op(request, params, NVME_CTRLR_OP_RESET);
}
SPDK_RPC_REGISTER("bdev_nvme_reset_controller", rpc_bdev_nvme_reset_controller, SPDK_RPC_RUNTIME)

In a reset storm, repeated resets can keep bdevs oscillating between enabled, missing, and examining states. diskengine's RAID code already hints at this by avoiding broad bdev_get_bdevs during NVMe controller reset. Operationally, slow down the actor causing resets, inspect controller health, and avoid layering more delete/recreate operations on top of an unstable controller.

For PCIe hot-remove, SPDK registers removal callbacks and can poll for hotplug:

module/bdev/nvme/bdev_nvme.c:6187

static void
remove_cb(void *cb_ctx, struct spdk_nvme_ctrlr *ctrlr)
{
	struct nvme_ctrlr *nvme_ctrlr = cb_ctx;

	bdev_nvme_delete_ctrlr(nvme_ctrlr, true);
}

and:

module/bdev/nvme/bdev_nvme.c:6212

bdev_nvme_hotplug(void *arg)
{
	struct spdk_nvme_transport_id trid_pcie;

	if (g_hotplug_probe_ctx) {
		return SPDK_POLLER_BUSY;
	}

	memset(&trid_pcie, 0, sizeof(trid_pcie));
	spdk_nvme_trid_populate_transport(&trid_pcie, SPDK_NVME_TRANSPORT_PCIE);

Hot-remove is an error path, not a graceful detach. The best case is that SPDK notices, tears down controller state, and upper layers see I/O errors. For a VM-backed disk, the operator must then decide whether the RAID layer still has enough replicas, whether the VM should continue, and whether the database should mark the physical path faulty.

Required Edge Cases

Controller already bound to the kernel:

If a local PCI controller is still bound to nvme, bdev_nvme_attach_controller with trtype=PCIe will not be able to claim it through VFIO/UIO. Check scripts/setup.sh status, /sys/bus/pci/devices/<bdf>/driver, and active mounts. Do not bind away a mounted filesystem.

IOMMU disabled:

setup.sh prefers vfio-pci when is_iommu_enabled succeeds. Without IOMMU it may fall back to UIO. Treat that as a security and safety downgrade. On production systems, enabling IOMMU in firmware and kernel command line is usually the right fix.

Hugepage exhaustion:

SPDK needs hugepages for DMA-capable buffers. If the requested hugepages cannot be allocated, fix memory fragmentation, reserve pages earlier in boot, reduce HUGEMEM/NRHUGE, or choose the right NUMA node. Do not debug this as an NVMe namespace problem.

Duplicate names:

SPDK rejects duplicate transport IDs and invalid controller names. diskengine's current RDMA path derives names from NQNs to make retries idempotent. A local PCI orchestrator should do the same from BDFs or inventory IDs.

Hot-remove:

For PCIe, the device can disappear below SPDK. SPDK has callbacks and hotplug polling, but upper layers still receive failure. RAID may survive if another base is healthy; a single local namespace does not.

Reset storms:

Repeated reset attempts can make bdev state flicker. Avoid broad graph mutations while a controller is resetting. Prefer controller/path health checks, bounded retries, and one owner for reset policy.

Stale JSON-RPC state:

Saved config, live SPDK state, and diskengine database state can diverge after crash/restart. Compare all three before deleting. Some "already exists" errors are successful idempotency; others are real name conflicts.

Multipath versus local PCI:

Multipath policy is meaningful for remote NVMe-oF paths. A local PCI BDF is a local controller path. Do not model two unrelated local drives as multipath just because the RPC accepts a multipath option.

Namespace geometry mismatch:

RAID and snapshot copy assume compatible block sizes, sizes, and metadata offsets. During resize, pending namespaces must be checked before membership changes. If geometry differs unexpectedly, stop and inspect the source mapping.

Operational rollback:

Rollback from local SPDK ownership means: stop VM exposure, delete vhost, delete RAID/lvol users, detach the NVMe controller from SPDK, stop SPDK if needed, and rebind the BDF to the kernel nvme driver. Then confirm /dev/nvme* reappears before mounting anything.

Misconceptions To Kill

"Baremetal mode always means SPDK owns physical SSDs."

Not in the current diskengine source. Current diskengine baremetal mode is a compute-side reconciler for remote NVMe-oF lvol exports. SPDK can also own local PCIe controllers, but that is a transport choice at the bdev_nvme_attach_controller boundary.

"Go is doing NVMe I/O."

No. Go sends JSON-RPC. SPDK C code owns controllers, bdevs, queues, pollers, DMA memory, reset handling, and hot-remove callbacks.

"If SPDK owns the controller, Linux can still mount it."

No. The kernel block device disappears when the kernel driver is unbound. Use SPDK exports such as vhost, NBD, NVMe-oF target, or CUSE if you need another interface.

"Detach is one delete call."

No. VM exposure, RAID/lvol users, NVMe controllers, PCI binding, and database state have different lifetimes and must be unwound in order.

"Multipath fixes every device failure."

No. Multipath is for multiple paths to the same storage identity. It is not a substitute for RAID, replication, or local PCI redundancy.

Lab: Read One Attach End To End

For the current diskengine remote path:

  1. Start at internal/baremetal/nvme_attach.go:reconcileNVMeConnections.
  2. Follow repository.GetRequiredNVMeConnectionsForVolumes to see how NQN, RDMA IP, and port are derived.
  3. Read attachNvmeSoft and identify the exact BdevNvmeAttachControllerParams.
  4. Read internal/spdkclient/wrappers.go:BdevNvmeAttachController.
  5. In SPDK, read module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_attach_controller.
  6. Continue to module/bdev/nvme/bdev_nvme.c:spdk_bdev_nvme_create.
  7. Confirm the returned bdev names feed RAID creation in internal/baremetal/raidensure.go.

For a local PCI experiment, compare only step 3: the transport ID becomes trtype=PCIe and traddr=<BDF>. Everything below the RPC changes from fabric connection to PCI probing and BAR/DMA ownership.

Source Reading Path

Primary diskengine files:

  • /home/lolwierd/Projects/excloud/diskengine/diskengine/docs/baremetal.md
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/baremetal.go
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/nvme_attach.go
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/raidensure.go
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/baremetal/raid_detach.go
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/helpers/spdk_helpers.go
  • /home/lolwierd/Projects/excloud/diskengine/diskengine/internal/spdkclient/wrappers.go

Primary SPDK files:

  • /home/lolwierd/spdk/doc/userspace.md
  • /home/lolwierd/spdk/doc/bdev.md
  • /home/lolwierd/spdk/doc/nvme.md
  • /home/lolwierd/spdk/doc/jsonrpc.md.jinja2
  • /home/lolwierd/spdk/scripts/setup.sh
  • /home/lolwierd/spdk/include/spdk/nvme.h
  • /home/lolwierd/spdk/module/bdev/nvme/bdev_nvme_rpc.c
  • /home/lolwierd/spdk/module/bdev/nvme/bdev_nvme.c
  • /home/lolwierd/spdk/lib/nvme/nvme.c
  • /home/lolwierd/spdk/lib/nvme/nvme_pcie.c

Operational Debug Exercise

Symptom: guest write hangs.

Classify the failure from top to bottom:

  1. Does vhost_get_controllers show the expected vhost<mapping_id>?
  2. Does QEMU still have the vhost socket open?
  3. Does bdev_raid_get_bdevs show raid_<volume_id> online or degraded?
  4. Do the expected NVMe bdevs still exist?
  5. For remote diskengine: does bdev_nvme_get_io_paths show enabled RDMA paths to the expected NQNs?
  6. For local PCI: does bdev_nvme_get_controllers show the expected trtype=PCIe and BDF?
  7. Are there recent reset, hot-remove, or hugepage allocation errors in SPDK logs?
  8. Does diskengine database state agree with live SPDK state?

The main lesson is to avoid crossing layers too early. If the vhost controller is missing, do not start with PCI binding. If the PCI controller is gone, do not keep recreating vhost. Walk the graph in order.