Chapter 14: bdev I/O Path In Detail | SPDK From First Principles

Reader Promise

By the end of this chapter you should be able to trace a single write from the public bdev API to the module submit_request() callback, then trace the completion back to the user callback. You should also be able to explain why an I/O might be split, queued for QoS, queued for NOMEM retry, aborted by reset, or completed later even when the module completes it immediately.

This chapter follows the bdev core path, not a specific backend. NVMe, malloc, null, lvol, RAID, crypto, compression, and passthru all enter this same core machinery. That is the main reason the bdev layer exists: it gives applications one asynchronous block I/O contract while modules provide the backend-specific implementation.

The official SPDK programming guide describes bdev requests as asynchronous spdk_bdev_io objects submitted on an I/O channel, with descriptors carrying permissions and the bdev layer providing common services such as queueing on memory pressure, statistics, reset, and timeout tracking. The custom module guide gives the other side of the contract: the bdev core calls a module's submit_request() function, and the module must eventually complete the I/O through spdk_bdev_io_complete() or a typed completion helper.

flowchart TB api[Public bdev API] --> alloc[Allocate spdk_bdev_io] alloc --> validate[Validate descriptor, range, metadata] validate --> init[Initialize callback and split state] init --> submit[bdev_io_submit] submit --> locked{Locked LBA range?} locked -->|yes| lockedq[ch->io_locked] locked -->|no| split{Split needed?} split -->|yes| children[Create child IOs] split -->|no| gate[_bdev_io_submit] children --> gate gate --> reset{Reset in progress?} reset -->|yes| aborted[Complete aborted] reset -->|no| qos{QoS enabled?} qos -->|queued| qosq[ch->qos_queued_io] qos -->|ready| submit2[bdev_io_do_submit] submit2 --> nomemq{NOMEM queue already active?} nomemq -->|yes| nomemq2[shared_resource->nomem_io] nomemq -->|no| module[Module submit_request] module --> complete[spdk_bdev_io_complete] complete --> retry{Status NOMEM?} retry -->|yes| nomemq2 retry -->|no| finish[bdev_io_complete] finish --> usercb[User completion callback] usercb --> free[spdk_bdev_free_io] nomemq2 --> retryloop[bdev_shared_ch_retry_io] retryloop --> module

The High-Level Path

The common write path is:

Caller submits through a public API such as spdk_bdev_writev_blocks().
bdev core validates descriptor permissions, block range, metadata, and optional extended parameters.
bdev core allocates a struct spdk_bdev_io from the channel's per-thread cache or the global pool.
bdev core fills operation parameters and initializes internal fields, including the user callback and the split flag.
bdev_io_submit() checks locked ranges, records the I/O as submitted, emits trace state, and either splits or continues.
_bdev_io_submit() checks reset and QoS gates.
bdev_io_do_submit() handles NOMEM backpressure and calls bdev_submit_request().
bdev_submit_request() invokes the module's fn_table->submit_request(ioch, bdev_io).
The module eventually calls spdk_bdev_io_complete() or a typed completion wrapper.
bdev core updates outstanding counts, handles NOMEM retry, bounce buffers, accel sequence completion, statistics, trace, and synchronous-completion deferral.
The user completion callback runs on the I/O's SPDK thread.
The callback frees the request with spdk_bdev_free_io() after it is done inspecting the I/O.

The important ownership rule is that the public API user owns the completion callback and later frees the spdk_bdev_io, but the bdev layer owns routing, queue membership, statistics, retry, and callback dispatch. A module receives a bdev I/O object, but it does not call the user's callback directly.

Step 1: Public API Builds An I/O

Public submit APIs are mostly builders. They do not perform hardware I/O. A write call validates the descriptor and range, allocates a generic I/O object, fills the write-specific fields, records the callback, and then enters the core path.

This excerpt shows the public write path's essential shape:

/* lib/bdev/bdev.c */
int
spdk_bdev_writev_blocks(struct spdk_bdev_desc *desc, struct spdk_io_channel *ch,
			struct iovec *iov, int iovcnt,
			uint64_t offset_blocks, uint64_t num_blocks,
			spdk_bdev_io_completion_cb cb, void *cb_arg)
{
	struct spdk_bdev *bdev = spdk_bdev_desc_get_bdev(desc);

	return bdev_writev_blocks_with_md(desc, ch, iov, iovcnt, NULL, offset_blocks,
					  num_blocks, NULL, NULL, NULL, bdev->dif_check_flags, 0, 0,
					  cb, cb_arg);
}

The helper does the actual validation and object setup:

/* lib/bdev/bdev.c */
if (spdk_unlikely(!desc->write)) {
	return -EBADF;
}

if (spdk_unlikely(!bdev_io_valid_blocks(bdev, offset_blocks, num_blocks))) {
	return -EINVAL;
}

bdev_io = bdev_channel_get_io(channel);
if (spdk_unlikely(!bdev_io)) {
	return -ENOMEM;
}

bdev_io->internal.ch = channel;
bdev_io->internal.desc = desc;
bdev_io->type = SPDK_BDEV_IO_TYPE_WRITE;
bdev_io->u.bdev.iovs = iov;
bdev_io->u.bdev.iovcnt = iovcnt;
bdev_io->u.bdev.md_buf = md_buf;
bdev_io->u.bdev.num_blocks = num_blocks;
bdev_io->u.bdev.offset_blocks = offset_blocks;
bdev_io_init(bdev_io, bdev, cb_arg, cb);

The callback is stored, not called. The desc establishes permissions, the channel ties the I/O to the caller's SPDK thread, and bdev_io_init() sets the internal status to pending. It also determines whether this I/O will need splitting before module submission.

/* lib/bdev/bdev.c */
void
bdev_io_init(struct spdk_bdev_io *bdev_io,
	     struct spdk_bdev *bdev, void *cb_arg,
	     spdk_bdev_io_completion_cb cb)
{
	bdev_io->bdev = bdev;
	bdev_io->internal.f.raw = 0;
	bdev_io->internal.caller_ctx = cb_arg;
	bdev_io->internal.cb = cb;
	bdev_io->internal.status = SPDK_BDEV_IO_STATUS_PENDING;
	bdev_io->internal.f.in_submit_request = false;
	bdev_io->internal.error.nvme.cdw0 = 0;
	bdev_io->num_retries = 0;
	...
	if (cb == bdev_io_split_done) {
		bdev_io->internal.f.child_io = true;
		bdev_io->internal.f.split = false;
	} else {
		bdev_io->internal.f.child_io = false;
		bdev_io->internal.f.split = bdev_io_should_split(bdev_io);
	}
}

The child-I/O test is subtle. Split children re-enter public helper paths, but their callback is bdev_io_split_done, so bdev_io_init() marks them as children and avoids splitting them again through the same parent rule.

Step 2: Allocation And I/O Wait

spdk_bdev_io objects are pooled. The fast path uses a per-thread cache associated with the bdev management channel. If that cache is empty and nobody is already waiting, bdev core tries the global mempool. If someone is already queued for an I/O object, a new caller does not jump ahead.

/* lib/bdev/bdev.c */
struct spdk_bdev_io *
bdev_channel_get_io(struct spdk_bdev_channel *channel)
{
	struct spdk_bdev_mgmt_channel *ch = channel->shared_resource->mgmt_ch;
	struct spdk_bdev_io *bdev_io;

	if (ch->per_thread_cache_count > 0) {
		bdev_io = STAILQ_FIRST(&ch->per_thread_cache);
		STAILQ_REMOVE_HEAD(&ch->per_thread_cache, internal.buf_link);
		ch->per_thread_cache_count--;
	} else if (spdk_unlikely(!TAILQ_EMPTY(&ch->io_wait_queue))) {
		/*
		 * Don't try to look for bdev_ios in the global pool if there are
		 * waiters on bdev_ios - we don't want this caller to jump the line.
		 */
		bdev_io = NULL;
	} else {
		bdev_io = spdk_mempool_get(g_bdev_mgr.bdev_io_pool);
	}

	return bdev_io;
}

If allocation fails here, the public submit function returns -ENOMEM. That return value means the I/O was not submitted and the user's completion callback will not run for that attempt. A caller that wants notification when an I/O object becomes available can use spdk_bdev_queue_io_wait() immediately after -ENOMEM.

spdk_bdev_free_io() is the other half of this contract. The official API reference says the user should call it only after the completion callback has run. Locally, freeing returns the object to the per-thread cache when possible and drains waiters from io_wait_queue:

/* lib/bdev/bdev.c */
void
spdk_bdev_free_io(struct spdk_bdev_io *bdev_io)
{
	struct spdk_bdev_mgmt_channel *ch;

	assert(bdev_io != NULL);
	assert(bdev_io->internal.status != SPDK_BDEV_IO_STATUS_PENDING);

	ch = bdev_io->internal.ch->shared_resource->mgmt_ch;

	if (bdev_io->internal.f.has_buf) {
		bdev_io_put_buf(bdev_io);
	}

	if (ch->per_thread_cache_count < ch->bdev_io_cache_size) {
		ch->per_thread_cache_count++;
		STAILQ_INSERT_HEAD(&ch->per_thread_cache, bdev_io, internal.buf_link);
		while (ch->per_thread_cache_count > 0 && !TAILQ_EMPTY(&ch->io_wait_queue)) {
			struct spdk_bdev_io_wait_entry *entry;

			entry = TAILQ_FIRST(&ch->io_wait_queue);
			TAILQ_REMOVE(&ch->io_wait_queue, entry, link);
			entry->cb_fn(entry->cb_arg);
		}
	} else {
		spdk_mempool_put(g_bdev_mgr.bdev_io_pool, (void *)bdev_io);
	}
}

Do not confuse public -ENOMEM with SPDK_BDEV_IO_STATUS_NOMEM. Public -ENOMEM means no bdev I/O entered the path. SPDK_BDEV_IO_STATUS_NOMEM means a module received the I/O but could not start it because of transient resource pressure, so the bdev core retries it.

Block-size note for source reading: descriptor-facing helpers may report a different block size when metadata is hidden. spdk_bdev_desc_get_block_size() returns bdev->blocklen - bdev->md_len for a descriptor opened with hide_metadata, while module/core internals often use the module-visible bdev->blocklen. When a snippet multiplies by bdev->blocklen, read it as an internal path, not necessarily the block size a public descriptor reports.

Step 3: Submit Accounting And Locked Ranges

bdev_io_submit() is the central entry after extended metadata, memory-domain, and accel setup have either completed or been bypassed. It first enforces locked LBA ranges. These locks are used by operations that need temporary exclusion over a region. Child I/Os skip this test because the parent was already checked before splitting.

/* lib/bdev/bdev.c */
void
bdev_io_submit(struct spdk_bdev_io *bdev_io)
{
	struct spdk_bdev_channel *ch = bdev_io->internal.ch;

	assert(bdev_io->internal.status == SPDK_BDEV_IO_STATUS_PENDING);

	/* Child I/Os are not checked against locked ranges because their parent I/O was already
	 * checked before splitting, so they must be allowed to proceed. */
	if (!bdev_io->internal.f.child_io && !TAILQ_EMPTY(&ch->locked_ranges)) {
		struct lba_range *range;

		TAILQ_FOREACH(range, &ch->locked_ranges, tailq) {
			if (bdev_io_range_is_locked(bdev_io, range)) {
				TAILQ_INSERT_TAIL(&ch->io_locked, bdev_io, internal.ch_link);
				return;
			}
		}
	}

	bdev_ch_add_to_io_submitted(bdev_io);
	bdev_io->internal.submit_tsc = spdk_get_ticks();
	...
	if (bdev_io->internal.f.split) {
		bdev_io_split(bdev_io);
		return;
	}

	_bdev_io_submit(bdev_io);
}

Adding to io_submitted is not just a list operation. It increments channel queue depth, gives completion code something to remove, and sets up latency accounting through submit_tsc. If an I/O is completed before reaching this point, bdev core uses a separate "unsubmitted" completion path because there is no submitted-list entry to remove.

Locked ranges are normal bdev-core serialization, not backend-specific hardware locks. NVMe passthrough commands are treated conservatively elsewhere because bdev core cannot decode arbitrary commands well enough to prove a range is safe.

Step 4: Splitting

Splitting exists because a user request shape is not always a legal backend request shape. A bdev can expose limits such as maximum read/write size, maximum segment count, maximum segment size, optimal I/O boundary, write unit size, maximum unmap size, maximum write-zeroes size, and maximum copy size. Rather than forcing every module to reimplement common split policy, bdev core can turn one parent into multiple child I/Os before the module sees them.

The split decision is type-specific:

/* lib/bdev/bdev.c */
static bool
bdev_rw_should_split(struct spdk_bdev_io *bdev_io)
{
	uint32_t io_boundary;
	struct spdk_bdev *bdev = bdev_io->bdev;
	uint32_t max_segment_size = bdev->max_segment_size;
	uint32_t max_size = bdev->max_rw_size;
	int max_segs = bdev->max_num_segments;

	io_boundary = bdev_rw_get_io_boundary(bdev, bdev_io->type);

	if (spdk_likely(!io_boundary && !max_segs && !max_segment_size && !max_size)) {
		return false;
	}
	...
	if (max_size) {
		if (bdev_io->u.bdev.num_blocks > max_size) {
			return true;
		}
	}

	return false;
}

And the dispatcher keeps the rule local to each I/O type:

/* lib/bdev/bdev.c */
static bool
bdev_io_should_split(struct spdk_bdev_io *bdev_io)
{
	switch (bdev_io->type) {
	case SPDK_BDEV_IO_TYPE_READ:
	case SPDK_BDEV_IO_TYPE_WRITE:
		return bdev_rw_should_split(bdev_io);
	case SPDK_BDEV_IO_TYPE_UNMAP:
		return bdev_unmap_should_split(bdev_io);
	case SPDK_BDEV_IO_TYPE_WRITE_ZEROES:
		return bdev_write_zeroes_should_split(bdev_io);
	case SPDK_BDEV_IO_TYPE_COPY:
		return bdev_copy_should_split(bdev_io);
	default:
		return false;
	}
}

The parent I/O remains the caller-visible I/O. It stores split progress: current offset, remaining blocks, outstanding child count, and final status. For read and write, _bdev_rw_split() computes a child window that obeys boundaries and limits, then submits the child through the normal bdev API helpers using bdev_io_split_done as the child callback.

/* lib/bdev/bdev.c */
static int
bdev_io_split_submit(struct spdk_bdev_io *bdev_io, struct iovec *iov, int iovcnt, void *md_buf,
		     uint64_t num_blocks, uint64_t *offset, uint64_t *remaining)
{
	int rc;
	uint64_t current_offset, current_remaining, current_src_offset;
	spdk_bdev_io_wait_cb io_wait_fn;

	current_offset = *offset;
	current_remaining = *remaining;

	assert(bdev_io->internal.f.split);

	bdev_io->internal.split.outstanding++;

	io_wait_fn = _bdev_rw_split;
	switch (bdev_io->type) {
	case SPDK_BDEV_IO_TYPE_READ:
		rc = bdev_readv_blocks_with_md(..., current_offset, num_blocks,
						bdev_io_split_done, bdev_io);
		break;
	case SPDK_BDEV_IO_TYPE_WRITE:
		rc = bdev_writev_blocks_with_md(..., current_offset, num_blocks,
						 bdev_io_split_done, bdev_io);
		break;
	...
	}

The excerpt above uses ... only to hide long argument plumbing; the important part is that each child is a real bdev I/O with the parent as callback argument. If a child allocation hits public -ENOMEM, the split code can use I/O wait and continue later. That is why splitting interacts with the same allocation fairness rules as ordinary submissions.

The child callback frees each child and either advances the parent or completes it:

/* lib/bdev/bdev.c */
static void
bdev_io_split_done(struct spdk_bdev_io *bdev_io, bool success, void *cb_arg)
{
	struct spdk_bdev_io *parent_io = cb_arg;
	bool use_accel_sequence;
	void *caller_ctx;

	assert(parent_io->internal.f.split);

	if (!success) {
		parent_io->internal.status = bdev_io->internal.status;
		parent_io->internal.error = bdev_io->internal.error;
		/* If any child I/O failed, stop further splitting process. */
		parent_io->internal.split.current_offset_blocks += parent_io->internal.split.remaining_num_blocks;
		parent_io->internal.split.remaining_num_blocks = 0;
	}

	use_accel_sequence = bdev_io_use_accel_sequence(bdev_io);
	caller_ctx = bdev_io->internal.caller_ctx;
	spdk_bdev_free_io(bdev_io);

	parent_io->internal.split.outstanding--;
	if (parent_io->internal.split.outstanding != 0) {
		return;
	}

The user callback runs only for the parent. That is why split code copies child error status into the parent and frees children internally. A module's submit_request() is therefore not guaranteed to receive the exact request shape submitted by the original caller.

Step 5: Reset And QoS Gates

After locked ranges and splitting, _bdev_io_submit() handles per-channel gates. This function is deliberately small because it is a hot path.

/* lib/bdev/bdev.c */
static inline void
_bdev_io_submit(struct spdk_bdev_io *bdev_io)
{
	struct spdk_bdev *bdev = bdev_io->bdev;
	struct spdk_bdev_channel *bdev_ch = bdev_io->internal.ch;

	if (spdk_likely(bdev_ch->flags == 0)) {
		bdev_io_do_submit(bdev_ch, bdev_io);
		return;
	}

	if (bdev_ch->flags & BDEV_CH_RESET_IN_PROGRESS) {
		_bdev_io_complete_in_submit(bdev_ch, bdev_io, SPDK_BDEV_IO_STATUS_ABORTED);
	} else if (bdev_ch->flags & BDEV_CH_QOS_ENABLED) {
		if (spdk_unlikely(bdev_io->type == SPDK_BDEV_IO_TYPE_ABORT) &&
		    bdev_abort_queued_io(&bdev_ch->qos_queued_io, bdev_io->u.abort.bio_to_abort)) {
			_bdev_io_complete_in_submit(bdev_ch, bdev_io, SPDK_BDEV_IO_STATUS_SUCCESS);
		} else {
			TAILQ_INSERT_TAIL(&bdev_ch->qos_queued_io, bdev_io, internal.link);
			bdev_qos_io_submit(bdev_ch, bdev->internal.qos);
		}
	} else {
		SPDK_ERRLOG("unknown bdev_ch flag %x found\n", bdev_ch->flags);
		_bdev_io_complete_in_submit(bdev_ch, bdev_io, SPDK_BDEV_IO_STATUS_FAILED);
	}
}

Reset in progress means new non-reset work should not start. The bdev reset path freezes channels, aborts queued work such as NOMEM and QoS entries, and then either submits reset to the module or waits for outstanding I/O to drain depending on reset configuration. From the point of view of a new write, the important rule is simple: when BDEV_CH_RESET_IN_PROGRESS is set, it completes as aborted through bdev core.

QoS is a delay gate, not a new backend. The I/O stays on the bdev channel's qos_queued_io list until the configured rate-limit accounting says it can proceed. When it can proceed, it continues into bdev_io_do_submit() exactly like a non-QoS I/O.

/* lib/bdev/bdev.c */
static bool
bdev_qos_queue_io(struct spdk_bdev_qos *qos, struct spdk_bdev_io *bdev_io)
{
	int i;

	if (bdev_qos_io_to_limit(bdev_io) == true) {
		for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {
			if (!qos->rate_limits[i].queue_io) {
				continue;
			}

			if (qos->rate_limits[i].queue_io(&qos->rate_limits[i],
							 bdev_io) == true) {
				for (i -= 1; i >= 0 ; i--) {
					if (!qos->rate_limits[i].queue_io) {
						continue;
					}
					qos->rate_limits[i].rewind_quota(&qos->rate_limits[i], bdev_io);
				}
				return true;
			}
		}
	}

	return false;
}

bdev_qos_queue_io() returns true when the I/O must remain queued. The submit helper removes only I/Os that pass all relevant limits:

/* lib/bdev/bdev.c */
static int
bdev_qos_io_submit(struct spdk_bdev_channel *ch, struct spdk_bdev_qos *qos)
{
	struct spdk_bdev_io		*bdev_io = NULL, *tmp = NULL;
	int				submitted_ios = 0;

	TAILQ_FOREACH_SAFE(bdev_io, &ch->qos_queued_io, internal.link, tmp) {
		if (!bdev_qos_queue_io(qos, bdev_io)) {
			TAILQ_REMOVE(&ch->qos_queued_io, bdev_io, internal.link);
			bdev_io_do_submit(ch, bdev_io);

			submitted_ios++;
		}
	}

	return submitted_ios;
}

The QoS poller refills quota by timeslice and tries queued I/O again. This is why a QoS-delayed I/O can complete later even though the backend was idle when the user submitted it.

/* lib/bdev/bdev.c */
static int
bdev_channel_poll_qos(void *arg)
{
	struct spdk_bdev *bdev = arg;
	struct spdk_bdev_qos *qos = bdev->internal.qos;
	uint64_t now = spdk_get_ticks();
	int i;
	int64_t remaining_last_timeslice;

	if (spdk_unlikely(qos->thread == NULL)) {
		return SPDK_POLLER_IDLE;
	}

	if (now < (qos->last_timeslice + qos->timeslice_size)) {
		return SPDK_POLLER_IDLE;
	}

	/* Reset for next round of rate limiting */
	for (i = 0; i < SPDK_BDEV_QOS_NUM_RATE_LIMIT_TYPES; i++) {
		remaining_last_timeslice = __atomic_exchange_n(&qos->rate_limits[i].remaining_this_timeslice,
					   0, __ATOMIC_RELAXED);
		...
	}

QoS can apply to normal reads and writes, selected passthrough paths, and zcopy start phases. The exact inclusion rule lives in bdev_qos_io_to_limit(), so when debugging a surprising QoS delay, read that function first.

Step 6: The Module Call

bdev_io_do_submit() is the last generic gate before module code runs. It handles aborts of queued I/O, validates write-unit split expectations, respects an active NOMEM queue, increments outstanding counts, marks in_submit_request, and calls the module.

/* lib/bdev/bdev.c */
static inline void
bdev_io_do_submit(struct spdk_bdev_channel *bdev_ch, struct spdk_bdev_io *bdev_io)
{
	struct spdk_bdev *bdev = bdev_io->bdev;
	struct spdk_io_channel *ch = bdev_ch->channel;
	struct spdk_bdev_shared_resource *shared_resource = bdev_ch->shared_resource;
	...
	if (spdk_likely(TAILQ_EMPTY(&shared_resource->nomem_io))) {
		bdev_io_increment_outstanding(bdev_ch, shared_resource);
		bdev_io->internal.f.in_submit_request = true;
		bdev_submit_request(bdev, ch, bdev_io);
		bdev_io->internal.f.in_submit_request = false;
	} else {
		bdev_queue_nomem_io_tail(shared_resource, bdev_io, BDEV_IO_RETRY_STATE_SUBMIT);
		if (shared_resource->nomem_threshold == 0 && shared_resource->io_outstanding == 0) {
			bdev_shared_ch_retry_io(shared_resource);
		}
	}
}

The in_submit_request flag exists for completion safety. A module is allowed to complete an I/O before submit_request() returns. That is legal, but bdev core must avoid calling the user's completion callback recursively from inside the module call stack. The completion section shows how this flag is used.

bdev_submit_request() is the dispatch point:

/* lib/bdev/bdev.c */
static inline void
bdev_submit_request(struct spdk_bdev *bdev, struct spdk_io_channel *ioch,
		    struct spdk_bdev_io *bdev_io)
{
	/* After a request is submitted to a bdev module, the ownership of an accel sequence
	 * associated with that bdev_io is transferred to the bdev module. So, clear the internal
	 * sequence pointer to make sure we won't touch it anymore. */
	if ((bdev_io->type == SPDK_BDEV_IO_TYPE_WRITE ||
	     bdev_io->type == SPDK_BDEV_IO_TYPE_READ) && bdev_io->u.bdev.accel_sequence != NULL) {
		assert(!bdev_io_needs_sequence_exec(bdev_io));
		bdev_io->internal.f.has_accel_sequence = false;
	}

	assert((bdev_io->type != SPDK_BDEV_IO_TYPE_WRITE &&
		bdev_io->type != SPDK_BDEV_IO_TYPE_READ) ||
	       ((bdev_io->u.bdev.dif_check_flags & bdev->dif_check_flags) ==
		bdev_io->u.bdev.dif_check_flags));

	bdev->fn_table->submit_request(ioch, bdev_io);
}

The module interface is intentionally narrow:

/* include/spdk/bdev_module.h */
struct spdk_bdev_fn_table {
	/** Destroy the backend block device object. */
	int (*destruct)(void *ctx);

	/** Process the IO. */
	void (*submit_request)(struct spdk_io_channel *ch, struct spdk_bdev_io *);

	/** Check if the block device supports a specific I/O type. */
	bool (*io_type_supported)(void *ctx, enum spdk_bdev_io_type);

The module gets an SPDK I/O channel for the thread where bdev core is submitting. It may complete immediately, queue work to a poller, forward to a lower bdev, or submit to hardware. It must complete exactly once through bdev completion APIs.

Step 7: Completion

The public module completion entry is spdk_bdev_io_complete(). A module that wants to report NVMe, SCSI, AIO, or base-I/O status can use typed wrappers, but they all end by setting bdev status and entering this core path.

/* lib/bdev/bdev.c */
void
spdk_bdev_io_complete(struct spdk_bdev_io *bdev_io, enum spdk_bdev_io_status status)
{
	struct spdk_bdev *bdev = bdev_io->bdev;
	struct spdk_bdev_channel *bdev_ch = bdev_io->internal.ch;
	struct spdk_bdev_shared_resource *shared_resource = bdev_ch->shared_resource;

	if (spdk_unlikely(bdev_io->internal.status != SPDK_BDEV_IO_STATUS_PENDING)) {
		SPDK_ERRLOG("Unexpected completion on IO from %s module, status was %s\n",
			    spdk_bdev_get_module_name(bdev),
			    bdev_io_status_get_string(bdev_io->internal.status));
		assert(false);
	}
	bdev_io->internal.status = status;

	if (spdk_unlikely(bdev_io->type == SPDK_BDEV_IO_TYPE_RESET)) {
		assert(bdev_io == bdev->internal.reset_in_progress);
		spdk_bdev_for_each_channel(bdev, bdev_unfreeze_channel, bdev_io,
					   bdev_reset_complete);
		return;
	} else {
		bdev_io_decrement_outstanding(bdev_ch, shared_resource);
		...
		if (spdk_unlikely(_bdev_io_handle_no_mem(bdev_io, BDEV_IO_RETRY_STATE_SUBMIT))) {
			return;
		}
	}

	bdev_io_complete(bdev_io);
}

This function changes ownership state again: outstanding counts drop, reset completion takes a special path, successful I/O may need post-processing, and NOMEM may turn completion into retry instead of final callback.

The final callback handoff is split into two functions. _bdev_io_complete() invokes the user callback on the original SPDK thread. bdev_io_complete() does the completion accounting first and defers if the module completed synchronously inside submit_request().

/* lib/bdev/bdev.c */
static inline void
_bdev_io_complete(void *ctx)
{
	struct spdk_bdev_io *bdev_io = ctx;

	if (spdk_unlikely(bdev_io_use_accel_sequence(bdev_io))) {
		assert(bdev_io->internal.status != SPDK_BDEV_IO_STATUS_SUCCESS);
		spdk_accel_sequence_abort(bdev_io->internal.accel_sequence);
	}

	assert(bdev_io->internal.cb != NULL);
	assert(spdk_get_thread() == spdk_bdev_io_get_thread(bdev_io));

	bdev_io->internal.cb(bdev_io, bdev_io->internal.status == SPDK_BDEV_IO_STATUS_SUCCESS,
			     bdev_io->internal.caller_ctx);
}

/* lib/bdev/bdev.c */
static inline void
bdev_io_complete(void *ctx)
{
	struct spdk_bdev_io *bdev_io = ctx;
	struct spdk_bdev_channel *bdev_ch = bdev_io->internal.ch;
	uint64_t tsc, tsc_diff;

	if (spdk_unlikely(bdev_io->internal.f.in_submit_request)) {
		/*
		 * Defer completion to avoid potential infinite recursion if the
		 * user's completion callback issues a new I/O.
		 */
		spdk_thread_send_msg(spdk_bdev_io_get_thread(bdev_io),
				     bdev_io_complete, bdev_io);
		return;
	}

	tsc = spdk_get_ticks();
	tsc_diff = tsc - bdev_io->internal.submit_tsc;

	bdev_ch_remove_from_io_submitted(bdev_io);
	spdk_trace_record_tsc(tsc, TRACE_BDEV_IO_DONE, bdev_ch->trace_id, 0, (uintptr_t)bdev_io,
			      bdev_io->internal.caller_ctx, bdev_ch->queue_depth);
	...
	bdev_io_update_io_stat(bdev_io, tsc_diff);
	_bdev_io_complete(bdev_io);
}

The deferral rule is one of the easiest bdev details to miss. If a null-like module completed every I/O inline and bdev core called the user's callback inline, a completion callback that submits another I/O could recursively call back into the same module indefinitely. Sending a message to the I/O's thread breaks that recursion while preserving the thread-affinity guarantee.

NOMEM Retry

SPDK_BDEV_IO_STATUS_NOMEM is a retry request, not a final failure. The enum documents the intended module use:

/* include/spdk/bdev_module.h */
/*
 * NOMEM should be returned when a bdev module cannot start an I/O because of
 *  some lack of resources.  It may not be returned for RESET I/O.  I/O completed
 *  with NOMEM status will be retried after some I/O from the same channel have
 *  completed.
 */
SPDK_BDEV_IO_STATUS_NOMEM = -4,

When completion sees NOMEM, bdev core resets the I/O status to pending and queues it on the shared resource, usually at the head so later I/O do not jump ahead of a failed earlier one.

/* lib/bdev/bdev.c */
static inline bool
_bdev_io_handle_no_mem(struct spdk_bdev_io *bdev_io, enum bdev_io_retry_state state)
{
	struct spdk_bdev_channel *bdev_ch = bdev_io->internal.ch;
	struct spdk_bdev_shared_resource *shared_resource = bdev_ch->shared_resource;

	if (spdk_unlikely(bdev_io->internal.status == SPDK_BDEV_IO_STATUS_NOMEM)) {
		bdev_io->internal.status = SPDK_BDEV_IO_STATUS_PENDING;
		bdev_queue_nomem_io_head(shared_resource, bdev_io, state);

		if (shared_resource->io_outstanding == 0 && !shared_resource->nomem_poller) {
			shared_resource->nomem_poller = SPDK_POLLER_REGISTER(bdev_no_mem_poller, shared_resource,
							10 * SPDK_MSEC_TO_USEC);
		}
		...
		return true;
	}

	if (spdk_unlikely(!TAILQ_EMPTY(&shared_resource->nomem_io))) {
		bdev_ch_retry_io(bdev_ch);
	}

	return false;
}

The queue helper computes a retry threshold from the number of outstanding I/Os. This is not arbitrary. Some drivers cannot accept a replacement request while still unwinding completion for the request that just freed resources.

/* lib/bdev/bdev.c */
static inline void
bdev_queue_nomem_io_head(struct spdk_bdev_shared_resource *shared_resource,
			 struct spdk_bdev_io *bdev_io, enum bdev_io_retry_state state)
{
	/* Wait for some of the outstanding I/O to complete before we retry any of the nomem_io.
	 * Normally we will wait for NOMEM_THRESHOLD_COUNT I/O to complete but for low queue depth
	 * channels we will instead wait for half to complete.
	 */
	shared_resource->nomem_threshold = spdk_max((int64_t)shared_resource->io_outstanding / 2,
					   (int64_t)shared_resource->io_outstanding - NOMEM_THRESHOLD_COUNT);

	assert(state != BDEV_IO_RETRY_STATE_INVALID);
	bdev_io->internal.retry_state = state;
	TAILQ_INSERT_HEAD(&shared_resource->nomem_io, bdev_io, internal.link);
}

Retry runs from the shared resource queue. For ordinary submit retry, it increments outstanding again, clears stale status details, increments num_retries, and calls the same module dispatch point.

/* lib/bdev/bdev.c */
static inline void
bdev_ch_resubmit_io(struct spdk_bdev_shared_resource *shared_resource, struct spdk_bdev_io *bdev_io)
{
	struct spdk_bdev *bdev = bdev_io->bdev;

	bdev_io_increment_outstanding(bdev_io->internal.ch, shared_resource);
	bdev_io->internal.error.nvme.cdw0 = 0;
	bdev_io->num_retries++;
	bdev_submit_request(bdev, spdk_bdev_io_get_io_channel(bdev_io), bdev_io);
}

If there are no outstanding I/Os left to trigger retry, bdev_no_mem_poller() prevents a permanent stall at low queue depth:

/* lib/bdev/bdev.c */
static int
bdev_no_mem_poller(void *ctx)
{
	struct spdk_bdev_shared_resource *shared_resource = ctx;

	if (!TAILQ_EMPTY(&shared_resource->nomem_io)) {
		bdev_shared_ch_retry_io(shared_resource);
	}

	/* Keep poller registered if list is not empty and there are no io outstanding. */
	if (!TAILQ_EMPTY(&shared_resource->nomem_io) && shared_resource->io_outstanding == 0) {
		return SPDK_POLLER_BUSY;
	}

	spdk_poller_unregister(&shared_resource->nomem_poller);
	return SPDK_POLLER_IDLE;
}

NOMEM retry is ordered and cooperative. It tries to preserve fairness, avoid retry storms, and prevent an I/O from getting stuck merely because the application is running at queue depth one.

Completion In A Simple Module

The null bdev is a compact backend to read after the core path. It demonstrates the module contract without NVMe hardware detail. The module handles supported I/O types and calls spdk_bdev_io_complete(); it never calls the user's bdev callback.

/* module/bdev/null/bdev_null.c */
switch (bdev_io->type) {
case SPDK_BDEV_IO_TYPE_WRITE_ZEROES:
case SPDK_BDEV_IO_TYPE_RESET:
	TAILQ_INSERT_TAIL(&ch->io, null_io, link);
	break;
case SPDK_BDEV_IO_TYPE_ABORT:
	if (bdev_null_abort_io(ch, bdev_io->u.abort.bio_to_abort)) {
		spdk_bdev_io_complete(bdev_io, SPDK_BDEV_IO_STATUS_SUCCESS);
	} else {
		spdk_bdev_io_complete(bdev_io, SPDK_BDEV_IO_STATUS_FAILED);
	}
	break;
case SPDK_BDEV_IO_TYPE_FLUSH:
case SPDK_BDEV_IO_TYPE_UNMAP:
default:
	spdk_bdev_io_complete(bdev_io, SPDK_BDEV_IO_STATUS_FAILED);
	break;
}

For read, write, write zeroes, and reset, null queues work to its channel list so a poller can complete it later. That makes it a useful teaching backend: it shows that a module can finish later even when the operation has no real device latency.

Edge Cases And Failure Modes

Public API returns -ENOMEM: no spdk_bdev_io was allocated, no I/O was submitted, and no completion callback will happen for that attempt. Register an I/O wait entry immediately if the upper layer wants to resubmit when memory is available.

Module completes with SPDK_BDEV_IO_STATUS_NOMEM: the module received the I/O but could not start it. bdev core resets the status to pending, queues the same I/O on shared_resource->nomem_io, and retries later. The user callback does not run unless retry eventually completes with a final status.

Module completes synchronously: bdev_io_complete() sees in_submit_request and sends a message to the I/O's thread. The user's callback still runs on the correct thread, but not on the module's current call stack.

I/O overlaps a locked range: bdev_io_submit() puts the parent I/O on ch->io_locked. Split children are not rechecked because the parent already passed or waited at the locked-range gate.

Reset is in progress: _bdev_io_submit() completes new non-reset I/O as aborted. Reset handling also aborts queued work and serializes module reset with outstanding I/O.

QoS is enabled and quota is exhausted: the I/O waits on ch->qos_queued_io until the QoS poller refills quota and bdev_qos_io_submit() removes it from the queue.

Split child fails: bdev_io_split_done() copies the child status and error into the parent, stops further split progress, frees the child, and completes the parent after outstanding children drain.

Metadata or memory-domain setup fails before bdev_io_submit(): bdev core uses bdev_io_complete_unsubmitted() because the I/O was never added to io_submitted.

User forgets spdk_bdev_free_io() after callback: I/O objects leak from the pool or per-thread cache, and later submissions can hit public -ENOMEM.

Misconceptions To Kill

"Every failed I/O reaches hardware." No. Permissions, range, metadata, reset, locked ranges, QoS, and allocation can all stop or delay an I/O before backend hardware sees it.
"NOMEM is always final failure." No. Public -ENOMEM is a failed submission attempt; SPDK_BDEV_IO_STATUS_NOMEM is a module-level retry request.
"Splitting is a module problem." Often no. bdev core handles common split rules before module dispatch.
"Completion callback always runs inline." No. It may run later because of a poller, QoS, NOMEM retry, split children, bounce-buffer work, accel sequence work, or synchronous-completion deferral.
"Reset just submits a reset command." No. bdev reset freezes channels, aborts queued work, serializes reset, and may wait for outstanding I/O.
"The module owns user callback routing." No. The module completes a bdev I/O. bdev core routes status to the user callback and maintains statistics.

Source Reading Exercise

Trace a write in this order:

lib/bdev/bdev.c:spdk_bdev_writev_blocks().
lib/bdev/bdev.c:bdev_writev_blocks_with_md().
lib/bdev/bdev.c:bdev_io_init().
lib/bdev/bdev.c:_bdev_io_submit_ext().
lib/bdev/bdev.c:bdev_io_submit().
lib/bdev/bdev.c:bdev_io_should_split().
lib/bdev/bdev.c:_bdev_io_submit().
lib/bdev/bdev.c:bdev_io_do_submit().
lib/bdev/bdev.c:bdev_submit_request().
module/bdev/null/bdev_null.c:bdev_null_submit_request().
module/bdev/null/bdev_null.c:null_io_poll().
lib/bdev/bdev.c:spdk_bdev_io_complete().
lib/bdev/bdev.c:bdev_io_complete().
lib/bdev/bdev.c:_bdev_io_complete().
lib/bdev/bdev.c:spdk_bdev_free_io().

Questions to answer while reading:

Where is the I/O counted as submitted?
Where does module code first run?
Why does the bdev core store in_submit_request?
Where would a write be rejected for a read-only descriptor?
Where would a large write be split?
Which queue holds QoS-delayed I/O?
Which queue holds retryable NOMEM I/O?
Why can a child I/O bypass the locked-range check?

Operational Lab

Build a paper trace for this scenario:

A caller submits a 1 MiB write.
The bdev has max_rw_size set to 128 KiB.
QoS is enabled and quota is exhausted.
The module completes the first child with NOMEM on first attempt, then success on retry.

Your trace should include:

Parent I/O allocation and bdev_io_init().
Parent entry into bdev_io_submit() and split decision.
First child allocation through bdev_writev_blocks_with_md().
Child queueing on qos_queued_io.
QoS poller refilling quota and calling bdev_qos_io_submit().
Child entry into bdev_io_do_submit() and module submit_request().
Module completion with SPDK_BDEV_IO_STATUS_NOMEM.
_bdev_io_handle_no_mem() queueing the child on shared_resource->nomem_io.
Retry through bdev_ch_resubmit_io().
Successful child completion and bdev_io_split_done().
More child submissions until parent remaining_num_blocks reaches zero.
Parent callback and spdk_bdev_free_io().

References

Local source: lib/bdev/bdev.c.
Local source: include/spdk/bdev.h.
Local source: include/spdk/bdev_module.h.
Local source: module/bdev/null/bdev_null.c.
Local source: module/bdev/passthru/vbdev_passthru.c.
Official SPDK Block Device Layer Programming Guide: https://spdk.io/doc/bdev_pg.html
Official SPDK Writing a Custom Block Device Module: https://spdk.io/doc/bdev_module.html
Official SPDK bdev API reference: https://spdk.io/doc/bdev_8h.html