SPDK From First Principles

SPDK deep learning path

Chapter 16: QoS, Reset, Remove, Hotplug, Events

By the end of this chapter you should be able to explain the non-happy-path machinery around bdev I/O: rate limits, reset, unregister, hotremove, media-management events, and...

Source: drafts/bdev-nvme/16-qos-reset-remove-hotplug-events.md

Reader Promise

By the end of this chapter you should be able to explain the non-happy-path machinery around bdev I/O: rate limits, reset, unregister, hotremove, media-management events, and quiesce. These are the paths that matter when production systems hang, drain, reconnect, delete, or rebalance.

The happy path is "submit I/O, complete I/O." Real systems spend a lot of engineering effort on what happens while someone is deleting a volume, resetting a controller, changing a namespace, limiting throughput, or handling a device disappearing.

QoS: Rate Limiting At The bdev Layer

SPDK bdev QoS is per-bdev rate limiting implemented in the bdev core. It can limit:

  • read/write IOPS.
  • combined read/write bytes per second.
  • read bytes per second.
  • write bytes per second.

Source anchors:

  • lib/bdev/bdev.c:spdk_bdev_set_qos_rate_limits().
  • lib/bdev/bdev.c:bdev_set_qos_rate_limits().
  • lib/bdev/bdev.c:bdev_qos_is_iops_rate_limit().
  • lib/bdev/bdev.c:bdev_qos_io_to_limit().
  • lib/bdev/bdev.c:bdev_qos_queue_io().
  • lib/bdev/bdev.c:bdev_qos_io_submit().
  • lib/bdev/bdev.c:bdev_channel_poll_qos().
  • lib/bdev/bdev.c:bdev_qos_update_max_quota_per_timeslice().
  • lib/bdev/bdev_rpc.c:rpc_bdev_set_qos_limit().

How QoS Works

spdk_bdev_set_qos_rate_limits() validates and normalizes limits. Byte limits from RPC are converted from MiB/s to bytes/s. Limits are rounded up to minimum granularity if needed. If no rate limit remains enabled, QoS is disabled.

When QoS is enabled:

  • BDEV_CH_QOS_ENABLED is set on bdev channels.
  • _bdev_io_submit() queues incoming I/O on qos_queued_io.
  • bdev_qos_io_submit() walks queued I/O and submits those with available quota.
  • bdev_channel_poll_qos() periodically refills quota and tries again.

QoS does not apply to every I/O type. bdev_qos_io_to_limit() includes normal reads, writes, NVMe passthrough I/O types, and zcopy starts. It excludes operations that do not represent data transfer in the same way.

QoS Edge Cases

  • QoS change in progress: another change can return -EAGAIN.
  • Limits too large: byte limits may be capped to avoid overflow.
  • Limits not aligned to minimum rate: bdev logs and rounds up.
  • Last descriptor closes: QoS poller shutdown is asynchronous and has special handling.
  • I/O larger than one timeslice quota: the code allows slight overrun and accounts for it in the next timeslice.
  • QoS plus reset: reset aborts QoS queued I/O on each channel.

Misconception to kill: QoS is not implemented in the NVMe module. It is bdev-layer policy, so it can apply to any bdev module that uses the core path.

Reset: Freeze, Drain, Submit, Unfreeze

Reset looks like a simple I/O type at the API boundary:

Source anchor: lib/bdev/bdev.c:spdk_bdev_reset().

Internally it is a management operation:

Source anchors:

  • lib/bdev/bdev.c:bdev_start_reset().
  • lib/bdev/bdev.c:bdev_reset_freeze_channel().
  • lib/bdev/bdev.c:bdev_reset_freeze_channel_done().
  • lib/bdev/bdev.c:bdev_reset_check_outstanding_io().
  • lib/bdev/bdev.c:bdev_reset_check_outstanding_io_done().
  • lib/bdev/bdev.c:bdev_reset_poll_for_outstanding_io().
  • lib/bdev/bdev.c:bdev_reset_complete().
  • lib/bdev/bdev.c:_bdev_reset_complete().

Reset Sequence

  1. Public caller submits reset with descriptor and channel.
  2. bdev core allocates a reset spdk_bdev_io.
  3. bdev_start_reset() adds it to submitted I/O.
  4. It takes a channel reference so the channel survives reset.
  5. It serializes against any existing reset using bdev->internal.reset_in_progress.
  6. It freezes every channel by setting BDEV_CH_RESET_IN_PROGRESS.
  7. It aborts NOMEM, iobuf-waiting, and QoS queued I/O.
  8. If reset_io_drain_timeout == 0, it submits reset immediately.
  9. Otherwise it checks for outstanding I/O and may wait until timeout.
  10. Module handles SPDK_BDEV_IO_TYPE_RESET in submit_request().
  11. Module completes the reset I/O.
  12. bdev core unfreezes channels and completes queued reset requests with the same status.

The drain behavior is controlled by struct spdk_bdev.reset_io_drain_timeout.

Source anchor: include/spdk/bdev_module.h:struct spdk_bdev.reset_io_drain_timeout.

The intended use is shared lower devices. For example, multiple lvol bdevs may share one underlying bdev. A nonzero drain timeout lets SPDK avoid sending a disruptive reset to the lower device if the outstanding I/O drains naturally.

Reset Edge Cases

  • Reset while reset already in progress: later reset I/O are queued and completed with the first reset status.
  • New I/O during reset: _bdev_io_submit() completes it as aborted.
  • Outstanding memory-domain or accel work: reset can fail after timeout because those operations cannot be aborted safely.
  • Module does not support reset: API or module path should fail cleanly.
  • Reset completes inline: core still handles unfreeze and completion routing.
  • Reset on virtual bdev: virtual module must decide whether to forward reset to base, fan out to multiple bases, or implement local semantics.

Misconception to kill: reset is not merely "send one command to hardware." bdev core coordinates all channels before the module sees the reset I/O.

Remove And Unregister

Remove has two related ideas:

  • A module or control plane unregisters a bdev intentionally.
  • A lower device disappears and upper users receive a remove event.

Source anchors:

  • include/spdk/bdev_module.h:spdk_bdev_unregister().
  • include/spdk/bdev_module.h:spdk_bdev_unregister_by_name().
  • lib/bdev/bdev.c:spdk_bdev_unregister().
  • lib/bdev/bdev.c:spdk_bdev_unregister_by_name().
  • lib/bdev/bdev.c:bdev_unregister().
  • lib/bdev/bdev.c:bdev_unregister_unsafe().
  • lib/bdev/bdev.c:spdk_bdev_close().

spdk_bdev_unregister():

  • Requires an SPDK thread.
  • Is expected to be called from the app thread.
  • Rejects duplicate unregister/remove with -EBUSY.
  • Sets bdev status to unregistering.
  • Records the unregister callback and thread.
  • Stops queue-depth sampling.
  • Iterates channels to abort queued work.
  • Notifies open descriptors of removal.
  • Defers final destruction until descriptors are closed when needed.

spdk_bdev_unregister_by_name() is safer for external callers because it opens the named bdev, verifies the module owner, calls unregister, and closes its temporary descriptor.

Descriptor Event Callback

Descriptors are opened with an event callback. Virtual modules depend on this.

Example source anchors:

  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_base_bdev_event_cb().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_base_bdev_hotremove_cb().

When passthru receives SPDK_BDEV_EVENT_REMOVE for its base bdev, it unregisters the virtual bdev.

Misconception to kill: remove notification does not magically close every descriptor. Upper layers must stop using the bdev and close descriptors so destruction can finish.

Hotplug And Hotremove

At the bdev level, hotremove is observed as remove events and unregister. At the module level, physical modules may have their own hotplug pollers.

For NVMe bdevs, the relevant source anchors are:

  • module/bdev/nvme/bdev_nvme_rpc.c:rpc_bdev_nvme_set_hotplug().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_set_hotplug().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_hotplug().
  • module/bdev/nvme/bdev_nvme.c:bdev_nvme_remove_poller().

Hotplug is primary-process sensitive for PCIe. The NVMe module limits or rejects some hotplug behavior for secondary processes because PCIe discovery and ownership are not just ordinary userspace state.

For virtual modules, the most important hotplug pattern is "base bdev appears later." Passthru stores desired base and virtual names, then vbdev_passthru_examine() tries to create the virtual bdev whenever a new bdev is examined.

Source anchors:

  • module/bdev/passthru/vbdev_passthru.c:bdev_passthru_create_disk().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_examine().

Media Events

Some bdevs support media-management events. The bdev core stores pending media events on descriptors and notifies interested openers.

Source anchors:

  • include/spdk/bdev_module.h:spdk_bdev_push_media_events().
  • include/spdk/bdev_module.h:spdk_bdev_notify_media_management().
  • lib/bdev/bdev.c:spdk_bdev_push_media_events().
  • lib/bdev/bdev.c:spdk_bdev_notify_media_management().
  • lib/bdev/bdev.c:spdk_bdev_get_media_events().

spdk_bdev_push_media_events():

  • Requires bdev->media_events.
  • Finds a writable descriptor with a media event buffer.
  • Moves events into that descriptor's pending queue.

spdk_bdev_notify_media_management():

  • Walks open descriptors.
  • Sends SPDK_BDEV_EVENT_MEDIA_MANAGEMENT to descriptors with pending media events.

Edge case: if no suitable descriptor exists, pushing media events returns -ENODEV.

Quiesce And Locked Ranges

Quiesce lets the registering module temporarily stop I/O for a bdev or an LBA range. This is used for operations that need a stable range, such as metadata updates, reshaping, or other internal transitions.

Source anchors:

  • include/spdk/bdev_module.h:spdk_bdev_quiesce().
  • include/spdk/bdev_module.h:spdk_bdev_unquiesce().
  • include/spdk/bdev_module.h:spdk_bdev_quiesce_range().
  • include/spdk/bdev_module.h:spdk_bdev_unquiesce_range().
  • lib/bdev/bdev.c:spdk_bdev_quiesce().
  • lib/bdev/bdev.c:spdk_bdev_unquiesce().
  • lib/bdev/bdev.c:spdk_bdev_quiesce_range().
  • lib/bdev/bdev.c:spdk_bdev_unquiesce_range().
  • lib/bdev/bdev.c:_spdk_bdev_quiesce().
  • lib/bdev/bdev.c:bdev_io_range_is_locked().

Rules:

  • Only the module that registered the bdev may call quiesce APIs.
  • Full bdev quiesce is implemented as a range from offset 0 to bdev->blockcnt.
  • I/O submitted after quiesce queues until unquiesce.
  • Unquiesce range must match the exact range previously quiesced.

Edge case: reads can be allowed through some locked ranges depending on the range's quiesce flag, but writes and modifying commands are blocked if they overlap.

Prose Diagram

Picture one bdev channel with four side queues:

  • io_submitted: I/O that has entered the core path and has not completed.
  • qos_queued_io: I/O waiting for rate-limit quota.
  • shared nomem_io: I/O waiting for backend resources.
  • io_locked: I/O blocked by a quiesced or locked LBA range.

Now picture reset as a red vertical bar across all channels. It sets reset-in-progress flags, aborts NOMEM, iobuf, and QoS queues, waits or submits reset, then removes the red bar after reset completion.

Picture unregister as a lifecycle arrow from "registered" to "unregistering" to "removing" to "destroyed." Open descriptors can hold the object in the middle until they close.

Edge Cases And Failure Modes

  • QoS configured but no descriptors open: poller creation and destruction are lazy and tied to channel/descriptor lifecycle.
  • QoS disabled with old poller still running: the code swaps QoS structures to avoid races during shutdown.
  • Reset storm: bdev core serializes resets; upper layers should still avoid loops that keep resetting a failing backend.
  • Remove while descriptors are open: destruction waits; users must close.
  • Remove event ignored: virtual bdevs can remain visible over a dead base until they fail I/O or unregister later.
  • Hotplug disabled: remove polling may still run depending on module policy.
  • Media event buffer full: later events may not be queued.
  • Quiesce exact-match requirement: unquiesce with a different range returns error.
  • NVMe passthrough during quiesce: bdev core assumes worst-case overlap.
  • Channel destroyed with queued I/O: queued work is aborted to avoid dangling callbacks.

Misconceptions To Kill

  • "Delete means freed immediately." No. Open descriptors and async destruct delay final free.
  • "Hotremove is only a PCIe problem." No. Any base bdev removal matters to virtual bdevs.
  • "QoS changes hardware queue depth." Not directly. It queues at the bdev layer before module submission.
  • "Reset only affects the caller's channel." No. bdev reset iterates all channels.
  • "Quiesce is for applications." The public module API says only the registering module may call it.
  • "Media events are completions." No. They are descriptor events, separate from I/O completion callbacks.

Source Reading Exercise

Read these paths:

  1. QoS: lib/bdev/bdev_rpc.c:rpc_bdev_set_qos_limit() -> lib/bdev/bdev.c:spdk_bdev_set_qos_rate_limits() -> lib/bdev/bdev.c:bdev_channel_poll_qos().
  2. Reset: lib/bdev/bdev.c:spdk_bdev_reset() -> lib/bdev/bdev.c:bdev_start_reset() -> lib/bdev/bdev.c:bdev_reset_freeze_channel() -> lib/bdev/bdev.c:spdk_bdev_io_complete().
  3. Remove: lib/bdev/bdev.c:spdk_bdev_unregister() -> lib/bdev/bdev.c:spdk_bdev_close().
  4. Virtual hotremove: module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_base_bdev_event_cb().

Questions:

  • Which queues are aborted during reset?
  • What protects against two resets running at once?
  • Why can unregister finish later than the RPC that requested it?
  • How does passthru convert a base remove event into virtual bdev removal?
  • Which QoS limits apply to writes but not reads?

Operational Lab

Create a timeline for a volume delete that races with I/O:

  1. Application has a descriptor open.
  2. I/O is queued by QoS.
  3. Control plane asks to unregister the bdev.
  4. bdev core starts removal.
  5. Descriptor receives remove event.
  6. Application stops I/O and closes descriptor.
  7. Module destruct runs.

For each step, write which source function owns the transition. Then identify where a bug would appear if the application ignores the remove event.

Self-Check

  1. What is the difference between QoS queued I/O and NOMEM queued I/O?
  2. Why does reset take a channel reference?
  3. What does reset_io_drain_timeout change?
  4. Why does unregister notify descriptors instead of freeing immediately?
  5. What should a virtual bdev do when its base bdev is removed?
  6. Why are NVMe passthrough commands treated conservatively for locked ranges?
  7. Which functions push and notify media events?
  8. Why must unquiesce use the exact range that was quiesced?

References

  • Local source: include/spdk/bdev_module.h.
  • Local source: lib/bdev/bdev.c.
  • Local source: lib/bdev/bdev_rpc.c.
  • Local source: module/bdev/passthru/vbdev_passthru.c.
  • Local source: module/bdev/nvme/bdev_nvme.c.
  • SPDK documentation: https://spdk.io/doc/