Reader Promise
By the end of this chapter you should be able to sketch a small bdev module and know where the hard parts are. You will know the difference between a physical bdev module and a virtual bdev module, how to allocate and register a bdev, how to handle I/O, how to forward I/O to a base bdev, how claims work, and how to tear down without leaving dangling descriptors or channels.
This is not a full coding tutorial with a patch to compile. It is a source-guided design draft. The goal is to make the existing SPDK examples readable before you write code.
Physical vs Virtual bdev Modules
A physical bdev module presents a device-like backend directly. Examples include NVMe, malloc, null, aio, and uring. A physical module's submit_request() usually talks to hardware, a file descriptor, memory, or a network transport.
A virtual bdev module presents a new bdev stacked on one or more base bdevs. Examples include passthru, lvol, RAID, crypto, delay, and split. A virtual module's submit_request() usually transforms or forwards I/O to base bdevs.
The difference matters because a virtual bdev module has extra responsibilities:
- Open the base bdev.
- Register an event callback for base bdev removal.
- Claim the base bdev when exclusive stacking is required.
- Create per-thread base channels.
- Forward completion status from base I/O to original I/O.
- Release claim and close descriptor on destruct.
Minimal Physical Module: null
The null bdev is a compact physical example. It does not store data; it accepts reads and writes, completes them later from a poller, and optionally generates or verifies DIF.
Source anchors:
module/bdev/null/bdev_null.c:struct null_bdev.module/bdev/null/bdev_null.c:struct null_io_channel.module/bdev/null/bdev_null.c:null_if.module/bdev/null/bdev_null.c:bdev_null_get_ctx_size().module/bdev/null/bdev_null.c:bdev_null_create().module/bdev/null/bdev_null.c:null_fn_table.module/bdev/null/bdev_null.c:bdev_null_submit_request().module/bdev/null/bdev_null.c:null_io_poll().module/bdev/null/bdev_null.c:bdev_null_destruct().module/bdev/null/bdev_null.c:bdev_null_initialize().module/bdev/null/bdev_null.c:bdev_null_finish().
What null teaches
The module object:
null_if.name = "null".null_if.module_init = bdev_null_initialize.null_if.module_fini = bdev_null_finish.null_if.async_fini = true.null_if.get_ctx_size = bdev_null_get_ctx_size.
The bdev object:
- Allocated in
bdev_null_create(). - Name duplicated from RPC options.
- Geometry and metadata fields filled from options.
bdev.ctxt = null_disk.bdev.fn_table = &null_fn_table.bdev.module = &null_if.- Registered with
spdk_bdev_register().
The function table:
destruct = bdev_null_destruct.submit_request = bdev_null_submit_request.io_type_supported = bdev_null_io_type_supported.get_io_channel = bdev_null_get_io_channel.write_config_json = bdev_null_write_config_json.
The io_device:
- Registered in
bdev_null_initialize()usingspdk_io_device_register(). - Uses the address of
g_null_bdev_headas the io_device key. - Per-thread channel is
struct null_io_channel. - Channel create callback registers
null_io_poll(). - Channel destroy callback unregisters the poller.
The submit path:
bdev_null_submit_request()receives an I/O from bdev core.- It switches on
bdev_io->type. - It queues supported operations on the channel's
iolist. - It completes unsupported operations as failed.
null_io_poll()drains the queue and callsspdk_bdev_io_complete().
Misconception to kill: even a "do nothing" backend should usually avoid deep inline completion recursion. Null queues work and completes it from a poller, which behaves more like a real asynchronous backend.
Minimal Virtual Module: passthru
Passthru is the canonical first virtual bdev because it mostly forwards I/O unchanged.
Source anchors:
module/bdev/passthru/vbdev_passthru.c:struct vbdev_passthru.module/bdev/passthru/vbdev_passthru.c:struct pt_io_channel.module/bdev/passthru/vbdev_passthru.c:struct passthru_bdev_io.module/bdev/passthru/vbdev_passthru.c:passthru_if.module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_register().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_fn_table.module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_submit_request().module/bdev/passthru/vbdev_passthru.c:_pt_complete_io().module/bdev/passthru/vbdev_passthru.c:pt_bdev_ch_create_cb().module/bdev/passthru/vbdev_passthru.c:pt_bdev_ch_destroy_cb().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_destruct().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_base_bdev_event_cb().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_examine().
What passthru teaches
Creation:
- Store requested base and virtual names in
g_bdev_names. - On RPC or examine, call
vbdev_passthru_register(). - Open the base with
spdk_bdev_open_ext(). - Store
base_descandbase_bdev. - Copy relevant geometry and metadata fields to
pt_bdev. - Set
pt_bdev.ctxt,pt_bdev.fn_table, andpt_bdev.module. - Register an io_device for the virtual bdev's per-thread state.
- Claim the base bdev with
spdk_bdev_module_claim_bdev(). - Register the virtual bdev with
spdk_bdev_register().
Channel handling:
vbdev_passthru_get_io_channel()returnsspdk_get_io_channel(pt_node).pt_bdev_ch_create_cb()gets a base bdev channel withspdk_bdev_get_io_channel(pt_node->base_desc).pt_bdev_ch_destroy_cb()puts that base channel.
I/O forwarding:
vbdev_passthru_submit_request()switches on original bdev I/O type.- For reads, it calls
spdk_bdev_io_get_buf()first if needed. - For writes, flush, unmap, reset, zcopy, abort, and copy, it calls the corresponding bdev API on the base descriptor and base channel.
_pt_complete_io()copies the base I/O status to the original I/O usingspdk_bdev_io_complete_base_io_status()and frees the base I/O.
NOMEM handling:
- If a base submission API returns
-ENOMEM, passthru queues a wait entry usingspdk_bdev_queue_io_wait(). - When resources are available,
vbdev_passthru_resubmit_io()retries the original I/O.
Destruct:
vbdev_passthru_destruct()removes the node from the module list.- It releases the claim with
spdk_bdev_module_release_bdev(). - It closes the base descriptor on the thread that opened it, using
spdk_thread_send_msg()if needed. - It unregisters the io_device and frees the virtual bdev object.
Hotremove:
- The base descriptor was opened with
vbdev_passthru_base_bdev_event_cb(). - On
SPDK_BDEV_EVENT_REMOVE, passthru unregisters the virtual bdev.
Misconception to kill: forwarding an I/O does not mean reusing the same struct spdk_bdev_io for the base device. The virtual module receives one bdev I/O, submits another bdev I/O to the base, then completes the original when the base I/O completes.
A KISS Module Checklist
For a physical module:
- Define backend object containing
struct spdk_bdev. - Define per-channel state if needed.
- Define per-I/O context if needed.
- Define
struct spdk_bdev_module. - Register it with
SPDK_BDEV_MODULE_REGISTER(). - Implement
get_ctx_size(). - Implement
module_init()and register any io_device. - Implement
module_fini()and unregister io_device. - Implement
struct spdk_bdev_fn_table. - Fill bdev name, geometry, capabilities, module, function table, and context.
- Call
spdk_bdev_register(). - Complete every submitted I/O exactly once.
- Free resources in
destruct().
For a virtual module, add:
- Store configured base name and virtual name.
- Use examine to attach when base appears.
- Open base with an event callback.
- Claim base before exposing the virtual bdev.
- Create virtual io_device.
- Get and put base channels in virtual channel create/destroy.
- Forward I/O through public bdev APIs.
- Translate or copy completion status.
- On base remove, unregister virtual bdev.
- On destruct, release claim and close base descriptor on the correct thread.
The Hard Rules
Do not block
Module callbacks run on SPDK threads. Blocking in submit_request() stalls that thread and all other work scheduled there. Use pollers, asynchronous APIs, and messages.
Relevant source anchors:
module/bdev/null/bdev_null.c:null_io_poll().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_resubmit_io().
Complete exactly once
Every I/O delivered to submit_request() must be completed exactly once. A missing completion hangs upper layers. A double completion trips bdev core assertions.
Relevant source anchor: lib/bdev/bdev.c:spdk_bdev_io_complete().
Respect thread ownership
Descriptors and many lifecycle actions are thread-bound.
Relevant source anchors:
lib/bdev/bdev.c:spdk_bdev_close().module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_destruct().
Separate "submission failed" from "I/O failed"
If a public bdev API returns nonzero, the new I/O was not submitted. Complete the original I/O yourself if you are a virtual module. If the base I/O completes unsuccessfully later, copy that status in your completion callback.
Relevant source anchors:
module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_submit_request().module/bdev/passthru/vbdev_passthru.c:_pt_complete_io().
Do not invent unsupported capabilities
io_type_supported() must reflect reality. If your module forwards to a base bdev, either mirror the base or deliberately restrict it.
Relevant source anchor: module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_io_type_supported().
Prose Diagram
Imagine two side-by-side diagrams.
Physical module diagram:
module_init() creates module state and io_device. An RPC calls create(). The create function allocates a backend object containing struct spdk_bdev, fills fields, and calls spdk_bdev_register(). Later, bdev core calls module submit_request(), and the module completes I/O from a poller or callback.
Virtual module diagram:
An RPC stores "base bdev name -> virtual bdev name." Examine sees the base bdev. The module opens the base, claims it, copies geometry, registers its own bdev, and creates per-thread channels that each hold a base channel. I/O enters the virtual bdev, gets submitted to the base bdev, base completion returns, virtual module completes original I/O.
Edge Cases And Failure Modes
- Create succeeds partly, then register fails: clean up name, io_device, descriptor, claim, and allocated object in reverse order.
- Base bdev not present yet: virtual modules may store configuration and create later during examine.
- Base bdev removed: event callback must unregister virtual bdevs that depend on it.
- Base descriptor opened on another thread: close it on the original thread.
- Claim fails: another module or writer already owns the base. Do not register the virtual bdev.
spdk_bdev_register()fails after claim: release the claim and close the descriptor.- Base submission returns
-ENOMEM: queue wait or fail the original I/O intentionally. - Base submission returns other error: complete original I/O failed.
- Base completion carries NVMe/SCSI/AIO status: use helper functions to preserve error detail when possible.
- Module supports reset but cannot reset safely: either fail reset or forward it carefully. Do not pretend reset happened.
- Asynchronous destruct needed: return
1fromdestruct()and callspdk_bdev_destruct_done()later. - Missing
spdk_bdev_module_examine_done(): examine can hang subsystem progress.
Misconceptions To Kill
- "A virtual bdev can skip claims if it is just forwarding." Not if it needs to protect write ownership and stacking semantics.
- "A module can malloc per-I/O context in
submit_request()." It can, but usually should not. Useget_ctx_size()for the common context. - "The base bdev event callback is optional." If you stack on a base bdev, ignoring remove events creates dangling state.
- "A create RPC should always fail if the base bdev is missing." Some virtual modules intentionally defer creation until examine sees the base.
- "The destructor can free everything immediately." Only if no asynchronous close/unregister/device cleanup remains.
Source Reading Exercise
Read module/bdev/passthru/vbdev_passthru.c in this order:
passthru_if.struct vbdev_passthru.vbdev_passthru_insert_name().vbdev_passthru_register().pt_bdev_ch_create_cb().vbdev_passthru_submit_request()._pt_complete_io().vbdev_passthru_base_bdev_event_cb().vbdev_passthru_destruct().vbdev_passthru_examine().
Questions:
- Which function opens the base bdev?
- Which function claims it?
- Which fields are copied from base bdev to virtual bdev?
- Where does the virtual bdev get a base channel?
- Where is the original I/O completed?
- What happens when the base bdev is removed?
Operational Lab
Design a "readonly passthru" module on paper.
Requirements:
- It creates a virtual bdev on top of one base bdev.
- It allows reads and flushes.
- It rejects writes, unmap, write zeroes, copy, and zcopy.
- It unregisters when the base is removed.
Write:
- The module object fields.
- The function table fields.
- The virtual bdev fields you would copy from the base.
- The
io_type_supported()policy. - The
submit_request()switch cases. - The destruct order.
Then compare your plan to passthru. You should be able to make the design by deleting or failing cases, not by inventing a new architecture.
Self-Check
- What does
get_ctx_size()buy you? - Why does passthru store
base_descinstead of onlybase_bdev? - Why does a virtual bdev need per-thread channels?
- What must happen if a base submission API returns
-ENOMEM? - What is the difference between
spdk_bdev_unregister()andspdk_bdev_destruct_done()? - Why must examine callbacks call
spdk_bdev_module_examine_done()? - Why should
io_type_supported()be conservative? - What is the safest cleanup order after virtual bdev registration fails?
References
- Local source:
include/spdk/bdev_module.h. - Local source:
lib/bdev/bdev.c. - Local source:
module/bdev/null/bdev_null.c. - Local source:
module/bdev/passthru/vbdev_passthru.c. - SPDK documentation: https://spdk.io/doc/