SPDK From First Principles

SPDK deep learning path

Chapter 15: Writing A bdev Module

By the end of this chapter you should be able to sketch a small bdev module and know where the hard parts are. You will know the difference between a physical bdev module and a...

Source: drafts/bdev-nvme/15-writing-a-bdev-module.md

Reader Promise

By the end of this chapter you should be able to sketch a small bdev module and know where the hard parts are. You will know the difference between a physical bdev module and a virtual bdev module, how to allocate and register a bdev, how to handle I/O, how to forward I/O to a base bdev, how claims work, and how to tear down without leaving dangling descriptors or channels.

This is not a full coding tutorial with a patch to compile. It is a source-guided design draft. The goal is to make the existing SPDK examples readable before you write code.

Physical vs Virtual bdev Modules

A physical bdev module presents a device-like backend directly. Examples include NVMe, malloc, null, aio, and uring. A physical module's submit_request() usually talks to hardware, a file descriptor, memory, or a network transport.

A virtual bdev module presents a new bdev stacked on one or more base bdevs. Examples include passthru, lvol, RAID, crypto, delay, and split. A virtual module's submit_request() usually transforms or forwards I/O to base bdevs.

The difference matters because a virtual bdev module has extra responsibilities:

  • Open the base bdev.
  • Register an event callback for base bdev removal.
  • Claim the base bdev when exclusive stacking is required.
  • Create per-thread base channels.
  • Forward completion status from base I/O to original I/O.
  • Release claim and close descriptor on destruct.

Minimal Physical Module: null

The null bdev is a compact physical example. It does not store data; it accepts reads and writes, completes them later from a poller, and optionally generates or verifies DIF.

Source anchors:

  • module/bdev/null/bdev_null.c:struct null_bdev.
  • module/bdev/null/bdev_null.c:struct null_io_channel.
  • module/bdev/null/bdev_null.c:null_if.
  • module/bdev/null/bdev_null.c:bdev_null_get_ctx_size().
  • module/bdev/null/bdev_null.c:bdev_null_create().
  • module/bdev/null/bdev_null.c:null_fn_table.
  • module/bdev/null/bdev_null.c:bdev_null_submit_request().
  • module/bdev/null/bdev_null.c:null_io_poll().
  • module/bdev/null/bdev_null.c:bdev_null_destruct().
  • module/bdev/null/bdev_null.c:bdev_null_initialize().
  • module/bdev/null/bdev_null.c:bdev_null_finish().

What null teaches

The module object:

  • null_if.name = "null".
  • null_if.module_init = bdev_null_initialize.
  • null_if.module_fini = bdev_null_finish.
  • null_if.async_fini = true.
  • null_if.get_ctx_size = bdev_null_get_ctx_size.

The bdev object:

  • Allocated in bdev_null_create().
  • Name duplicated from RPC options.
  • Geometry and metadata fields filled from options.
  • bdev.ctxt = null_disk.
  • bdev.fn_table = &null_fn_table.
  • bdev.module = &null_if.
  • Registered with spdk_bdev_register().

The function table:

  • destruct = bdev_null_destruct.
  • submit_request = bdev_null_submit_request.
  • io_type_supported = bdev_null_io_type_supported.
  • get_io_channel = bdev_null_get_io_channel.
  • write_config_json = bdev_null_write_config_json.

The io_device:

  • Registered in bdev_null_initialize() using spdk_io_device_register().
  • Uses the address of g_null_bdev_head as the io_device key.
  • Per-thread channel is struct null_io_channel.
  • Channel create callback registers null_io_poll().
  • Channel destroy callback unregisters the poller.

The submit path:

  • bdev_null_submit_request() receives an I/O from bdev core.
  • It switches on bdev_io->type.
  • It queues supported operations on the channel's io list.
  • It completes unsupported operations as failed.
  • null_io_poll() drains the queue and calls spdk_bdev_io_complete().

Misconception to kill: even a "do nothing" backend should usually avoid deep inline completion recursion. Null queues work and completes it from a poller, which behaves more like a real asynchronous backend.

Minimal Virtual Module: passthru

Passthru is the canonical first virtual bdev because it mostly forwards I/O unchanged.

Source anchors:

  • module/bdev/passthru/vbdev_passthru.c:struct vbdev_passthru.
  • module/bdev/passthru/vbdev_passthru.c:struct pt_io_channel.
  • module/bdev/passthru/vbdev_passthru.c:struct passthru_bdev_io.
  • module/bdev/passthru/vbdev_passthru.c:passthru_if.
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_register().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_fn_table.
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_submit_request().
  • module/bdev/passthru/vbdev_passthru.c:_pt_complete_io().
  • module/bdev/passthru/vbdev_passthru.c:pt_bdev_ch_create_cb().
  • module/bdev/passthru/vbdev_passthru.c:pt_bdev_ch_destroy_cb().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_destruct().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_base_bdev_event_cb().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_examine().

What passthru teaches

Creation:

  1. Store requested base and virtual names in g_bdev_names.
  2. On RPC or examine, call vbdev_passthru_register().
  3. Open the base with spdk_bdev_open_ext().
  4. Store base_desc and base_bdev.
  5. Copy relevant geometry and metadata fields to pt_bdev.
  6. Set pt_bdev.ctxt, pt_bdev.fn_table, and pt_bdev.module.
  7. Register an io_device for the virtual bdev's per-thread state.
  8. Claim the base bdev with spdk_bdev_module_claim_bdev().
  9. Register the virtual bdev with spdk_bdev_register().

Channel handling:

  • vbdev_passthru_get_io_channel() returns spdk_get_io_channel(pt_node).
  • pt_bdev_ch_create_cb() gets a base bdev channel with spdk_bdev_get_io_channel(pt_node->base_desc).
  • pt_bdev_ch_destroy_cb() puts that base channel.

I/O forwarding:

  • vbdev_passthru_submit_request() switches on original bdev I/O type.
  • For reads, it calls spdk_bdev_io_get_buf() first if needed.
  • For writes, flush, unmap, reset, zcopy, abort, and copy, it calls the corresponding bdev API on the base descriptor and base channel.
  • _pt_complete_io() copies the base I/O status to the original I/O using spdk_bdev_io_complete_base_io_status() and frees the base I/O.

NOMEM handling:

  • If a base submission API returns -ENOMEM, passthru queues a wait entry using spdk_bdev_queue_io_wait().
  • When resources are available, vbdev_passthru_resubmit_io() retries the original I/O.

Destruct:

  • vbdev_passthru_destruct() removes the node from the module list.
  • It releases the claim with spdk_bdev_module_release_bdev().
  • It closes the base descriptor on the thread that opened it, using spdk_thread_send_msg() if needed.
  • It unregisters the io_device and frees the virtual bdev object.

Hotremove:

  • The base descriptor was opened with vbdev_passthru_base_bdev_event_cb().
  • On SPDK_BDEV_EVENT_REMOVE, passthru unregisters the virtual bdev.

Misconception to kill: forwarding an I/O does not mean reusing the same struct spdk_bdev_io for the base device. The virtual module receives one bdev I/O, submits another bdev I/O to the base, then completes the original when the base I/O completes.

A KISS Module Checklist

For a physical module:

  1. Define backend object containing struct spdk_bdev.
  2. Define per-channel state if needed.
  3. Define per-I/O context if needed.
  4. Define struct spdk_bdev_module.
  5. Register it with SPDK_BDEV_MODULE_REGISTER().
  6. Implement get_ctx_size().
  7. Implement module_init() and register any io_device.
  8. Implement module_fini() and unregister io_device.
  9. Implement struct spdk_bdev_fn_table.
  10. Fill bdev name, geometry, capabilities, module, function table, and context.
  11. Call spdk_bdev_register().
  12. Complete every submitted I/O exactly once.
  13. Free resources in destruct().

For a virtual module, add:

  1. Store configured base name and virtual name.
  2. Use examine to attach when base appears.
  3. Open base with an event callback.
  4. Claim base before exposing the virtual bdev.
  5. Create virtual io_device.
  6. Get and put base channels in virtual channel create/destroy.
  7. Forward I/O through public bdev APIs.
  8. Translate or copy completion status.
  9. On base remove, unregister virtual bdev.
  10. On destruct, release claim and close base descriptor on the correct thread.

The Hard Rules

Do not block

Module callbacks run on SPDK threads. Blocking in submit_request() stalls that thread and all other work scheduled there. Use pollers, asynchronous APIs, and messages.

Relevant source anchors:

  • module/bdev/null/bdev_null.c:null_io_poll().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_resubmit_io().

Complete exactly once

Every I/O delivered to submit_request() must be completed exactly once. A missing completion hangs upper layers. A double completion trips bdev core assertions.

Relevant source anchor: lib/bdev/bdev.c:spdk_bdev_io_complete().

Respect thread ownership

Descriptors and many lifecycle actions are thread-bound.

Relevant source anchors:

  • lib/bdev/bdev.c:spdk_bdev_close().
  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_destruct().

Separate "submission failed" from "I/O failed"

If a public bdev API returns nonzero, the new I/O was not submitted. Complete the original I/O yourself if you are a virtual module. If the base I/O completes unsuccessfully later, copy that status in your completion callback.

Relevant source anchors:

  • module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_submit_request().
  • module/bdev/passthru/vbdev_passthru.c:_pt_complete_io().

Do not invent unsupported capabilities

io_type_supported() must reflect reality. If your module forwards to a base bdev, either mirror the base or deliberately restrict it.

Relevant source anchor: module/bdev/passthru/vbdev_passthru.c:vbdev_passthru_io_type_supported().

Prose Diagram

Imagine two side-by-side diagrams.

Physical module diagram:

module_init() creates module state and io_device. An RPC calls create(). The create function allocates a backend object containing struct spdk_bdev, fills fields, and calls spdk_bdev_register(). Later, bdev core calls module submit_request(), and the module completes I/O from a poller or callback.

Virtual module diagram:

An RPC stores "base bdev name -> virtual bdev name." Examine sees the base bdev. The module opens the base, claims it, copies geometry, registers its own bdev, and creates per-thread channels that each hold a base channel. I/O enters the virtual bdev, gets submitted to the base bdev, base completion returns, virtual module completes original I/O.

Edge Cases And Failure Modes

  • Create succeeds partly, then register fails: clean up name, io_device, descriptor, claim, and allocated object in reverse order.
  • Base bdev not present yet: virtual modules may store configuration and create later during examine.
  • Base bdev removed: event callback must unregister virtual bdevs that depend on it.
  • Base descriptor opened on another thread: close it on the original thread.
  • Claim fails: another module or writer already owns the base. Do not register the virtual bdev.
  • spdk_bdev_register() fails after claim: release the claim and close the descriptor.
  • Base submission returns -ENOMEM: queue wait or fail the original I/O intentionally.
  • Base submission returns other error: complete original I/O failed.
  • Base completion carries NVMe/SCSI/AIO status: use helper functions to preserve error detail when possible.
  • Module supports reset but cannot reset safely: either fail reset or forward it carefully. Do not pretend reset happened.
  • Asynchronous destruct needed: return 1 from destruct() and call spdk_bdev_destruct_done() later.
  • Missing spdk_bdev_module_examine_done(): examine can hang subsystem progress.

Misconceptions To Kill

  • "A virtual bdev can skip claims if it is just forwarding." Not if it needs to protect write ownership and stacking semantics.
  • "A module can malloc per-I/O context in submit_request()." It can, but usually should not. Use get_ctx_size() for the common context.
  • "The base bdev event callback is optional." If you stack on a base bdev, ignoring remove events creates dangling state.
  • "A create RPC should always fail if the base bdev is missing." Some virtual modules intentionally defer creation until examine sees the base.
  • "The destructor can free everything immediately." Only if no asynchronous close/unregister/device cleanup remains.

Source Reading Exercise

Read module/bdev/passthru/vbdev_passthru.c in this order:

  1. passthru_if.
  2. struct vbdev_passthru.
  3. vbdev_passthru_insert_name().
  4. vbdev_passthru_register().
  5. pt_bdev_ch_create_cb().
  6. vbdev_passthru_submit_request().
  7. _pt_complete_io().
  8. vbdev_passthru_base_bdev_event_cb().
  9. vbdev_passthru_destruct().
  10. vbdev_passthru_examine().

Questions:

  • Which function opens the base bdev?
  • Which function claims it?
  • Which fields are copied from base bdev to virtual bdev?
  • Where does the virtual bdev get a base channel?
  • Where is the original I/O completed?
  • What happens when the base bdev is removed?

Operational Lab

Design a "readonly passthru" module on paper.

Requirements:

  • It creates a virtual bdev on top of one base bdev.
  • It allows reads and flushes.
  • It rejects writes, unmap, write zeroes, copy, and zcopy.
  • It unregisters when the base is removed.

Write:

  • The module object fields.
  • The function table fields.
  • The virtual bdev fields you would copy from the base.
  • The io_type_supported() policy.
  • The submit_request() switch cases.
  • The destruct order.

Then compare your plan to passthru. You should be able to make the design by deleting or failing cases, not by inventing a new architecture.

Self-Check

  1. What does get_ctx_size() buy you?
  2. Why does passthru store base_desc instead of only base_bdev?
  3. Why does a virtual bdev need per-thread channels?
  4. What must happen if a base submission API returns -ENOMEM?
  5. What is the difference between spdk_bdev_unregister() and spdk_bdev_destruct_done()?
  6. Why must examine callbacks call spdk_bdev_module_examine_done()?
  7. Why should io_type_supported() be conservative?
  8. What is the safest cleanup order after virtual bdev registration fails?

References

  • Local source: include/spdk/bdev_module.h.
  • Local source: lib/bdev/bdev.c.
  • Local source: module/bdev/null/bdev_null.c.
  • Local source: module/bdev/passthru/vbdev_passthru.c.
  • SPDK documentation: https://spdk.io/doc/