Beginner Mental Model
RAID in SPDK is a virtual bdev module. It takes several base bdevs and exposes one logical bdev. The RAID bdev's submit_request function maps logical offsets to base-device offsets and submits one or more child IOs. The bdev user sees one device; the RAID module sees a set of base devices, metadata, state, and sometimes a background process such as rebuild.
Prose diagram:
base bdev 0 base bdev 1 base bdev 2
\ | /
\ | /
+------ RAID bdev -------+
|
bdev consumers
RAID is both a data mapping layer and a lifecycle layer. It has to answer:
- Which base bdevs belong to this RAID bdev?
- Is the array online, configuring, or offline?
- Can IO proceed with missing bases?
- Does metadata on disk identify the array?
- Is a rebuild or other process active?
- What happens if a base bdev is removed or resized?
Source Map
Core RAID files:
module/bdev/raid/bdev_raid.hmodule/bdev/raid/bdev_raid.cmodule/bdev/raid/bdev_raid_sb.cmodule/bdev/raid/bdev_raid_rpc.c
RAID level modules:
module/bdev/raid/raid0.cmodule/bdev/raid/raid1.cmodule/bdev/raid/concat.cmodule/bdev/raid/raid5f.c
Tests:
test/bdev/bdev_raid.shtest/unit/lib/bdev/raid/bdev_raid.c/bdev_raid_ut.ctest/unit/lib/bdev/raid/bdev_raid_sb.c/bdev_raid_sb_ut.ctest/unit/lib/bdev/raid/raid0.c/raid0_ut.ctest/unit/lib/bdev/raid/raid1.c/raid1_ut.ctest/unit/lib/bdev/raid/concat.c/concat_ut.ctest/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c
Core Objects
Private structures and enums:
module/bdev/raid/bdev_raid.h:enum raid_bdev_statemodule/bdev/raid/bdev_raid.h:enum raid_process_typemodule/bdev/raid/bdev_raid.h:struct raid_base_bdev_infomodule/bdev/raid/bdev_raid.h:struct raid_bdevmodule/bdev/raid/bdev_raid.h:struct raid_bdev_iomodule/bdev/raid/bdev_raid.h:struct raid_bdev_io_channelmodule/bdev/raid/bdev_raid.h:struct raid_bdev_process_requestmodule/bdev/raid/bdev_raid.h:struct raid_bdev_module
Public-ish module APIs inside RAID:
module/bdev/raid/bdev_raid.h:raid_bdev_createmodule/bdev/raid/bdev_raid.h:raid_bdev_deletemodule/bdev/raid/bdev_raid.h:raid_bdev_add_base_bdevmodule/bdev/raid/bdev_raid.h:raid_bdev_remove_base_bdevmodule/bdev/raid/bdev_raid.h:raid_bdev_alloc_superblockmodule/bdev/raid/bdev_raid.h:raid_bdev_write_superblockmodule/bdev/raid/bdev_raid.h:raid_bdev_load_base_bdev_superblock
The struct raid_bdev_module interface is how RAID0, RAID1, concat, and raid5f plug into common RAID lifecycle code. It includes level-specific functions for configuring, stopping, submitting read/write requests, submitting null-payload requests, and background process requests.
RAID States
State anchors:
module/bdev/raid/bdev_raid.h:enum raid_bdev_statemodule/bdev/raid/bdev_raid.c:g_raid_state_namesmodule/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_get_bdevs
Important states:
- Online: the RAID bdev is registered with the bdev layer.
- Configuring: not all needed information or base devices are present yet.
- Offline: the RAID bdev is not available for normal IO.
Beginner misconception to kill: "configured in JSON" and "online bdev exists" are not the same state. A RAID bdev can exist as an in-memory configuration object before it is registered as a bdev.
RAID Creation
RPC and implementation anchors:
module/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_createmodule/bdev/raid/bdev_raid.c:raid_bdev_createmodule/bdev/raid/bdev_raid.c:_raid_bdev_createmodule/bdev/raid/bdev_raid.c:raid_bdev_add_base_bdevmodule/bdev/raid/bdev_raid.c:raid_bdev_configure_base_bdevmodule/bdev/raid/bdev_raid.c:raid_bdev_configuremodule/bdev/raid/bdev_raid.c:raid_bdev_configure_cont
Creation flow:
raid_bdev_create RPC
-> allocate raid_bdev object
-> store level, strip size, expected base count, UUID/superblock flags
-> add configured base names
-> when enough base bdevs are open and configured:
initialize bdev fields
optionally write superblocks
register RAID bdev
During configuration, RAID opens base bdevs, creates IO channels, checks block sizes and strip constraints, stores base slots, and claims bases.
RAID bdev IO Path
Common entry:
module/bdev/raid/bdev_raid.c:raid_bdev_submit_requestmodule/bdev/raid/bdev_raid.c:raid_bdev_io_initmodule/bdev/raid/bdev_raid.c:raid_bdev_submit_rw_requestmodule/bdev/raid/bdev_raid.c:raid_bdev_submit_null_payload_requestmodule/bdev/raid/bdev_raid.c:raid_bdev_io_completemodule/bdev/raid/bdev_raid.c:raid_bdev_queue_io_waitmodule/bdev/raid/bdev_raid.c:raid_bdev_io_splitmodule/bdev/raid/bdev_raid.c:raid_bdev_submit_requestmodule/bdev/raid/bdev_raid.c:g_raid_bdev_fn_table
Level-specific entries:
module/bdev/raid/raid0.c:raid0_submit_rw_requestmodule/bdev/raid/raid0.c:raid0_submit_null_payload_requestmodule/bdev/raid/raid1.c:raid1_submit_rw_requestmodule/bdev/raid/raid1.c:raid1_submit_read_requestmodule/bdev/raid/raid1.c:raid1_submit_write_requestmodule/bdev/raid/raid1.c:raid1_submit_null_payload_requestmodule/bdev/raid/concat.c:concat_submit_rw_requestmodule/bdev/raid/concat.c:concat_submit_null_payload_requestmodule/bdev/raid/raid5f.c:raid5f_submit_rw_requestmodule/bdev/raid/raid5f.c:raid5f_submit_read_requestmodule/bdev/raid/raid5f.c:raid5f_submit_write_request
Common IO flow:
bdev user IO to RAID bdev
-> lib/bdev/bdev.c:bdev_submit_request
-> module/bdev/raid/bdev_raid.c:raid_bdev_submit_request
-> common RAID IO object setup
-> if background process affects range, adjust/split/channel-select
-> call level module submit function
-> submit one or more child IOs to base bdevs
-> collect child completions
-> complete original bdev_io
RAID0 Mapping
Source anchors:
module/bdev/raid/raid0.c:raid0_submit_rw_requestmodule/bdev/raid/raid0.c:raid0_submit_null_payload_requestmodule/bdev/raid/raid0.c:g_raid0_module
RAID0 stripes logical blocks across base devices. Reads and writes go to the base that owns the stripe segment. Large IO may split across strips and across base devices.
Prose diagram for strip size S, bases 0 and 1:
logical blocks:
0..S-1 -> base0 offset 0
S..2S-1 -> base1 offset 0
2S..3S-1 -> base0 offset S
3S..4S-1 -> base1 offset S
RAID0 has no redundancy. Losing a base means the array cannot serve complete data.
RAID1 Mapping
Source anchors:
module/bdev/raid/raid1.c:raid1_submit_rw_requestmodule/bdev/raid/raid1.c:raid1_submit_read_requestmodule/bdev/raid/raid1.c:raid1_submit_write_requestmodule/bdev/raid/raid1.c:raid1_submit_process_requestmodule/bdev/raid/raid1.c:g_raid1_module
RAID1 mirrors data. A write is submitted to multiple operational bases. A read may choose an operational base. Rebuild copies data from an existing operational mirror to a replacement target.
Prose diagram:
write LBA X:
submit to base0 LBA X
submit to base1 LBA X
complete success only if required write policy succeeds
read LBA X:
choose one operational base
submit read
RAID1 can operate degraded if enough mirrors remain. The exact thresholds are represented in struct raid_bdev fields such as min_base_bdevs_operational.
Concat Mapping
Source anchors:
module/bdev/raid/concat.c:concat_submit_rw_requestmodule/bdev/raid/concat.c:concat_submit_null_payload_requestmodule/bdev/raid/concat.c:g_concat_module
Concat appends base devices end-to-end. It has no striping or redundancy. It is useful as a simple many-to-one mapping example:
logical 0..end(base0)-1 -> base0
logical end(base0)..next-1 -> base1
logical next..next+base2-1 -> base2
RAID5f Mapping
Source anchors:
module/bdev/raid/raid5f.c:raid5f_submit_rw_requestmodule/bdev/raid/raid5f.c:raid5f_submit_read_requestmodule/bdev/raid/raid5f.c:raid5f_submit_write_requestmodule/bdev/raid/raid5f.c:raid5f_submit_reconstruct_readmodule/bdev/raid/raid5f.c:raid5f_submit_process_requestmodule/bdev/raid/raid5f.c:g_raid5f_module
RAID5f is more complex because it stores parity and may reconstruct reads when a chunk is unavailable. The implementation has stripe request helpers and process support for reconstructing data to a target.
Beginner path: understand RAID0 and RAID1 first. Then read raid5f_submit_read_request() and look for when it calls raid5f_submit_reconstruct_read().
Superblocks
Source anchors:
module/bdev/raid/bdev_raid_sb.c:raid_bdev_alloc_superblockmodule/bdev/raid/bdev_raid_sb.c:raid_bdev_init_superblockmodule/bdev/raid/bdev_raid_sb.c:raid_bdev_write_superblockmodule/bdev/raid/bdev_raid_sb.c:_raid_bdev_write_superblockmodule/bdev/raid/bdev_raid_sb.c:raid_bdev_write_superblock_cbmodule/bdev/raid/bdev_raid_sb.c:raid_bdev_load_base_bdev_superblockmodule/bdev/raid/bdev_raid.c:raid_bdev_examine_load_sbmodule/bdev/raid/bdev_raid.c:raid_bdev_examine_sbmodule/bdev/raid/bdev_raid.c:raid_bdev_create_from_sb
Superblocks let RAID discover arrays from base bdev metadata. During examine, the module may read a base bdev superblock, create the RAID object if needed, and then load/configure other base devices.
Superblocks are also written after configuration changes such as base removal, resize, or process completion.
Edge cases:
- Superblock version may be newer than the running code expects.
- A base may have stale RAID metadata.
- Some arrays may be config-only without superblock discovery.
- Superblock writes can fail and leave the array in a transitional state that examine/recovery must handle later.
Degraded Mode And Base Removal
Source anchors:
module/bdev/raid/bdev_raid.c:raid_bdev_remove_base_bdevmodule/bdev/raid/bdev_raid.c:_raid_bdev_remove_base_bdevmodule/bdev/raid/bdev_raid.c:raid_bdev_remove_base_bdev_quiescemodule/bdev/raid/bdev_raid.c:raid_bdev_remove_base_bdev_on_quiescedmodule/bdev/raid/bdev_raid.c:raid_bdev_remove_base_bdev_contmodule/bdev/raid/bdev_raid.c:raid_bdev_remove_base_bdev_do_removemodule/bdev/raid/bdev_raid.c:raid_bdev_deconfiguremodule/bdev/raid/bdev_raid.c:raid_bdev_deconfigure_base_bdev
When a base bdev is removed, RAID may:
- Quiesce the RAID bdev or affected ranges.
- Reset the base.
- Stop or modify an active background process.
- Deconfigure the base slot.
- Deconfigure or unregister the RAID bdev if too few bases remain.
- Write updated superblocks.
- Complete the remove callback asynchronously.
Beginner misconception to kill: base removal is not just clearing a pointer. The RAID bdev may still have in-flight IO, per-thread channels, child IOs waiting on resources, superblock updates, and rebuild state.
Rebuild And Background Processes
Source anchors:
module/bdev/raid/bdev_raid.h:enum raid_process_typemodule/bdev/raid/bdev_raid.c:raid_bdev_start_rebuildmodule/bdev/raid/bdev_raid.c:raid_bdev_process_allocmodule/bdev/raid/bdev_raid.c:raid_bdev_process_startmodule/bdev/raid/bdev_raid.c:raid_bdev_process_thread_initmodule/bdev/raid/bdev_raid.c:raid_bdev_process_thread_runmodule/bdev/raid/bdev_raid.c:raid_bdev_process_lock_window_rangemodule/bdev/raid/bdev_raid.c:raid_bdev_submit_process_requestmodule/bdev/raid/bdev_raid.c:raid_bdev_process_request_completemodule/bdev/raid/bdev_raid.c:raid_bdev_process_finishmodule/bdev/raid/bdev_raid.c:raid_bdev_process_finish_write_sbmodule/bdev/raid/raid1.c:raid1_submit_process_requestmodule/bdev/raid/raid5f.c:raid5f_submit_process_request
Rebuild is implemented as a background process. It has a target base bdev, a window offset, a maximum window size, request objects, optional bandwidth limiting, and finish actions. The process locks/quiesces a range, reconstructs/copies data, updates channel state so normal IO sees the correct mapping, unlocks the range, and moves to the next window.
Prose diagram:
replacement base added
-> mark replacement as process target
-> start rebuild process
-> for each window:
quiesce RAID range
submit process IO through level-specific module
update channels so processed range can use replacement
unquiesce range
-> write superblocks
-> clear process target
Key subtlety: normal foreground IO may overlap the rebuild boundary. Start from module/bdev/raid/bdev_raid.c:raid_bdev_submit_request and follow the RAID-level submit path into the RAID1 process code to see how foreground IO is handled around processed and unprocessed rebuild ranges.
Resize
Source anchors:
module/bdev/raid/bdev_raid.c:raid_bdev_resize_base_bdevmodule/bdev/raid/bdev_raid.c:raid_bdev_resize_write_sb_cbmodule/bdev/raid/bdev_raid.c:raid_bdev_destructmodule/bdev/raid/bdev_raid.c:raid_bdev_event_base_bdev
Base resize can affect RAID capacity. The RAID module must recalculate logical block count according to the RAID level and base sizes, notify the bdev layer, and persist new superblock information if superblocks are enabled.
Edge cases:
- One base grows but others do not; the RAID logical size may not change.
- A base shrinks below current mapping requirements; the array may need deconfigure or fail.
- Resize during rebuild or removal has to interact with process/channel state.
- Upper layers such as lvolstore may need explicit grow operations after RAID grows.
JSON-RPC Surface
RPC anchors:
module/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_createmodule/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_deletemodule/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_get_bdevsmodule/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_add_base_bdevmodule/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_remove_base_bdevmodule/bdev/raid/bdev_raid_rpc.c:rpc_bdev_raid_set_options
The get RPC uses categories such as all, online, configuring, and offline, matching the state model. This is a practical debugging tool: if a RAID bdev is not visible as a bdev, it may still exist as configuring/offline state.
Stacking With lvol
Two common stacks:
RAID under lvol:
base0 + base1 -> RAID bdev -> lvolstore -> lvol bdevs
lvol under RAID:
lvol bdev A + lvol bdev B -> RAID bdev
RAID under lvol is usually easier to reason about for shared durability: one lvolstore sees one reliable-ish base device. lvol under RAID is possible as a bdev graph, but it creates coupling between independent lvolstores and RAID lifecycle. The right answer depends on product requirements, but the source-level responsibilities remain the same:
- Each layer claims its immediate base.
- Each layer handles remove/resize events from its base.
- Each layer exposes a new bdev that may be examined by other modules.
- Each layer translates IO and completion semantics.
Misconceptions To Kill
- "RAID is special outside bdev." No. It is a bdev module with a virtual bdev.
- "Online means all bases are present." Not necessarily for redundant RAID levels; degraded online operation may be allowed.
- "Configuring means broken." It may mean the module is waiting for more bases or metadata.
- "Superblocks are required for all RAID." Config-driven RAID can exist, but superblocks enable disk discovery.
- "Rebuild is a single IO." It is a windowed background process with quiesce/unquiesce and per-channel updates.
- "Resize of one base automatically grows every upper layer." RAID may resize its bdev, but lvolstore/blobstore grow is a separate operation.
Source Reading Exercise
Read the RAID1 write path:
module/bdev/raid/bdev_raid.c:raid_bdev_submit_requestmodule/bdev/raid/bdev_raid.c:raid_bdev_submit_rw_requestmodule/bdev/raid/raid1.c:raid1_submit_rw_requestmodule/bdev/raid/raid1.c:raid1_submit_write_requestmodule/bdev/raid/bdev_raid.c:raid_bdev_io_complete
Questions:
- How many child IOs can one write produce?
- Where does the level module decide which base bdevs receive IO?
- How are completions collected?
- Where would
-ENOMEMor IO-wait retry enter the path?
Operational Lab
Use test/bdev/bdev_raid.sh as the main lab script.
Suggested manual lab:
1. Create three malloc bdevs.
2. Create RAID1 or RAID0 using two of them.
3. Run bdev_get_bdevs and raid_bdev_get_bdevs with category=all.
4. Create an lvolstore on the RAID bdev.
5. Create an lvol and write data.
6. Remove one RAID base bdev.
7. Observe RAID state and lvol IO behavior.
8. Add a replacement base.
9. Observe rebuild/process state.
10. Grow base bdevs and determine which layers see the new size.
Debug checklist:
- Is the RAID bdev online, configuring, or offline?
- Are all expected base names present?
- Did superblock load succeed?
- Is a background process active?
- Is a base marked
is_process_target? - Did upper layers wait for examine after base creation?
- Is the lvolstore on top of RAID using
spdk_bs_grow_live()or equivalent grow path after RAID size changes?
Self-Check
- What does RAID add beyond a simple bdev wrapper?
- Why can a RAID bdev exist but not be registered as an online bdev?
- Which functions read and write RAID superblocks?
- Why does rebuild quiesce ranges?
- How does RAID0 map logical offsets to base offsets?
- How does RAID1 differ for read and write?
- What happens if a base is removed during a background process?
- Why is RAID resize not the same as lvolstore grow?
References
- Local RAID core:
module/bdev/raid/bdev_raid.c - Local RAID structures:
module/bdev/raid/bdev_raid.h - Local RAID superblocks:
module/bdev/raid/bdev_raid_sb.c - Local RAID RPC:
module/bdev/raid/bdev_raid_rpc.c - Local RAID levels:
module/bdev/raid/raid0.c,module/bdev/raid/raid1.c,module/bdev/raid/concat.c,module/bdev/raid/raid5f.c - Local tests:
test/bdev/bdev_raid.sh,test/unit/lib/bdev/raid/bdev_raid.c/bdev_raid_ut.c