Chapter 3: What A Block Device Promises | SPDK From First Principles

Chapter Goal

A block device is the lie that makes storage programming possible. It tells software: "Give me a logical block address and a length, and I will read or write that range." It hides heads, cylinders, NAND pages, erase blocks, caches, remaps, retries, and controller firmware. SPDK is built around that same promise: almost every storage object eventually becomes a struct spdk_bdev, even when the backing thing is an NVMe namespace, a malloc buffer, a file, a RAID volume, a logical volume, or a remote export.

By the end of this chapter, you should be able to explain the difference between bytes, sectors, and logical blocks; reason about alignment and atomicity; explain why flush, FUA, unmap, write zeroes, and metadata matter; and read the SPDK bdev structure without treating its fields as random driver trivia.

Beginner Mental Model

Imagine a huge numbered shelf of fixed-size boxes:

logical block number:   0      1      2      3      4      5
                      +------+------+------+------+------+------+
contents:             | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB | 4KiB |
                      +------+------+------+------+------+------+

The host does not ask "put this byte at NAND die 3, plane 1, block 22, page 17." It asks "write 8 logical blocks starting at LBA 128." A block device turns byte-oriented user data into block-oriented commands. The simplest block device contract has four pieces:

A fixed logical block size, commonly 512 bytes or 4096 bytes.
A count of logical blocks.
Operations such as read, write, flush, unmap, reset, and write zeroes.
Rules about which addresses, lengths, and buffers are legal or fast.

That last line is where most production bugs hide. A block device is not an infinite byte array with magical durability. It is an asynchronous state machine with geometry, caching, media limits, and failure modes.

Bytes, Sectors, And Logical Blocks

A byte is the CPU's smallest addressable unit. A sector was historically the disk drive's native transfer unit, often 512 bytes. A logical block is the unit exposed by a modern storage API. In practice people still say "sector" when they mean "logical block," but the distinction matters when devices expose 512-byte logical blocks backed by 4096-byte physical sectors or when metadata/DIF adds extra bytes per block.

In SPDK, bdev APIs use blocks for most storage operations. The public getters expose that geometry:

include/spdk/bdev.h:807 declares spdk_bdev_get_block_size().
include/spdk/bdev.h:830 declares spdk_bdev_get_num_blocks() and documents that valid logical blocks are numbered from 0 through num_blocks - 1.
include/spdk/bdev_module.h:436 and include/spdk/bdev_module.h:445 show the backend fields blocklen and blockcnt.

The most important beginner rule is this: offsets and lengths in a block API are not byte offsets unless the function name says so. offset_blocks = 10 on a 4096-byte bdev means byte offset 40960, not byte offset 10.

Why This Matters For diskengine And excloud

Cloud volumes need a stable illusion. Tenants and filesystems expect a volume to have a size, support reads and writes at specific offsets, and preserve ordering when the control plane asks for a flush or when a guest issues a barrier. diskengine can compose SPDK bdevs, export them, and reconcile them, but the correctness boundary still starts with the block contract.

When debugging an excloud volume, classify symptoms using the block model first:

"Read returns old data" may be ordering, flush, cache, or lost completion.
"Write succeeds but later data is zero" may be unmap/write-zeroes behavior, thin provisioning, or backend replacement.
"Only 4K writes fail" may be alignment, write unit size, metadata, or atomicity.
"Latency spikes during random write" may be SSD garbage collection hidden below the block abstraction.

SPDK does not remove those concerns. It gives you sharper tools and fewer kernel layers between the application and the device.

The SPDK bdev Contract In Source

The central structure is struct spdk_bdev in include/spdk/bdev_module.h:420. The fields are the vocabulary of the contract:

name and aliases identify the device.
blocklen is the logical block size in bytes.
phys_blocklen describes the physical block size when it differs.
io_type_supported records which operation classes the backend accepts.
blockcnt is the number of logical blocks.
write_unit_size, optimal_io_boundary, preferred_write_alignment, preferred_write_granularity, and optimal_write_size describe write shape.
max_segment_size, max_num_segments, max_unmap, max_unmap_segments, max_write_zeroes, max_copy, and max_rw_size limit request shape.
required_alignment says data buffers may need a specific alignment; SPDK may double-buffer when a caller violates it.

These are not just informational fields. They determine whether the bdev layer splits requests, rejects requests, allocates bounce buffers, or passes an operation directly to the module. The comments around include/spdk/bdev_module.h:448 explain that the bdev layer may split writes on write_unit_size or split reads/writes on optimal_io_boundary; the same comments explicitly call out that these flags do not force splitting for unmap, write zeroes, or flush.

The public support check is spdk_bdev_io_type_supported() in include/spdk/bdev.h:752. You should never assume a bdev supports unmap, write zeroes, compare-and-write, zone append, or NVMe passthrough just because the underlying hardware might. The exported bdev may be virtual, layered, or deliberately conservative.

Atomicity Is Smaller Than You Think

Atomicity answers: after a crash or power failure, can software observe half of a write? The naive answer is "a sector write is atomic." The useful answer is "read the contract, then still be skeptical."

There are multiple atomicity levels:

CPU store atomicity: irrelevant once data leaves CPU caches.
DMA transfer granularity: a device may see a scatter-gather request as multiple memory reads.
Device media/program granularity: the SSD may program NAND pages or internal units larger than an LBA.
Controller advertised write unit: the host-visible unit that may constrain legal or reliable writes.
Filesystem or database transaction: an upper-layer protocol built from writes, flushes, journals, checksums, and recovery.

SPDK exposes a bdev write_unit_size and acwu field in include/spdk/bdev_module.h:507 and include/spdk/bdev_module.h:510. The public getter comment in include/spdk/bdev.h:815 says write operations must be multiples of the write unit size. That is a shape rule. It is not a promise that every arbitrary multi-block write is power-fail atomic.

Misconception to kill: "If a write completion callback fired, the data is on NAND." A write completion means the device accepted and completed the command according to the protocol and its cache policy. If volatile write cache is involved, durability may still require a flush or a command with forced-unit-access semantics, depending on the protocol and device.

Alignment And Splitting

Alignment has three separate meanings:

LBA alignment: the starting block and block count must satisfy some multiple.
Buffer alignment: the host memory pointer must be aligned for DMA or backend requirements.
Internal media alignment: the SSD prefers larger sequential shapes even if it accepts smaller legal writes.

SPDK makes alignment visible in bdev fields. required_alignment in include/spdk/bdev_module.h:513 describes buffer alignment and says the bdev layer may double-buffer misaligned I/O. preferred_write_alignment, preferred_write_granularity, and optimal_write_size are performance hints. split_on_write_unit and split_on_optimal_io_boundary are enforcement flags.

A useful diagram in prose:

Application request:
  write 12 blocks at LBA 6

Device rule:
  optimal boundary = 8 blocks

Visual:
  boundary 0          boundary 8          boundary 16
  |-------------------|-------------------|
        request starts here: [6 7 | 8 9 10 11 12 13 14 15 | 16 17]

Possible bdev-layer behavior:
  child A: LBA 6, 2 blocks
  child B: LBA 8, 8 blocks
  child C: LBA 16, 2 blocks

Splitting is a correctness tool and a performance tool. It also changes debugging. One user request may become several module requests and several completions internally, while the user sees one callback.

Flush, FUA, And Volatile Caches

Storage has at least three places data can be "written":

In host memory, before command submission.
In controller memory or volatile device cache.
In non-volatile media or protected cache.

A flush asks the device to make previously accepted writes durable. FUA, when supported by a protocol or command, asks for a particular write to bypass or commit through volatile cache. A block abstraction that ignores flush semantics can pass tests and still corrupt a filesystem during power loss.

In SPDK, flush is an I/O type. The bdev API checks operation support through the same spdk_bdev_io_type_supported() mechanism as other types. The implementation path for flush lives in lib/bdev/bdev.c around spdk_bdev_flush() and spdk_bdev_flush_blocks(); later chapters will trace that in detail. For this chapter, the key idea is conceptual: flush is not "write more bytes." It is an ordering and durability command.

Edge cases:

Some virtual bdevs must translate one flush into flushes on multiple base bdevs.
Some devices complete flush quickly because they have power-loss protection.
Some devices expose a volatile write cache but lie or behave badly under firmware bugs.
A flush after an unmap may not mean reads return zero; it means the deallocation command's persistence rules are satisfied.

UNMAP, TRIM, Deallocate, And Write Zeroes

Unmap tells a device that a range no longer contains useful data. SATA calls the idea TRIM; SCSI calls it UNMAP; NVMe uses Dataset Management deallocate and related semantics. SPDK's bdev abstraction has unmap limits: preferred_unmap_alignment, preferred_unmap_granularity, max_unmap, and max_unmap_segments appear in include/spdk/bdev_module.h:538 through include/spdk/bdev_module.h:559.

Write zeroes is different. It asks the device to make future reads return zeroes for a range, often without transferring a zero-filled buffer from the host. SPDK exposes max_write_zeroes in include/spdk/bdev_module.h:561.

Misconception to kill: "Unmap means zero." It may, but it does not have to in every stack. A deallocated read may return zeroes, old data, undefined data, or complete with special semantics depending on protocol, provisioning mode, and bdev implementation. If an upper layer needs zeros, use a zeroing operation whose semantics are actually guaranteed for that path.

Metadata, DIF, And Protection Information

Some block devices carry extra metadata per block. That metadata can be interleaved with data or stored separately. It may contain Data Integrity Field (DIF) protection information such as guard tags, application tags, or reference tags.

SPDK exposes descriptor-specific metadata queries in include/spdk/bdev.h:658 through include/spdk/bdev.h:719. The bdev structure tracks metadata placement with md_interleave in include/spdk/bdev_module.h:474 and DIF placement with dif_is_head_of_md around include/spdk/bdev_module.h:482.

Beginner trap: a "4096-byte block" may not be only 4096 bytes on the wire or media. The host may manage 4096 bytes of data plus metadata. Passing buffers without considering metadata can make a perfectly aligned data request illegal.

Edge Cases And Failure Modes

Out-of-range LBA: the last valid LBA is num_blocks - 1; off-by-one math often writes one block past the end.
Integer overflow: byte length is num_blocks * block_size; use wide types and validate before multiplying.
Short writes do not exist in the usual block command model; commands complete or fail, but layered software may split and partially complete internally before surfacing a failure.
Reset may fail outstanding I/O or delay new I/O.
Remove/hotplug can invalidate a bdev while descriptors and channels still exist.
A virtual bdev may have stricter limits than its base bdev.
Buffer alignment may silently cost performance because of bounce buffers.
A benchmark that reads deallocated LBAs may measure metadata fast paths instead of NAND reads.
A workload that never flushes may look fast and still be unsafe for databases.

Source Reading Exercise

Read these anchors in order:

include/spdk/bdev_module.h:420 through include/spdk/bdev_module.h:575.
include/spdk/bdev.h:807 through include/spdk/bdev.h:838.
include/spdk/bdev.h:752 through include/spdk/bdev.h:759.

Answer these while reading:

Which fields describe logical geometry?
Which fields describe physical or performance geometry?
Which fields can force request splitting?
Which fields are limits rather than hints?
Which public getters expose fields directly and which expose descriptor-specific views?

Operational Lab

No live SPDK system is required.

- write 1 block at LBA 0 - write 4 blocks at LBA 4 - read 256 blocks at LBA 128 - write 8 blocks at LBA 262140

Pick a hypothetical bdev with blocklen = 4096, blockcnt = 262144, write_unit_size = 4, and max_rw_size = 128.
Compute the byte capacity.
Decide whether each request is legal before splitting:
For the read of 256 blocks, sketch how a bdev layer could split it if max_rw_size = 128.
Explain which failed cases should return an error immediately and which might be transformed.

Expected reasoning: capacity is 262144 * 4096 = 1 GiB; writes must be multiples of 4 blocks; the read may split into two 128-block reads; the final write overruns the device because LBAs 262144 through 262147 do not exist.

Self-Check

Why is a logical block not the same as a NAND page?
Why can a 512-byte logical block device still prefer 4096-byte writes?
What does a flush promise that a write does not necessarily promise?
Why is unmap not the same operation as write zeroes?
Where does SPDK store the logical block size for a bdev?
What can happen when a user buffer violates required_alignment?
Why should a benchmark write a device before measuring reads?

References

Local source: include/spdk/bdev_module.h, especially struct spdk_bdev.
Local source: include/spdk/bdev.h, especially bdev geometry, metadata, and I/O support getters.
Local source: lib/bdev/bdev.c, especially request entry points such as read, write, flush, unmap, and write zeroes.
SPDK documentation: doc/bdev.md and doc/bdev_module.md.
NVM Express specifications landing page: https://nvmexpress.org/specifications/
NVM Express Base Specification page: https://nvmexpress.org/specification/nvm-express-base-specification/