Chapter Goal
This chapter teaches how to look inside a running SPDK application without first attaching a debugger. You will learn what logs are good for, what traces are good for, what stats are good for, and how to combine them without fooling yourself. The focus is beginner-first operational reasoning, grounded in SPDK source.
By the end, you should be able to do three things that matter when you start extending SPDK C code:
- Read a log, tracepoint, or RPC stat and map it back to the C function that produced it.
- Decide which observability layer can answer your current question with the least distortion.
- Explain why a missing log, stale-looking counter, or quiet trace does not automatically prove that nothing happened.
SPDK is not a traditional blocking server where one request owns one operating-system thread from start to finish. Most of the interesting work is split across reactors, SPDK threads, pollers, I/O channels, transport qpairs, and completion callbacks. That is why observability in SPDK is layered. You often need a log for the branch, a counter for the layer, a trace for ordering, and source for the invariant.
Beginner Mental Model
SPDK observability has three main layers:
- logs explain decisions and errors in human language.
- traces record high-frequency events in a compact buffer.
- stats expose counters, timings, and current object state through RPC.
Each layer answers a different question. Logs answer "what did the code decide or reject?" Traces answer "what happened in what order?" Stats answer "what is true now, and what changed since my last sample?"
symptom
|
+-- logs: error text, component, branch clue
|
+-- stats: queue depth, poller load, I/O counts, bdev state
|
+-- traces: event ordering, latency gaps, object ownership
|
+-- source: exact function, thread context, invariant
No single layer is the truth. Logs can be missing. Counters can be cumulative, reset by another caller, or collected by walking multiple threads. Traces can be disabled, overwritten, or absent from the path you care about. The best debugging comes from correlating all three with source anchors.
This is also how you should read SPDK source. When you find a symptom, do not jump directly to the deepest NVMe or RDMA code. First ask where the request is known to exist. For example, if bdev_get_iostat shows no bdev writes, an RDMA completion counter on the storage node may be irrelevant. If bdev writes increase on the baremetal node but storage-node NVMf request counters stay flat, the failure is between the initiator bdev and the target transport.
Observability Layers At A Glance
Use this table as a first-pass chooser:
| Question | First tool | Why |
|---|---|---|
| Did an RPC reject my parameters? | logs plus RPC error | RPC handlers usually decode params and return structured errors. |
| Is any I/O reaching this bdev? | bdev_get_iostat twice | Deltas show movement better than one snapshot. |
| Which SPDK thread owns this work? | thread_get_stats, thread_get_pollers, framework_get_reactors | SPDK work is thread and poller oriented. |
| Is a transport polling but not completing? | transport stats plus traces | Counters show layer health; traces show state ordering. |
| Did the request pass through a hot path stage? | traces | Tracepoints are designed for high-frequency paths. |
| Why did a specific branch fail? | logs plus source search | Log strings normally live near the branch that emitted them. |
The important habit is to sample and compare. A single "high" counter is rarely meaningful. A counter that increased by 10000 while a lower-layer counter increased by zero is meaningful.
Source Anchors
Start with these files when reading this chapter:
include/spdk/log.h: log macros, levels, flags, and component registration.lib/log/log.c: log level, print level, timestamps, and output.lib/log/log_flags.c: component flag storage, glob matching, and enable/clear behavior.lib/event/log_rpc.c: RPC handlers for runtime log level and flag changes.include/spdk/trace.h: trace recording macros and public trace API.lib/trace/trace.c:_spdk_trace_record, per-core trace histories, and circular buffer writes.lib/trace/trace_flags.c: tracepoint registration and owner id allocation.lib/trace/trace_rpc.c: trace RPC handlers.include/spdk_internal/sock_module.h: compact socket request tracepoint example.module/bdev/nvme/bdev_nvme.c: bdev NVMe I/O start/done tracepoints and NVMe pollers.lib/nvmf/tcp.candlib/nvmf/rdma.c: NVMe-oF request tracepoints and transport counters.lib/event/app_rpc.c:thread_get_stats,thread_get_pollers, andframework_get_reactors.include/spdk_internal/thread.h: internal poller stat accessors.lib/bdev/bdev_rpc.c:bdev_get_iostatRPC implementation.lib/bdev/bdev.c: aggregation of bdev stats across channels.lib/nvmf/nvmf_rpc.c:nvmf_get_statsRPC implementation.module/bdev/nvme/bdev_nvme_rpc.c:bdev_nvme_get_transport_statistics.lib/nvme/nvme_rdma.c: initiator-side RDMA poll group statistics.app/spdk_top/spdk_top.c: interactive stats client built on RPCs.doc/tracing.md,doc/spdk_top.md,doc/usdt.md,doc/event.md,doc/concurrency.md, anddoc/jsonrpc.md.jinja2.
Logs
Logs are the first thing most people inspect. They are useful because they usually include a message written near the failing branch. Search a log message in the source tree and you often find the exact condition.
SPDK uses macros such as:
SPDK_ERRLOGfor errors.SPDK_WARNLOGfor warnings.SPDK_NOTICELOGfor notable events.SPDK_INFOLOGfor informational component logs.SPDK_DEBUGLOGfor debug component logs.
Errors and warnings are often always compiled in. Debug and info logs are more selective. They depend on the current levels and on component flags. That split exists because SPDK hot paths cannot afford unlimited string formatting or printing.
Component Flags
A component flag is a named switch such as bdev, nvmf, rdma, bdev_nvme, or rpc. Source files register these flags with SPDK_LOG_REGISTER_COMPONENT.
From include/spdk/log.h:
struct spdk_log_flag {
TAILQ_ENTRY(spdk_log_flag) tailq;
const char *name;
bool enabled;
};
void spdk_log_register_flag(const char *name, struct spdk_log_flag *flag);
#define SPDK_LOG_REGISTER_COMPONENT(flag) \
struct spdk_log_flag SPDK_LOG_##flag = { \
.name = #flag, \
.enabled = false, \
}; \
__attribute__((constructor)) static void register_flag_##flag(void) \
{ \
spdk_log_register_flag(#flag, &SPDK_LOG_##flag); \
}
Read this slowly. The macro creates a global struct spdk_log_flag. The flag name is the macro argument turned into a string. The constructor runs during process startup and inserts the flag into SPDK's log flag list. That means a file can define its own logging namespace without calling an init function manually.
When you see this in a source file:
SPDK_LOG_REGISTER_COMPONENT(bdev_nvme)
the runtime flag name is bdev_nvme. If you want that component's debug logs, enable exactly that flag before reproducing the issue.
Level And Print Level
SPDK has a log level and a print level. The names sound similar, but they answer different questions:
- the log level controls which messages are considered enabled by SPDK logging.
- the print level controls which messages are printed by the default output path.
From lib/log/log.c:
void
spdk_log(enum spdk_log_level level, const char *file, const int line, const char *func,
const char *format, ...)
{
va_list ap;
va_start(ap, format);
spdk_vlog(level, file, line, func, format, ap);
va_end(ap);
}
void
spdk_vlog(enum spdk_log_level level, const char *file, const int line, const char *func,
const char *format, va_list ap)
{
...
if (g_log_opts.log) {
g_log_opts.log(level, file, line, func, format, ap);
return;
}
if (level > g_spdk_log_print_level && level > g_spdk_log_level) {
The public log call delegates to spdk_vlog. If an application installed a custom log backend, SPDK calls that backend and returns. Otherwise, the default path compares the message severity against both g_spdk_log_print_level and g_spdk_log_level. For a beginner, the practical rule is simple: if you expect debug text, set the global level, set the print level, and enable the component flag.
Runtime Log RPCs
SPDK can change logging at runtime. The handlers live in lib/event/log_rpc.c.
From lib/event/log_rpc.c:
static int
_parse_log_level(char *level)
{
if (!strcasecmp(level, "ERROR")) {
return SPDK_LOG_ERROR;
} else if (!strcasecmp(level, "WARNING")) {
return SPDK_LOG_WARN;
} else if (!strcasecmp(level, "NOTICE")) {
return SPDK_LOG_NOTICE;
} else if (!strcasecmp(level, "INFO")) {
return SPDK_LOG_INFO;
} else if (!strcasecmp(level, "DEBUG")) {
return SPDK_LOG_DEBUG;
}
return -1;
}
static void
rpc_log_set_level(struct spdk_jsonrpc_request *request,
const struct spdk_json_val *params)
{
struct rpc_log_level req = {};
int level;
if (spdk_json_decode_object(params, rpc_log_set_print_level_decoders,
SPDK_COUNTOF(rpc_log_set_print_level_decoders), &req)) {
SPDK_DEBUGLOG(log_rpc, "spdk_json_decode_object failed\n");
spdk_jsonrpc_send_error_response(request, SPDK_JSONRPC_ERROR_INTERNAL_ERROR,
"spdk_json_decode_object failed");
goto end;
}
level = _parse_log_level(req.level);
if (level == -1) {
SPDK_DEBUGLOG(log_rpc, "tried to set invalid log level\n");
spdk_jsonrpc_send_error_response(request, SPDK_JSONRPC_ERROR_INVALID_PARAMS,
"invalid log level");
goto end;
}
spdk_log_set_level(level);
spdk_jsonrpc_send_bool_response(request, true);
end:
free_rpc_log_level(&req);
}
SPDK_RPC_REGISTER("log_set_level", rpc_log_set_level, SPDK_RPC_STARTUP | SPDK_RPC_RUNTIME)
This is a standard RPC-handler pattern. The decoder validates JSON params into a small C struct. The handler converts "DEBUG" or "INFO" into an enum. Invalid input returns a JSON-RPC error. Success calls the real subsystem function, then sends a boolean response.
Useful methods include:
log_set_levellog_get_levellog_set_print_levellog_get_print_levellog_set_flaglog_clear_flaglog_get_flagslog_enable_timestamps
For example:
scripts/rpc.py log_set_level DEBUG
scripts/rpc.py log_set_print_level DEBUG
scripts/rpc.py log_set_flag bdev_nvme
After the reproduction, turn the flag off:
scripts/rpc.py log_clear_flag bdev_nvme
Avoid enabling every flag on a busy target. It can distort timing and flood the useful signal.
Flag Matching
Log flags support all and glob-style matching.
From lib/log/log_flags.c:
static int
log_set_flag(const char *name, bool value)
{
struct spdk_log_flag *flag;
int rc = -EINVAL;
if (strcasecmp(name, "all") == 0) {
TAILQ_FOREACH(flag, &g_log_flags, tailq) {
flag->enabled = value;
}
return 0;
}
TAILQ_FOREACH(flag, &g_log_flags, tailq) {
if (fnmatch(name, flag->name, FNM_CASEFOLD) == 0) {
flag->enabled = value;
rc = 0;
}
}
return rc;
}
If the name is all, every registered flag changes. Otherwise SPDK walks the registered flag list and applies fnmatch. That means a pattern can match more than one flag. This is useful, but it is also easy to over-enable logs accidentally.
Reading Log Messages
A log message is a clue to a branch. It is not a root cause by itself.
Example:
spdk_json_decode_object failed
This points to params shape, not necessarily object state. Search in the specific RPC handler. Find the decoder table. Compare the client's JSON keys and types.
Example:
Failed to open bdev
This tells you a named bdev could not be opened at that point in the code. It does not tell you whether the bdev never existed, was removed, was claimed exclusively, or was in a transient state. Search the string, inspect the nearby return code, then query bdev_get_bdevs and related module state.
Beginner workflow:
1. reproduce the symptom with default logs
2. capture exact log lines and RPC errors
3. search exact strings in source with rg
4. identify the component flag near that source file
5. enable the narrowest useful flag
6. reproduce again
7. turn the flag off
8. explain the branch using source, not only the text
Why Logs Can Be Silent
Logs can be silent for normal reasons:
- the failing branch returns an error without logging.
- the component flag is off.
- the log level or print level filters the message.
- the message is in a different process than the one you are watching.
- the application installed a custom log callback.
- the build or runtime path does not include the debug code you expected.
- the code path is hot and intentionally uses stats or traces instead of logs.
Do not conclude "no log means no failure." In SPDK, many failures are returned through callbacks or JSON-RPC responses.
Traces
Traces are structured event records. They are designed for high-frequency paths where normal logs would be too expensive or too noisy. SPDK trace entries are stored in shared memory and later decoded. The public API is in include/spdk/trace.h. The writer is lib/trace/trace.c: _spdk_trace_record. Tracepoint descriptions are registered through SPDK_TRACE_REGISTER_FN and spdk_trace_register_description.
The mental model:
code registers tpoint descriptions
|
trace mask enables group or tpoint
|
hot path calls spdk_trace_record(...)
|
entry lands in per-lcore trace buffer
|
trace parser decodes ids into names
Traces are strongest when you need ordering:
- Did the request get queued?
- Did it later get submitted?
- Did completion arrive?
- Which thread or lcore owned the event?
- Was there a long gap between two known stages?
Trace Macro
From include/spdk/trace.h:
#define _spdk_trace_record_tsc(tsc, tpoint_id, owner_id, size, object_id, num_args, ...) \
do { \
assert(tpoint_id < SPDK_TRACE_MAX_TPOINT_ID); \
if (!spdk_trace_tpoint_enabled(tpoint_id)) { \
break; \
} \
_spdk_trace_record(tsc, tpoint_id, owner_id, size, object_id, \
num_args, ## __VA_ARGS__); \
} while (0)
#define spdk_trace_record(tpoint_id, owner_id, size, object_id, ...) \
spdk_trace_record_tsc(0, tpoint_id, owner_id, size, object_id, ## __VA_ARGS__)
This macro is why disabled tracepoints are much cheaper than logs. The first runtime check asks whether the tracepoint is enabled. If not, it breaks out before writing a trace entry. If enabled, it calls _spdk_trace_record.
The fields have conventional meanings:
tpoint_id: which event this is.owner_id: which owner this event belongs to, or zero for none.size: a numeric size, often bytes or a request length.object_id: the object to correlate across events, often a pointer cast to integer.- extra args: tracepoint-specific values that must match the registered description.
Trace Buffer Write
From lib/trace/trace.c:
static inline struct spdk_trace_entry *
get_trace_entry(struct spdk_trace_history *history, uint64_t offset)
{
return &history->entries[offset & (history->num_entries - 1)];
}
void
_spdk_trace_record(uint64_t tsc, uint16_t tpoint_id, uint16_t owner_id, uint32_t size,
uint64_t object_id, int num_args, ...)
{
struct spdk_trace_history *lcore_history;
struct spdk_trace_entry *next_entry;
...
lcore = spdk_env_get_current_core();
if (spdk_likely(lcore != SPDK_ENV_LCORE_ID_ANY)) {
lcore_history = spdk_get_per_lcore_history(g_trace_file, lcore);
} else if (t_ut_lcore_history != NULL) {
lcore_history = t_ut_lcore_history;
} else {
return;
}
if (tsc == 0) {
tsc = spdk_get_ticks();
}
lcore_history->tpoint_count[tpoint_id]++;
...
next_entry = get_trace_entry(lcore_history, lcore_history->next_entry);
next_entry->tsc = tsc;
next_entry->tpoint_id = tpoint_id;
next_entry->owner_id = owner_id;
next_entry->size = size;
next_entry->object_id = object_id;
get_trace_entry masks the offset with num_entries - 1. That is the circular-buffer behavior. When the writer advances far enough, old entries are overwritten. The writer chooses a per-lcore history using the current DPDK/SPDK core id. If code runs from a registered user thread instead, it can use a user-thread trace history. If neither exists, the trace call returns without writing.
The timestamp is spdk_get_ticks() if the caller did not pass an explicit timestamp. That is why trace output needs a tick rate to convert ticks into seconds, milliseconds, or microseconds.
Near the end of the writer:
/* Ensure all elements of the trace entry are visible to outside trace tools */
spdk_smp_wmb();
lcore_history->next_entry += num_entries;
}
The write memory barrier matters because another process may read the shared memory trace file. SPDK wants the entry contents visible before it publishes the updated next_entry.
Trace Registration
Trace names are not stored in every event. Instead, SPDK registers tracepoint metadata once and trace records store compact ids.
From lib/trace/trace_flags.c:
int
trace_flags_init(void)
{
struct spdk_trace_register_fn *reg_fn;
uint16_t i;
uint16_t owner_id_start;
int rc;
reg_fn = g_reg_fn_head;
while (reg_fn) {
reg_fn->reg_fn();
reg_fn = reg_fn->next;
}
/* We will not use owner_id 0, it will be reserved to mean "no owner". */
owner_id_start = 256;
g_owner_ids.ring = calloc(g_trace_file->num_owners, sizeof(uint16_t));
...
for (i = 0; i < g_owner_ids.tail; i++) {
g_owner_ids.ring[i] = i + owner_id_start;
}
The registration functions run first. Then owner ids are allocated from a ring. Owner id zero means "no owner." This helps trace tools group related entries without storing long names in every event.
Socket Trace Example
include/spdk_internal/sock_module.h is a compact example because it shows a request moving through queues.
From include/spdk_internal/sock_module.h:
static inline void
spdk_sock_request_queue(struct spdk_sock *sock, struct spdk_sock_request *req)
{
assert(req->internal.curr_list == NULL);
if (spdk_trace_tpoint_enabled(TRACE_SOCK_REQ_QUEUE)) {
uint64_t len = 0;
int i;
for (i = 0; i < req->iovcnt; i++) {
len += SPDK_SOCK_REQUEST_IOV(req, i)->iov_len;
}
spdk_trace_record(TRACE_SOCK_REQ_QUEUE, 0, len, (uintptr_t)req, (uintptr_t)req->cb_arg);
}
TAILQ_INSERT_TAIL(&sock->queued_reqs, req, internal.link);
...
sock->queued_iovcnt += req->iovcnt;
}
The tracepoint is recorded before the request is inserted into queued_reqs. The size is the sum of all iovec lengths. The object id is req. The extra argument is req->cb_arg. If you later see the same request pointer in a pending or complete tracepoint, you can reconstruct the request's movement.
The same file records the next stages:
static inline void
spdk_sock_request_pend(struct spdk_sock *sock, struct spdk_sock_request *req)
{
assert(req->internal.curr_list == &sock->queued_reqs);
spdk_trace_record(TRACE_SOCK_REQ_PEND, 0, 0, (uintptr_t)req, (uintptr_t)req->cb_arg);
TAILQ_REMOVE(&sock->queued_reqs, req, internal.link);
...
TAILQ_INSERT_TAIL(&sock->pending_reqs, req, internal.link);
}
static inline int
spdk_sock_request_complete(struct spdk_sock *sock, struct spdk_sock_request *req, int err)
{
...
spdk_trace_record(TRACE_SOCK_REQ_COMPLETE, 0, 0, (uintptr_t)req, (uintptr_t)req->cb_arg);
req->internal.offset = 0;
req->internal.zcopy_idx = 0;
req->internal.pending_zcopy = false;
This is the kind of trace sequence you want in a hot path. It tells you ordering and object identity without printing strings on every I/O.
Bdev NVMe Trace Example
The NVMe bdev module records when a bdev I/O enters and leaves the NVMe-backed path.
From module/bdev/nvme/bdev_nvme.c:
static inline void
__bdev_nvme_io_complete(struct spdk_bdev_io *bdev_io, enum spdk_bdev_io_status status,
const struct spdk_nvme_cpl *cpl)
{
spdk_trace_record(TRACE_BDEV_NVME_IO_DONE, 0, 0, (uintptr_t)bdev_io->driver_ctx,
(uintptr_t)bdev_io);
if (cpl) {
spdk_bdev_io_complete_nvme_status(bdev_io, cpl->cdw0, cpl->status.sct, cpl->status.sc);
} else {
spdk_bdev_io_complete(bdev_io, status);
}
}
The trace happens before completing the bdev I/O. The object id is bdev_io->driver_ctx, which is the NVMe-bdev private I/O context. The extra argument is the generic bdev_io. This gives you two identities: one for the module-private state and one for the bdev-layer request.
From the submission side:
static void
bdev_nvme_submit_request(struct spdk_io_channel *ch, struct spdk_bdev_io *bdev_io)
{
struct nvme_bdev_channel *nbdev_ch = spdk_io_channel_get_ctx(ch);
struct nvme_bdev_io *nbdev_io = (struct nvme_bdev_io *)bdev_io->driver_ctx;
if (spdk_likely(nbdev_io->submit_tsc == 0)) {
nbdev_io->submit_tsc = spdk_bdev_io_get_submit_tsc(bdev_io);
} else {
nbdev_io->submit_tsc = spdk_get_ticks();
}
spdk_trace_record(TRACE_BDEV_NVME_IO_START, 0, 0, (uintptr_t)nbdev_io, (uintptr_t)bdev_io);
nbdev_io->io_path = bdev_nvme_find_io_path(nbdev_ch);
The trace happens after timestamp setup and before path selection. If TRACE_BDEV_NVME_IO_START appears but TRACE_BDEV_NVME_IO_DONE does not, the request entered this module and has not completed through this completion path within the captured trace window. That can mean a real stall. It can also mean the trace buffer wrapped or the completion happened on a different core whose history you did not inspect.
Trace Edge Cases
Trace buffers are finite. A very busy system can overwrite old entries. If the interesting event happened minutes ago, the buffer may no longer contain it. The official tracing guide recommends spdk_trace_record when the live circular buffer may not retain enough events.
Trace timestamps are tick based. Use the returned or documented tick rate to convert. Do not compare timestamps across machines as if they were wall-clock time.
Tracepoints only record what the code author instrumented. Absence of a trace event can mean the event did not happen. It can also mean the tpoint was disabled, the code path has no tpoint, the event was overwritten, or you are looking at the wrong process or lcore history.
Traces are lighter than logs, but not free. An enabled tracepoint writes shared-memory records and may copy arguments. Narrow masks reduce overhead and reduce buffer churn.
Useful commands:
build/bin/nvmf_tgt -e nvmf_rdma
build/bin/spdk_trace -s nvmf -p <pid>
build/bin/spdk_trace_record -q -s nvmf -p <pid> -f /tmp/spdk_nvmf_record.trace
build/bin/spdk_trace -f /tmp/spdk_nvmf_record.trace
scripts/rpc.py trace_get_tpoint_group_mask
scripts/rpc.py trace_disable_tpoint_group nvmf
scripts/rpc.py trace_clear_tpoint_mask nvmf 0xffffffffffffffff
Stats
Stats are sampled through RPC or tools that use RPC. They are the safest first choice for "what is the system doing now?" They are also the easiest layer to misread.
Common sources:
thread_get_statsthread_get_pollersthread_get_io_channelsframework_get_reactorsbdev_get_iostatbdev_get_bdevsnvmf_get_statsbdev_nvme_get_transport_statisticsenv_dpdk_get_mem_statsiobuf_get_stats
The implementation is distributed. Thread and reactor RPCs live in lib/event/app_rpc.c. Bdev stats live in lib/bdev/bdev_rpc.c and aggregate data from lib/bdev/bdev.c. NVMe-oF stats live in lib/nvmf/nvmf_rpc.c and transport files such as lib/nvmf/rdma.c. NVMe initiator transport statistics live in module/bdev/nvme/bdev_nvme_rpc.c and lower transport files such as lib/nvme/nvme_rdma.c.
The core rule:
stats are snapshots collected by code
snapshots are not universal truth
deltas are usually more useful than absolute values
Thread Stats RPC
SPDK apps are event-driven. Reactors run SPDK lightweight threads. Threads run pollers and messages. A high-level health check starts with:
framework_get_reactors
thread_get_stats
thread_get_pollers
From lib/event/app_rpc.c:
static void
rpc_thread_get_stats_for_each(struct spdk_jsonrpc_request *request, spdk_msg_fn fn)
{
struct rpc_get_stats_ctx *ctx;
ctx = calloc(1, sizeof(*ctx));
if (!ctx) {
spdk_jsonrpc_send_error_response(request, SPDK_JSONRPC_ERROR_INTERNAL_ERROR,
"Memory allocation error");
return;
}
ctx->request = request;
ctx->w = spdk_jsonrpc_begin_result(ctx->request);
spdk_json_write_object_begin(ctx->w);
spdk_json_write_named_uint64(ctx->w, "tick_rate", spdk_get_ticks_hz());
spdk_json_write_named_array_begin(ctx->w, "threads");
spdk_for_each_thread(fn, ctx, rpc_thread_get_stats_done);
}
This helper allocates one RPC context, starts a JSON result, writes tick_rate, and then calls spdk_for_each_thread. That is a crucial detail. The RPC does not freeze the whole process at one instant. It asks each SPDK thread to run a callback and append its local data. The result is a good operational snapshot, but not a transactional global read.
The per-thread callback:
static void
_rpc_thread_get_stats(void *arg)
{
struct rpc_get_stats_ctx *ctx = arg;
struct spdk_thread *thread = spdk_get_thread();
struct spdk_poller *poller;
struct spdk_thread_stats stats;
uint64_t active_pollers_count = 0;
uint64_t timed_pollers_count = 0;
uint64_t paused_pollers_count = 0;
for (poller = spdk_thread_get_first_active_poller(thread); poller != NULL;
poller = spdk_thread_get_next_active_poller(poller)) {
active_pollers_count++;
}
...
if (0 == spdk_thread_get_stats(&stats)) {
spdk_json_write_object_begin(ctx->w);
spdk_json_write_named_string(ctx->w, "name", spdk_thread_get_name(thread));
spdk_json_write_named_uint64(ctx->w, "id", spdk_thread_get_id(thread));
...
spdk_json_write_named_uint64(ctx->w, "busy", stats.busy_tsc);
spdk_json_write_named_uint64(ctx->w, "idle", stats.idle_tsc);
The callback runs on the thread being inspected. It counts active, timed, and paused pollers by walking the thread's poller lists. It writes cumulative busy and idle ticks. To interpret busy percent, sample twice:
busy_delta = busy_B - busy_A
idle_delta = idle_B - idle_A
busy_percent = busy_delta / (busy_delta + idle_delta)
Do not use process CPU alone. The official spdk_top guide explains why Linux top is not enough for polled-mode applications: a reactor can look like it is using a full CPU even when it is mostly polling and finding no useful work.
Poller Stats RPC
Pollers are repeated functions registered on an SPDK thread. They usually replace blocking waits or interrupt-driven logic. The official event framework docs state that pollers execute repeatedly until unregistered and run on the thread where they were registered.
From include/spdk_internal/thread.h:
struct spdk_poller_stats {
uint64_t run_count;
uint64_t busy_count;
};
const char *spdk_poller_get_name(struct spdk_poller *poller);
uint64_t spdk_poller_get_id(struct spdk_poller *poller);
const char *spdk_poller_get_state_str(struct spdk_poller *poller);
uint64_t spdk_poller_get_period_ticks(struct spdk_poller *poller);
void spdk_poller_get_stats(struct spdk_poller *poller, struct spdk_poller_stats *stats);
These are internal helpers used by the RPC layer. run_count tells you how many times the poller ran. busy_count tells you how many runs reported work. A poller can run a lot and still be mostly idle.
From lib/event/app_rpc.c:
static void
rpc_get_poller(struct spdk_poller *poller, struct spdk_json_write_ctx *w)
{
struct spdk_poller_stats stats;
uint64_t period_ticks;
period_ticks = spdk_poller_get_period_ticks(poller);
spdk_poller_get_stats(poller, &stats);
spdk_json_write_object_begin(w);
spdk_json_write_named_string(w, "name", spdk_poller_get_name(poller));
spdk_json_write_named_uint64(w, "id", spdk_poller_get_id(poller));
spdk_json_write_named_string(w, "state", spdk_poller_get_state_str(poller));
spdk_json_write_named_uint64(w, "run_count", stats.run_count);
spdk_json_write_named_uint64(w, "busy_count", stats.busy_count);
if (period_ticks) {
spdk_json_write_named_uint64(w, "period_ticks", period_ticks);
}
spdk_json_write_object_end(w);
}
period_ticks is ticks, not microseconds. Use tick_rate from related RPCs or spdk_top conversion logic when you need time. A period of zero usually means an active poller that can run on every reactor loop iteration. A nonzero period means a timed poller.
Look for:
- a thread pinned where you expect.
- a poller running far more often than expected.
- a timed poller not running.
- increasing busy time with flat I/O.
- idle time near zero on a core that should have capacity.
- a poller stuck in a paused state.
Reactor Stats RPC
Reactors are the per-core event loops in SPDK applications. framework_get_reactors shows lcore placement, process/thread ids, busy/idle ticks, interrupt mode, and lightweight threads on each reactor.
From lib/event/app_rpc.c:
static void
_rpc_framework_get_reactors(void *arg1, void *arg2)
{
struct rpc_get_stats_ctx *ctx = arg1;
uint32_t current_core;
struct spdk_reactor *reactor;
struct spdk_lw_thread *lw_thread;
struct spdk_thread *thread;
current_core = spdk_env_get_current_core();
reactor = spdk_reactor_get(current_core);
spdk_json_write_object_begin(ctx->w);
spdk_json_write_named_uint32(ctx->w, "lcore", current_core);
spdk_json_write_named_uint64(ctx->w, "tid", spdk_get_tid());
spdk_json_write_named_uint64(ctx->w, "busy", reactor->busy_tsc);
spdk_json_write_named_uint64(ctx->w, "idle", reactor->idle_tsc);
spdk_json_write_named_bool(ctx->w, "in_interrupt", reactor->in_interrupt);
...
spdk_json_write_named_array_begin(ctx->w, "lw_threads");
TAILQ_FOREACH(lw_thread, &reactor->threads, link) {
thread = spdk_thread_get_from_ctx(lw_thread);
...
spdk_json_write_named_string(ctx->w, "name", spdk_thread_get_name(thread));
This runs on each reactor and reports the SPDK threads currently placed there. Use it when a thread or poller is "missing" from where you expected it. It is also the place to notice interrupt mode.
Reactor starvation is different from ordinary busyness. A reactor can be busy because it is doing useful work, or because a poller spins without completions, or because a long callback starves other messages. Correlate:
- reactor busy/idle deltas.
- thread busy/idle deltas.
- poller
run_countandbusy_countdeltas. - bdev or transport I/O deltas.
- traces for the stage that should be progressing.
If reactor busy increases but all useful I/O counters stay flat, look for a hot poller, repeated error/retry loop, or callback that does too much per run.
Bdev Stats
bdev_get_iostat is one of the most useful RPCs for storage symptoms. It reports tick rate, current ticks, bdev I/O counters, bytes, latency-related values, queue-depth values when enabled, and driver-specific statistics when a bdev module provides them.
Beginner pattern:
sample A
wait 5 seconds under workload
sample B
compute deltas
compare submitted work with completed work
compare bdev layer with transport layer
From lib/bdev/bdev_rpc.c:
static void
rpc_get_iostat_started(struct rpc_get_iostat_ctx *rpc_ctx)
{
rpc_ctx->w = spdk_jsonrpc_begin_result(rpc_ctx->request);
spdk_json_write_object_begin(rpc_ctx->w);
spdk_json_write_named_uint64(rpc_ctx->w, "tick_rate", spdk_get_ticks_hz());
spdk_json_write_named_uint64(rpc_ctx->w, "ticks", spdk_get_ticks());
}
static void
bdev_get_iostat_done(struct spdk_bdev *bdev, struct spdk_bdev_io_stat *stat,
void *cb_arg, int rc)
{
struct bdev_get_iostat_ctx *bdev_ctx = cb_arg;
struct rpc_get_iostat_ctx *rpc_ctx = bdev_ctx->rpc_ctx;
struct spdk_json_write_ctx *w = rpc_ctx->w;
...
spdk_json_write_object_begin(w);
spdk_json_write_named_string(w, "name", spdk_bdev_get_name(bdev));
spdk_bdev_dump_io_stat_json(stat, w);
if (spdk_bdev_get_qd_sampling_period(bdev)) {
spdk_json_write_named_uint64(w, "queue_depth_polling_period",
spdk_bdev_get_qd_sampling_period(bdev));
The response includes tick_rate and current ticks. Use them to convert tick deltas. The bdev-specific object is written only after stats collection completes for that bdev.
The RPC also supports per-channel mode for one bdev:
static void
bdev_get_per_channel_stat(struct spdk_bdev_channel_iter *i, struct spdk_bdev *bdev,
struct spdk_io_channel *ch, void *ctx)
{
struct bdev_get_iostat_ctx *bdev_ctx = ctx;
struct spdk_json_write_ctx *w = bdev_ctx->rpc_ctx->w;
spdk_bdev_get_io_stat(bdev, ch, bdev_ctx->stat, bdev_ctx->rpc_ctx->reset_mode);
spdk_json_write_object_begin(w);
spdk_json_write_named_uint64(w, "thread_id", spdk_thread_get_id(spdk_get_thread()));
spdk_bdev_dump_io_stat_json(bdev_ctx->stat, w);
spdk_json_write_object_end(w);
spdk_bdev_for_each_channel_continue(i, 0);
}
Per-channel output is often the answer when a global bdev counter is confusing. SPDK bdev I/O is issued through channels tied to threads. One thread may be busy while another is idle. One channel may see retries or latency while another does not.
The aggregation path in lib/bdev/bdev.c shows why:
void
spdk_bdev_get_device_stat(struct spdk_bdev *bdev, struct spdk_bdev_io_stat *stat,
enum spdk_bdev_reset_stat_mode reset_mode,
spdk_bdev_get_device_stat_cb cb, void *cb_arg)
{
struct spdk_bdev_iostat_ctx *bdev_iostat_ctx;
...
/* Start with the statistics from previously deleted channels. */
spdk_spin_lock(&bdev->internal.spinlock);
bdev_get_io_stat(bdev_iostat_ctx->stat, bdev->internal.stat);
spdk_bdev_reset_io_stat(bdev->internal.stat, reset_mode);
spdk_spin_unlock(&bdev->internal.spinlock);
/* Then iterate and add the statistics from each existing channel. */
spdk_bdev_for_each_channel(bdev, bdev_get_each_channel_stat, bdev_iostat_ctx,
bdev_get_device_stat_done);
}
Global bdev stats are built from deleted-channel history plus existing channels. That is good because channels can disappear. It also means you should understand reset behavior. If one observer asks for reset behavior, it can change what another observer sees later. Record the exact RPC parameters used in every capture.
bdev_get_bdevs complements stats by showing static and semi-static properties:
- block size.
- number of blocks.
- aliases.
- product name.
- claim state.
- supported I/O types.
- driver-specific information.
- RAID state if driver-specific output is implemented.
Do not treat bdev_get_bdevs as performance telemetry. It is mostly shape and state. Use it with bdev_get_iostat.
NVMe Initiator Transport Stats
If SPDK is acting as an NVMe initiator through the bdev NVMe module, use:
scripts/rpc.py bdev_nvme_get_transport_statistics
From module/bdev/nvme/bdev_nvme_rpc.c:
static void
rpc_bdev_nvme_rdma_stats(struct spdk_json_write_ctx *w,
struct spdk_nvme_transport_poll_group_stat *stat)
{
struct spdk_nvme_rdma_device_stat *device_stats;
uint32_t i;
spdk_json_write_named_array_begin(w, "devices");
for (i = 0; i < stat->rdma.num_devices; i++) {
device_stats = &stat->rdma.device_stats[i];
spdk_json_write_object_begin(w);
spdk_json_write_named_string(w, "dev_name", device_stats->name);
spdk_json_write_named_uint64(w, "polls", device_stats->polls);
spdk_json_write_named_uint64(w, "idle_polls", device_stats->idle_polls);
spdk_json_write_named_uint64(w, "completions", device_stats->completions);
spdk_json_write_named_uint64(w, "queued_requests", device_stats->queued_requests);
For RDMA initiators, this reports per-device poller stats. polls and idle_polls describe polling behavior. completions tells you completions reaped from the transport. queued_requests tells you requests queued at this transport layer. Doorbell counters show how many work requests reached verbs/provider submission paths.
The RPC is per channel:
static void
rpc_bdev_nvme_stats_per_channel(struct spdk_io_channel_iter *i)
{
struct rpc_bdev_nvme_transport_stat_ctx *ctx;
struct spdk_io_channel *ch;
struct nvme_poll_group *group;
struct spdk_nvme_poll_group_stat *stat;
...
ch = spdk_io_channel_iter_get_channel(i);
group = spdk_io_channel_get_ctx(ch);
rc = spdk_nvme_poll_group_get_stats(group->group, &stat);
...
spdk_json_write_named_string(ctx->w, "thread", spdk_thread_get_name(spdk_get_thread()));
spdk_json_write_named_array_begin(ctx->w, "transports");
Again, this is not a single global atomic instant. It walks channels and asks each poll group for stats.
The lower RDMA transport fills the values in lib/nvme/nvme_rdma.c:
STAILQ_FOREACH(poller, &group->pollers, link) {
device_stat = &stats->rdma.device_stats[i];
device_stat->name = poller->device->device->name;
device_stat->polls = poller->stats.polls;
device_stat->idle_polls = poller->stats.idle_polls;
device_stat->completions = poller->stats.completions;
device_stat->queued_requests = poller->stats.queued_requests;
device_stat->total_send_wrs = poller->stats.rdma_stats.send.num_submitted_wrs;
device_stat->send_doorbell_updates = poller->stats.rdma_stats.send.doorbell_updates;
device_stat->total_recv_wrs = poller->stats.rdma_stats.recv.num_submitted_wrs;
device_stat->recv_doorbell_updates = poller->stats.rdma_stats.recv.doorbell_updates;
i++;
}
If bdev write counters increase but NVMe transport queued/completion counters do not, the I/O may not be reaching the NVMe transport path you think it is. If queued requests increase but completions do not, inspect transport health, qpair state, network/RDMA errors, target-side NVMf stats, and tracepoints.
NVMe-oF Target Stats
If SPDK is serving storage through NVMe-oF, use:
scripts/rpc.py nvmf_get_stats
From lib/nvmf/nvmf_rpc.c:
static void
_rpc_nvmf_get_stats(struct spdk_io_channel_iter *i)
{
struct rpc_nvmf_get_stats_ctx *ctx = spdk_io_channel_iter_get_ctx(i);
struct spdk_io_channel *ch;
struct spdk_nvmf_poll_group *group;
ch = spdk_get_io_channel(ctx->tgt);
group = spdk_io_channel_get_ctx(ch);
spdk_nvmf_poll_group_dump_stat(group, ctx->w);
spdk_put_io_channel(ch);
spdk_for_each_channel_continue(i, 0);
}
The target stats RPC walks I/O channels for the target. For each channel, it gets the poll group and asks the transport to dump stats. This is a direct source explanation for why target stats are poll-group oriented.
The RDMA transport maintains poller counters:
From lib/nvmf/rdma.c:
struct spdk_nvmf_rdma_poller_stat {
uint64_t completions;
uint64_t polls;
uint64_t idle_polls;
uint64_t requests;
uint64_t request_latency;
uint64_t pending_free_request;
uint64_t pending_rdma_read;
uint64_t pending_rdma_write;
uint64_t pending_rdma_send;
struct spdk_rdma_provider_qp_stats qp_stats;
};
This tells you what the target can expose:
- completions reaped from the RDMA completion queue.
- total polls and idle polls.
- requests observed.
- accumulated request latency ticks.
- pending counters for data transfer and send stages.
- provider-level send/recv work request counters.
The poller increments some of these counters here:
/* Poll for completing operations. */
reaped = ibv_poll_cq(rpoller->cq, 32, wc);
if (spdk_unlikely(reaped < 0)) {
SPDK_ERRLOG("Error polling CQ! (%d): %s\n",
errno, spdk_strerror(errno));
return -1;
} else if (reaped == 0) {
rpoller->stat.idle_polls++;
}
rpoller->stat.polls++;
rpoller->stat.completions += reaped;
If polls rises and idle_polls rises with no completions, the poller is running but not finding RDMA completions. That might be normal during idle time. Under load, it points you toward network, qpair, host, or target queue state.
The JSON output is produced here:
TAILQ_FOREACH(rpoller, &rgroup->pollers, link) {
spdk_json_write_object_begin(w);
spdk_json_write_named_string(w, "name",
ibv_get_device_name(rpoller->device->context->device));
spdk_json_write_named_uint64(w, "polls", rpoller->stat.polls);
spdk_json_write_named_uint64(w, "idle_polls", rpoller->stat.idle_polls);
spdk_json_write_named_uint64(w, "completions", rpoller->stat.completions);
spdk_json_write_named_uint64(w, "requests", rpoller->stat.requests);
spdk_json_write_named_uint64(w, "request_latency", rpoller->stat.request_latency);
spdk_json_write_named_uint64(w, "pending_rdma_read", rpoller->stat.pending_rdma_read);
spdk_json_write_named_uint64(w, "pending_rdma_write", rpoller->stat.pending_rdma_write);
spdk_json_write_named_uint64(w, "pending_rdma_send", rpoller->stat.pending_rdma_send);
Interpret these as deltas. For example, increasing pending_rdma_read is not automatically bad. It may mean work is flowing through that pending state. It becomes suspicious if pending counters rise while completions and bdev progress do not.
RPC Observability
SPDK JSON-RPC is itself observable. The official JSON-RPC docs describe methods, parameters, errors, and rpc.py. The source pattern is consistent:
- A
SPDK_RPC_REGISTERmacro binds a method name to a C handler. - The handler decodes
paramsinto a C struct. - The handler calls subsystem code.
- The handler sends a result or error.
RPC stats races are normal because many stats methods collect data asynchronously across threads or channels. That does not make them useless. It means you should:
- capture wall-clock time next to every RPC result.
- sample twice or more.
- compare deltas from the same method and same params.
- avoid mixing reset and non-reset samples.
- keep raw JSON when debugging production issues.
- record which process and socket you queried.
Example capture habit:
date -Is
scripts/rpc.py thread_get_stats > /tmp/thread-A.json
sleep 5
date -Is
scripts/rpc.py thread_get_stats > /tmp/thread-B.json
Do the same for bdev_get_iostat and transport stats in the same reproduction window.
spdk_top
spdk_top is an interactive client for SPDK stats. It connects to the RPC socket and periodically calls stats methods. The official guide says it relies on thread_get_stats, thread_get_pollers, and framework_get_reactors.
Source reading anchors:
app/spdk_top/spdk_top.c: rpc_send_reqsends JSON-RPC requests.app/spdk_top/spdk_top.ccallsthread_get_stats.app/spdk_top/spdk_top.ccallsthread_get_pollers.app/spdk_top/spdk_top.ccallsframework_get_reactors.app/spdk_top/spdk_top.c: show_statsdrives display refresh.
spdk_top is good for interactive exploration. For repeatable debugging, save raw RPC output and timestamps. Screens are hard to diff.
spdk_top also teaches the right mental model. It separates threads, pollers, and cores because those are different SPDK concepts. A core can be busy while a thread is not doing useful storage work. A poller can run frequently and remain idle. A timed poller can be healthy even if its run count grows slowly.
USDT
USDT probes are another observability path. The official guide is doc/usdt.md. Use USDT when the platform and build support it and when external tracing tools are preferred. It is especially useful when correlating SPDK events with system-level tools. For a beginner, learn logs, RPC stats, and SPDK traces first. Then add USDT for cross-process or kernel/user timelines.
diskengine Visibility
diskengine is treated in this book as an external control-plane process around SPDK. The local diskengine draft index says diskengine calls SPDK JSON-RPC through Go wrappers, reconciles database state, and lets SPDK own the data path once controllers, bdevs, lvols, RAID devices, and exports are created.
That creates a visibility split:
diskengine logs and database state
|
v
SPDK JSON-RPC object state
|
v
SPDK data path stats and traces
|
v
device, fabric, guest, or kernel-side evidence
For a diskengine storage-node issue, you usually need both:
- diskengine loop logs: did it attempt disk init, lvol provision, export, resize, snapshot, or deletion?
- SPDK state: do bdevs, lvstores, lvols, NVMf subsystems, listeners, namespaces, threads, and pollers exist?
For a diskengine baremetal issue, you usually need:
- diskengine attach/RAID/vhost loop logs.
bdev_nvme_get_controllersandbdev_nvme_get_io_pathsfor remote storage-node paths.bdev_get_bdevsand RAID state for the local graph.bdev_get_iostatfor local movement.bdev_nvme_get_transport_statisticsfor initiator transport movement.- storage-node
nvmf_get_statsandbdev_get_iostatfor target-side movement. - vhost/QEMU evidence if the guest sees a hang or missing disk.
Inference from local diskengine drafts: diskengine's data path visibility is intentionally indirect. The Go process creates and repairs SPDK objects; it is not the per-I/O proxy. Therefore, a diskengine log line can explain why an object was created, skipped, or retried, but SPDK counters and traces explain whether an actual guest write moved.
Distributed visibility gaps are expected:
- baremetal and storage-node clocks may differ.
- each SPDK process has its own RPC socket, logs, trace shared memory, and counters.
- diskengine database state can lag SPDK state during reconciliation.
- a storage-node export can exist while the baremetal initiator path is down.
- a baremetal NVMe controller can exist while its bdev namespace or I/O path is not usable.
- a vhost controller can exist with no QEMU session.
The practical answer is to build a timeline:
time T0: diskengine decided/attempted action
time T1: SPDK RPC created or rejected object
time T2: stats showed data path movement or no movement
time T3: trace showed last known stage for representative request
time T4: guest/client observed symptom
Correlation Recipes
Slow I/O Workload
1. collect logs around the symptom
2. sample bdev_get_iostat twice on the top bdev
3. sample thread_get_stats twice
4. check framework_get_reactors for core placement
5. sample transport stats twice
6. query lower-layer bdev or target stats
7. enable narrow trace groups if counters show a stuck layer
8. inspect source around the first layer that stops progressing
Do not skip step 2. If top-level bdev counters are flat, lower-layer transport analysis may be premature.
Create RPC Fails
1. capture exact RPC error
2. search the error string
3. inspect the decoder and semantic checks
4. enable the component log flag if needed
5. query object state with get RPCs
6. retry from a clean process if replay order is suspected
Many RPC failures are parameter or phase problems. Do not treat every RPC error as a device failure.
Target Accepts Connections But I/O Stalls
1. check nvmf_get_stats
2. check bdev_get_iostat for backend movement
3. check thread_get_pollers for target pollers
4. inspect transport logs
5. enable transport and bdev traces narrowly
6. compare initiator and target counters
If initiator queued requests increase but target request counters do not, suspect fabric/path/connection. If target requests increase but backend bdev counters do not, inspect namespace mapping, bdev state, and target request execution.
diskengine Baremetal Guest Write Hangs
1. confirm QEMU/vhost session exists
2. check RAID bdev state and top-level bdev iostat
3. check base NVMe bdev iostat and bdev_nvme transport stats
4. check storage-node nvmf_get_stats for the exported lvol
5. check storage-node bdev/lvol iostat
6. line up diskengine logs with SPDK RPC results
7. add traces at the first layer where counters stop
The important distinction is between control-plane readiness and data-plane progress. diskengine can believe an object should exist while SPDK is still reconnecting, examining, resetting, or tearing it down.
Edge Cases
Missing Logs Due To Flags Or Build Type
Debug logs usually require the right log level, print level, and component flag. Some debug-only state checks depend on debug builds or #ifdef DEBUG. If you cannot find a log, search the source for the return path and inspect RPC errors or stats.
Reactor Starvation
A reactor can be busy without useful progress. Look for high reactor busy deltas, high poller run deltas, low useful I/O deltas, and repeated retry/error logs. If a callback does too much work in one poller run, other messages can wait behind it.
Poller Timing Confusion
Poller period values are in ticks in the RPC source. Some tools display converted microseconds. A period of zero does not mean "never runs"; it usually means active poller. Timed pollers may run slowly by design.
Trace Buffer Overhead And Wrap
Enabled tracepoints write shared memory and can copy arguments. The buffer is circular. Use narrow groups, trace_disable_tpoint_group, trace_clear_tpoint_mask, and spdk_trace_record when you need a longer capture.
JSON-RPC Stats Races
Stats RPCs often walk threads, reactors, bdevs, or I/O channels. Objects can change while collection is in progress. Treat a response as an operational snapshot, not a serialized transaction.
Per-Thread Versus Global Counters
Global counters can hide a hot channel or a stuck thread. Use per-channel bdev stats, thread stats, and poller stats when one core or path behaves differently.
Counter Reset Changes The Story
Some RPCs support reset behavior. One observer resetting counters affects later observers. Always record reset parameters.
Zero I/O At One Layer Can Be Correct
If a request is cached, rejected early, blocked above the layer, or sent down another path, lower counters stay flat. Start from the top layer where the request is known to exist.
Distributed diskengine Visibility Gaps
In diskengine deployments, baremetal SPDK, storage-node SPDK, QEMU, guest kernel, and diskengine are different observation points. No single RPC covers the whole path. Collect from both sides of the fabric and line up times.
Misconceptions To Kill
- "A log line is the root cause." It is usually the first visible branch.
- "No errors in logs means no error." Many failures return through callbacks or RPC responses.
- "A high counter is bad." Deltas and context matter.
- "A quiet trace means nothing happened." The tracepoint may be disabled, absent, or overwritten.
- "Stats are real time." RPC stats are sampled snapshots.
- "spdk_top fixes observability." It is a client, not a data store.
- "Enabling all debug output is best." It can drown the useful signal.
- "One layer's success proves the whole path is healthy." Storage paths are stacked.
- "diskengine handles every write." In the inspected design, diskengine creates and repairs SPDK objects; SPDK moves the I/O.
Source Reading Path
Read in this order if you want to extend SPDK observability code:
include/spdk/log.handlib/event/log_rpc.cto understand log flags and runtime control.- One component that uses logs, such as
module/bdev/nvme/bdev_nvme.c. include/spdk/trace.handlib/trace/trace.cto understand trace record cost and fields.include/spdk_internal/sock_module.hfor a small tracepoint flow.module/bdev/nvme/bdev_nvme.cfor bdev NVMe start/done tracepoints.lib/event/app_rpc.cfor thread, poller, and reactor snapshots.lib/bdev/bdev_rpc.candlib/bdev/bdev.cfor bdev stat aggregation.lib/nvmf/nvmf_rpc.candlib/nvmf/rdma.cfor target transport stats.module/bdev/nvme/bdev_nvme_rpc.candlib/nvme/nvme_rdma.cfor initiator transport stats.app/spdk_top/spdk_top.cto see how a real tool consumes these RPCs.
When adding new observability, choose the layer by cost:
- use logs for rare decisions and error branches.
- use stats for cumulative counters and current state.
- use traces for hot-path ordering.
- use RPC output for human and automation access.
Labs
Lab: Map A Log Flag
Pick a source file with SPDK_LOG_REGISTER_COMPONENT. Find the component name. Find a SPDK_DEBUGLOG or SPDK_INFOLOG in that component. Find the RPC that enables log flags. Write the exact scripts/rpc.py command you would use to enable only that flag. Explain what you expect to see before reproducing the issue.
Lab: Sample Bdev Deltas
Run a simple workload against a test bdev. Capture bdev_get_iostat. Wait five seconds. Capture it again. Compute deltas for read/write operations and bytes. Convert tick-based values using the returned tick rate. Write one paragraph describing the workload from counters alone.
Lab: Thread Stats First Pass
Capture thread_get_stats and thread_get_pollers. List every thread name. List each poller name. Mark which pollers are active, timed, or paused. Explain which thread you would inspect first if I/O is not moving.
Lab: Tracepoint Reading
Open include/spdk_internal/sock_module.h. Find the calls to spdk_trace_record. Write down the event names implied by the trace constants. Explain how queued, pending, and complete socket events differ. Then open lib/trace/trace_flags.c and find where descriptions are registered.
Lab: Two-Node diskengine Timeline
For a hypothetical volume served from a storage node to a baremetal node, write the evidence you would collect at each point:
diskengine baremetal logs
baremetal SPDK bdev_get_iostat
baremetal SPDK bdev_nvme_get_transport_statistics
storage-node SPDK nvmf_get_stats
storage-node SPDK bdev_get_iostat
diskengine storage-node logs
Explain which counter should move first for a guest write and which counter should move last.
Self-Check
- What is the difference between log level and print level?
- Which file implements runtime log RPCs?
- What does a tracepoint record that a counter usually does not?
- Why should most stats be compared as deltas?
- What RPC would you use to inspect poller names?
- What does
spdk_topuse to collect its data? - Why can a trace buffer miss an old event?
- What source file implements
bdev_get_iostat? - Why is
thread_get_statsnot a single atomic global read? - Why can diskengine logs and SPDK counters disagree temporarily?
References
Local source and docs:
include/spdk/log.hlib/log/log.clib/log/log_flags.clib/event/log_rpc.cinclude/spdk/trace.hlib/trace/trace.clib/trace/trace_flags.clib/trace/trace_rpc.cinclude/spdk_internal/sock_module.hinclude/spdk_internal/thread.hlib/event/app_rpc.clib/bdev/bdev_rpc.clib/bdev/bdev.cmodule/bdev/nvme/bdev_nvme.cmodule/bdev/nvme/bdev_nvme_rpc.clib/nvmf/nvmf_rpc.clib/nvmf/rdma.clib/nvmf/tcp.clib/nvme/nvme_rdma.capp/spdk_top/spdk_top.capp/spdk_top/READMEdoc/tracing.mddoc/spdk_top.mddoc/usdt.mddoc/event.mddoc/concurrency.mddoc/jsonrpc.md.jinja2test/unit/lib/event/reactor.c/reactor_ut.ctest/unit/lib/log/log.c/log_ut.cspdk-learning-site/drafts/transport-diskengine/00-index.mdspdk-learning-site/drafts/transport-diskengine/22-nvme-of-target.mdspdk-learning-site/drafts/transport-diskengine/23-nvme-of-initiator-bdev-nvme.mdspdk-learning-site/drafts/transport-diskengine/29-diskengine-storage-node-mode.mdspdk-learning-site/drafts/transport-diskengine/30-diskengine-baremetal-mode.mdspdk-learning-site/drafts/transport-diskengine/31-complete-vm-write-to-ssd-walkthrough.md
Official SPDK docs:
- SPDK tracing framework: https://spdk.io/doc/tracepoints.html
- SPDK
spdk_top: https://spdk.io/doc/spdk_top.html - SPDK JSON-RPC: https://spdk.io/doc/jsonrpc.html