Files
harmony/iobench

iobench

A command-line I/O benchmarking tool using fio. Runs locally, over SSH, or on Kubernetes pods. Includes a Redpanda storage characterization profile for validating Ceph RBD suitability.

Build

cargo build -p iobench --release

Simple profile (default)

Runs standard fio benchmarks (sequential/random read/write, single/multi-job) against a local or remote target.

# Local
iobench

# Over SSH
iobench --target ssh/user@host

# On a Kubernetes pod
iobench --target k8s/namespace/pod-name

# Custom parameters
iobench --duration 60 --size 4G --block-size 128k --tests read,write,randwrite

Available tests: read, write, randread, randwrite, multiread, multiwrite, multirandread, multirandwrite.

Results are saved to ./iobench-{timestamp}/summary.csv.

Redpanda profile

Characterizes whether a Kubernetes storage backend (e.g. Ceph RBD) can sustain Redpanda's I/O patterns. Deploys a 3-pod StatefulSet with per-node PVCs and runs four fio workloads that model Redpanda's actual storage behavior.

Quick start

# Run the full profile (sequential baselines, then parallel contention test)
iobench redpanda --storage-class ceph-block

# Sequential baselines only (single-node, no cluster impact)
iobench redpanda --mode sequential --storage-class ceph-block

# Parallel contention test only
iobench redpanda --mode parallel --storage-class ceph-block

# Keep pods alive after the run (useful for re-running or debugging)
iobench redpanda --storage-class ceph-block --keep-deployment

# Deploy pods without running any workloads
iobench deploy --storage-class ceph-block

# Remove all iobench pods and PVCs
iobench undeploy

Options

Flag Default Description
--mode both sequential, parallel, or both
--storage-class ceph-block Kubernetes StorageClass for the PVCs
--pvc-size 50Gi Size of each PVC
--replicas 3 Number of pods (should match cluster node count)
--namespace default Kubernetes namespace
--output-dir auto-timestamped Directory for results
--keep-deployment false Don't delete pods/PVCs after the run

Workloads

The profile runs four workloads, each targeting a different aspect of Redpanda's I/O:

Workload What it models Key params Runtime
throughput Bulk segment writes (optimistic upper bound) 16K seq write, no fsync, 4 jobs, iodepth=16 5 min
fsync_hot_path Raft commit path with acks=all 16K seq write, fdatasync per op, 4 jobs, iodepth=4 10 min
selftest_512k rpk cluster self-test 512K phase 512K seq write, fdatasync, 1 job, iodepth=4 2 min
selftest_4k_qd1 Worst-case single-stream commit 4K seq write, fdatasync, 1 job, iodepth=1 2 min

All workloads use direct=1 (bypass page cache) and ioengine=libaio.

Execution modes

  • sequential -- Runs all four workloads back-to-back on a single node. Clean per-pattern baselines, comparable to Redpanda's published hardware requirements (16,000 IOPS minimum per broker).
  • parallel -- Runs all four workloads concurrently across all nodes simultaneously. Each node writes to its own PVC. A wall-clock barrier synchronizes start times across pods so contention windows overlap. This is the production-shape test that exposes the Ceph OSD fan-in pattern.
  • both (default) -- Sequential first, then parallel.

Interpreting results

The headline metric is the worst p99.9 fdatasync latency on the fsync_hot_path workload during parallel execution:

  • <= 100 ms: PASS. Storage can sustain Raft heartbeats under load.
  • > 100 ms: FAIL. Raft heartbeats (150 ms interval, 1.5 s election timeout) will not survive load spikes, leading to election storms.

Reference values from healthy NVMe (from rpk cluster self-test):

  • selftest_512k: ~1182 IOPS, ~591 MiB/s
  • selftest_4k_qd1: ~406 IOPS

Output

Results are saved to ./iobench-redpanda-{timestamp}/:

iobench-redpanda-2026-05-03-143022/
  redpanda_summary.csv          # Full results: all metrics, all nodes, all workloads
  iobench.csv                   # Dashboard-compatible format
  sequential/
    throughput_node-0.json       # Raw fio JSON per workload per node
    fsync_hot_path_node-0.json
    ...
    node-0/                      # Per-node time-series logs
      throughput_lat.1.log
      throughput_iops.1.log
      ...
  parallel/
    throughput_node-0.json
    throughput_node-1.json
    throughput_node-2.json
    ...

The CSV includes per-workload per-node: IOPS, bandwidth, completion latency percentiles (p50-p99.99, max), and fdatasync latency percentiles (p50-p99.99, max).

Per-second time-series logs (*_lat.*.log, *_iops.*.log) capture the bimodal/spiky behavior of Ceph under load that summary statistics miss.

Warning

Parallel mode generates heavy fdatasync workloads across all nodes simultaneously. This will impact other workloads on the same Ceph pool. Run during a maintenance window or off-peak.

Dashboard

A Plotly Dash app for visualizing results. Supports both simple and Redpanda profile data.

cd iobench/dash
virtualenv venv
source venv/bin/activate
pip install -r requirements_freeze.txt

# Copy result CSVs into the dash directory, then:
python iobench-dash.py

The dashboard reads iobench.csv and/or redpanda_summary.csv from its working directory.