Some checks failed
Run Check Script / check (pull_request) Failing after -44h57m24s
5.4 KiB
5.4 KiB
Redpanda iobench preflight checklist
Cluster inspection performed: 2026-05-03 Context: maintenance window, other workloads turned off, Ceph pool at ~0 IOPS idle.
Cluster topology
- PASS — 3 worker nodes available (wk0, wk1, wk2). Matches
--replicas 3default. - PASS — All 3 workers are Ready, no DiskPressure / MemoryPressure / PIDPressure.
- PASS — Control plane nodes (cp3, cp4, cp5) have
NoScheduletaint. iobench pods cannot land there. - PASS — All nodes carry
kubernetes.io/hostnamelabel. Anti-affinity topology key will work. - PASS — Worker nodes are untainted and have no scheduling restrictions.
Storage
- PASS — StorageClass
ceph-blockexists,volumeBindingMode: Immediate,reclaimPolicy: Delete. - PASS — Ceph pool
ceph-blockpoolreplication: size=3, min_size=2. - PASS — 3 OSDs all Running (osd-0, osd-1, osd-2).
- PASS — Raw capacity: 503 GiB available. 3x50GiB PVCs = 150 GiB usable = 450 GiB raw. Leaves 53 GiB raw headroom (~10.5%). Tight but sufficient for a benchmark run.
- PASS — Largest single-workload disk footprint:
throughputwith numjobs=4 x size=10G = 40 GiB per PVC. Fits in 50 GiB PVC with 10 GiB headroom for logs and filesystem overhead. - PASS — Pool is idle ("nothing is going on"), OSD commit/apply latency showing 0 ms. Confirms maintenance window baseline.
- PASS — All 3 disks are actually Intel SSDs (misidentified as HDD by Ceph due to HP RAID controller passthrough). No mixed-media concern.
- PASS (non-blocking) — Ceph health is
HEALTH_WARN: too many PGs per OSD (265 > max 250). This is a pre-existing tuning issue unrelated to the benchmark. Does not affect correctness of results. Note: if this were a data-integrity warning (e.g.HEALTH_WARN: degraded PGs), the benchmark must not proceed.
Namespace and resource constraints
- PASS —
defaultnamespace exists. - PASS — No ResourceQuota or LimitRange in
defaultnamespace. Pod creation won't be blocked. - PASS — No leftover
iobench-redpandaStatefulSet or PVCs indefaultnamespace. Clean slate.
Container image
- PASS —
juicedata/fio:latestpulls successfully and runs on this cluster. - PASS — fio version 3.18 confirmed. Supports
fdatasync=1,log_avg_msec,write_lat_log,write_iops_log,--output-format=json. - PASS — Image contains
sh,date(needed for wall-clock barrier), andtar(needed forkubectl cp).
Clock synchronization (barrier correctness)
- PASS — Node heartbeat timestamps are within ~5s of each other. The 60-second barrier (
start_at = now_epoch() + 60) provides ample margin. Barrier only fails if clock skew exceeds 60s.
Code safety review
- PASS — All fio workloads use
direct=1(bypass page cache). Benchmark measures Ceph, not RAM. - PASS — All writes target
/data/iobench_testfileon the mounted PVC filesystem, not a raw block device. No risk of corrupting the node OS disk. - PASS — Pods run with default security context (no privileged, no hostPath, no hostNetwork). Blast radius is limited to the PVCs.
- PASS —
reclaimPolicy: Deletemeans PVCs and their backing RBD images are fully cleaned up when deleted. No storage leak afterundeploy. - PASS —
delete_resourcestargets onlystatefulset,pvcwith labelapp=iobench-redpanda. Cannot accidentally delete unrelated resources. - PASS — Parallel mode prints a clear warning before starting.
- PASS —
--keep-deploymentflag available for debugging without re-provisioning. - PASS — Workloads run sequentially within each mode (throughput, then fsync_hot_path, then selftest_512k, then selftest_4k_qd1). Disk usage doesn't accumulate across workloads since they reuse the same filename.
Resource consumption during run
- PASS — Workers have ample CPU headroom (3-6% current usage). fio with 4 jobs + iodepth 16 is not CPU-intensive on modern hardware.
- PASS — Workers have ample memory headroom (5-10% current usage). fio with
direct=1uses minimal RAM. - PASS — No CPU/memory resource limits or requests set on the fio container. This is intentional — resource limits would throttle the benchmark and distort results. Acceptable during a maintenance window with other workloads off.
Risks acknowledged
- Ceph pool impact: Parallel mode will saturate the Ceph pool. This is the point of the test. Confirmed other workloads are off and pool is at ~0 IOPS.
- Capacity headroom is tight: 53 GiB raw remaining after PVC provisioning (~10.5%). If the cluster has background operations that consume space (e.g., snapshots, compaction), this could trigger a
HEALTH_ERR: fullcondition. Mitigated by: maintenance window, no other workloads, andreclaimPolicy: Deleteensuring cleanup. --wait=falseon delete:delete_resourcesuses--wait=false, soundeployreturns before PVCs are fully reclaimed. This is fine — Ceph handles RBD image deletion asynchronously. But if re-running immediately after undeploy, PVCs from the previous run may still be terminating. Mitigated by: the deploy step useskubectl applywhich is idempotent.
Verdict
All checks pass. Safe to proceed with iobench redpanda --storage-class ceph-block during the current maintenance window.