Benchmark gate
A FrankenTUI change that makes the kernel slower is a regression even if every test still passes. The benchmark gate encodes “how slow is too slow” as structured evidence — thresholds with explicit budgets and tolerances — and fails CI when a measurement exceeds its ceiling.
Source: crates/ftui-harness/src/benchmark_gate.rs +
scripts/perf_regression_gate.sh + scripts/bench_budget.sh +
tests/baseline.json + slo.yaml.
Mental model
baseline.json ─┐
├─▶ BenchmarkGate.evaluate(&measurements) ─▶ GateResult
criterion run ──┘ │
passed? failed?
│
RolloutScorecardThree pieces cooperate:
Threshold— a named budget (metric,budget,tolerance_pct).Measurement— a named observation (metric,value, optionalunit).BenchmarkGate— a collection of thresholds.evaluatezips measurements to thresholds by name and produces aGateResultwith per-metricMetricVerdicts.
API at a glance
Threshold
pub struct Threshold {
pub metric: String, // e.g. "frame_render_p99_us"
pub budget: f64, // upper bound in whatever unit the metric uses
pub tolerance_pct: f64, // allowed overage, 0..100
}| Method | Purpose |
|---|---|
Threshold::new(metric, budget) | Zero tolerance. |
.tolerance_pct(pct) | Allow a percentage overage above budget. |
.ceiling() | Effective maximum = budget * (1 + tolerance_pct / 100). |
Measurement
pub struct Measurement {
pub metric: String,
pub value: f64,
}| Method | Purpose |
|---|---|
Measurement::new(metric, value) | Construct. |
.unit(unit) | Attach a unit string for reports. |
BenchmarkGate and GateResult
let gate = BenchmarkGate::new("render_frame_gate")
.threshold(Threshold::new("frame_render_p99_us", 2000.0).tolerance_pct(10.0))
.threshold(Threshold::new("diff_compute_p99_us", 500.0));
let measurements = vec![
Measurement::new("frame_render_p99_us", 1950.0),
Measurement::new("diff_compute_p99_us", 480.0),
];
let result: GateResult = gate.evaluate(&measurements);
assert!(result.passed());GateResult method | Purpose |
|---|---|
passed() | All metrics within ceiling. |
failures() | Iterator of failing MetricResults. |
summary() | Human-readable one-line summary. |
Per-metric verdicts live in MetricVerdict::{Pass, Fail} inside each
MetricResult { metric, value, threshold, verdict }.
JSON baseline format
BenchmarkGate::load_json(gate_name, json) ingests:
{
"frame_render_p99_us": { "budget": 2000.0, "tolerance_pct": 10.0 },
"diff_compute_p99_us": { "budget": 500.0 }
}load_baseline_json(gate_name, json, "p99") reads a criterion-style
baseline and extracts the named percentile as the budget.
How CI enforces the gate
scripts/perf_regression_gate.sh
Runs criterion benchmarks, compares means to tests/baseline.json
p99 budgets, writes target/regression-gate/regression_report.jsonl.
./scripts/perf_regression_gate.sh # Run + check
./scripts/perf_regression_gate.sh --check-only # Parse existing results
./scripts/perf_regression_gate.sh --quick # CI-friendly sampling
./scripts/perf_regression_gate.sh --json # Emit JSONL report
./scripts/perf_regression_gate.sh --flamegraph # Generate flamegraphs
./scripts/perf_regression_gate.sh --update # Refresh baseline with actualsA CI failure looks like:
[perf-gate] FAIL frame_render_p99_us: observed 2280.0us > ceiling 2200.0us (budget 2000.0us +10%)When you see that, the first questions are:
- Is the change actually faster on a different percentile? Check
target/criterion/.../new/estimates.json. - Did you add allocation on the hot path? Run
--flamegraph. - Is the new budget acceptable? If yes, run
--updateand document the rationale in the PR.
scripts/bench_budget.sh
Budget-only enforcement — no baseline file required. Reads the budgets from the benchmark annotations themselves and fails on overshoot.
./scripts/bench_budget.sh
./scripts/bench_budget.sh --quick
./scripts/bench_budget.sh --check-only
./scripts/bench_budget.sh --jsonUsed for early-stage benchmarks that don’t yet have a stable baseline.
SLO alignment
The kernel has a small set of service-level objectives in
slo.yaml. The benchmark gate’s budgets
mirror the SLO ceilings — every budget in tests/baseline.json
that maps to an SLO metric should be less-than-or-equal to the SLO’s
max_value.
Example correspondences:
| SLO metric | Benchmark-gate metric | Budget source |
|---|---|---|
render_frame_p99_us | frame_render_p99_us | slo.yaml max_value 4000 µs |
layout_compute_p99_us | layout_compute_p99_us | slo.yaml max_value 1500 µs |
diff_strategy_p99_us | diff_strategy_p99_us | slo.yaml max_value 1000 µs |
ansi_present_p99_us | ansi_present_p99_us | slo.yaml max_value 1200 µs |
See SLO schema for the full list and frame budget for what happens at runtime when a budget is exceeded.
Feeding the rollout scorecard
A benchmark GateResult is a first-class input to the rollout
scorecard alongside shadow-run results:
use ftui_harness::rollout_scorecard::{
RolloutScorecard, RolloutScorecardConfig,
};
let mut scorecard = RolloutScorecard::new(
RolloutScorecardConfig::default().require_benchmark_pass(true),
);
scorecard.add_shadow_result(shadow_result);
scorecard.set_benchmark_gate(gate_result);
assert!(scorecard.evaluate().is_go());See rollout scorecard.
Pitfalls
Don’t silence the gate. If a PR legitimately raises a budget —
new feature costs measurable time — update the budget and document
why in the PR description. A silent --update erases institutional
memory.
Measurement noise. Criterion tolerates a few percent of
noise by default. tolerance_pct above the gate threshold is there
for measurement variance, not for absorbing regressions. If you need
20% headroom to pass, the change is the regression.
Percentile choice. p99 is the default because p50 hides tail pain. Don’t downgrade a metric from p99 to p95 to make a gate pass — tail latency is user-visible on every n-th frame.