Skip to Content
TestingBenchmark gate

Benchmark gate

A FrankenTUI change that makes the kernel slower is a regression even if every test still passes. The benchmark gate encodes “how slow is too slow” as structured evidence — thresholds with explicit budgets and tolerances — and fails CI when a measurement exceeds its ceiling.

Source: crates/ftui-harness/src/benchmark_gate.rs + scripts/perf_regression_gate.sh + scripts/bench_budget.sh + tests/baseline.json + slo.yaml.

Mental model

baseline.json ─┐ ├─▶ BenchmarkGate.evaluate(&measurements) ─▶ GateResult criterion run ──┘ │ passed? failed? RolloutScorecard

Three pieces cooperate:

  1. Threshold — a named budget (metric, budget, tolerance_pct).
  2. Measurement — a named observation (metric, value, optional unit).
  3. BenchmarkGate — a collection of thresholds. evaluate zips measurements to thresholds by name and produces a GateResult with per-metric MetricVerdicts.

API at a glance

Threshold

pub struct Threshold { pub metric: String, // e.g. "frame_render_p99_us" pub budget: f64, // upper bound in whatever unit the metric uses pub tolerance_pct: f64, // allowed overage, 0..100 }
MethodPurpose
Threshold::new(metric, budget)Zero tolerance.
.tolerance_pct(pct)Allow a percentage overage above budget.
.ceiling()Effective maximum = budget * (1 + tolerance_pct / 100).

Measurement

pub struct Measurement { pub metric: String, pub value: f64, }
MethodPurpose
Measurement::new(metric, value)Construct.
.unit(unit)Attach a unit string for reports.

BenchmarkGate and GateResult

let gate = BenchmarkGate::new("render_frame_gate") .threshold(Threshold::new("frame_render_p99_us", 2000.0).tolerance_pct(10.0)) .threshold(Threshold::new("diff_compute_p99_us", 500.0)); let measurements = vec![ Measurement::new("frame_render_p99_us", 1950.0), Measurement::new("diff_compute_p99_us", 480.0), ]; let result: GateResult = gate.evaluate(&measurements); assert!(result.passed());
GateResult methodPurpose
passed()All metrics within ceiling.
failures()Iterator of failing MetricResults.
summary()Human-readable one-line summary.

Per-metric verdicts live in MetricVerdict::{Pass, Fail} inside each MetricResult { metric, value, threshold, verdict }.

JSON baseline format

BenchmarkGate::load_json(gate_name, json) ingests:

{ "frame_render_p99_us": { "budget": 2000.0, "tolerance_pct": 10.0 }, "diff_compute_p99_us": { "budget": 500.0 } }

load_baseline_json(gate_name, json, "p99") reads a criterion-style baseline and extracts the named percentile as the budget.

How CI enforces the gate

scripts/perf_regression_gate.sh

Runs criterion benchmarks, compares means to tests/baseline.json p99 budgets, writes target/regression-gate/regression_report.jsonl.

./scripts/perf_regression_gate.sh # Run + check ./scripts/perf_regression_gate.sh --check-only # Parse existing results ./scripts/perf_regression_gate.sh --quick # CI-friendly sampling ./scripts/perf_regression_gate.sh --json # Emit JSONL report ./scripts/perf_regression_gate.sh --flamegraph # Generate flamegraphs ./scripts/perf_regression_gate.sh --update # Refresh baseline with actuals

A CI failure looks like:

[perf-gate] FAIL frame_render_p99_us: observed 2280.0us > ceiling 2200.0us (budget 2000.0us +10%)

When you see that, the first questions are:

  1. Is the change actually faster on a different percentile? Check target/criterion/.../new/estimates.json.
  2. Did you add allocation on the hot path? Run --flamegraph.
  3. Is the new budget acceptable? If yes, run --update and document the rationale in the PR.

scripts/bench_budget.sh

Budget-only enforcement — no baseline file required. Reads the budgets from the benchmark annotations themselves and fails on overshoot.

./scripts/bench_budget.sh ./scripts/bench_budget.sh --quick ./scripts/bench_budget.sh --check-only ./scripts/bench_budget.sh --json

Used for early-stage benchmarks that don’t yet have a stable baseline.

SLO alignment

The kernel has a small set of service-level objectives in slo.yaml. The benchmark gate’s budgets mirror the SLO ceilings — every budget in tests/baseline.json that maps to an SLO metric should be less-than-or-equal to the SLO’s max_value.

Example correspondences:

SLO metricBenchmark-gate metricBudget source
render_frame_p99_usframe_render_p99_usslo.yaml max_value 4000 µs
layout_compute_p99_uslayout_compute_p99_usslo.yaml max_value 1500 µs
diff_strategy_p99_usdiff_strategy_p99_usslo.yaml max_value 1000 µs
ansi_present_p99_usansi_present_p99_usslo.yaml max_value 1200 µs

See SLO schema for the full list and frame budget for what happens at runtime when a budget is exceeded.

Feeding the rollout scorecard

A benchmark GateResult is a first-class input to the rollout scorecard alongside shadow-run results:

use ftui_harness::rollout_scorecard::{ RolloutScorecard, RolloutScorecardConfig, }; let mut scorecard = RolloutScorecard::new( RolloutScorecardConfig::default().require_benchmark_pass(true), ); scorecard.add_shadow_result(shadow_result); scorecard.set_benchmark_gate(gate_result); assert!(scorecard.evaluate().is_go());

See rollout scorecard.

Pitfalls

Don’t silence the gate. If a PR legitimately raises a budget — new feature costs measurable time — update the budget and document why in the PR description. A silent --update erases institutional memory.

Measurement noise. Criterion tolerates a few percent of noise by default. tolerance_pct above the gate threshold is there for measurement variance, not for absorbing regressions. If you need

20% headroom to pass, the change is the regression.

Percentile choice. p99 is the default because p50 hides tail pain. Don’t downgrade a metric from p99 to p95 to make a gate pass — tail latency is user-visible on every n-th frame.