Benchmark gate

A FrankenTUI change that makes the kernel slower is a regression even if every test still passes. The benchmark gate encodes “how slow is too slow” as structured evidence — thresholds with explicit budgets and tolerances — and fails CI when a measurement exceeds its ceiling.

Source: crates/ftui-harness/src/benchmark_gate.rs + scripts/perf_regression_gate.sh + scripts/bench_budget.sh + tests/baseline.json + slo.yaml.

Mental model


baseline.json  ─┐
                ├─▶  BenchmarkGate.evaluate(&measurements)  ─▶  GateResult
criterion run ──┘                                                  │
                                                         passed? failed?
                                                                   │
                                                          RolloutScorecard

Three pieces cooperate:

Threshold — a named budget (metric, budget, tolerance_pct).
Measurement — a named observation (metric, value, optional unit).
BenchmarkGate — a collection of thresholds. evaluate zips measurements to thresholds by name and produces a GateResult with per-metric MetricVerdicts.

API at a glance

`Threshold`


pub struct Threshold {
    pub metric: String,         // e.g. "frame_render_p99_us"
    pub budget: f64,             // upper bound in whatever unit the metric uses
    pub tolerance_pct: f64,      // allowed overage, 0..100
}

Method	Purpose
`Threshold::new(metric, budget)`	Zero tolerance.
`.tolerance_pct(pct)`	Allow a percentage overage above `budget`.
`.ceiling()`	Effective maximum = `budget * (1 + tolerance_pct / 100)`.

`Measurement`


pub struct Measurement {
    pub metric: String,
    pub value: f64,
}

Method	Purpose
`Measurement::new(metric, value)`	Construct.
`.unit(unit)`	Attach a unit string for reports.

`BenchmarkGate` and `GateResult`


let gate = BenchmarkGate::new("render_frame_gate")
    .threshold(Threshold::new("frame_render_p99_us", 2000.0).tolerance_pct(10.0))
    .threshold(Threshold::new("diff_compute_p99_us", 500.0));
 
let measurements = vec![
    Measurement::new("frame_render_p99_us", 1950.0),
    Measurement::new("diff_compute_p99_us", 480.0),
];
 
let result: GateResult = gate.evaluate(&measurements);
assert!(result.passed());

`GateResult` method	Purpose
`passed()`	All metrics within ceiling.
`failures()`	Iterator of failing `MetricResult`s.
`summary()`	Human-readable one-line summary.

Per-metric verdicts live in MetricVerdict::{Pass, Fail} inside each MetricResult { metric, value, threshold, verdict }.

JSON baseline format

BenchmarkGate::load_json(gate_name, json) ingests:


{
  "frame_render_p99_us": { "budget": 2000.0, "tolerance_pct": 10.0 },
  "diff_compute_p99_us": { "budget": 500.0 }
}

load_baseline_json(gate_name, json, "p99") reads a criterion-style baseline and extracts the named percentile as the budget.

How CI enforces the gate

`scripts/perf_regression_gate.sh`

Runs criterion benchmarks, compares means to tests/baseline.json p99 budgets, writes target/regression-gate/regression_report.jsonl.


./scripts/perf_regression_gate.sh              # Run + check
./scripts/perf_regression_gate.sh --check-only # Parse existing results
./scripts/perf_regression_gate.sh --quick      # CI-friendly sampling
./scripts/perf_regression_gate.sh --json       # Emit JSONL report
./scripts/perf_regression_gate.sh --flamegraph # Generate flamegraphs
./scripts/perf_regression_gate.sh --update     # Refresh baseline with actuals

A CI failure looks like:


[perf-gate] FAIL frame_render_p99_us: observed 2280.0us > ceiling 2200.0us (budget 2000.0us +10%)

When you see that, the first questions are:

Is the change actually faster on a different percentile? Check target/criterion/.../new/estimates.json.
Did you add allocation on the hot path? Run --flamegraph.
Is the new budget acceptable? If yes, run --update and document the rationale in the PR.

`scripts/bench_budget.sh`

Budget-only enforcement — no baseline file required. Reads the budgets from the benchmark annotations themselves and fails on overshoot.


./scripts/bench_budget.sh
./scripts/bench_budget.sh --quick
./scripts/bench_budget.sh --check-only
./scripts/bench_budget.sh --json

Used for early-stage benchmarks that don’t yet have a stable baseline.

SLO alignment

The kernel has a small set of service-level objectives in slo.yaml. The benchmark gate’s budgets mirror the SLO ceilings — every budget in tests/baseline.json that maps to an SLO metric should be less-than-or-equal to the SLO’s max_value.

Example correspondences:

SLO metric	Benchmark-gate metric	Budget source
`render_frame_p99_us`	`frame_render_p99_us`	`slo.yaml` max_value 4000 µs
`layout_compute_p99_us`	`layout_compute_p99_us`	`slo.yaml` max_value 1500 µs
`diff_strategy_p99_us`	`diff_strategy_p99_us`	`slo.yaml` max_value 1000 µs
`ansi_present_p99_us`	`ansi_present_p99_us`	`slo.yaml` max_value 1200 µs

See SLO schema for the full list and frame budget for what happens at runtime when a budget is exceeded.

Feeding the rollout scorecard

A benchmark GateResult is a first-class input to the rollout scorecard alongside shadow-run results:


use ftui_harness::rollout_scorecard::{
    RolloutScorecard, RolloutScorecardConfig,
};
 
let mut scorecard = RolloutScorecard::new(
    RolloutScorecardConfig::default().require_benchmark_pass(true),
);
scorecard.add_shadow_result(shadow_result);
scorecard.set_benchmark_gate(gate_result);
 
assert!(scorecard.evaluate().is_go());

See rollout scorecard.

Pitfalls

Don’t silence the gate. If a PR legitimately raises a budget — new feature costs measurable time — update the budget and document why in the PR description. A silent --update erases institutional memory.

Measurement noise. Criterion tolerates a few percent of noise by default. tolerance_pct above the gate threshold is there for measurement variance, not for absorbing regressions. If you need

20% headroom to pass, the change is the regression.

Percentile choice. p99 is the default because p50 hides tail pain. Don’t downgrade a metric from p99 to p95 to make a gate pass — tail latency is user-visible on every n-th frame.

SLO schema Frame budget Performance HUD Shadow-run Rollout scorecard E2E scripts

How this piece fits in testing.

Testing overview