BufferDiff
BufferDiff is the change-detection engine. Given two buffers of
identical dimensions it produces a compact list of changed cells,
coalesced into contiguous ChangeRuns the presenter can emit with
one cursor-move per run. It is the hottest loop in the render
kernel — for a 200 × 60 terminal at 60 Hz it compares 720 000 cells
per second in the common case, and its cost dominates every frame
that does not enter a full redraw.
The engine works at three granularities, each a successive refinement: the outer loop skips non-dirty rows, the middle layer scans in 4-cell SIMD blocks, and an optional tile-level skip uses a Summed-Area Table (SAT) to drop whole rectangular regions when the dirty-cell density is low enough.
This page documents the algorithms, the block layout, the
ChangeRun coalescing rule, and the DiffSkipHint interface the
runtime uses to bypass the diff when it has independent proof that
buffers match.
Motivation
A clear-and-redraw renderer writes every cell on every frame. A
naive per-cell PartialEq diff still visits every cell in a nested
loop. The target here is closer to O(changes) than O(width ×
height) for the common case where only a few widgets updated —
without ever sacrificing correctness when the whole screen changes.
Three trade-offs shape the design:
- SIMD auto-vectorization over branches. The diff must avoid
early-exit branches the predictor cannot learn; hence
4-cell blocks and bitwise
bits_eq. - Soundness under false positives. Dirty tracking is allowed to over-report (mark a row dirty that did not actually change); it is never allowed to under-report. False positives cost the per-cell scan for that row; false negatives cause stale frames.
- Bypass path for proofs. Certain frames (no model update, or
a Bayesian posterior saying the change density is below threshold)
do not need a diff at all — the runtime passes a
DiffSkipHintand the engine short-circuits.
Algorithm at a glance
BufferDiff::compute_dirty(old, new)
│
▼
for each row y in 0..height:
if !new.is_row_dirty(y): continue ← outer skip (row bitmap)
──────────────────────────────────────────
for each 4-cell block in row:
if cell_quad_bits_eq(old, new, block): ← SIMD compare
continue (128-bit pcmpeqd)
for i in 0..4:
if !old[i].bits_eq(&new[i]):
changes.push((x, y))
│
▼
coalesce_changes(&mut changes) → Vec<ChangeRun>With the SAT tile path enabled (opt-in via DiffSkipHint or high
span density), the outer loop first builds a 2D prefix-sum over the
dirty-cell bitmap; any tile whose prefix-sum evaluates to zero
dirties is skipped with one subtraction.
Block-based SIMD compare
const BLOCK_SIZE: usize = 4; // 4 cells = 64 bytes = one cache line
const _: () = assert!(
core::mem::size_of::<Cell>() * BLOCK_SIZE == 64,
"BLOCK_SIZE * Cell must equal 64-byte cache line"
);A block is exactly one cache line: four 16-byte cells,
64-byte-aligned. The comparator cell_quad_bits_eq ORs four
bits_eq results; LLVM auto-vectorizes this into a single 256-bit
(AVX2) or 128-bit×2 (SSE2) compare followed by a vpmovmskb /
reduction. On AArch64 Neon the codegen is symmetric via cmeq +
vmaxvq.
The inner per-cell loop is #[inline(always)] so the compiler sees
the whole 4-iteration body and can unroll it; scan_row_changes_range
is the documented entry point at diff.rs:L161-L205.
Dirty-row skipping
The outer loop consults Buffer::is_row_dirty(y) and drops clean
rows without touching memory. Given the dirty-row soundness
invariant
(cell-and-buffer),
a clean row matches its predecessor cell-for-cell, so skipping is
correct.
In steady-state a typical UI (a clock widget ticking, a list re-rendering the selected row) marks 1–3 rows dirty per frame — the diff touches O(rows_changed × width) cells instead of O(height × width), a 10–50× speedup on large terminals.
ChangeRun coalescing
Adjacent dirty cells on the same row are expensive to emit
individually (each one costs a fresh cursor move). After the scan,
the engine coalesces (x, y) positions into contiguous runs:
pub struct ChangeRun {
pub y: u16, // row
pub x0: u16, // start col (inclusive)
pub x1: u16, // end col (inclusive)
}A run amortizes a single cursor move across $x_1 - x_0
- 1$ emitted cells. The presenter consumes
Vec<ChangeRun>and decides per-run whether aCUP,CHA,CUF, or simple overwrite is cheapest.
SAT tile-skip
For very sparse dirty bitmaps (the Bayesian diff_strategy module
tracks the posterior change rate), scanning even the block-wise loop
wastes work. The optional SAT path builds a 2D prefix-sum over the
per-cell dirty bitmap (one u8 per cell) and asks, for each
rectangular tile, “is the sum of dirties inside this tile zero?”
Four table lookups and a subtraction per tile; the diff visits the
tile’s cells only if the sum is nonzero. Tile dimensions and
fallback policy are configurable via TileDiffConfig
(crates/ftui-render/src/diff.rs:L436-L545).
Typical tile sizes are 16×4 cells; the prefix-sum build cost is
O((width / tile_w) × (height / tile_h)) once per frame, amortized
across the tile queries. For dense change patterns (p > 0.3
posterior), the SAT overhead is not worth it and the engine drops
back to block scan — the Bayesian diff strategy
picks between modes based on the posterior.
DiffSkipHint
pub enum DiffSkipHint {
FullDiff, // standard path
SkipDiff, // caller guarantees buffers identical
NarrowToRows(Vec<u16>), // only these rows may differ
}The runtime evaluates a render certificate each frame — “did the
model update since last frame? are any subscriptions live?” — and
passes the result. SkipDiff produces an empty diff in O(1);
NarrowToRows cuts the outer loop to the listed rows. Incorrect
hints are not verified — the contract is that the runtime owns
the soundness proof, and the engine trusts it. Passing SkipDiff
when buffers actually differ produces a stale frame.
Worked example
use ftui_render::buffer::Buffer;
use ftui_render::cell::Cell;
use ftui_render::diff::{BufferDiff, ChangeRun};
let mut old = Buffer::new(80, 24);
let mut new = old.clone();
old.clear_dirty(); // reset tracking on both sides
new.clear_dirty();
// Two isolated changes in row 5.
new.set(10, 5, Cell::from_char('X'));
new.set(11, 5, Cell::from_char('Y'));
new.set(40, 5, Cell::from_char('Z')); // gap > merge_gap → two runs
let diff = BufferDiff::compute_dirty(&old, &new);
let runs: Vec<ChangeRun> = diff.runs();
// Two runs: [(5, 10, 11), (5, 40, 40)]
assert_eq!(runs.len(), 2);
assert_eq!(runs[0], ChangeRun::new(5, 10, 11));
assert_eq!(runs[1], ChangeRun::new(5, 40, 40));Buffers must have identical dimensions. Diffing a 80×24 against a
100×30 debug-asserts in development and produces undefined results
in release. On terminal resize, the runtime allocates a fresh
buffer at the new size, copies what it wants, and flags the first
diff as a full redraw via DiffSkipHint::FullDiff on a freshly
dirtied buffer — the SIMD path will scan it end-to-end, which is
correct but not free. Plan for this on large resizes.
Performance characteristics
| Scenario | Work per frame | Notes |
|---|---|---|
| Clean frame (no dirty rows) | O(height) | row bitmap scan only |
| One row changed | O(width) | one row of SIMD blocks |
| Scattered changes (0.1–1%) | O(height × width / 4) | block scan, most blocks skip |
| Dense changes (>30%) | O(height × width / 4) | block scan, most blocks hit |
| Full redraw / resize | O(height × width) | block scan, every block hits |
The constant factor of the block scan is ~0.5–1 CPU cycle per cell on modern x86_64 with AVX2; a 80×24 terminal diff completes in ~1 µs when no rows are dirty and ~100 µs on a full redraw.
Cross-references
- Cell & Buffer — the data the diff consumes and the dirty-tracking invariants that make it sound.
- Presenter — what consumes
ChangeRun[]and emits ANSI. - Bayesian diff strategy — the Beta-Bernoulli posterior that selects full-diff, dirty-row, or redraw.
- Screen modes — how the inline/alt-screen choice interacts with the diff on resize.
- One-writer rule — concurrent writes corrupt both the buffer and the diff.