Skip to Content
ftui-renderBufferDiff

BufferDiff

BufferDiff is the change-detection engine. Given two buffers of identical dimensions it produces a compact list of changed cells, coalesced into contiguous ChangeRuns the presenter can emit with one cursor-move per run. It is the hottest loop in the render kernel — for a 200 × 60 terminal at 60 Hz it compares 720 000 cells per second in the common case, and its cost dominates every frame that does not enter a full redraw.

The engine works at three granularities, each a successive refinement: the outer loop skips non-dirty rows, the middle layer scans in 4-cell SIMD blocks, and an optional tile-level skip uses a Summed-Area Table (SAT) to drop whole rectangular regions when the dirty-cell density is low enough.

This page documents the algorithms, the block layout, the ChangeRun coalescing rule, and the DiffSkipHint interface the runtime uses to bypass the diff when it has independent proof that buffers match.

Motivation

A clear-and-redraw renderer writes every cell on every frame. A naive per-cell PartialEq diff still visits every cell in a nested loop. The target here is closer to O(changes) than O(width × height) for the common case where only a few widgets updated — without ever sacrificing correctness when the whole screen changes.

Three trade-offs shape the design:

  1. SIMD auto-vectorization over branches. The diff must avoid early-exit branches the predictor cannot learn; hence 4-cell blocks and bitwise bits_eq.
  2. Soundness under false positives. Dirty tracking is allowed to over-report (mark a row dirty that did not actually change); it is never allowed to under-report. False positives cost the per-cell scan for that row; false negatives cause stale frames.
  3. Bypass path for proofs. Certain frames (no model update, or a Bayesian posterior saying the change density is below threshold) do not need a diff at all — the runtime passes a DiffSkipHint and the engine short-circuits.

Algorithm at a glance

BufferDiff::compute_dirty(old, new) for each row y in 0..height: if !new.is_row_dirty(y): continue ← outer skip (row bitmap) ────────────────────────────────────────── for each 4-cell block in row: if cell_quad_bits_eq(old, new, block): ← SIMD compare continue (128-bit pcmpeqd) for i in 0..4: if !old[i].bits_eq(&new[i]): changes.push((x, y)) coalesce_changes(&mut changes) → Vec<ChangeRun>

With the SAT tile path enabled (opt-in via DiffSkipHint or high span density), the outer loop first builds a 2D prefix-sum over the dirty-cell bitmap; any tile whose prefix-sum evaluates to zero dirties is skipped with one subtraction.

Block-based SIMD compare

crates/ftui-render/src/diff.rs
const BLOCK_SIZE: usize = 4; // 4 cells = 64 bytes = one cache line const _: () = assert!( core::mem::size_of::<Cell>() * BLOCK_SIZE == 64, "BLOCK_SIZE * Cell must equal 64-byte cache line" );

A block is exactly one cache line: four 16-byte cells, 64-byte-aligned. The comparator cell_quad_bits_eq ORs four bits_eq results; LLVM auto-vectorizes this into a single 256-bit (AVX2) or 128-bit×2 (SSE2) compare followed by a vpmovmskb / reduction. On AArch64 Neon the codegen is symmetric via cmeq + vmaxvq.

The inner per-cell loop is #[inline(always)] so the compiler sees the whole 4-iteration body and can unroll it; scan_row_changes_range is the documented entry point at diff.rs:L161-L205.

Dirty-row skipping

The outer loop consults Buffer::is_row_dirty(y) and drops clean rows without touching memory. Given the dirty-row soundness invariant (cell-and-buffer), a clean row matches its predecessor cell-for-cell, so skipping is correct.

In steady-state a typical UI (a clock widget ticking, a list re-rendering the selected row) marks 1–3 rows dirty per frame — the diff touches O(rows_changed × width) cells instead of O(height × width), a 10–50× speedup on large terminals.

ChangeRun coalescing

Adjacent dirty cells on the same row are expensive to emit individually (each one costs a fresh cursor move). After the scan, the engine coalesces (x, y) positions into contiguous runs:

crates/ftui-render/src/diff.rs
pub struct ChangeRun { pub y: u16, // row pub x0: u16, // start col (inclusive) pub x1: u16, // end col (inclusive) }

A run (y,x0,x1)(y, x_0, x_1) amortizes a single cursor move across $x_1 - x_0

  • 1$ emitted cells. The presenter consumes Vec<ChangeRun> and decides per-run whether a CUP, CHA, CUF, or simple overwrite is cheapest.

SAT tile-skip

For very sparse dirty bitmaps (the Bayesian diff_strategy module tracks the posterior change rate), scanning even the block-wise loop wastes work. The optional SAT path builds a 2D prefix-sum over the per-cell dirty bitmap (one u8 per cell) and asks, for each rectangular tile, “is the sum of dirties inside this tile zero?”

sum(x0,y0,x1,y1)=S(x1,y1)S(x01,y1)S(x1,y01)+S(x01,y01)\text{sum}(x_0, y_0, x_1, y_1) = S(x_1, y_1) - S(x_0-1, y_1) - S(x_1, y_0-1) + S(x_0-1, y_0-1)

Four table lookups and a subtraction per tile; the diff visits the tile’s cells only if the sum is nonzero. Tile dimensions and fallback policy are configurable via TileDiffConfig (crates/ftui-render/src/diff.rs:L436-L545).

Typical tile sizes are 16×4 cells; the prefix-sum build cost is O((width / tile_w) × (height / tile_h)) once per frame, amortized across the tile queries. For dense change patterns (p > 0.3 posterior), the SAT overhead is not worth it and the engine drops back to block scan — the Bayesian diff strategy picks between modes based on the posterior.

DiffSkipHint

crates/ftui-render/src/diff.rs
pub enum DiffSkipHint { FullDiff, // standard path SkipDiff, // caller guarantees buffers identical NarrowToRows(Vec<u16>), // only these rows may differ }

The runtime evaluates a render certificate each frame — “did the model update since last frame? are any subscriptions live?” — and passes the result. SkipDiff produces an empty diff in O(1); NarrowToRows cuts the outer loop to the listed rows. Incorrect hints are not verified — the contract is that the runtime owns the soundness proof, and the engine trusts it. Passing SkipDiff when buffers actually differ produces a stale frame.

Worked example

examples/diff.rs
use ftui_render::buffer::Buffer; use ftui_render::cell::Cell; use ftui_render::diff::{BufferDiff, ChangeRun}; let mut old = Buffer::new(80, 24); let mut new = old.clone(); old.clear_dirty(); // reset tracking on both sides new.clear_dirty(); // Two isolated changes in row 5. new.set(10, 5, Cell::from_char('X')); new.set(11, 5, Cell::from_char('Y')); new.set(40, 5, Cell::from_char('Z')); // gap > merge_gap → two runs let diff = BufferDiff::compute_dirty(&old, &new); let runs: Vec<ChangeRun> = diff.runs(); // Two runs: [(5, 10, 11), (5, 40, 40)] assert_eq!(runs.len(), 2); assert_eq!(runs[0], ChangeRun::new(5, 10, 11)); assert_eq!(runs[1], ChangeRun::new(5, 40, 40));

Buffers must have identical dimensions. Diffing a 80×24 against a 100×30 debug-asserts in development and produces undefined results in release. On terminal resize, the runtime allocates a fresh buffer at the new size, copies what it wants, and flags the first diff as a full redraw via DiffSkipHint::FullDiff on a freshly dirtied buffer — the SIMD path will scan it end-to-end, which is correct but not free. Plan for this on large resizes.

Performance characteristics

ScenarioWork per frameNotes
Clean frame (no dirty rows)O(height)row bitmap scan only
One row changedO(width)one row of SIMD blocks
Scattered changes (0.1–1%)O(height × width / 4)block scan, most blocks skip
Dense changes (>30%)O(height × width / 4)block scan, most blocks hit
Full redraw / resizeO(height × width)block scan, every block hits

The constant factor of the block scan is ~0.5–1 CPU cycle per cell on modern x86_64 with AVX2; a 80×24 terminal diff completes in ~1 µs when no rows are dirty and ~100 µs on a full redraw.

Cross-references

  • Cell & Buffer — the data the diff consumes and the dirty-tracking invariants that make it sound.
  • Presenter — what consumes ChangeRun[] and emits ANSI.
  • Bayesian diff strategy — the Beta-Bernoulli posterior that selects full-diff, dirty-row, or redraw.
  • Screen modes — how the inline/alt-screen choice interacts with the diff on resize.
  • One-writer rule — concurrent writes corrupt both the buffer and the diff.

Where next