Grapheme Pool
Most terminal cells hold a single Unicode scalar — an ASCII letter, a
CJK ideograph, or an emoji that happens to be one codepoint wide. A
4-byte char fits inline in CellContent with no heap lookup, and
99% of cells in a typical UI are this case. The remaining 1% — ZWJ
sequences like 👨👩👧👦, flag emoji, complex combining
sequences — need multi-codepoint strings that do not fit in 4 bytes.
GraphemePool interns those strings and hands back a 32-bit
GraphemeId that encodes three fields: a slot index, a generation
counter, and a display width. The generation counter makes slot reuse
safe: if a GraphemeId is held past the lifetime of its pool entry,
the lookup returns None rather than the new tenant’s content.
This page documents the bit packing, the interning semantics, and the refcount-plus-generation dance that keeps stale IDs safe.
Motivation
Three design constraints conflict:
- Keep
Cellat 16 bytes. A cell must not embed a heap pointer or a variable-length payload; the SIMD compare assumes a fixed layout. - Avoid heap allocations on the hot path. ASCII-dominated text (every code editor, every log viewer) would be unusable if each char round-tripped through a pool.
- Support full Unicode faithfully. Showing
👨👩👧👦as four separate cells, each rendering a fragment of the ZWJ sequence, is broken.
The resolution: CellContent stores either an inline char or a
GraphemeId, discriminated by bit 31. Simple cells skip the pool
entirely; complex cells pay one lookup per emission and share the
string across every cell that uses it.
GraphemeId bit layout
#[repr(transparent)]
pub struct GraphemeId(u32); bit 31 bit 30 ──── 27 | bit 26 ──── 16 | bit 15 ──── 0
↓ ↓ ↓ ↓
type width (4) generation (11) slot (16)
discriminant 0..=15 cells 0..=2047 0..=65535
(0 = Char,
1 = GraphemeId)- Slot (16 bits). Index into
Vec<Option<GraphemeSlot>>. 64K distinct grapheme clusters per frame is plenty — even heavily emoji-rich UIs never approach this. - Generation (11 bits). Incremented when a slot is freed and reused. 2048 distinct lives per slot detects the ABA problem: holding an ID while the slot turns over, then dereferencing it, fails safely.
- Width (4 bits). Cached display width (0–15 cells). The width
is also embedded in the
GraphemeSlotitself for debugging, but theGraphemeIdcarries it redundantly so the layout engine can askid.width()without a pool indirection.
Bit 31 discriminates the CellContent union: when it is 0, the
remaining 31 bits are a char (cast to u32); when it is 1, the
remaining 31 bits are a packed GraphemeId.
How ASCII stays inline
pub struct CellContent(u32);
impl CellContent {
pub const fn from_char(c: char) -> Self {
CellContent((c as u32) & 0x7FFF_FFFF) // clear bit 31
}
pub fn as_char(self) -> Option<char> {
if self.0 >> 31 == 0 { // bit 31 = 0 → inline
char::from_u32(self.0)
} else {
None
}
}
pub fn as_grapheme_id(self) -> Option<GraphemeId> {
if self.0 >> 31 == 1 {
Some(GraphemeId::from_raw(self.0 & 0x7FFF_FFFF))
} else {
None
}
}
}Cell::from_char('a') stores 0x00000061 — pure char cast, no
allocation, no pool touch. Cell::from_char('€')
(U+20AC) stores 0x000020AC — still inline. Only when content
exceeds what a char can represent (multi-codepoint clusters) does
the pool come into play.
Interior code points greater than the Unicode maximum
(0x10FFFF = 0x0010FFFF) are not possible via from_char, so
there is no ambiguity with bit 31 being accidentally set by a
valid Unicode char.
Interning and refcount
pub struct GraphemePool {
slots: Vec<Option<GraphemeSlot>>, // None = free
generations: Vec<u16>, // per-slot, incremented on reuse
lookup: AHashMap<String, GraphemeId>, // dedup
free_list: Vec<u32>, // reuse freed slots LIFO
}
struct GraphemeSlot {
text: String,
width: u8,
refcount: u32,
}Interning (pool.intern(s, width)):
- Look up
sin theAHashMap. If present, increment the slot’s refcount and return the existingGraphemeId. - If not present, pop a slot from
free_list(or push a new one); populatetext,width,refcount = 1; insert in the lookup map; return aGraphemeIdpacked from(slot, current_generation, width).
Retain (pool.retain(id)) — increment refcount when a GraphemeId
is copied into another cell.
Release (pool.release(id)) — decrement refcount; if it reaches
zero, set the slot to None, bump generations[slot] += 1 (mod
2048), remove from the lookup map, push slot onto free_list.
Lookup (pool.get(id)):
- Extract
slotfrom the id. - If
slot >= slots.len(), returnNone. - If
generations[slot] != id.generation(), returnNone(stale!). - If
slots[slot].is_none(), returnNone. - Otherwise return the
&str.
Stale-safe lookup — the ABA guard
t0: id = pool.intern("👨👩👧👦", 2); // slot=3, gen=0
pool.release(id); pool.release(id); // refcount → 0, slot freed
// generations[3] += 1 => now 1
t1: id2 = pool.intern("🧑🏽💻", 2); // reuses slot 3, gen=1
t2: pool.get(id); // id.generation() == 0, generations[3] == 1 → None
pool.get(id2); // matches → Some("🧑🏽💻")Without the generation check, get(id) at t2 would return
"🧑🏽💻" — the new tenant — silently corrupting whatever used the
stale id. Eleven bits of generation give 2048 distinct lives; a
slot would have to cycle 2049 times and line up with a stale ID
and have refcount mismatch at exactly the wrong moment to alias.
In practice this is a correctness guarantee, not just an
optimization.
Worked example
use ftui_render::grapheme_pool::GraphemePool;
use ftui_render::cell::{Cell, CellContent};
let mut pool = GraphemePool::new();
// Intern a ZWJ family emoji.
let family = pool.intern("👨👩👧👦", 2);
assert_eq!(family.width(), 2);
assert_eq!(pool.get(family), Some("👨👩👧👦"));
// Place it in two cells of a buffer — retain the id for the second.
let cell_a = Cell::new(CellContent::from_grapheme_id(family));
pool.retain(family);
let cell_b = Cell::new(CellContent::from_grapheme_id(family));
// ... cells used in a frame ...
// On overwrite, release once per cell.
pool.release(family); // refcount 2 → 1
assert!(pool.get(family).is_some()); // still alive
pool.release(family); // refcount 1 → 0
assert!(pool.get(family).is_none()); // freed + generation bumpedWide-character continuation
When a cell has width > 1, subsequent columns are filled with
Cell::CONTINUATION (a distinguished content value) so the diff
does not treat the “second half” of a CJK character as an
independent cell. The continuation has transparent colors and no
attrs, so it is cheap to compare. The presenter knows to skip
continuation cells during emission.
Don’t stash a GraphemeId in long-lived state. IDs are valid
only as long as the pool holds a reference. A cache that saves
GraphemeIds across frames must either retain each one (and
release on eviction) or re-intern on every use. Storing the raw
&str and interning lazily is safer for occasional lookups; the
pool’s AHashMap will dedup.
Capacity and telemetry
pool.len() is the active slot count, pool.capacity() is the
allocated upper bound. The runtime exposes these via the
observability widgets so a developer building emoji-heavy UI can see
whether pool churn is a concern. Typical steady-state usage is
dozens of slots — the pool is not a hot allocator.
Cross-references
- Cell & Buffer —
CellContentlayout and the bit-31 discriminant. - Frame — where the
&mut GraphemePoolis borrowed from. - Presenter — the lookup path during ANSI emission.
- Grapheme width — the layout engine’s counterpart (admission cache, NFC normalization).
- Screen modes — continuation cells behave the same inline and alt-screen.
- One-writer rule — pool access is single-writer by virtue of frame ownership.