Per-script SEO byte budgets and a script-aware clamp.
Background. Google Search Central and Bing Webmaster Guidelines both
document SERP snippet limits in pixels, not characters. Latin
glyphs render at roughly half the pixel width of CJK glyphs, while
Arabic/Hebrew letterforms sit between the two. A single length
budget for <title> / <meta description> will always be wrong for
at least one of the 14 publishing languages — typically over-truncating
Latin copy and over-running CJK by a factor of two.
This module provides:
classifyScript — three-way latin | cjk | rtl family
classifier driven by the locale code (no glyph inspection — the
BCP-47 language tag is authoritative because every publishing
pipeline emits one full output per language).
SEO_BUDGETS — per-surface × per-script byte caps derived
from the documented platform envelopes (Google ≤580 px title /
≤155 char description; Bing slightly more generous; Facebook ≤95
chars on og:title; Twitter ≤70 / ≤200; LinkedIn shares OG).
budgetFor — typed accessor returning the byte cap for a
(lang, surface) pair, with a uniform fallback to the strictest
Latin budget when the locale is unknown.
clampForBudget — script-aware truncator that prefers
natural clause boundaries (CJK full-width punctuation, RTL
sentence punctuation, Latin clause separators) before falling
back to whitespace breaks. Returns the input verbatim when it
already fits.
Pure, leaf module. No I/O, no dependencies on other aggregator
modules beyond the existing text-utils.ts clause-boundary
vocabulary.
Description
Per-script SEO byte budgets and a script-aware clamp.
Background. Google Search Central and Bing Webmaster Guidelines both document SERP snippet limits in pixels, not characters. Latin glyphs render at roughly half the pixel width of CJK glyphs, while Arabic/Hebrew letterforms sit between the two. A single
lengthbudget for<title>/<meta description>will always be wrong for at least one of the 14 publishing languages — typically over-truncating Latin copy and over-running CJK by a factor of two.This module provides:
latin | cjk | rtlfamily classifier driven by the locale code (no glyph inspection — the BCP-47 language tag is authoritative because every publishing pipeline emits one full output per language).og:title; Twitter ≤70 / ≤200; LinkedIn shares OG).(lang, surface)pair, with a uniform fallback to the strictest Latin budget when the locale is unknown.Pure, leaf module. No I/O, no dependencies on other aggregator modules beyond the existing
text-utils.tsclause-boundary vocabulary.