Zipf–Mandelbrot (8-bit)

Zipf exponent, vocabulary richness, frequency decay
distributionaldim Linguistic7 metrics

What It Measures

How the frequencies of byte values decay from most-common to least-common.

Counts the frequency of each of the 256 possible byte values, sorts them from most to least common, and fits the Zipf-Mandelbrot law: does frequency drop off as a power of rank? Natural language follows Zipf's law closely (the 2nd most common word appears half as often as the 1st). Random data has flat frequency — no decay. This geometry characterizes the "vocabulary" structure of the byte stream.

Metrics

zipf_alpha

The Zipf exponent: how steeply does frequency decay with rank? Alpha = 0 means flat (all bytes equally common). Alpha = 1 is Zipf's law (natural language). Poker hands (3.87) and Collatz gap lengths (3.62) score highest — extremely steep decay, a few values dominate completely. Sensor event streams (3.04) are similarly top-heavy. Collatz parity scores 0.0 (two values only, not enough for a power-law fit).

zipf_r_squared

How well does the Zipf-Mandelbrot model actually fit? De Bruijn (1.0) scores perfect: its uniform distribution is a trivial special case (alpha = 0, perfect fit). Divisor count (0.99) and logistic near-full chaos (0.99) also fit well. Collatz parity scores 0.0 (too few unique values). A high alpha with low r_squared means the signal is concentrated but not in a power-law way — useful for distinguishing genuine Zipf behavior from arbitrary concentration.

mandelbrot_q

The Mandelbrot offset parameter: how much do the low ranks deviate from pure Zipf? Large q means the most common values are less dominant than Zipf would predict — the top of the frequency curve is flattened. Solar wind IMF, Solar wind speed, and Sunspot all score 10.0 (maximum q — their distributions have a plateau at the top before the power-law tail kicks in). Logistic chaos and constants score 0.0.

gini_coefficient

Income-inequality measure applied to byte frequencies. 0.0 means perfect equality (all bytes equally common). 1.0 means maximal inequality (one byte gets all the count). Rainfall (0.97), Forest fire (0.95), and Neural net pruned (0.94) are the most unequal — a handful of values dominate. Constants score 0.0 (only one value — no inequality when there is only one entity).

hapax_ratio

Fraction of distinct byte values that appear exactly once (hapax legomena). Rainfall (0.31), Accel sit (0.30), and EEG tumor (0.26) score highest — many byte values appear only once, indicating a sparse tail. Logistic chaos, Henon map, and Tent map score 0.0 (chaotic maps visit enough values often enough that none are unique). High hapax ratio signals have "rare words" — a linguistic fingerprint of sparse, heavy-tailed data.

bigram_predictability

log₂(16) minus the conditional entropy H(X_{t+1} | X_t) computed over a 16-symbol coarse-graining of the byte stream. Zero means the next coarse symbol is fully unpredictable from the current one (max-entropy transitions); log₂(16)=4 means the next symbol is determined by the current one. Period-locked logistic orbits, Constants, Collatz Gap Lengths, Square Wave, Forest Fire, and Rainfall all score near the 4.0 ceiling — every transition is deterministic given the coarse state. PRNGs and crypto bottom out around 0.01 (Wichmann-Hill, XorShift32, White Noise, Pi Digits, MINSTD, glibc LCG, AES Encrypted, Arnold Cat Map). Correlates −0.93 with Predictability:cond_entropy_k1. Despite the "8-bit"/"16-bit" geometry labels, the metric uses a fixed 16-symbol quantization to keep the 16×16 transition matrix (256 cells) well-sampled at typical data lengths.

entropy_nonstationarity

Standard deviation of the windowed (window=2048, step=1024) conditional-entropy sequence. Captures whether the bigram predictability itself drifts over time — large when the transition law is non-stationary. Financial indices lead by a wide margin (Nikkei Returns 1.12, NASDAQ 1.10, NYSE 1.10) — their bigram statistics shift across regimes. Middle-Square (0.92) follows (the classical PRNG's recurrence periods produce non-stationary transitions). Speech tokens, Kepler Exoplanet, Quantum Walk, and Rössler Hyperchaos also score mid-range. Zero (no variation) for all constants, period-locked logistic orbits, De Bruijn Sequence, Gray Code Counter, and L-System Dragon — their bigram statistics are stationary by construction. Heavy-tailed (kurt=17).

Atlas Rankings

bigram_predictability
SourceDomainValue
Logistic r=3.5 (Period-4)chaos4.0000
Logistic r=3.83 (Period-3 Window)chaos4.0000
Logistic r=3.74 (Period-5 Window)chaos4.0000
···
Euler-Mascheroni γ Digitsnumber_theory0.0099
MT19937 (Mersenne Twister)binary0.0105
Wichmann-Hillbinary0.0105
entropy_nonstationarity
SourceDomainValue
Fibonacci Tight-Bindingquantum0.7319
Gaussian Collatz Orbitnumber_theory0.5980
Kepler Exoplanetastro0.5730
···
Logistic r=3.5 (Period-4)chaos0.0000
Logistic r=3.83 (Period-3 Window)chaos0.0000
Logistic r=3.2 (Period-2)chaos0.0000
gini_coefficient
SourceDomainValue
Devil's Staircaseexotic0.9774
Rainfall (ORD Hourly)climate0.9667
Aubry-André Criticalquantum0.9629
···
Logistic r=3.2 (Period-2)chaos0.0000
Logistic r=3.5 (Period-4)chaos0.0000
De Bruijn Sequencenumber_theory0.0000
hapax_ratio
SourceDomainValue
Devil's Staircaseexotic0.5035
von Mangoldt Functionnumber_theory0.3496
Aubry-André Criticalquantum0.3175
···
Henon Mapchaos0.0000
Rossler Attractorchaos0.0000
Gzip (level 9)binary0.0000
mandelbrot_q
SourceDomainValue
Arnold Cat Mapchaos100.0000
Collatz Trajectorynumber_theory100.0000
EEG Eyes Openmedical100.0000
···
Lotka-Volterrabio0.0000
Continued Fractionsnumber_theory0.0000
Sine Wavewaveform0.0000
zipf_alpha
SourceDomainValue
Poker Handsexotic3.8710
DNA Thermusbio3.7681
Collatz Gap Lengthsnumber_theory3.6218
···
Gray Code Counterexotic0.0000
De Bruijn Sequencenumber_theory0.0000
Dice Rollsexotic0.0268
zipf_r_squared
SourceDomainValue
Gray Code Counterexotic1.0000
De Bruijn Sequencenumber_theory1.0000
Intermittency Type-IIIchaos0.9945
···
Phyllotaxisbio0.0170
Circle Map Quasiperiodicchaos0.0171
Sawtooth Wavewaveform0.0326

When It Lights Up

Zipf-Mandelbrot (8-bit) is the framework's vocabulary profiler at single-byte resolution. The combination of alpha (decay steepness), r_squared (fit quality), and gini (concentration) gives a three-dimensional characterization of the frequency curve that entropy alone collapses to a single number. In the atlas, rainfall and forest fire cluster together on the high-gini, high-alpha, high-hapax corner — both are "natural language-like" in having a few dominant values and a long sparse tail. PRNGs and De Bruijn occupy the opposite corner: flat frequencies, low gini, zero hapax.

Open in Atlas
← WassersteinZipf–Mandelbrot (16-bit) →