Sparse, adaptive, scalable histogram in TypeScript. On GitHub.
This Typescript library provides a sophisticated data structure for efficiently characterizing datasets through histograms.
Sparstogram is a histogram that maintains a complete or sparse approximation of the data frequency. The representation will be complete if the number of distinct values is less than or equal to the maxCentroids. Otherwise, the values will be compressed to a smaller number of centroids in a way that intelligently minimizes loss while preserving distribution features like peaks and tails. The histogram can also maintain a set of quantiles, which are maintained with each new value (e.g. median can be efficiently maintained). The structure is adaptive in range and resolution, and can be used for streaming or large data digests.
The implementation uses B+Trees to efficiently maintain the centroids and their priority scores, which is a self-balancing structure that scales efficiently. As the number of unique data values grows beyond the configured maxCentroids, the loss returned from the add method will begin to be non-zero and represents how well the data compresses. If and when this loss grows over some threshold, the user can choose to increase the maxCentroids value to maintain higher accuracy. On the other hand, maxCentroids can be dynamically shrunk to reduce memory, at the expense of more approximation.
The compression algorithm uses a curvature-aware scoring system that preferentially preserves regions where the empirical cumulative distribution function (CDF) bends sharply—such as peaks, valleys, and distribution tails—rather than simply merging the closest pairs. This approach is inspired by optimal k-point discrete approximations and Wasserstein-1 (Earth Mover's Distance) minimizing quantizers.
maxCentroidsmaxCentroids can be dynamically adjusted up or downHistograms are often used in visualizing the frequency of occurrences, but many programmers don't know how useful histograms are as an analytical data structure. Histograms are a powerful statistical tool that offer a frequency representation of the distribution of a dataset, which is useful across a variety of disciplines. There are many practical reasons for perceiving the distribution, outliers, skewness, range, magnitude, and central tendency of data. Just about any circumstance where there is signal with noise, histograms are invaluable for deciphering each, by establishing a cut-off quantile, or computing k-Means.
In image processing, histograms are crucial for scaling pixel values. By examining the histogram of an image's pixel intensities, you can accurately adjust the contrast and brightness, enhancing the image quality by stretching or compressing the range of pixel values to utilize the full spectrum more effectively.
In performance optimization, system monitoring, and distributed systems, histograms play a vital role in determining timing thresholds. By generating a histogram of response times or execution times of a system, you can identify common performance bottlenecks and outliers. This enables you to set appropriate thresholds for alerts or to optimize system performance by focusing on the most impactful areas. For instance, understanding the distribution of database query times can help in optimizing indexes or rewriting queries to reduce the tail latencies that affect user experience.
Sparstogram is inspired by the paper "Approximate Frequency Counts over Data Streams" by Gurmeet Singh Manku and Rajeev Motwani. Though similar in spirit, this implementation incorporates several sophisticated enhancements:
The Sparstogram maintains centroids in two B+Tree indices:
_centroids): Maintains centroids sorted by their value, enabling efficient range queries, quantile lookups, and neighbor access_losses): Maintains centroid pairs sorted by their compression score, allowing O(1) identification of the best pair to mergeEach centroid stores:
When the number of centroids exceeds maxCentroids, the algorithm must merge adjacent pairs to reduce memory usage. The challenge is choosing which pairs to merge to minimize information loss while preserving important distribution features.
Rather than simply merging the closest pairs (which would smooth out peaks and tails), Sparstogram uses a curvature-aware scoring system inspired by optimal transport theory and data-aware discretization methods.
The key insight is that the empirical CDF should be approximated more finely where it "bends" sharply (high curvature) and can tolerate coarser approximation in flat regions (low curvature). This is analogous to:
For each adjacent pair of centroids (a, b), we estimate local CDF curvature by examining density changes with their neighbors:
density(u, v) = (count_u + count_v) / |value_v - value_u|
left_curvature = |density(l, a) - density(a, b)| // where l is left neighbor
right_curvature = |density(a, b) - density(b, r)| // where r is right neighbor
curvature = 0.5 * (left_curvature + right_curvature)
This approximates the second derivative of the empirical CDF at the seam between a and b.
The score used to prioritize merge candidates is:
score = baseLoss / (ε + curvature)
Where:
Lower scores (flat regions with high loss, or high-curvature regions with any loss) are merged first. Higher scores (high-curvature regions) are preserved longer.
This means:
The base loss between two centroids combines geometric and statistical information:
baseLoss = (count_a + count_b) * |value_a - value_b| + combinedVariance(a, b)
The weighted distance term captures the "transport cost" of moving mass, while the combined variance term accounts for information loss from merging distributions. The variance calculation properly handles the degrees of freedom and between-group variance:
combinedVariance = (SS_a + SS_b + SS_between) / (n_total - 1)
Where SS_between captures the variance introduced by the difference in means.
When merging two centroids, the new centroid is placed at the weighted median (the value of whichever centroid has more mass) rather than the weighted mean. This "micro-recentering" better preserves the distribution structure, especially for skewed or multi-modal data within the pair.
The tightnessJ property provides a cheap, incrementally-maintained proxy for Wasserstein-1 (Earth Mover's) distance:
tightnessJ = Σ min(count_i, count_{i+1}) × |value_{i+1} - value_i|
This metric:
Note that tightnessJ is a heuristic approximation, not an exact W1 distance. It's unnormalized and scale-dependent, making it best suited for relative monitoring (e.g., tracking how compression affects distribution spread) rather than as an absolute error bound.
In a typical representation of a histogram, values are evenly spaced. Really, each value represents a range of values that are grouped. For instance, if the values are 4ms, 5ms, and 6ms; in reality each bucket represents an interval (3.4ms-4.5ms, 4.5ms-5.5ms, ...).
In computers, histograms are typically represented with an array, where every element represents the count in that range (or "entry"). There are a few noteworthy suppositions:

In a Sparstogram, no entries are stored until they are added, and no range is maintained between entries (AKA centroids) holding unused space. The same histogram information as above is represented in a Sparstogram as only the three ordered entries having counts:

Unlike the traditional representation, the Sparstogram dynamically adapts itself in entries, resolution and range. A limit can be imposed on a Sparstogram through maxCentroids. Once that number of unique values has been reached, additional entries will result in compression so as not to exceed that many entries.
Compression involves merging two adjacent entries into one, with a new value (at the weighted median), combined count, and a computed variance. The pair to merge is chosen based on the lowest curvature-aware score, prioritizing merges in flat regions while preserving peaks, valleys, and tails. This information is maintained in an internal priority queue so as not to require searching.
Here is an illustration of the same Sparstogram, were we to set maxCentroids to 2, thus asking it to shrink to only 2 entries:

Note that the weighted mean information is preserved through the count-weighted centroid values, while distribution shape information is captured through variance. To reduce the amount of distribution information lost, a variance is computed to capture the notion that the entry has "spread" and is no longer a point. The countAt, rankAt, and other accessor methods will interpolate based on that variance, assuming a normal distribution. As a result, though the true distribution is lost in the compression, the general effect of there being multiple entries is preserved, and important distribution features (like modes) are preferentially retained by the curvature-aware compression algorithm.
Create an instance with a maxCentroids budget, and feed is some information. Time spent per User Interface, financial data, pixel brightnesses, audio samples, whatever. The result will the the frequency of each value, which you can use to make judgements from. You can maintain the median over a streaming dataset, for instance, or compute the 75th percentile of something to discover outliers or eliminate noise.
To install the Sparstogram library, use the following command:
npm install sparstogram
Or, if you prefer using pnpm:
pnpm add sparstogram
import { Sparstogram } from "sparstogram";
// Initialize a Sparstogram with a maximum of 100 centroids and a quantile marker for maintaining the median
const histogram = new Sparstogram(100, [0.5]);
// Add a series of values to the histogram
histogram.add(5.0);
histogram.add(2.5);
const loss = histogram.add(3.7);
// Dynamically adjust maxCentroids based on the application's precision requirements
if (loss > 3.5) {
histogram.maxCentroids = 150;
}
// Retrieve the median value
const median = histogram.quantileAt(0.5).value;
// Or the same using the marker (faster, maintained incrementally)
const fastMedian = histogram.markerAt(0).value;
// Get the total count of values in the histogram
const totalCount = histogram.count;
// Get the tightness metric (Wasserstein-1 proxy)
const tightness = histogram.tightnessJ;
// Find the rank of a specific value
const rank = histogram.rankAt(3.7);
// Find the value at the given rank
const quantile = histogram.valueAt(2);
const value = quantile.value;
// Get interpolated count at a specific value
const count = histogram.countAt(3.7);
// Iterate all centroids
for (const entry of histogram.ascending()) {
console.log(entry);
}
// Iterate from the median backwards
for (const entry of histogram.descending({ markerIndex: 0 })) {
console.log(entry);
}
// Iterate from a value forwards
for (const entry of histogram.ascending({ value: 2.5 })) {
console.log(entry);
}
// Find the local maxima with smoothing window for distribution analysis and clustering
for (const peak of histogram.peaks(3)) {
console.log(`Peak from ${peak.start} to ${peak.end}, max count: ${peak.max}`);
}
// Merge another histogram into this one
const histogram1 = new Sparstogram(100);
const histogram2 = new Sparstogram(100);
// Add data to both
histogram1.add(1.0);
histogram1.add(2.0);
histogram2.add(3.0);
histogram2.add(4.0);
// Merge histogram2 into histogram1
histogram1.mergeFrom(histogram2);
// Now histogram1 contains all data from both
console.log(histogram1.count); // 4
// Add multiple centroids at once (more efficient for bulk operations)
const centroids = [
{ value: 1.0, count: 10, variance: 0.5 },
{ value: 2.0, count: 20, variance: 0.3 },
{ value: 3.0, count: 15, variance: 0.4 }
];
const maxLoss = histogram.append(...centroids);
console.log(`Max loss from batch: ${maxLoss}`);
The curvature-aware compression in Sparstogram is inspired by optimal transport theory, particularly Wasserstein-1 (Earth Mover's Distance) minimization. The W1 distance between two probability distributions can be thought of as the minimum "cost" of transforming one distribution into another, where cost is the amount of mass moved times the distance moved.
When compressing a histogram, we're essentially solving a discrete optimal transport problem: how to reduce k centroids to (k-1) centroids while minimizing the W1 distance to the original distribution. The curvature-aware approach approximates this by:
This is related to several research areas:
The problem of approximating a continuous or high-resolution discrete distribution with k points has been studied extensively:
Sparstogram differs from these in that it operates incrementally (streaming) and adaptively adjusts its representation as data arrives, rather than requiring multiple passes over the data.
The curvature-aware approach is analogous to adaptive mesh refinement in numerical analysis and computational physics, where:
In Sparstogram, curvature serves as the "error estimate" that drives where to preserve detail.
The compression algorithm relates to concepts from signal processing:
maxCentroids parameter controls the "rate" (memory usage) and the curvature-aware scoring optimizes "distortion" (information loss)The tightnessJ metric serves as a distortion measure, allowing users to monitor the rate-distortion trade-off.
Sparstogram is designed for streaming data scenarios where:
This places Sparstogram in the family of streaming algorithms and online algorithms:
| Algorithm | Space | Query Time | Update Time | Accuracy |
|---|---|---|---|---|
| Sparstogram | O(k) | O(log k) | O(log k) | Curvature-adaptive |
| t-digest | O(k) | O(k) | O(log k) | Quantile-adaptive |
| Q-digest | O(log u) | O(log u) | O(log u) | Uniform error |
| GK (Greenwald-Khanna) | O(1/ε log εn) | O(log 1/ε) | O(log 1/ε) | ε-approximate |
| KLL sketch | O(1/ε log n) | O(1) | O(1) amortized | ε-approximate |
Where:
Sparstogram's advantages:
maxCentroidsTrade-offs:
Each centroid maintains a variance, which serves multiple purposes:
When merging centroids a and b, the variance contributes to the base loss:
SS_a = variance_a × (count_a - 1) // sum of squares
SS_b = variance_b × (count_b - 1)
SS_between = (count_a × count_b × (value_a - value_b)²) / (count_a + count_b)
combined_variance = (SS_a + SS_b + SS_between) / (count_a + count_b - 1)
This properly accounts for:
When querying a value that falls between centroids, or within a compressed centroid, Sparstogram assumes the mass follows a normal distribution with the centroid's mean and variance:
density(x) = (1 / (σ√(2π))) × exp(-0.5 × ((x - μ) / σ)²)
CDF(x) = 0.5 × (1 + erf((x - μ) / (σ√2)))
This allows:
The normal distribution assumption is a simplification, but works well in practice:
Other approaches to interpolation include:
The normal distribution strikes a balance between accuracy, computational efficiency, and theoretical justification (Central Limit Theorem suggests compressed centroids approximate normal distributions).
Sparstogram is well-suited for distributed and parallel computing:
Multiple Sparstogram instances can process different data partitions independently:
// On worker 1
const hist1 = new Sparstogram(1000);
for (const value of partition1) hist1.add(value);
// On worker 2
const hist2 = new Sparstogram(1000);
for (const value of partition2) hist2.add(value);
// On coordinator
const combined = new Sparstogram(2000);
combined.mergeFrom(hist1);
combined.mergeFrom(hist2);
For large-scale systems, use tree-based aggregation:
Level 0: [H1] [H2] [H3] [H4] [H5] [H6] [H7] [H8] (workers)
Level 1: [H12] [H34] [H56] [H78] (reducers)
Level 2: [H1234] [H5678] (reducers)
Level 3: [H12345678] (final)
Each merge operation is efficient and the variance tracking ensures accuracy is maintained.
For streaming systems with time windows:
const windows = new Map<number, Sparstogram>();
function processValue(timestamp: number, value: number) {
const windowId = Math.floor(timestamp / WINDOW_SIZE);
if (!windows.has(windowId)) {
windows.set(windowId, new Sparstogram(100));
}
windows.get(windowId)!.add(value);
}
// Merge windows for longer-term analysis
function getHourlyAggregate(hour: number) {
const hourlyHist = new Sparstogram(500);
for (let min = 0; min < 60; min++) {
const windowId = hour * 60 + min;
if (windows.has(windowId)) {
hourlyHist.mergeFrom(windows.get(windowId)!);
}
}
return hourlyHist;
}
No formal error bounds: Unlike GK or KLL, Sparstogram doesn't provide formal ε-approximation guarantees. The tightnessJ metric is heuristic.
Normal distribution assumption: Interpolation assumes normality, which may not hold for all data patterns.
Local curvature only: The curvature metric uses only immediate neighbors, which may miss global structure.
Single dimension: Currently handles univariate data only (though could be extended to multiple independent histograms).
K-means integration: Automatically detect and maintain cluster centers based on peak detection.
Allan variance: For time-series data, AVAR could provide better stability and noise characterization.
Adaptive maxCentroids: Automatically adjust based on tightnessJ to maintain quality targets.
Formal error bounds: Develop theoretical bounds on quantile approximation error as a function of maxCentroids and curvature.
Non-parametric interpolation: Alternative to normal distribution assumption, perhaps based on observed local shape.
Multi-dimensional extension: Joint histograms or copula-based approaches for multivariate data.
Incremental computation: Maintain running statistics (mean, std dev, skewness, kurtosis) incrementally.
Contributions to the Digitree Sparstogram library are welcome! Here's how you can contribute:
.editorconfigyarn build or npm run buildyarn test or npm testThis project is licensed under the MIT License - see the LICENSE file for details.