The initial maximum number of centroids to store. This must be at least 1. This can be changed at any time. Note that nothing is allocated by this; it represents the point at which point centroids will begin to be collapsed.
Optional
markers: number[]Optional persisted quantiles to be maintained with each new value. These don't have to be in order, but they must be between 0 and 1. The quantiles will be maintained with each new value, and can be retrieved with atMarker.
Optional
markersOptional persisted quantiles to be maintained with each new value. These don't have to be in order, but they must be between 0 and 1. The quantiles will be maintained with each new value, and can be retrieved with atMarker.
The current number of distinct values (centroids) in the histogram.
This may be fewer than the number of distinct values added if maxCentroids
is less than the number of distinct values added.
The current total count accumulated in the histogram (sum of counts of all values)
The current maximum number of centroids to store before compression.
Adds a value to the histogram. If you want to dynamically limit loss, monitor the returned loss and adjust maxCentroids accordingly.
The loss incurred by compression, if any
Adds one or more centroids to the histogram. Adds all centroids before compressing to incur the least loss. If you want to reduce memory usage, or monitor loss, use an iterator with sequential calls to this rather than this method.
The maximal loss incurred by compression, if any
Returns the interpolated count of values at a given value in the histogram
The value to find the count for
The count of the value in the histogram (interpolated, but in whole counts)
Returns the quantile marker at a given index, as given by markers at construction (0 = median, 1 = lower quartile, 2 = upper quartile, etc.)
The index of the marker to find the value for
The Quartile information at the given marker in the histogram
Merges all the centroids from another histogram into this one. Adds all centroids before compressing to incur the least loss. If you want to reduce memory usage, or monitor loss, use an iterator with sequential calls to append rather than this method.
Computes the peaks in the histogram. This is a simple peak detection algorithm that finds local maxima in the histogram for frequency of cluster detection. A peak is defined as a local maximum where the count is greater than the counts of the centroids on either side. If there are too few centroids to allow for the given smoothing in the histogram, no peaks will be returned
The number of centroids in each "direction" to consider for smoothing the peaks (default 3)
Iterator for the computed Peak entries in the histogram
Returns the value at a given quantile in the histogram (0.5 = median)
The quantile (0-1) to find the value for
The centroid at the given quantile in the histogram
Returns the rank of a value in the histogram - count of all values less than or equal to the given value This method interpolates the rank between and beyond centroids based on the normal distribution of each centroid.
The value to find the rank for.
The rank of the value in the histogram
Returns the centroid at a given rank in the histogram (count / 2 = median)
The rank (accumulated count from start if positive, end if negative) to find the value for
The Quantile information at the given rank in the histogram - always returns a positive rank
A histogram that maintains a complete or sparse approximation of the data frequency. The representation will be complete if the number of distinct values is less than or equal to the maxCentroids. Otherwise, the values will be compressed to a smaller number of centroids in a way that minimizes the loss. The histogram can also maintain a set of quantiles, which are maintained with each new value (e.g. median can be efficiently maintained). The structure is adaptive in range and resolution, and can be used for streaming data. The histogram is roughly based on the ideas in the paper "Approximate Frequency Counts over Data Streams" by Gurmeet Singh Manku and Rajeev Motwani. The implementation uses B+Trees to efficiently maintain the centroids and losses, which is a self-balancing structure; so this should be able to scale efficiently.