DataSketches KLL Sketch module
This module provides Apache Druid aggregators based on numeric quantiles KllFloatsSketch and KllDoublesSketch from Apache DataSketches library. KLL quantiles sketch is a mergeable streaming algorithm to estimate the distribution of values, and approximately answer queries about the rank of a value, probability mass function of the distribution (PMF) or histogram, cumulative distribution function (CDF), and quantiles (median, min, max, 95th percentile and such). See Quantiles Sketch Overview. This document applies to both KllFloatsSketch and KllDoublesSketch. Only one of them will be used in the examples.
There are three major modes of operation:
- Ingesting sketches built outside of Druid (say, with Pig or Hive)
- Building sketches from raw data during ingestion
- Building sketches from raw data at query time
To use this aggregator, make sure you include the extension in your config file:
druid.extensions.loadList=["druid-datasketches"]
For additional sketch types supported in Druid, see DataSketches extension.
Aggregator
The result of the aggregation is a KllFloatsSketch or KllDoublesSketch that is the union of all sketches either built from raw data or read from the segments.
{
"type" : "KllDoublesSketch",
"name" : <output_name>,
"fieldName" : <metric_name>,
"k": <parameter that controls size and accuracy>
}
Property | Description | Required? |
---|---|---|
type | Either "KllFloatsSketch" or "KllDoublesSketch" | yes |
name | A String for the output (result) name of the calculation. | yes |
fieldName | String for the name of the input field, which may contain sketches or raw numeric values. | yes |
k | Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be from 8 to 65535. See KLL Sketch Accuracy and Size. | no, defaults to 200 |
maxStreamLength | This parameter defines the number of items that can be presented to each sketch before it may need to move from off-heap to on-heap memory. This is relevant to query types that use off-heap memory, including TopN and GroupBy. Ideally, should be set high enough such that most sketches can stay off-heap. | no, defaults to 1000000000 |