Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

โ€บDruid SQL

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL ๐Ÿ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion ๐Ÿ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

SQL aggregation functions

Apache Druid supports two query languages: Druid SQL and native queries. This document describes the SQL language.

You can use aggregation functions in the SELECT clause of any Druid SQL query.

Filter any aggregator using the FILTER clause, for example:

SELECT 
  SUM(added) FILTER(WHERE channel = '#en.wikipedia')
FROM wikipedia

The FILTER clause limits an aggregation query to only the rows that match the filter. Druid translates the FILTER clause to a native filtered aggregator. Two aggregators in the same SQL query may have different filters.

When no rows are selected, aggregation functions return their initial value. This can occur from the following:

  • When no rows match the filter while aggregating values across an entire table without a grouping, or
  • When using filtered aggregations within a grouping.

The initial value varies by aggregator. COUNT and the approximate count distinct sketch functions always return 0 as the initial value.

In the aggregation functions supported by Druid, only COUNT, ARRAY_AGG, and STRING_AGG accept the DISTINCT keyword.

The order of aggregation operations across segments is not deterministic. This means that non-commutative aggregation functions can produce inconsistent results across the same query.

Functions that operate on an input type of "float" or "double" may also see these differences in aggregation results across multiple query runs because of this. If precisely the same value is desired across multiple query runs, consider using the ROUND function to smooth out the inconsistencies between queries.

FunctionNotesDefault
COUNT(*)Counts the number of rows.0
COUNT(DISTINCT expr)Counts distinct values of expr.

When useApproximateCountDistinct is set to "true" (the default), this is an alias for APPROX_COUNT_DISTINCT. The specific algorithm depends on the value of druid.sql.approxCountDistinct.function. In this mode, you can use strings, numbers, or prebuilt sketches. If counting prebuilt sketches, the prebuilt sketch type must match the selected algorithm.

When useApproximateCountDistinct is set to "false", the computation will be exact. In this case, expr must be string or numeric, since exact counts are not possible using prebuilt sketches. In exact mode, only one distinct count per query is permitted unless useGroupingSetForExactDistinct is enabled.

Counts each distinct value in a multi-value-row separately.
0
SUM(expr)Sums numbers.null if druid.generic.useDefaultValueForNull=false, otherwise 0
MIN(expr)Takes the minimum of numbers.null if druid.generic.useDefaultValueForNull=false, otherwise 9223372036854775807 (maximum LONG value)
MAX(expr)Takes the maximum of numbers.null if druid.generic.useDefaultValueForNull=false, otherwise -9223372036854775808 (minimum LONG value)
AVG(expr)Averages numbers.null if druid.generic.useDefaultValueForNull=false, otherwise 0
APPROX_COUNT_DISTINCT(expr)Counts distinct values of expr using an approximate algorithm. The expr can be a regular column or a prebuilt sketch column.

The specific algorithm depends on the value of druid.sql.approxCountDistinct.function. By default, this is APPROX_COUNT_DISTINCT_BUILTIN. If the DataSketches extension is loaded, you can set it to APPROX_COUNT_DISTINCT_DS_HLL or APPROX_COUNT_DISTINCT_DS_THETA.

When run on prebuilt sketch columns, the sketch column type must match the implementation of this function. For example: when druid.sql.approxCountDistinct.function is set to APPROX_COUNT_DISTINCT_BUILTIN, this function runs on prebuilt hyperUnique columns, but not on prebuilt HLLSketchBuild columns.
APPROX_COUNT_DISTINCT_BUILTIN(expr)Usage note: consider using APPROX_COUNT_DISTINCT_DS_HLL instead, which offers better accuracy in many cases.

Counts distinct values of expr using Druid's built-in "cardinality" or "hyperUnique" aggregators, which implement a variant of HyperLogLog. The expr can be a string, a number, or a prebuilt hyperUnique column. Results are always approximate, regardless of the value of useApproximateCountDistinct.
APPROX_QUANTILE(expr, probability, [resolution])Deprecated. Use APPROX_QUANTILE_DS instead, which provides a superior distribution-independent algorithm with formal error guarantees.

Computes approximate quantiles on numeric or approxHistogram expressions. probability should be between 0 and 1, exclusive. resolution is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. Load the approximate histogram extension to use this function.
NaN
APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])Computes approximate quantiles on numeric or fixed buckets histogram expressions. probability should be between 0 and 1, exclusive. The numBuckets, lowerLimit, upperLimit, and outlierHandlingMode parameters are described in the fixed buckets histogram documentation. Load the approximate histogram extension to use this function.0.0
BLOOM_FILTER(expr, numEntries)Computes a bloom filter from values produced by expr, with numEntries maximum number of distinct values before false positive rate increases. See bloom filter extension documentation for additional details.Empty base64 encoded bloom filter STRING
VAR_POP(expr)Computes variance population of expr. See stats extension documentation for additional details.null if druid.generic.useDefaultValueForNull=false, otherwise 0
VAR_SAMP(expr)Computes variance sample of expr. See stats extension documentation for additional details.null if druid.generic.useDefaultValueForNull=false, otherwise 0
VARIANCE(expr)Computes variance sample of expr. See stats extension documentation for additional details.null if druid.generic.useDefaultValueForNull=false, otherwise 0
STDDEV_POP(expr)Computes standard deviation population of expr. See stats extension documentation for additional details.null if druid.generic.useDefaultValueForNull=false, otherwise 0
STDDEV_SAMP(expr)Computes standard deviation sample of expr. See stats extension documentation for additional details.null if druid.generic.useDefaultValueForNull=false, otherwise 0
STDDEV(expr)Computes standard deviation sample of expr. See stats extension documentation for additional details.null if druid.generic.useDefaultValueForNull=false, otherwise 0
EARLIEST(expr)Returns the earliest value of expr, which must be numeric. If expr comes from a relation with a timestamp column (like __time in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column. If the earliest non-null value of the timestamp column appears in multiple rows, the expr may be taken from any of those rows. If expr does not come from a relation with a timestamp, then it is simply the first value encountered.null if druid.generic.useDefaultValueForNull=false, otherwise 0
EARLIEST(expr, maxBytesPerString)Like EARLIEST(expr), but for strings. The maxBytesPerString parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.null if druid.generic.useDefaultValueForNull=false, otherwise ''
EARLIEST_BY(expr, timestampExpr)Returns the earliest value of expr, which must be numeric. The earliest value of expr is taken from the row with the overall earliest non-null value of timestampExpr. If the earliest non-null value of timestampExpr appears in multiple rows, the expr may be taken from any of those rows.null if druid.generic.useDefaultValueForNull=false, otherwise 0
EARLIEST_BY(expr, timestampExpr, maxBytesPerString)Like EARLIEST_BY(expr, timestampExpr), but for strings. The maxBytesPerString parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.null if druid.generic.useDefaultValueForNull=false, otherwise ''
LATEST(expr)Returns the latest value of expr, which must be numeric. If expr comes from a relation with a timestamp column (like __time in a Druid datasource), the "latest" is taken from the row with the overall latest non-null value of the timestamp column. If the latest non-null value of the timestamp column appears in multiple rows, the expr may be taken from any of those rows. If expr does not come from a relation with a timestamp, then it is simply the last value encountered.null if druid.generic.useDefaultValueForNull=false, otherwise 0
LATEST(expr, maxBytesPerString)Like LATEST(expr), but for strings. The maxBytesPerString parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.null if druid.generic.useDefaultValueForNull=false, otherwise ''
LATEST_BY(expr, timestampExpr)Returns the latest value of expr, which must be numeric. The latest value of expr is taken from the row with the overall latest non-null value of timestampExpr. If the overall latest non-null value of timestampExpr appears in multiple rows, the expr may be taken from any of those rows.null if druid.generic.useDefaultValueForNull=false, otherwise 0
LATEST_BY(expr, timestampExpr, maxBytesPerString)Like LATEST_BY(expr, timestampExpr), but for strings. The maxBytesPerString parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.null if druid.generic.useDefaultValueForNull=false, otherwise ''
ANY_VALUE(expr)Returns any value of expr including null. expr must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)null if druid.generic.useDefaultValueForNull=false, otherwise 0
ANY_VALUE(expr, maxBytesPerString)Like ANY_VALUE(expr), but for strings. The maxBytesPerString parameter determines how much aggregation space to allocate per string. Strings longer than this limit are truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.null if druid.generic.useDefaultValueForNull=false, otherwise ''
GROUPING(expr, expr...)Returns a number to indicate which groupBy dimension is included in a row, when using GROUPING SETS. Refer to additional documentation on how to infer this number.N/A
ARRAY_AGG(expr, [size])Collects all values of expr into an ARRAY, including null values, with size in bytes limit on aggregation size (default of 1024 bytes). If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the ARRAY_AGG expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.null
ARRAY_AGG(DISTINCT expr, [size])Collects all distinct values of expr into an ARRAY, including null values, with size in bytes limit on aggregation size (default of 1024 bytes) per aggregate. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the ARRAY_AGG expression is not currently supported, and the ordering of results will be based on the default for the element type.null
ARRAY_CONCAT_AGG(expr, [size])Concatenates all array expr into a single ARRAY, with size in bytes limit on aggregation size (default of 1024 bytes). Input expr must be an array. Null expr will be ignored, but any null values within an expr will be included in the resulting array. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the ARRAY_CONCAT_AGG expression is not currently supported, and the ordering of results within the output array may vary depending on processing order.null
ARRAY_CONCAT_AGG(DISTINCT expr, [size])Concatenates all distinct values of all array expr into a single ARRAY, with size in bytes limit on aggregation size (default of 1024 bytes) per aggregate. Input expr must be an array. Null expr will be ignored, but any null values within an expr will be included in the resulting array. If the aggregated array grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the ARRAY_CONCAT_AGG expression is not currently supported, and the ordering of results will be based on the default for the element type.null
STRING_AGG(expr, separator, [size])Collects all values of expr into a single STRING, ignoring null values. Each value is joined by the separator which must be a literal STRING. An optional size in bytes can be supplied to limit aggregation size (default of 1024 bytes). If the aggregated string grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the STRING_AGG expression is not currently supported, and the ordering of results within the output string may vary depending on processing order.null if druid.generic.useDefaultValueForNull=false, otherwise ''
STRING_AGG(DISTINCT expr, separator, [size])Collects all distinct values of expr into a single STRING, ignoring null values. Each value is joined by the separator which must be a literal STRING. An optional size in bytes can be supplied to limit aggregation size (default of 1024 bytes). If the aggregated string grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the STRING_AGG expression is not currently supported, and the ordering of results will be based on the default STRING ordering.null if druid.generic.useDefaultValueForNull=false, otherwise ''
BIT_AND(expr)Performs a bitwise AND operation on all input values.null if druid.generic.useDefaultValueForNull=false, otherwise 0
BIT_OR(expr)Performs a bitwise OR operation on all input values.null if druid.generic.useDefaultValueForNull=false, otherwise 0
BIT_XOR(expr)Performs a bitwise XOR operation on all input values.null if druid.generic.useDefaultValueForNull=false, otherwise 0

Sketch functions

These functions create sketch objects that you can use to perform fast, approximate analyses. For advice on choosing approximate aggregation functions, check out our approximate aggregations documentation. To operate on sketch objects, also see the DataSketches post aggregator functions.

HLL sketch functions

Load the DataSketches extension to use the following functions.

FunctionNotesDefault
APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])Counts distinct values of expr, which can be a regular column or an HLL sketch column. Results are always approximate, regardless of the value of useApproximateCountDistinct. The lgK and tgtHllType parameters here are, like the equivalents in the aggregator, described in the HLL sketch documentation. See also COUNT(DISTINCT expr).0
DS_HLL(expr, [lgK, tgtHllType])Creates an HLL sketch on the values of expr, which can be a regular column or a column containing HLL sketches. The lgK and tgtHllType parameters are described in the HLL sketch documentation.'0' (STRING)

Theta sketch functions

Load the DataSketches extension to use the following functions.

FunctionNotesDefault
APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])Counts distinct values of expr, which can be a regular column or a Theta sketch column. Results are always approximate, regardless of the value of useApproximateCountDistinct. The size parameter is described in the Theta sketch documentation. See also COUNT(DISTINCT expr).0
DS_THETA(expr, [size])Creates a Theta sketch on the values of expr, which can be a regular column or a column containing Theta sketches. The size parameter is described in the Theta sketch documentation.'0.0' (STRING)

Quantiles sketch functions

Load the DataSketches extension to use the following functions.

FunctionNotesDefault
APPROX_QUANTILE_DS(expr, probability, [k])Computes approximate quantiles on numeric or Quantiles sketch expressions. The probability value should be between 0 and 1, exclusive. The k parameter is described in the Quantiles sketch documentation.

See the known issue with this function.
NaN
DS_QUANTILES_SKETCH(expr, [k])Creates a Quantiles sketch on the values of expr, which can be a regular column or a column containing quantiles sketches. The k parameter is described in the Quantiles sketch documentation.

See the known issue with this function.
'0' (STRING)

T-Digest sketch functions

Load the T-Digest extension to use the following functions. See the T-Digest extension for additional details and for more information on these functions.

FunctionNotesDefault
TDIGEST_QUANTILE(expr, quantileFraction, [compression])Builds a T-Digest sketch on values produced by expr and returns the value for the quantile. Compression parameter (default value 100) determines the accuracy and size of the sketch. Higher compression means higher accuracy but more space to store sketches.Double.NaN
TDIGEST_GENERATE_SKETCH(expr, [compression])Builds a T-Digest sketch on values produced by expr. Compression parameter (default value 100) determines the accuracy and size of the sketch Higher compression means higher accuracy but more space to store sketches.Empty base64 encoded T-Digest sketch STRING
โ† Scalar functionsMulti-value string functions โ†’
  • Sketch functions
    • HLL sketch functions
    • Theta sketch functions
    • Quantiles sketch functions
    • T-Digest sketch functions

Technologyโ€‚ยทโ€‚Use Casesโ€‚ยทโ€‚Powered by Druidโ€‚ยทโ€‚Docsโ€‚ยทโ€‚Communityโ€‚ยทโ€‚Downloadโ€‚ยทโ€‚FAQ

โ€‚ยทโ€‚โ€‚ยทโ€‚โ€‚ยทโ€‚
Copyright ยฉ 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.