Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

โ€บConcepts

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL ๐Ÿ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion ๐Ÿ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

Datasources

Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource, and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true during data ingestion, where ingestion is always creating or writing into a table datasource. But at query time, there are many other types of datasources available.

The word "datasource" is generally spelled dataSource (with a capital S) when it appears in API requests and responses.

Datasource type

table

SQL
Native
SELECT column1, column2 FROM "druid"."dataSourceName"
{
"queryType": "scan",
"dataSource": "dataSourceName",
"columns": ["column1", "column2"],
"intervals": ["0000/3000"]
}

The table datasource is the most common type. This is the kind of datasource you get when you perform data ingestion. They are split up into segments, distributed around the cluster, and queried in parallel.

In Druid SQL, table datasources reside in the druid schema. This is the default schema, so table datasources can be referenced as either druid.dataSourceName or simply dataSourceName.

In native queries, table datasources can be referenced using their names as strings (as in the example above), or by using JSON objects of the form:

"dataSource": {
  "type": "table",
  "name": "dataSourceName"
}

To see a list of all table datasources, use the SQL query SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'.

lookup

SQL
Native
SELECT k, v FROM lookup.countries
{
"queryType": "scan",
"dataSource": {
"type": "lookup",
"lookup": "countries"
},
"columns": ["k", "v"],
"intervals": ["0000/3000"]
}

Lookup datasources correspond to Druid's key-value lookup objects. In Druid SQL, they reside in the lookup schema. They are preloaded in memory on all servers, so they can be accessed rapidly. They can be joined onto regular tables using the join operator.

Lookup datasources are key-value oriented and always have exactly two columns: k (the key) and v (the value), and both are always strings.

To see a list of all lookup datasources, use the SQL query SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'.

Performance tip: Lookups can be joined with a base table either using an explicit join, or by using the SQL LOOKUP function. However, the join operator must evaluate the condition on each row, whereas the LOOKUP function can defer evaluation until after an aggregation phase. This means that the LOOKUP function is usually faster than joining to a lookup datasource.

Refer to the Query execution page for more details on how queries are executed when you use table datasources.

union

SQL
Native
SELECT column1, column2
FROM (
SELECT column1, column2 FROM table1
UNION ALL
SELECT column1, column2 FROM table2
UNION ALL
SELECT column1, column2 FROM table3
)
{
"queryType": "scan",
"dataSource": {
"type": "union",
"dataSources": ["table1", "table2", "table3"]
},
"columns": ["column1", "column2"],
"intervals": ["0000/3000"]
}

Unions allow you to treat two or more tables as a single datasource. In SQL, this is done with the UNION ALL operator applied directly to tables, called a "table-level union". In native queries, this is done with a "union" datasource.

With SQL table-level unions the same columns must be selected from each table in the same order, and those columns must either have the same types, or types that can be implicitly cast to each other (such as different numeric types). For this reason, it is more robust to write your queries to select specific columns.

With the native union datasource, the tables being unioned do not need to have identical schemas. If they do not fully match up, then columns that exist in one table but not another will be treated as if they contained all null values in the tables where they do not exist.

In either case, features like expressions, column aliasing, JOIN, GROUP BY, ORDER BY, and so on cannot be used with table unions.

Refer to the Query execution page for more details on how queries are executed when you use union datasources.

inline

Native
{
"queryType": "scan",
"dataSource": {
"type": "inline",
"columnNames": ["country", "city"],
"rows": [
["United States", "San Francisco"],
["Canada", "Calgary"]
]
},
"columns": ["country", "city"],
"intervals": ["0000/3000"]
}

Inline datasources allow you to query a small amount of data that is embedded in the query itself. They are useful when you want to write a query on a small amount of data without loading it first. They are also useful as inputs into a join. Druid also uses them internally to handle subqueries that need to be inlined on the Broker. See the query datasource documentation for more details.

There are two fields in an inline datasource: an array of columnNames and an array of rows. Each row is an array that must be exactly as long as the list of columnNames. The first element in each row corresponds to the first column in columnNames, and so on.

Inline datasources are not available in Druid SQL.

Refer to the Query execution page for more details on how queries are executed when you use inline datasources.

query

SQL
Native
-- Uses a subquery to count hits per page, then takes the average.
SELECT
AVG(cnt) AS average_hits_per_page
FROM
(SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
{
"queryType": "timeseries",
"dataSource": {
"type": "query",
"query": {
"queryType": "groupBy",
"dataSource": "site_traffic",
"intervals": ["0000/3000"],
"granularity": "all",
"dimensions": ["page"],
"aggregations": [
{ "type": "count", "name": "hits" }
]
}
},
"intervals": ["0000/3000"],
"granularity": "all",
"aggregations": [
{ "type": "longSum", "name": "hits", "fieldName": "hits" },
{ "type": "count", "name": "pages" }
],
"postAggregations": [
{ "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
]
}

Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a dataSource. In SQL, they can appear in the following places, always surrounded by parentheses:

  • The FROM clause: FROM (<subquery>).
  • As inputs to a JOIN: <table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>.
  • In the WHERE clause: WHERE <column> { IN | NOT IN } (<subquery>). These are translated to joins by the SQL planner.

Performance tip: In most cases, subquery results are fully buffered in memory on the Broker and then further processing occurs on the Broker itself. This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits on the Broker. See the Query execution page for more details on how subqueries are executed and what limits will apply.

join

SQL
Native
-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
SELECT
store_to_country.v AS country,
SUM(sales.revenue) AS country_revenue
FROM
sales
INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
GROUP BY
countries.v
{
"queryType": "groupBy",
"dataSource": {
"type": "join",
"left": "sales",
"right": {
"type": "lookup",
"lookup": "store_to_country"
},
"rightPrefix": "r.",
"condition": "store == \"r.k\"",
"joinType": "INNER"
},
"intervals": ["0000/3000"],
"granularity": "all",
"dimensions": [
{ "type": "default", "outputName": "country", "dimension": "r.v" }
],
"aggregations": [
{ "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
]
}

Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows you to join arbitrarily many datasources.

In Druid 25.0.0, joins are implemented with a broadcast hash-join algorithm. This means that all datasources other than the leftmost "base" datasource must fit in memory. It also means that the join condition must be an equality. This feature is intended mainly to allow joining regular Druid tables with lookup, inline, and query datasources.

Refer to the Query execution page for more details on how queries are executed when you use join datasources.

Joins in SQL

SQL joins take the form:

<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>

The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together. Conditions like t1.x = t2.x, or LOWER(t1.x) = t2.x, or t1.x = t2.x AND t1.y = t2.y can all be handled. Conditions like t1.x <> t2.x cannot currently be handled.

Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the SQL layer will generate subqueries include:

  • Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept a table on the left-hand side, but not the right, so a subquery is needed.

  • Join conditions where the expressions on either side are of different types.

  • Join conditions where the right-hand expression is not a direct column access.

For more information about how Druid translates SQL to native queries, refer to the Druid SQL documentation.

Joins in native queries

Native join datasources have the following properties. All are required.

FieldDescription
leftLeft-hand datasource. Must be of type table, join, lookup, query, or inline. Placing another join as the left datasource allows you to join arbitrarily many datasources.
rightRight-hand datasource. Must be of type lookup, query, or inline. Note that this is more rigid than what Druid SQL requires.
rightPrefixString prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string __time. Any columns from the left-hand side that start with your rightPrefix will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.
conditionExpression that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.
joinTypeINNER or LEFT.

Join performance

Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:

  1. Joins are especially useful with lookup datasources, but in most cases, the LOOKUP function performs better than a join. Consider using the LOOKUP function if it is appropriate for your use case.
  2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in your queries. Refer to the Druid SQL documentation for more details about when this happens and how to detect it.
  3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match. For example, since lookup keys are always strings, the condition druid.d JOIN lookup.l ON d.field = l.field will perform best if d.field is a string.
  4. As of Druid 25.0.0, the join operator must evaluate the condition for each row. In the future, we expect to implement both early and deferred condition evaluation, which we expect to improve performance considerably for common use cases.
  5. Currently, Druid does not support pushing down predicates (condition and filter) past a Join (i.e. into Join's children). Druid only supports pushing predicates into the join if they originated from above the join. Hence, the location of predicates and filters in your Druid SQL is very important. Also, as a result of this, comma joins should be avoided.

Future work for joins

Joins are an area of active development in Druid. The following features are missing today but may appear in future versions:

  • Reordering of predicates and filters (pushing up and/or pushing down) to get the most performant plan.
  • Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
  • RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not always be correct.
  • Performance-related optimizations as mentioned in the previous section.
  • Join algorithms other than broadcast hash-joins.
  • Join condition on a column compared to a constant value.
  • Join conditions on a column containing a multi-value dimension.
โ† TroubleshootingJoins โ†’
  • Datasource type
    • table
    • lookup
    • union
    • inline
    • query
    • join

Technologyโ€‚ยทโ€‚Use Casesโ€‚ยทโ€‚Powered by Druidโ€‚ยทโ€‚Docsโ€‚ยทโ€‚Communityโ€‚ยทโ€‚Downloadโ€‚ยทโ€‚FAQ

โ€‚ยทโ€‚โ€‚ยทโ€‚โ€‚ยทโ€‚
Copyright ยฉ 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.