Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

โ€บNative query components

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL ๐Ÿ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion ๐Ÿ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

Virtual columns

Apache Druid supports two query languages: Druid SQL and native queries. This document describes the native language. For information about functions available in SQL, refer to the SQL documentation.

Virtual columns are queryable column "views" created from a set of columns during a query.

A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column.

Virtual columns can be referenced by their output names to be used as dimensions or as inputs to filters and aggregators.

Each Apache Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example:

{
 "queryType": "scan",
 "dataSource": "page_data",
 "columns":[],
 "virtualColumns": [
    {
      "type": "expression",
      "name": "fooPage",
      "expression": "concat('foo' + page)",
      "outputType": "STRING"
    },
    {
      "type": "expression",
      "name": "tripleWordCount",
      "expression": "wordCount * 3",
      "outputType": "LONG"
    }
  ],
 "intervals": [
   "2013-01-01/2019-01-02"
 ]
}

Virtual column types

Expression virtual column

Expression virtual columns use Druid's native expression system to allow defining query time transforms of inputs from one or more columns.

The expression virtual column has the following syntax:

{
  "type": "expression",
  "name": <name of the virtual column>,
  "expression": <row expression>,
  "outputType": <output value type of expression>
}
propertydescriptionrequired?
typeMust be "expression" to indicate that this is an expression virtual column.yes
nameThe name of the virtual column.yes
expressionAn expression that takes a row as input and outputs a value for the virtual column.yes
outputTypeThe expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, STRING, ARRAY types, or COMPLEX types.no, default is FLOAT

Nested field virtual column

The nested field virtual column is an optimized virtual column that can provide direct access into various paths of a COMPLEX<json> column, including using their indexes.

This virtual column is used for the SQL operators JSON_VALUE (if processFromRaw is set to false) or JSON_QUERY (if processFromRaw is true), and accepts 'JSONPath' or 'jq' syntax string representations of paths, or a parsed list of "path parts" in order to determine what should be selected from the column.

You can define a nested field virtual column with any of the following equivalent syntaxes. The examples all produce the same output value, with each example showing a different way to specify how to access the nested value. The first is using JSONPath syntax path, the second with a jq path, and the third uses pathParts.

    {
      "type": "nested-field",
      "columnName": "shipTo",
      "outputName": "v0",
      "expectedType": "STRING",
      "path": "$.phoneNumbers[1].number"
    }
    {
      "type": "nested-field",
      "columnName": "shipTo",
      "outputName": "v1",
      "expectedType": "STRING",
      "path": ".phoneNumbers[1].number",
      "useJqSyntax": true
    }
    {
      "type": "nested-field",
      "columnName": "shipTo",
      "outputName": "v2",
      "expectedType": "STRING",
      "pathParts": [
        {
          "type": "field",
          "field": "phoneNumbers"
        },
        {
          "type": "arrayElement",
          "index": 1
        },
        {
          "type": "field",
          "field": "number"
        }
      ]
    }
propertydescriptionrequired?
typeMust be "nested-field" to indicate that this is a nested field virtual column.yes
columnNameThe name of the COMPLEX<json> input column.yes
outputNameThe name of the virtual column.yes
expectedTypeThe native Druid output type of the column, Druid will coerce output to this type if it does not match the underlying data. This can be STRING, LONG, FLOAT, DOUBLE, or COMPLEX<json>. Extracting ARRAY types is not yet supported.no, default STRING
pathPartsThe parsed path parts used to locate the nested values. path will be translated into pathParts internally. One of path or pathParts must be setno, if path is defined
processFromRawIf set to true, the virtual column will process the "raw" JSON data to extract values rather than using an optimized "literal" value selector. This option allows extracting non-literal values (such as nested JSON objects or arrays) as a COMPLEX<json> at the cost of much slower performance.no, default false
path'JSONPath' (or 'jq') syntax path. One of path or pathParts must be set.no, if pathParts is defined
useJqSyntaxIf true, parse path using 'jq' syntax instead of 'JSONPath'.no, default is false

Nested path part

Specify pathParts as an array of objects that describe each component of the path to traverse. Each object can take the following properties:

propertydescriptionrequired?
typeMust be 'field' or 'arrayElement'. Use field when accessing a specific field in a nested structure. Use arrayElement when accessing a specific integer position of an array (zero based).yes
fieldThe name of the 'field' in a 'field' type path partyes, if type is 'field'
indexThe array element index if type is arrayElementyes, if type is 'arrayElement'

See Nested columns for more information on ingesting and storing nested data.

List filtered virtual column

This virtual column provides an alternative way to use 'list filtered' dimension spec as a virtual column. It has optimized access to the underlying column value indexes that can provide a small performance improvement in some cases.

    {
      "type": "mv-filtered",
      "name": "filteredDim3",
      "delegate": "dim3",
      "values": ["hello", "world"],
      "isAllowList": true
    }
propertydescriptionrequired?
typeMust be "mv-filtered" to indicate that this is a list filtered virtual column.yes
nameThe output name of the virtual columnyes
delegateThe name of the multi-value STRING input column to filteryes
valuesSet of STRING values to allow or denyyes
isAllowListIf true, the output of the virtual column will be limited to the set specified by values, else it will provide all values except those specified.No, default true
โ† String comparatorsSpatial filters โ†’
  • Virtual column types
    • Expression virtual column
    • Nested field virtual column
    • List filtered virtual column

Technologyโ€‚ยทโ€‚Use Casesโ€‚ยทโ€‚Powered by Druidโ€‚ยทโ€‚Docsโ€‚ยทโ€‚Communityโ€‚ยทโ€‚Downloadโ€‚ยทโ€‚FAQ

โ€‚ยทโ€‚โ€‚ยทโ€‚โ€‚ยทโ€‚
Copyright ยฉ 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.