Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

โ€บIngestion

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL ๐Ÿ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion ๐Ÿ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

Ingestion spec reference

All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time task. All types of ingestion use an ingestion spec to configure ingestion.

Ingestion specs consists of three main components:

  • dataSchema, which configures the datasource name, primary timestamp, dimensions, metrics, and transforms and filters (if needed).
  • ioConfig, which tells Druid how to connect to the source system and how to parse data. For more information, see the documentation for each ingestion method.
  • tuningConfig, which controls various tuning parameters specific to each ingestion method.

Example ingestion spec for task type index_parallel (native batch):

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "wikipedia",
      "timestampSpec": {
        "column": "timestamp",
        "format": "auto"
      },
      "dimensionsSpec": {
        "dimensions": [
          "page",
          "language",
          { "type": "long", "name": "userId" }
        ]
      },
      "metricsSpec": [
        { "type": "count", "name": "count" },
        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
      ],
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "intervals": [
          "2013-08-31/2013-09-01"
        ]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "local",
        "baseDir": "examples/indexing/",
        "filter": "wikipedia_data.json"
      },
      "inputFormat": {
        "type": "json",
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": [
            { "type": "path", "name": "userId", "expr": "$.user.id" }
          ]
        }
      }
    },
    "tuningConfig": {
      "type": "index_parallel"
    }
  }
}

The specific options supported by these sections will depend on the ingestion method you have chosen. For more examples, refer to the documentation for each ingestion method.

You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality available in Druid's web console. Druid's visual data loader supports Kafka, Kinesis, and native batch mode.

dataSchema

The dataSchema spec has been changed in 0.17.0. The new spec is supported by all ingestion methods except for Hadoop ingestion. See the Legacy dataSchema spec for the old spec.

The dataSchema is a holder for the following components:

  • datasource name
  • primary timestamp
  • dimensions
  • metrics
  • transforms and filters (if needed).

An example dataSchema is:

"dataSchema": {
  "dataSource": "wikipedia",
  "timestampSpec": {
    "column": "timestamp",
    "format": "auto"
  },
  "dimensionsSpec": {
    "dimensions": [
      "page",
      "language",
      { "type": "long", "name": "userId" }
    ]
  },
  "metricsSpec": [
    { "type": "count", "name": "count" },
    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
  ],
  "granularitySpec": {
    "segmentGranularity": "day",
    "queryGranularity": "none",
    "intervals": [
      "2013-08-31/2013-09-01"
    ]
  }
}

dataSource

The dataSource is located in dataSchema โ†’ dataSource and is simply the name of the datasource that data will be written to. An example dataSource is:

"dataSource": "my-first-datasource"

timestampSpec

The timestampSpec is located in dataSchema โ†’ timestampSpec and is responsible for configuring the primary timestamp. An example timestampSpec is:

"timestampSpec": {
  "column": "timestamp",
  "format": "auto"
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

A timestampSpec can have the following components:

FieldDescriptionDefault
columnInput row field to read the primary timestamp from.

Regardless of the name of this input field, the primary timestamp will always be stored as a column named __time in your Druid datasource.
timestamp
formatTimestamp format. Options are:
  • iso: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"
  • posix: seconds since epoch
  • millis: milliseconds since epoch
  • micro: microseconds since epoch
  • nano: nanoseconds since epoch
  • auto: automatically detects ISO (either 'T' or space separator) or millis format
  • any Joda DateTimeFormat string
auto
missingValueTimestamp to use for input records that have a null or missing timestamp column. Should be in ISO8601 format, like "2000-01-01T01:02:03.456", even if you have specified something else for format. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all.none

You can use the timestamp in a expression as __time because Druid parses the timestampSpec before applying transforms. You can also set the expression name to __time to replace the value of the timestamp.

Treat __time as a millisecond timestamp: the number of milliseconds since Jan 1, 1970 at midnight UTC.

dimensionsSpec

The dimensionsSpec is located in dataSchema โ†’ dimensionsSpec and is responsible for configuring dimensions. An example dimensionsSpec is:

"dimensionsSpec" : {
  "dimensions": [
    "page",
    "language",
    { "type": "long", "name": "userId" }
  ],
  "dimensionExclusions" : [],
  "spatialDimensions" : []
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

A dimensionsSpec can have the following components:

FieldDescriptionDefault
dimensionsA list of dimension names or objects. You cannot include the same column in both dimensions and dimensionExclusions.

If dimensions and spatialDimensions are both null or empty arrays, Druid treats all columns other than timestamp or metrics that do not appear in dimensionExclusions as String-typed dimension columns. See inclusions and exclusions for details.

As a best practice, put the most frequently filtered dimensions at the beginning of the dimensions list. In this case, it would also be good to consider partitioning by those same dimensions.
[]
dimensionExclusionsThe names of dimensions to exclude from ingestion. Only names are supported here, not objects.

This list is only used if the dimensions and spatialDimensions lists are both null or empty arrays; otherwise it is ignored. See inclusions and exclusions below for details.
[]
spatialDimensionsAn array of spatial dimensions.[]
includeAllDimensionsYou can set includeAllDimensions to true to ingest both explicit dimensions in the dimensions field and other dimensions that the ingestion task discovers from input data. In this case, the explicit dimensions will appear first in order that you specify them and the dimensions dynamically discovered will come after. This flag can be useful especially with auto schema discovery using flattenSpec. If this is not set and the dimensions field is not empty, Druid will ingest only explicit dimensions. If this is not set and the dimensions field is empty, all discovered dimensions will be ingested.false

Dimension objects

Each dimension in the dimensions list can either be a name or an object. Providing a name is equivalent to providing a string type dimension object with the given name, e.g. "page" is equivalent to {"name": "page", "type": "string"}.

Dimension objects can have the following components:

FieldDescriptionDefault
typeEither string, long, float, double, or json.string
nameThe name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.

Note that you can use a transformSpec if you want to rename columns during ingestion time.
none (required)
createBitmapIndexFor string typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for string typed dimensions.true
multiValueHandlingSpecify the type of handling for multi-value fields. Possible values are sorted_array, sorted_set, and array. sorted_array and sorted_set order the array upon ingestion. sorted_set removes duplicates. array ingests data as-issorted_array

Inclusions and exclusions

Druid will interpret a dimensionsSpec in two possible ways: normal or schemaless.

Normal interpretation occurs when either dimensions or spatialDimensions is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of dimensionExclusions will be ignored.

Schemaless interpretation occurs when both dimensions and spatialDimensions are empty or null. In this case, the set of dimensions is determined in the following way:

  1. First, start from the set of all root-level fields from the input record, as determined by the inputFormat. "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a flattenSpec. All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
  2. If a flattenSpec is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
  3. Any field listed in dimensionExclusions is excluded.
  4. The field listed as column in the timestampSpec is excluded.
  5. Any field used as an input to an aggregator from the metricsSpec is excluded.
  6. Any field with the same name as an aggregator from the metricsSpec is excluded.
  7. All other fields are ingested as string typed dimensions with the default settings.

Note: Fields generated by a transformSpec are not currently considered candidates for schemaless dimension interpretation.

metricsSpec

The metricsSpec is located in dataSchema โ†’ metricsSpec and is a list of aggregators to apply at ingestion time. This is most useful when rollup is enabled, since it's how you configure ingestion-time aggregation.

An example metricsSpec is:

"metricsSpec": [
  { "type": "count", "name": "count" },
  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
]

Generally, when rollup is disabled, you should have an empty metricsSpec (because without rollup, Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However, in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of pre-computing part of an approximate aggregation, this can only be done by defining a metric in a metricsSpec.

granularitySpec

The granularitySpec is located in dataSchema โ†’ granularitySpec and is responsible for configuring the following operations:

  1. Partitioning a datasource into time chunks (via segmentGranularity).
  2. Truncating the timestamp, if desired (via queryGranularity).
  3. Specifying which time chunks of segments should be created, for batch ingestion (via intervals).
  4. Specifying whether ingestion-time rollup should be used or not (via rollup).

Other than rollup, these operations are all based on the primary timestamp.

An example granularitySpec is:

"granularitySpec": {
  "segmentGranularity": "day",
  "queryGranularity": "none",
  "intervals": [
    "2013-08-31/2013-09-01"
  ],
  "rollup": true
}

A granularitySpec can have the following components:

FieldDescriptionDefault
typeuniformuniform
segmentGranularityTime chunking granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to day, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any granularity can be provided here. Note that all segments in the same time chunk should have the same segment granularity.day
queryGranularityThe resolution of timestamp storage within each segment. This must be equal to, or finer, than segmentGranularity. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of minute will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).

Any granularity can be provided here. Use none to store timestamps as-is, without any truncation. Note that rollup will be applied if it is set even when the queryGranularity is set to none.
none
rollupWhether to use ingestion-time rollup or not. Note that rollup is still effective even when queryGranularity is set to none. Your data will be rolled up if they have the exactly same timestamp.true
intervalsA list of intervals defining time chunks for segments. Specify interval values using ISO8601 format. For example, ["2021-12-06T21:27:10+00:00/2021-12-07T00:00:00+00:00"]. If you omit the time, the time defaults to "00:00:00".

Druid breaks the list up and rounds off the list values based on the segmentGranularity.

If null or not provided, batch ingestion tasks generally determine which time chunks to output based on the timestamps found in the input data.

If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks throw away any records with timestamps outside of the specified intervals.

Ignored for any form of streaming ingestion.
null

transformSpec

The transformSpec is located in dataSchema โ†’ transformSpec and is responsible for transforming and filtering records during ingestion time. It is optional. An example transformSpec is:

"transformSpec": {
  "transforms": [
    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
  ],
  "filter": {
    "type": "selector",
    "dimension": "country",
    "value": "San Serriffe"
  }
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

Transforms

The transforms list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a "name" which can be referred to by your dimensionsSpec, metricsSpec, etc.

If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".

Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular, they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field with another field containing all nulls, which will act similarly to removing the field.

Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:

{
  "type": "expression",
  "name": "<output name>",
  "expression": "<expr>"
}

The expression is a Druid query expression.

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

Filter

The filter conditionally filters input rows during ingestion. Only rows that pass the filter will be ingested. Any of Druid's standard query filters can be used. Note that within a transformSpec, the transforms are applied before the filter, so the filter can refer to a transform.

Legacy dataSchema spec

The dataSchema spec has been changed in 0.17.0. The new spec is supported by all ingestion methods except for Hadoop ingestion. See dataSchema for the new spec.

The legacy dataSchema spec has below two more components in addition to the ones listed in the dataSchema section above.

  • input row parser, flattening of nested data (if needed)

parser (Deprecated)

In legacy dataSchema, the parser is located in the dataSchema โ†’ parser and is responsible for configuring a wide variety of items related to parsing input records. The parser is deprecated and it is highly recommended to use inputFormat instead. For details about inputFormat and supported parser types, see the "Data formats" page.

For details about major components of the parseSpec, refer to their subsections:

  • timestampSpec, responsible for configuring the primary timestamp.
  • dimensionsSpec, responsible for configuring dimensions.
  • flattenSpec, responsible for flattening nested data formats.

An example parser is:

"parser": {
  "type": "string",
  "parseSpec": {
    "format": "json",
    "flattenSpec": {
      "useFieldDiscovery": true,
      "fields": [
        { "type": "path", "name": "userId", "expr": "$.user.id" }
      ]
    },
    "timestampSpec": {
      "column": "timestamp",
      "format": "auto"
    },
    "dimensionsSpec": {
      "dimensions": [
        "page",
        "language",
        { "type": "long", "name": "userId" }
      ]
    }
  }
}

flattenSpec

In the legacy dataSchema, the flattenSpec is located in dataSchema โ†’ parser โ†’ parseSpec โ†’ flattenSpec and is responsible for bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model. See Flatten spec for more details.

ioConfig

The ioConfig influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted filesystem, or any other supported source system. The inputFormat property applies to all ingestion method except for Hadoop ingestion. The Hadoop ingestion still uses the parser in the legacy dataSchema. The rest of ioConfig is specific to each individual ingestion method. An example ioConfig to read JSON data is:

"ioConfig": {
    "type": "<ingestion-method-specific type code>",
    "inputFormat": {
      "type": "json"
    },
    ...
}

For more details, see the documentation provided by each ingestion method.

tuningConfig

Tuning properties are specified in a tuningConfig, which goes at the top level of an ingestion spec. Some properties apply to all ingestion methods, but most are specific to each individual ingestion method. An example tuningConfig that sets all of the shared, common properties to their defaults is:

"tuningConfig": {
  "type": "<ingestion-method-specific type code>",
  "maxRowsInMemory": 1000000,
  "maxBytesInMemory": <one-sixth of JVM memory>,
  "indexSpec": {
    "bitmap": { "type": "roaring" },
    "dimensionCompression": "lz4",
    "metricCompression": "lz4",
    "longEncoding": "longs"
  },
  <other ingestion-method-specific properties>
}
FieldDescriptionDefault
typeEach ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are index, hadoop, kafka, and kinesis.
maxRowsInMemoryThe maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either maxRowsInMemory or maxBytesInMemory are reached (whichever happens first).1000000
maxBytesInMemoryThe maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either maxRowsInMemory or maxBytesInMemory are reached (whichever happens first). maxBytesInMemory also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of maxBytesInMemory until the next persist will decrease. If the sum of bytes of all intermediary persisted artifacts exceeds maxBytesInMemory the task fails.

Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).

Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sketches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.
One-sixth of max JVM heap size
skipBytesInMemoryOverheadCheckThe calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.false
indexSpecDefines segment storage format options to use at indexing time.See indexSpec for more information.
indexSpecForIntermediatePersistsDefines segment storage format options to use at indexing time for intermediate persisted temporary segments.See indexSpec for more information.
Other propertiesEach ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: Kafka indexing service, Kinesis indexing service, Native batch, and Hadoop-based.

indexSpec

The indexSpec object can include the following properties:

FieldDescriptionDefault
bitmapCompression format for bitmap indexes. Should be a JSON object with type set to roaring or concise. For type roaring, the boolean property compressRunOnSerialization (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.{"type": "roaring"}
dimensionCompressionCompression format for dimension columns. Options are lz4, lzf, zstd, or uncompressed.lz4
stringDictionaryEncodingEncoding format for STRING value dictionaries used by STRING and COMPLEX<json> columns.
Example to enable front coding: {"type":"frontCoded", "bucketSize": 4}
bucketSize is the number of values to place in a bucket to perform delta encoding. Must be a power of 2, maximum is 128. Defaults to 4.
See Front coding for more information.
{"type":"utf8"}
metricCompressionCompression format for primitive type metric columns. Options are lz4, lzf, zstd, uncompressed, or none (which is more efficient than uncompressed, but not supported by older versions of Druid).lz4
longEncodingEncoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are auto or longs. auto encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. longs stores the value as-is with 8 bytes each.longs
jsonCompressionCompression format to use for nested column raw data. Options are lz4, lzf, zstd, or uncompressed.lz4
Front coding

Front coding is an experimental feature starting in version 25.0. Front coding is an incremental encoding strategy that Druid can use to store STRING and COMPLEX<json> columns. It allows Druid to create smaller UTF-8 encoded segments with very little performance cost.

You can enable front coding with all types of ingestion. For information on defining an indexSpec in a query context, see SQL-based ingestion reference.

Front coding is new to Druid 25.0 so the current recommendation is to enable it in a staging environment and fully test your use case before using in production. Segments created with front coding enabled are not compatible with Druid versions older than 25.0.

Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each ingestion method for details.

โ† PartitioningSchema design tips โ†’
  • dataSchema
    • dataSource
    • timestampSpec
    • dimensionsSpec
    • metricsSpec
    • granularitySpec
    • transformSpec
    • Legacy dataSchema spec
  • ioConfig
  • tuningConfig
    • indexSpec

Technologyโ€‚ยทโ€‚Use Casesโ€‚ยทโ€‚Powered by Druidโ€‚ยทโ€‚Docsโ€‚ยทโ€‚Communityโ€‚ยทโ€‚Downloadโ€‚ยทโ€‚FAQ

โ€‚ยทโ€‚โ€‚ยทโ€‚โ€‚ยทโ€‚
Copyright ยฉ 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.