Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

β€ΊHidden

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL πŸ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion πŸ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

S3-compatible

S3 extension

This extension allows you to do 2 things:

  • Ingest data from files stored in S3.
  • Write segments to deep storage in S3.

To use this Apache Druid extension, include druid-s3-extensions in the extensions load list.

Reading data from S3

Use a native batch Parallel task with an S3 input source to read objects directly from S3.

Alternatively, use a Hadoop task, and specify S3 paths in your inputSpec.

To read objects from S3, you must supply connection information in configuration.

Deep Storage

S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3.

S3 deep storage needs to be explicitly enabled by setting druid.storage.type=s3. Only after setting the storage type to S3 will any of the settings below take effect.

To use S3 for Deep Storage, you must supply connection information in configuration and set additional configuration, specific for Deep Storage.

Deep storage specific configuration

PropertyDescriptionDefault
druid.storage.bucketBucket to store in.Must be set.
druid.storage.baseKeyA prefix string that will be prepended to the object names for the segments published to S3 deep storageMust be set.
druid.storage.typeGlobal deep storage provider. Must be set to s3 to make use of this extension.Must be set (likely s3).
druid.storage.archiveBucketS3 bucket name for archiving when running the archive task.none
druid.storage.archiveBaseKeyS3 object key prefix for archiving.none
druid.storage.disableAclBoolean flag for how object permissions are handled. To use ACLs, set this property to false. To use Object Ownership, set it to true. The permission requirements for ACLs and Object Ownership are different. For more information, see S3 permissions settings.false
druid.storage.useS3aSchemaIf true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.false

Configuration

S3 authentication methods

You can provide credentials to connect to S3 in a number of ways, whether for deep storage or as an ingestion source.

The configuration options are listed in order of precedence. For example, if you would like to use profile information given in ~/.aws.credentials, do not set druid.s3.accessKey and druid.s3.secretKey in your Druid config file because they would take precedence.

ordertypedetails
1Druid config fileBased on your runtime.properties if it contains values druid.s3.accessKey and druid.s3.secretKey
2Custom properties fileBased on custom properties file where you can supply sessionToken, accessKey and secretKey values. This file is provided to Druid through druid.s3.fileSessionCredentials properties
3Environment variablesBased on environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
4Java system propertiesBased on JVM properties aws.accessKeyId and aws.secretKey
5Profile informationBased on credentials you may have on your druid instance (generally in ~/.aws/credentials)
6ECS container credentialsBased on environment variables available on AWS ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the EC2ContainerCredentialsProviderWrapper documentation
7Instance profile informationBased on the instance profile you may have attached to your druid instance

For more information, refer to the Amazon Developer Guide.

Alternatively, you can bypass this chain by specifying an access key and secret key using a Properties Object inside your ingestion specification.

Use the property druid.startup.logging.maskProperties to mask credentials information in Druid logs. For example, ["password", "secretKey", "awsSecretAccessKey"].

S3 permissions settings

To manage the permissions for objects in an S3 bucket, you can use either ACLs or Object Ownership. The permissions required for each method are different.

By default, Druid uses ACLs. With ACLs, any object that Druid puts into the bucket inherits the ACL settings from the bucket.

You can switch from using ACLs to Object Ownership by setting druid.storage.disableAcl to true. The bucket owner owns any object that gets created, so you need to use S3's bucket policies to manage permissions.

Note that this setting only affects Druid's behavior. Changing S3 to use Object Ownership requires additional configuration. For more information, see the AWS documentation on Controlling ownership of objects and disabling ACLs for your bucket.

ACL permissions

If you're using ACLs, Druid needs the following permissions:

  • s3:GetObject
  • s3:PutObject
  • s3:DeleteObject
  • s3:GetBucketAcl
  • s3:PutObjectAcl

Object Ownership permissions

If you're using Object Ownership, Druid needs the following permissions:

  • s3:GetObject
  • s3:PutObject
  • s3:DeleteObject

AWS region

The AWS SDK requires that a target region be specified. You can set these by using the JVM system property aws.region or by setting an environment variable AWS_REGION.

For example, to set the region to 'us-east-1' through system properties:

  • Add -Daws.region=us-east-1 to the jvm.config file for all Druid services.
  • Add -Daws.region=us-east-1 to druid.indexer.runner.javaOpts in Middle Manager configuration so that the property will be passed to Peon (worker) processes.

Connecting to S3 configuration

PropertyDescriptionDefault
druid.s3.accessKeyS3 access key. See S3 authentication methods for more detailsCan be omitted according to authentication methods chosen.
druid.s3.secretKeyS3 secret key. See S3 authentication methods for more detailsCan be omitted according to authentication methods chosen.
druid.s3.fileSessionCredentialsPath to properties file containing sessionToken, accessKey and secretKey value. One key/value pair per line (format key=value). See S3 authentication methods for more detailsCan be omitted according to authentication methods chosen.
druid.s3.protocolCommunication protocol type to use when sending requests to AWS. http or https can be used. This configuration would be ignored if druid.s3.endpoint.url is filled with a URL with a different protocol.https
druid.s3.disableChunkedEncodingDisables chunked encoding. See AWS document for details.false
druid.s3.enablePathStyleAccessEnables path style access. See AWS document for details.false
druid.s3.forceGlobalBucketAccessEnabledEnables global bucket access. See AWS document for details.false
druid.s3.endpoint.urlService endpoint either with or without the protocol.None
druid.s3.endpoint.signingRegionRegion to use for SigV4 signing of requests (e.g. us-west-1).None
druid.s3.proxy.hostProxy host to connect through.None
druid.s3.proxy.portPort on the proxy host to connect through.None
druid.s3.proxy.usernameUser name to use when connecting through a proxy.None
druid.s3.proxy.passwordPassword to use when connecting through a proxy.None
druid.storage.sse.typeServer-side encryption type. Should be one of s3, kms, and custom. See the below Server-side encryption section for more details.None
druid.storage.sse.kms.keyIdAWS KMS key ID. This is used only when druid.storage.sse.type is kms and can be empty to use the default key ID.None
druid.storage.sse.custom.base64EncodedKeyBase64-encoded key. Should be specified if druid.storage.sse.type is custom.None

Server-side encryption

You can enable server-side encryption by setting druid.storage.sse.type to a supported type of server-side encryption. The current supported types are:

  • s3: Server-side encryption with S3-managed encryption keys
  • kms: Server-side encryption with AWS KMS–Managed Keys
  • custom: Server-side encryption with Customer-Provided Encryption Keys
← ProtobufSimple SSLContext Provider Module β†’
  • S3 extension
    • Reading data from S3
    • Deep Storage
  • Configuration
    • S3 authentication methods
    • S3 permissions settings
    • AWS region
    • Connecting to S3 configuration
  • Server-side encryption

Technology · Use Cases · Powered by Druid · Docs · Community · Download · FAQ

 ·  ·  · 
Copyright Β© 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.