Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

โ€บTutorials

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL ๐Ÿ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion ๐Ÿ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

Tutorial: Run with Docker

This quickstart guides you through the steps to download the Apache Druid image from Docker Hub and deploy it on a single machine using Docker and Docker Compose. After you finish the initial setup, the cluster will be ready to load data.

Before beginning the quickstart, it is helpful to read the general Druid overview and the ingestion overview, because the tutorials refer to concepts discussed on those pages. It also helps to be familiar with Docker.

This tutorial assumes you will download the required files from GitHub. The files are also available in a Druid installation and in the Druid sources.

Prerequisites

  • Docker

Docker memory requirements

The default docker-compose.yml launches eight containers: Zookeeper, PostgreSQL, and six Druid containers based upon the micro quickstart configuration. Each Druid service is configured to use up to 7 GiB of memory (6 GiB direct memory and 1 GiB heap). However, the quickstart will not use all the available memory.

For this setup, Docker needs at least 6 GiB of memory available for the Druid cluster. For Docker Desktop on Mac OS, adjust the memory settings in the Docker Desktop preferences. If you experience a crash with a 137 error code you likely don't have enough memory allocated to Docker.

You can modify the value of DRUID_SINGLE_NODE_CONF in the Docker environment to use different single-server mode. For example to use the nano quickstart: DRUID_SINGLE_NODE_CONF=nano-quickstart.

Getting started

Create a directory to hold the Druid Docker files.

The Druid source code contains an example docker-compose.yml which pulls an image from Docker Hub and is suited to be used as an example environment and to experiment with Docker based Druid configuration and deployments. Download this file to the directory created above.

Compose file

The example docker-compose.yml will create a container for each Druid service, as well as ZooKeeper and a PostgreSQL container as the metadata store.

It will also create a named volume druid_shared as deep storage to keep and share segments and task logs among Druid services. The volume is mounted as opt/shared in the container.

Environment file

The Druid docker-compose.yml example uses an environment file to specify the complete Druid configuration, including the environment variables described in Configuration. This file is named environment by default, and must be in the same directory as the docker-compose.yml file. Download the example environment file to the directory created above. The options in this file work well for trying Druid and for using the tutorial.

The single-file approach is inadequate for a production system. Instead we suggest using either DRUID_COMMON_CONFIG and DRUID_CONFIG_${service} or specially tailored, service-specific environment files.

Configuration

Configuration of the Druid Docker container is done via environment variables set within the container. Docker Compose passes the values from the environment file into the container. The variables may additionally specify paths to the standard Druid configuration files which must be available within the container.

The default values are fine for the Quickstart. Production systems will want to modify the defaults.

Basic configuration:

  • DRUID_MAXDIRECTMEMORYSIZE -- set Java max direct memory size. Default is 6 GiB.
  • DRUID_XMX -- set Java Xmx, the maximum heap size. Default is 1 GB.

Production configuration:

  • DRUID_CONFIG_COMMON -- full path to a file for Druid common properties
  • DRUID_CONFIG_${service} -- full path to a file for Druid service properties
  • JAVA_OPTS -- set Java options

Logging configuration:

  • DRUID_LOG4J -- set the entire log4j.xml configuration file verbatim. (Example)
  • DRUID_LOG_LEVEL -- override the default Log4j log level
  • DRUID_SERVICE_LOG4J -- set the entire log4j.xml configuration file verbatim specific to a service.
  • DRUID_SERVICE_LOG_LEVEL -- override the default Log4j log level in the service specific log4j.

Advanced memory configuration:

  • DRUID_XMS -- set Java Xms, the initial heap size. Default is 1 GB.
  • DRUID_MAXNEWSIZE -- set Java max new size
  • DRUID_NEWSIZE -- set Java new size

In addition to the special environment variables, the script which launches Druid in the container will use any environment variable starting with the druid_ prefix as command-line configuration. For example, an environment variable

druid_metadata_storage_type=postgresql

is translated into the following option in the Java launch command for the Druid process in the container:

-Ddruid.metadata.storage.type=postgresql

Note that Druid uses port 8888 for the console. This port is also used by Jupyter and other tools. To avoid conflicts, you can change the port in the ports section of the docker-compose.yml file. For example, to expose the console on port 9999 of the host:

    container_name: router
    ...
    ports:
      - "9999:8888"

Launching the cluster

cd into the directory that contains the configuration files. This is the directory you created above, or the distribution/docker/ in your Druid installation directory if you installed Druid locally.

Run docker-compose up to launch the cluster with a shell attached, or docker-compose up -d to run the cluster in the background.

Once the cluster has started, you can navigate to the web console at http://localhost:8888. The Druid router process serves the UI.

web console

It takes a few seconds for all the Druid processes to fully start up. If you open the console immediately after starting the services, you may see some errors that you can safely ignore.

Using the cluster

From here you can follow along with the Quickstart. For production use, refine your docker-compose.yml file to add any additional external service dependencies as necessary.

You can explore the Druid containers using Docker to start a shell:

docker exec -ti <id> sh

Where <id> is the container id found with docker ps. Druid is installed in /opt/druid. The script which consumes the environment variables mentioned above, and which launches Druid, is located at /druid.sh.

Run docker-compose down to shut down the cluster. Your data is persisted as a set of Docker volumes and will be available when you restart your Druid cluster.

โ† Transforming input dataKerberized HDFS deep storage โ†’
  • Prerequisites
    • Docker memory requirements
  • Getting started
    • Compose file
    • Environment file
    • Configuration
  • Launching the cluster
  • Using the cluster

Technologyโ€‚ยทโ€‚Use Casesโ€‚ยทโ€‚Powered by Druidโ€‚ยทโ€‚Docsโ€‚ยทโ€‚Communityโ€‚ยทโ€‚Downloadโ€‚ยทโ€‚FAQ

โ€‚ยทโ€‚โ€‚ยทโ€‚โ€‚ยทโ€‚
Copyright ยฉ 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.