Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

โ€บGetting started

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL ๐Ÿ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion ๐Ÿ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

Quickstart (local)

This quickstart gets you started with Apache Druid and introduces you to Druid ingestion and query features. For this tutorial, we recommend a machine with at least 6 GB of RAM.

In this quickstart, you'll do the following:

  • install Druid
  • start up Druid services
  • use SQL to ingest and query data

Druid supports a variety of ingestion options. Once you're done with this tutorial, refer to the Ingestion page to determine which ingestion method is right for you.

Requirements

You can follow these steps on a relatively modest machine, such as a workstation or virtual server with 16 GiB of RAM.

The software requirements for the installation machine are:

  • Linux, Mac OS X, or other Unix-like OS. (Windows is not supported)
  • Java 8u92+ or Java 11
  • Python2 or Python3

Druid relies on the environment variables JAVA_HOME or DRUID_JAVA_HOME to find Java on the machine. You can set DRUID_JAVA_HOME if there is more than one instance of Java. To verify Java requirements for your environment, run the bin/verify-java script.

Before installing a production Druid instance, be sure to review the security overview. In general, avoid running Druid as root user. Consider creating a dedicated user account for running Druid.

Install Druid

Download the 25.0.0 release from Apache Druid.

In your terminal, extract the file and change directories to the distribution directory:

tar -xzf apache-druid-25.0.0-bin.tar.gz
cd apache-druid-25.0.0

The distribution directory contains LICENSE and NOTICE files and subdirectories for executable files, configuration files, sample data and more.

Start up Druid services

Start up Druid services using the automatic single-machine configuration. This configuration includes default settings that are appropriate for this tutorial, such as loading the druid-multi-stage-query extension by default so that you can use the MSQ task engine.

You can view that setting and others in the configuration files in the conf/druid/auto.

From the apache-druid-25.0.0 package root, run the following command:

./bin/start-druid

This brings up instances of ZooKeeper and the Druid services and may use up to 80% of the total available system memory. To explicitly set the total memory available to Druid, pass a value for the memory parameter, e.g. ./bin/start-druid -m 16g or ./bin/start-druid --memory 16g.

$ ./bin/start-druid
[Tue Nov 29 16:31:06 2022] Starting Apache Druid.
[Tue Nov 29 16:31:06 2022] Open http://localhost:8888/ in your browser to access the web console.
[Tue Nov 29 16:31:06 2022] Or, if you have enabled TLS, use https on port 9088.
[Tue Nov 29 16:31:06 2022] Starting services with log directory [/apache-druid-25.0.0/log].
[Tue Nov 29 16:31:06 2022] Running command[zk]: bin/run-zk conf
[Tue Nov 29 16:31:06 2022] Running command[broker]: bin/run-druid broker /apache-druid-25.0.0/conf/druid/single-server/quickstart '-Xms1187m -Xmx1187m -XX:MaxDirectMemorySize=791m'
[Tue Nov 29 16:31:06 2022] Running command[router]: bin/run-druid router /apache-druid-25.0.0/conf/druid/single-server/quickstart '-Xms128m -Xmx128m'
[Tue Nov 29 16:31:06 2022] Running command[coordinator-overlord]: bin/run-druid coordinator-overlord /apache-druid-25.0.0/conf/druid/single-server/quickstart '-Xms1290m -Xmx1290m'
[Tue Nov 29 16:31:06 2022] Running command[historical]: bin/run-druid historical /apache-druid-25.0.0/conf/druid/single-server/quickstart '-Xms1376m -Xmx1376m -XX:MaxDirectMemorySize=2064m'
[Tue Nov 29 16:31:06 2022] Running command[middleManager]: bin/run-druid middleManager /apache-druid-25.0.0/conf/druid/single-server/quickstart '-Xms64m -Xmx64m' '-Ddruid.worker.capacity=2 -Ddruid.indexer.runner.javaOptsArray=["-server","-Duser.timezone=UTC","-Dfile.encoding=UTF-8","-XX:+ExitOnOutOfMemoryError","-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager","-Xms256m","-Xmx256m","-XX:MaxDirectMemorySize=256m"]'

All persistent state, such as the cluster metadata store and segments for the services, are kept in the var directory under the Druid root directory, apache-druid-25.0.0. Each service writes to a log file under var/sv.

At any time, you can revert Druid to its original, post-installation state by deleting the entire var directory. You may want to do this, for example, between Druid tutorials or after experimentation, to start with a fresh instance.

To stop Druid at any time, use CTRL+C in the terminal. This exits the bin/start-druid script and terminates all Druid processes.

Open the web console

After the Druid services finish startup, open the web console at http://localhost:8888.

web console

It may take a few seconds for all Druid services to finish starting, including the Druid router, which serves the console. If you attempt to open the web console before startup is complete, you may see errors in the browser. Wait a few moments and try again.

In this quickstart, you use the the web console to perform ingestion. The MSQ task engine specifically uses the Query view to edit and run SQL queries. For a complete walkthrough of the Query view as it relates to the multi-stage query architecture and the MSQ task engine, see UI walkthrough.

Load data

The Druid distribution bundles the wikiticker-2015-09-12-sampled.json.gz sample dataset that you can use for testing. The sample dataset is located in the quickstart/tutorial/ folder, accessible from the Druid root directory, and represents Wikipedia page edits for a given day.

Follow these steps to load the sample Wikipedia dataset:

  1. In the Query view, click Connect external data.

  2. Select the Local disk tile and enter the following values:

    • Base directory: quickstart/tutorial/

    • File filter: wikiticker-2015-09-12-sampled.json.gz

    Data location

    Entering the base directory and wildcard file filter separately, as afforded by the UI, allows you to specify multiple files for ingestion at once.

  3. Click Connect data.

  4. On the Parse page, you can examine the raw data and perform the following optional actions before loading data into Druid:

    • Expand a row to see the corresponding source data.
    • Customize how the data is handled by selecting from the Input format options.
    • Adjust the primary timestamp column for the data. Druid requires data to have a primary timestamp column (internally stored in a column called __time). If your dataset doesn't have a timestamp, Druid uses the default value of 1970-01-01 00:00:00.

    Data sample

  5. Click Done. You're returned to the Query view that displays the newly generated query. The query inserts the sample data into the table named wikiticker-2015-09-12-sampled.

    Show the query

    REPLACE INTO "wikiticker-2015-09-12-sampled" OVERWRITE ALL
    WITH input_data AS (SELECT *
    FROM TABLE(
      EXTERN(
        '{"type":"local","baseDir":"quickstart/tutorial/","filter":"wikiticker-2015-09-12-sampled.json.gz"}',
        '{"type":"json"}',
        '[{"name":"time","type":"string"},{"name":"channel","type":"string"},{"name":"cityName","type":"string"},{"name":"comment","type":"string"},{"name":"countryIsoCode","type":"string"},{"name":"countryName","type":"string"},{"name":"isAnonymous","type":"string"},{"name":"isMinor","type":"string"},{"name":"isNew","type":"string"},{"name":"isRobot","type":"string"},{"name":"isUnpatrolled","type":"string"},{"name":"metroCode","type":"long"},{"name":"namespace","type":"string"},{"name":"page","type":"string"},{"name":"regionIsoCode","type":"string"},{"name":"regionName","type":"string"},{"name":"user","type":"string"},{"name":"delta","type":"long"},{"name":"added","type":"long"},{"name":"deleted","type":"long"}]'
         )
       ))
    SELECT
      TIME_PARSE("time") AS __time,
      channel,
      cityName,
      comment,
      countryIsoCode,
      countryName,
      isAnonymous,
      isMinor,
      isNew,
      isRobot,
      isUnpatrolled,
      metroCode,
      namespace,
      page,
      regionIsoCode,
      regionName,
      user,
      delta,
      added,
      deleted
    FROM input_data
    PARTITIONED BY DAY
    

  6. Optionally, click Preview to see the general shape of the data before you ingest it.

  7. Edit the first line of the query and change the default destination datasource name from wikiticker-2015-09-12-sampled to wikipedia.

  8. Click Run to execute the query. The task may take a minute or two to complete. When done, the task displays its duration and the number of rows inserted into the table. The view is set to automatically refresh, so you don't need to refresh the browser to see the status change.

    Run query

    A successful task means that Druid data servers have picked up one or more segments.

Query data

Once the ingestion job is complete, you can query the data.

In the Query view, run the following query to produce a list of top channels:

SELECT
  channel,
  COUNT(*)
FROM "wikipedia"
GROUP BY channel
ORDER BY COUNT(*) DESC

Query view

Congratulations! You've gone from downloading Druid to querying data with the MSQ task engine in just one quickstart.

Next steps

See the following topics for more information:

  • Druid SQL overview or the Query tutorial to learn about how to query the data you just ingested.
  • Ingestion overview to explore options for ingesting more data.
  • Tutorial: Load files using SQL to learn how to generate a SQL query that loads external data into a Druid datasource.
  • Tutorial: Load data with native batch ingestion to load and query data with Druid's native batch ingestion feature.
  • Tutorial: Load stream data from Apache Kafka to load streaming data from a Kafka topic.
  • Extensions for details on Druid extensions.

Remember that after stopping Druid services, you can start clean next time by deleting the var directory from the Druid root directory and running the bin/start-druid script again. You may want to do this before using other data ingestion tutorials, since they use the same Wikipedia datasource.

โ† Introduction to Apache DruidSingle server deployment โ†’
  • Requirements
  • Install Druid
  • Start up Druid services
  • Open the web console
  • Load data
  • Query data
  • Next steps

Technologyโ€‚ยทโ€‚Use Casesโ€‚ยทโ€‚Powered by Druidโ€‚ยทโ€‚Docsโ€‚ยทโ€‚Communityโ€‚ยทโ€‚Downloadโ€‚ยทโ€‚FAQ

โ€‚ยทโ€‚โ€‚ยทโ€‚โ€‚ยทโ€‚
Copyright ยฉ 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.