Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

โ€บHidden

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files natively
  • Load files using SQL ๐Ÿ†•
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Theta sketches
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Tutorial: Run with Docker
  • Kerberized HDFS deep storage
  • Convert ingestion spec to SQL
  • Jupyter Notebook tutorials

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Stream ingestion

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Batch ingestion

    • Native batch
    • Native batch: input sources
    • Migrate from firehose
    • Hadoop-based

    SQL-based ingestion ๐Ÿ†•

    • Overview
    • Key concepts
    • API
    • Security
    • Examples
    • Reference
    • Known issues
  • Task reference
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • Druid SQL API
    • JDBC driver API
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • API reference
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Firehose (deprecated)
  • Native batch (simple)
  • Realtime Process
Edit

Aliyun OSS

Alibaba Cloud is the 3rd largest cloud infrastructure provider in the world. It provides its own storage solution known as OSS, Object Storage Service. This document describes how to use OSS as Druid deep storage.

Installation

Use the pull-deps tool shipped with Druid to install the aliyun-oss-extensions extension, as described here on middle manager and historical nodes.

java -classpath "{YOUR_DRUID_DIR}/lib/*" org.apache.druid.cli.Main tools pull-deps -c org.apache.druid.extensions.contrib:aliyun-oss-extensions:{YOUR_DRUID_VERSION}

Enabling

After installation, add this aliyun-oss-extensions extension to druid.extensions.loadList in common.runtime.properties and then restart middle manager and historical nodes.

Configuration

First add the following OSS configurations to common.runtime.properties

PropertyDescriptionRequired
druid.oss.accessKeyThe AccessKey ID of the account to be used to access the OSS bucketyes
druid.oss.secretKeyThe AccessKey Secret of the account to be used to access the OSS bucketyes
druid.oss.endpointThe endpoint URL of your OSS storage.
If your Druid cluster is also hosted in the same region on Alibaba Cloud as the region of your OSS bucket, it's recommended to use the internal network endpoint url, so that any inbound and outbound traffic to the OSS bucket is free of charge.
yes

To use OSS as deep storage, add the following configurations:

PropertyDescriptionRequired
druid.storage.typeGlobal deep storage provider. Must be set to oss to make use of this extension.yes
druid.storage.oss.bucketStorage bucket name.yes
druid.storage.oss.prefixFolder where segments will be published to. druid/segments is recommended.No

If OSS is used as deep storage for segment files, it's also recommended saving index logs in the OSS too. To do this, add following configurations:

PropertyDescriptionRequired
druid.indexer.logs.typeGlobal deep storage provider. Must be set to oss to make use of this extension.yes
druid.indexer.logs.oss.bucketThe bucket used to keep logs. It could be the same as druid.storage.oss.bucketyes
druid.indexer.logs.oss.prefixFolder where log files will be published to. druid/logs is recommended.no

Reading data from OSS

Currently, Web Console does not support ingestion from OSS, but it could be done by submitting an ingestion task with OSS's input source configuration.

Below shows the configurations of OSS's input source.

OSS Input Source

propertydescriptionRequired
typeThis should be oss.yes
urisJSON array of URIs where OSS objects to be ingested are located.
For example, oss://{your_bucket}/{source_file_path}
uris or prefixes or objects must be set
prefixesJSON array of URI prefixes for the locations of OSS objects to be ingested. Empty objects starting with one of the given prefixes will be skipped.uris or prefixes or objects must be set
objectsJSON array of OSS Objects to be ingested.uris or prefixes or objects must be set
propertiesProperties Object for overriding the default OSS configuration. See below for more information.no (defaults will be used if not given)

OSS Object

PropertyDescriptionDefaultRequired
bucketName of the OSS bucketNoneyes
pathThe path where data is located.Noneyes

Properties Object

PropertyDescriptionDefaultRequired
accessKeyThe Password Provider or plain text string of this OSS InputSource's access keyNoneyes
secretKeyThe Password Provider or plain text string of this OSS InputSource's secret keyNoneyes
endpointThe endpoint of this OSS InputSourceNoneno

Reading from a file

Say that the file rollup-data.json, which can be found under Druid's quickstart/tutorial directory, has been uploaded to a folder druid in your OSS bucket, the bucket for which your Druid is configured. In this case, the uris property of the OSS's input source can be used for reading, as shown:

{
  "type" : "index_parallel",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "rollup-tutorial-from-oss",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec" : {
        "dimensions" : [
          "srcIP",
          "dstIP"
        ]
      },
      "metricsSpec" : [
        { "type" : "count", "name" : "count" },
        { "type" : "longSum", "name" : "packets", "fieldName" : "packets" },
        { "type" : "longSum", "name" : "bytes", "fieldName" : "bytes" }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "week",
        "queryGranularity" : "minute",
        "intervals" : ["2018-01-01/2018-01-03"],
        "rollup" : true
      }
    },
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "oss",
        "uris" : [
          "oss://{YOUR_BUCKET_NAME}/druid/rollup-data.json"
        ]
      },
      "inputFormat" : {
        "type" : "json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index_parallel",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000
    }
  }
}

By posting the above ingestion task spec to http://{YOUR_ROUTER_IP}:8888/druid/indexer/v1/task, an ingestion task will be created by the indexing service to ingest.

Reading files in folders

If we want to read files in a same folder, we could use the prefixes property to specify the folder name where Druid could find input files instead of specifying file URIs one by one.

...
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "oss",
        "prefixes" : [
          "oss://{YOUR_BUCKET_NAME}/2020", "oss://{YOUR_BUCKET_NAME}/2021"
        ]
      },
      "inputFormat" : {
        "type" : "json"
      },
      "appendToExisting" : false
    }
...

The spec above tells the ingestion task to read all files under 2020 and 2021 folders.

Reading from other buckets

If you want to read from files in buckets which are different from the bucket Druid is configured, use objects property of OSS's InputSource for task submission as below:

...
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "oss",
        "objects" : [
          {"bucket": "YOUR_BUCKET_NAME", "path": "druid/rollup-data.json"}
        ]
      },
      "inputFormat" : {
        "type" : "json"
      },
      "appendToExisting" : false
    }
...

Reading with customized accessKey

If the default druid.oss.accessKey is not able to access a bucket, properties could be used to customize these secret information as below:

...
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "oss",
        "objects" : [
          {"bucket": "YOUR_BUCKET_NAME", "path": "druid/rollup-data.json"}
        ],
        "properties": {
          "endpoint": "YOUR_ENDPOINT_OF_BUCKET",
          "accessKey": "YOUR_ACCESS_KEY",
          "secretKey": "YOUR_SECRET_KEY"
        }
      },
      "inputFormat" : {
        "type" : "json"
      },
      "appendToExisting" : false
    }
...

This properties could be applied to any of uris, objects, prefixes property above.

Troubleshooting

When using OSS as deep storage or reading from OSS, the most problems that users will encounter are related to OSS permission. Please refer to the official OSS permission troubleshooting document to find a solution.

โ† GCE ExtensionsPrometheus Emitter โ†’
  • Installation
  • Enabling
  • Configuration
  • Reading data from OSS
    • OSS Input Source
    • Reading from a file
    • Reading files in folders
    • Reading from other buckets
    • Reading with customized accessKey
  • Troubleshooting

Technologyโ€‚ยทโ€‚Use Casesโ€‚ยทโ€‚Powered by Druidโ€‚ยทโ€‚Docsโ€‚ยทโ€‚Communityโ€‚ยทโ€‚Downloadโ€‚ยทโ€‚FAQ

โ€‚ยทโ€‚โ€‚ยทโ€‚โ€‚ยทโ€‚
Copyright ยฉ 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.