Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

›Hidden

Getting started

  • Introduction to Apache Druid
  • Quickstart
  • Single server deployment
  • Clustered deployment

Tutorials

  • Loading files natively
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Kerberized HDFS deep storage

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Data ingestion

  • Ingestion
  • Data formats
  • Schema design tips
  • Data management
  • Stream ingestion

    • Apache Kafka
    • Amazon Kinesis
    • Tranquility

    Batch ingestion

    • Native batch
    • Hadoop-based
  • Task reference
  • Troubleshooting FAQ

Querying

  • Druid SQL
  • Native query types

    • Making native queries
    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata
    • Search
    • Select
  • Multi-value dimensions
  • Lookups
  • Joins
  • Multitenancy considerations
  • Query caching
  • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Management UIs
  • Basic cluster tuning
  • API reference
  • High availability
  • Rolling updates
  • Retaining or automatically dropping data
  • Metrics
  • Alerts
  • Working with different versions of Apache Hadoop
  • HTTP compression
  • Recommendations
  • TLS support
  • Password providers
  • dump-segment tool
  • reset-cluster tool
  • insert-segment-to-db tool
  • pull-deps tool
  • Misc

    • Deep storage migration
    • Web console
    • Export Metadata Tool
    • Getting started with Apache Druid
    • Metadata Migration
    • Segment Size Optimization
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Expressions
  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Ambari Metrics Emitter
  • Microsoft Azure
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • Aggregations
  • Datasources
  • Transforming Dimension Values
  • Query Filters
  • Aggregation Granularity
  • Filter groupBy query results
  • Cardinality/HyperUnique aggregators
  • Sort groupBy query results
  • Post-Aggregations
  • Query context
  • Refining search queries
  • Sorting Orders
  • TopNMetricSpec
  • Virtual Columns
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • development/extensions-contrib/moving-average-query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • Realtime Process
Edit

Apache Parquet Extension

This Apache Druid (incubating) module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.

Note: druid-parquet-extensions depends on the druid-avro-extensions module, so be sure to include both.

Parquet Hadoop Parser

This extension provides two ways to parse Parquet files:

  • parquet - using a simple conversion contained within this extension
  • parquet-avro - conversion to avro records with the parquet-avro library and using the druid-avro-extensions module to parse the avro data

Selection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in the ioConfig:

  • org.apache.druid.data.input.parquet.DruidParquetInputFormat for parquet
  • org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat for parquet-avro

Both parse options support auto field discovery and flattening if provided with a flattenSpec with parquet or avro as the format. Parquet nested list and map logical types should operate correctly with JSON path expressions for all supported types. parquet-avro sets a hadoop job property parquet.avro.add-list-element-records to false (which normally defaults to true), in order to 'unwrap' primitive list elements into multi-value dimensions.

The parquet parser supports int96 Parquet values, while parquet-avro does not. There may also be some subtle differences in the behavior of JSON path expression evaluation of flattenSpec.

We suggest using parquet over parquet-avro to allow ingesting data beyond the schema constraints of Avro conversion. However, parquet-avro was the original basis for this extension, and as such it is a bit more mature.

FieldTypeDescriptionRequired
typeStringChoose parquet or parquet-avro to determine how Parquet files are parsedyes
parseSpecJSON ObjectSpecifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are timeAndDims, parquet, avro (if used with avro conversion).yes
binaryAsStringBooleanSpecifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway.no(default == false)

When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto or a explicitly defined format is required.

Examples

parquet parser, parquet parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "parquet",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

parquet parser, timeAndDims parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1",
              "dim2",
              "dim3",
              "listDim"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
  }
}

parquet-avro parser, avro parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat",
        "paths": "path/to/file.parquet"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "parquet-avro",
        "parseSpec": {
          "format": "avro",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

For additional details see Hadoop ingestion and general ingestion spec documentation.

← ORC ExtensionPostgreSQL Metadata Store →
  • Parquet Hadoop Parser
    • Examples

Technology · Use Cases · Powered by Druid · Docs · Community · Download · FAQ

 ·  ·  · 
Copyright © 2019 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.