Compressed Big Decimal

Overview

Compressed Big Decimal is an extension which provides support for Mutable big decimal value that can be used to accumulate values without losing precision or reallocating memory. This type helps in absolute precision arithmetic on large numbers in applications, where greater level of accuracy is required, such as financial applications, currency based transactions. This helps avoid rounding issues where in potentially large amount of money can be lost.

Accumulation requires that the two numbers have the same scale, but does not require that they are of the same size. If the value being accumulated has a larger underlying array than this value (the result), then the higher order bits are dropped, similar to what happens when adding a long to an int and storing the result in an int. A compressed big decimal that holds its data with an embedded array.

Compressed big decimal is an absolute number based complex type based on big decimal in Java. This supports all the functionalities supported by Java Big Decimal. Java Big Decimal is not mutable in order to avoid big garbage collection issues. Compressed big decimal is needed to mutate the value in the accumulator.

Main enhancements provided by this extension:

Functionality: Mutating Big decimal type with greater precision
Accuracy: Provides greater level of accuracy in decimal arithmetic

Operations

To use this extension, make sure to load druid-compressed-bigdecimal to your config file.

Configuration

There are currently no configuration properties specific to Compressed Big Decimal

Limitations

Compressed Big Decimal does not provide correct result when the value being accumulated has a larger underlying array than this value (the result), then the higher order bits are dropped, similar to what happens when adding a long to an int and storing the result in an int.

Ingestion Spec:

Most properties in the Ingest spec derived from Ingestion Spec / Data Formats

property	description	required?
metricsSpec	Metrics Specification, In metrics specification while specifying metrics details such as name, type should be specified as compressedBigDecimal	Yes

Query spec:

Most properties in the query spec derived from groupBy query / timeseries, see documentation for these query types.

property	description	required?
queryType	This String should always be either "groupBy" OR "timeseries"; this is the first thing Druid looks at to figure out how to interpret the query.	yes
dataSource	A String or Object defining the data source to query, very similar to a table in a relational database. See DataSource for more information.	yes
dimensions	A JSON list of DimensionSpec (Notice that property is optional)	no
limitSpec	See LimitSpec	no
having	See Having	no
granularity	A period granularity; See Period Granularities	yes
filter	See Filters	no
aggregations	Aggregations forms the input to Averagers; See Aggregations. The Aggregations must specify type, scale and size as follows for compressedBigDecimal Type `"aggregations": [{"type": "compressedBigDecimal","name": "..","fieldName": "..","scale": [Numeric],"size": [Numeric]}`. Please refer query example in Examples section.	Yes
postAggregations	Supports only aggregations as input; See Post Aggregations	no
intervals	A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.	yes
context	An additional JSON Object which can be used to specify certain flags.	no

Examples

Consider the data as

Date	Item	SaleAmount

20201208,ItemA,0.0
20201208,ItemB,10.000000000
20201208,ItemA,-1.000000000
20201208,ItemC,9999999999.000000000
20201208,ItemB,5000000000.000000005
20201208,ItemA,2.0
20201208,ItemD,0.0

IngestionSpec syntax:

{
	"type": "index_parallel",
	"spec": {
		"dataSchema": {
			"dataSource": "invoices",
			"timestampSpec": {
				"column": "timestamp",
				"format": "yyyyMMdd"
			},
			"dimensionsSpec": {
				"dimensions": [{
					"type": "string",
					"name": "itemName"
				}]
			},
			"metricsSpec": [{
				"name": "saleAmount",
				"type": "compressedBigDecimalSum",
				"fieldName": "saleAmount"
			}],
			"transformSpec": {
				"filter": null,
				"transforms": []
			},
			"granularitySpec": {
				"type": "uniform",
				"rollup": false,
				"segmentGranularity": "DAY",
				"queryGranularity": "none",
				"intervals": ["2020-12-08/2020-12-09"]
			}
		},
		"ioConfig": {
			"type": "index_parallel",
			"inputSource": {
				"type": "local",
				"baseDir": "/home/user/sales/data/staging/invoice-data",
				"filter": "invoice-001.20201208.txt"
			},
			"inputFormat": {
				"type": "tsv",
                                "delimiter": ",",
                                "skipHeaderRows": 0,
				"columns": [
						"timestamp",
						"itemName",
						"saleAmount"
					]
			}
		},
		"tuningConfig": {
			"type": "index_parallel"
		}
	}
}

SQL-based ingestion sample query:

REPLACE INTO "bigdecimal" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"local","baseDir":""/home/user/sales/data/staging/invoice-data","filter":"invoice-001.20201208.txt"}',
      '{"type":"csv","findColumnsFromHeader":false,"columns":["timestamp","itemName","saleAmount"]}',
      '[{"name":"timestamp","type":"string"},{"name":"itemName","type":"string"},{"name":"saleAmount","type":"double"}]'
    )
  ) 
)
SELECT
  TIME_PARSE(TRIM("timestamp")) AS "__time",
  "itemName",
  BIG_SUM("saleAmount") as amount
FROM "ext"
group by TIME_PARSE(TRIM("timestamp")) , itemName
PARTITIONED BY DAY

Group By Query example

Calculating sales groupBy all.

Query syntax:

{
    "queryType": "groupBy",
    "dataSource": "invoices",
    "granularity": "ALL",
    "dimensions": [
    ],
    "aggregations": [
        {
            "type": "compressedBigDecimalSum",
            "name": "saleAmount",
            "fieldName": "saleAmount",
            "scale": 9,
            "size": 3

        }
    ],
    "intervals": [
        "2020-01-08T00:00:00.000Z/P1D"
    ]
}

Result:

[ {
  "version" : "v1",
  "timestamp" : "2020-12-08T00:00:00.000Z",
  "event" : {
    "revenue" : 15000000010.000000005
  }
} ]

Had you used doubleSum instead of compressedBigDecimalSum the result would be

[ {
  "timestamp" : "2020-12-08T00:00:00.000Z",
  "result" : {
    "revenue" : 1.500000001E10
  }
} ]

As shown above the precision is lost and could lead to loss in money.

TimeSeries Query Example

Query syntax:

{
    "queryType": "timeseries",
    "dataSource": "invoices",
    "granularity": "ALL",
    "aggregations": [
        {
            "type": "compressedBigDecimalSum",
            "name": "revenue",
            "fieldName": "revenue",
            "scale": 9,
            "size": 3
        }
    ],
    "filter": {
        "type": "not",
        "field": {
            "type": "selector",
            "dimension": "itemName",
            "value": "ItemD"
        }
    },
    "intervals": [
        "2020-12-08T00:00:00.000Z/P1D"
    ]
}

Result:

[ {
  "timestamp" : "2020-12-08T00:00:00.000Z",
  "result" : {
    "revenue" : 15000000010.000000005
  }
} ]

Supported Query Functions

Native aggregation functions:

compressedBigDecimalSum
compressedBigDecimalMin
compressedBigDecimalMax

SQL aggregation functions:

big_sum()
big_min()
big_max()

Overview​

Main enhancements provided by this extension:​

Operations​

Configuration​

Limitations​

Ingestion Spec:​

Query spec:​

Examples​

Group By Query example​

TimeSeries Query Example​

Supported Query Functions​