SQL aggregation functions
Apache Druid supports two query languages: Druid SQL and native queries. This document describes the SQL language.
You can use aggregation functions in the SELECT clause of any Druid SQL query.
In the aggregation functions supported by Druid, only COUNT
, ARRAY_AGG
, and STRING_AGG
accept the DISTINCT keyword.
The order of aggregation operations across segments is not deterministic. This means that non-commutative aggregation functions can produce inconsistent results across the same query.
Functions that operate on an input type of "float" or "double" may also see these differences in aggregation
results across multiple query runs because of this. If precisely the same value is desired across multiple query runs,
consider using the ROUND
function to smooth out the inconsistencies between queries.
Filter aggregations
Filter any aggregator using the FILTER clause, for example:
SELECT
SUM(added) FILTER(WHERE channel = '#en.wikipedia')
FROM wikipedia
The FILTER clause limits an aggregation query to only the rows that match the filter. Druid translates the FILTER clause to a native filtered aggregator. Two aggregators in the same SQL query may have different filters.
When no rows are selected, aggregation functions return their initial value. This can occur from the following:
- When no rows match the filter while aggregating values across an entire table without a grouping, or
- When using filtered aggregations within a grouping.
The initial value varies by aggregator. COUNT
and the approximate count distinct sketch functions
always return 0 as the initial value.
General aggregation functions
Function | Notes | Default |
---|---|---|
COUNT(*) | Counts the number of rows. | 0 |
COUNT([DISTINCT] expr) | Counts the values of expr .By default, using DISTINCT serves as an alias for APPROX_COUNT_DISTINCT (useApproximateCountDistinct=true ). The specific algorithm depends on the value of druid.sql.approxCountDistinct.function . In this mode, you can use strings, numbers, or prebuilt sketches. If counting prebuilt sketches, the prebuilt sketch type must match the selected algorithm.When useApproximateCountDistinct=false , returns the exact computation. In this case, expr must be string or numeric, since exact counts are not possible using prebuilt sketches. In exact mode, only one distinct count per query is permitted unless useGroupingSetForExactDistinct is enabled.Counts each distinct value in a multi-value -row separately. | 0 |
SUM(expr) | Sums numbers. | null |
MIN(expr) | Takes the minimum of numbers. | null |
MAX(expr) | Takes the maximum of numbers. | null |
AVG(expr) | Averages numbers. | null |
APPROX_COUNT_DISTINCT(expr) | Counts distinct values of expr using an approximate algorithm. The expr can be a regular column or a prebuilt sketch column.The specific algorithm depends on the value of druid.sql.approxCountDistinct.function . By default, this is APPROX_COUNT_DISTINCT_BUILTIN . If the DataSketches extension is loaded, you can set it to APPROX_COUNT_DISTINCT_DS_HLL or APPROX_COUNT_DISTINCT_DS_THETA .When run on prebuilt sketch columns, the sketch column type must match the implementation of this function. For example: when druid.sql.approxCountDistinct.function is set to APPROX_COUNT_DISTINCT_BUILTIN , this function runs on prebuilt hyperUnique columns, but not on prebuilt HLLSketchBuild columns. | |
APPROX_COUNT_DISTINCT_BUILTIN(expr) | Usage note: consider using APPROX_COUNT_DISTINCT_DS_HLL instead, which offers better accuracy in many cases.Counts distinct values of expr using Druid's built-in "cardinality" or "hyperUnique" aggregators, which implement a variant of HyperLogLog. The expr can be a string, a number, or a prebuilt hyperUnique column. Results are always approximate, regardless of the value of useApproximateCountDistinct . | |
APPROX_QUANTILE(expr, probability, [resolution]) | Deprecated. Use APPROX_QUANTILE_DS instead, which provides a superior distribution-independent algorithm with formal error guarantees.Computes approximate quantiles on numeric or approxHistogram expressions. probability should be between 0 and 1, exclusive. resolution is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. Load the approximate histogram extension to use this function. | NaN |
APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode]) | Computes approximate quantiles on numeric or fixed buckets histogram expressions. probability should be between 0 and 1, exclusive. The numBuckets , lowerLimit , upperLimit , and outlierHandlingMode parameters are described in the fixed buckets histogram documentation. Load the approximate histogram extension to use this function. | 0.0 |
BLOOM_FILTER(expr, numEntries) | Computes a bloom filter from values produced by expr , with numEntries maximum number of distinct values before false positive rate increases. See bloom filter extension documentation for additional details. | Empty base64 encoded bloom filter STRING |
VAR_POP(expr) | Computes variance population of expr . See stats extension documentation for additional details. | null |
VAR_SAMP(expr) | Computes variance sample of expr . See stats extension documentation for additional details. | null |
VARIANCE(expr) | Computes variance sample of expr . See stats extension documentation for additional details. | null |
STDDEV_POP(expr) | Computes standard deviation population of expr . See stats extension documentation for additional details. | null |
STDDEV_SAMP(expr) | Computes standard deviation sample of expr . See stats extension documentation for additional details. | null |
STDDEV(expr) | Computes standard deviation sample of expr . See stats extension documentation for additional details. | null |
EARLIEST(expr, [maxBytesPerValue]) | Returns the earliest value of expr .If expr comes from a relation with a timestamp column (like __time in a Druid datasource), the "earliest" is taken from the row with the overall earliest non-null value of the timestamp column.If the earliest non-null value of the timestamp column appears in multiple rows, the expr may be taken from any of those rows. If expr does not come from a relation with a timestamp, then it is simply the first value encountered.If expr is a string or complex type maxBytesPerValue amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The maxBytesPerValue parameter should be set as low as possible, since high values will lead to wasted memory.If maxBytesPerValue is omitted; it defaults to 1024 . | null |
EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue]) | Returns the earliest value of expr .The earliest value of expr is taken from the row with the overall earliest non-null value of timestampExpr . If the earliest non-null value of timestampExpr appears in multiple rows, the expr may be taken from any of those rows.If expr is a string or complex type maxBytesPerValue amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The maxBytesPerValue parameter should be set as low as possible, since high values will lead to wasted memory.If maxBytesPerValue is omitted; it defaults to 1024 .Use EARLIEST instead of EARLIEST_BY on a table that has rollup enabled and was created with any variant of EARLIEST , LATEST , EARLIEST_BY , or LATEST_BY . In these cases, the intermediate type already stores the timestamp, and Druid ignores the value passed in timestampExpr . | null |
LATEST(expr, [maxBytesPerValue]) | Returns the latest value of expr The expr must come from a relation with a timestamp column (like __time in a Druid datasource) and the "latest" is taken from the row with the overall latest non-null value of the timestamp column.If the latest non-null value of the timestamp column appears in multiple rows, the expr may be taken from any of those rows.If expr is a string or complex type maxBytesPerValue amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The maxBytesPerValue parameter should be set as low as possible, since high values will lead to wasted memory.If maxBytesPerValue is omitted; it defaults to 1024 . | null |
LATEST_BY(expr, timestampExpr, [maxBytesPerValue]) | Returns the latest value of expr .The latest value of expr is taken from the row with the overall latest non-null value of timestampExpr .If the overall latest non-null value of timestampExpr appears in multiple rows, the expr may be taken from any of those rows.If expr is a string or complex type maxBytesPerValue amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The maxBytesPerValue parameter should be set as low as possible, since high values will lead to wasted memory.If maxBytesPerValue is omitted; it defaults to 1024 .Use LATEST instead of LATEST_BY on a table that has rollup enabled and was created with any variant of EARLIEST , LATEST , EARLIEST_BY , or LATEST_BY . In these cases, the intermediate type already stores the timestamp, and Druid ignores the value passed in timestampExpr . | null |
ANY_VALUE(expr, [maxBytesPerValue, [aggregateMultipleValues]]) | Returns any value of expr including null. This aggregator can simplify and optimize the performance by returning the first encountered value (including null ).If expr is a string or complex type maxBytesPerValue amount of space is allocated for the aggregation. Strings longer than this limit are truncated. The maxBytesPerValue parameter should be set as low as possible, since high values will lead to wasted memory.If maxBytesPerValue is omitted; it defaults to 1024 . aggregateMultipleValues is an optional boolean flag controls the behavior of aggregating a multi-value dimension. aggregateMultipleValues is set as true by default and returns the stringified array in case of a multi-value dimension. By setting it to false, function will return first value instead. | null |
GROUPING(expr, expr...) | Returns a number to indicate which groupBy dimension is included in a row, when using GROUPING SETS . Refer to additional documentation on how to infer this number. | N/A |
ARRAY_AGG([DISTINCT] expr, [size]) | Collects all values of the specified expression into an array. To include only unique values, specify DISTINCT . size determines the maximum aggregation size in bytes and defaults to 1024 bytes. If the resulting array exceeds the size limit, the query fails. ORDER BY is not supported. The order of elements in the output array may vary depending on the processing order. | null |
ARRAY_CONCAT_AGG([DISTINCT] expr, [size]) | Concatenates array inputs into a single array. To include only unique values, specify DISTINCT . expr must be an array. size determines the maximum aggregation size in bytes and defaults to 1024 bytes. If the resulting array exceeds the size limit, the query fails. Druid ignores null array expressions, but null values within arrays are included in the output. ORDER BY is not supported. The order of elements in the output array may vary depending on the processing order. | null |
STRING_AGG([DISTINCT] expr, [separator, [size]]) | Collects all values (or all distinct values) of expr into a single STRING, ignoring null values. Each value is joined by an optional separator , which must be a literal STRING. If the separator is not provided, strings are concatenated without a separator.An optional size in bytes can be supplied to limit aggregation size (default of 1024 bytes). If the aggregated string grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the STRING_AGG expression is not currently supported, and the ordering of results within the output string may vary depending on processing order. | null |
LISTAGG([DISTINCT] expr, [separator, [size]]) | Synonym for STRING_AGG . | null |
BIT_AND(expr) | Performs a bitwise AND operation on all input values. | null |
BIT_OR(expr) | Performs a bitwise OR operation on all input values. | null |
BIT_XOR(expr) | Performs a bitwise XOR operation on all input values. | null |
Sketch functions
These functions create sketch objects that you can use to perform fast, approximate analyses. For advice on choosing approximate aggregation functions, check out our approximate aggregations documentation. To operate on sketch objects, see the scalar DataSketches post aggregator functions.
HLL sketch functions
Load the DataSketches extension to use the following functions.
Function | Notes | Default |
---|---|---|
APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType]) | Counts distinct values of expr , which can be a regular column or an HLL sketch column. Results are always approximate, regardless of the value of useApproximateCountDistinct . The lgK and tgtHllType parameters here are, like the equivalents in the aggregator, described in the HLL sketch documentation. See also COUNT(DISTINCT expr) . | 0 |
DS_HLL(expr, [lgK, tgtHllType]) | Creates an HLL sketch on the values of expr , which can be a regular column or a column containing HLL sketches. The lgK and tgtHllType parameters are described in the HLL sketch documentation. | '0' (STRING) |
Theta sketch functions
Load the DataSketches extension to use the following functions.
Function | Notes | Default |
---|---|---|
APPROX_COUNT_DISTINCT_DS_THETA(expr, [size]) | Counts distinct values of expr , which can be a regular column or a Theta sketch column. Results are always approximate, regardless of the value of useApproximateCountDistinct . The size parameter is described in the Theta sketch documentation. See also COUNT(DISTINCT expr) . | 0 |
DS_THETA(expr, [size]) | Creates a Theta sketch on the values of expr , which can be a regular column or a column containing Theta sketches. The size parameter is described in the Theta sketch documentation. | '0.0' (STRING) |
Quantiles sketch functions
Load the DataSketches extension to use the following functions.
Function | Notes | Default |
---|---|---|
APPROX_QUANTILE_DS(expr, probability, [k]) | Computes approximate quantiles on numeric or Quantiles sketch expressions. The probability value should be between 0 and 1, exclusive. The k parameter is described in the Quantiles sketch documentation.See the known issue with this function. | NaN |
DS_QUANTILES_SKETCH(expr, [k]) | Creates a Quantiles sketch on the values of expr , which can be a regular column or a column containing quantiles sketches. The k parameter is described in the Quantiles sketch documentation.See the known issue with this function. | '0' (STRING) |
Tuple sketch functions
Load the DataSketches extension to use the following functions.
Function | Notes | Default |
---|---|---|
DS_TUPLE_DOUBLES(expr[, nominalEntries]) | Creates a Tuple sketch on a precomputed sketch column expr , where the precomputed Tuple sketch contains an array of double values as its Summary Object. The nominalEntries override parameter is optional and described in the Tuple sketch documentation. | |
DS_TUPLE_DOUBLES(dimensionColumnExpr, metricColumnExpr1[, metricColumnExpr2, ...], [nominalEntries]) | Creates a Tuple sketch on raw data. The Tuples sketch will contain an array of double values as its Summary Object based on the dimension value of dimensionColumnExpr and the numeric metric values contained in one or more metricColumnExpr columns. If the last value of the array is a numeric literal, Druid assumes that the value is an override parameter for nominal entries. |
T-Digest sketch functions
Load the T-Digest extension to use the following functions. See the T-Digest extension for additional details and for more information on these functions.
Function | Notes | Default |
---|---|---|
TDIGEST_QUANTILE(expr, quantileFraction, [compression]) | Builds a T-Digest sketch on values produced by expr and returns the value for the quantile. Compression parameter (default value 100) determines the accuracy and size of the sketch. Higher compression means higher accuracy but more space to store sketches. | Double.NaN |
TDIGEST_GENERATE_SKETCH(expr, [compression]) | Builds a T-Digest sketch on values produced by expr . Compression parameter (default value 100) determines the accuracy and size of the sketch Higher compression means higher accuracy but more space to store sketches. | Empty base64 encoded T-Digest sketch STRING |