Skip to main content

Bloom Filter

To use the Apache Druid® Bloom filter extension, include druid-bloom-filter in the extensions load list. See Loading extensions for more information.

This extension adds the abilities to construct Bloom filters from query results and to filter query results by testing against a Bloom filter. A Bloom filter is a probabilistic data structure to check for set membership. A Bloom filter is a good candidate to use when an explicit filter is impossible, such as filtering a query against a set of millions of values.

Following are some characteristics of Bloom filters:

  • Bloom filters are significantly more space efficient than HashSets.
  • Because they are probabilistic, false positive results are possible with Bloom filters. For example, the test() function might return true for an element that is not within the filter.
  • False negatives are not possible. If an element is present, test() always returns true.
  • The false positive probability of this implementation is fixed at 5%. Increasing the number of entries that the filter can hold can decrease this false positive rate in exchange for overall size.
  • Bloom filters are sensitive to the number of inserted elements. You must specify the expected number of entries at creation time. If the number of insertions exceeds the specified number of entries, the false positive probability increases accordingly.

This extension is based on org.apache.hive.common.util.BloomKFilter from hive-storage-api. Internally, this implementation uses Murmur3 as the hash algorithm.

The following Java example shows how to construct a BloomKFilter externally:

BloomKFilter bloomFilter = new BloomKFilter(1500);
bloomFilter.addString("value 1");
bloomFilter.addString("value 2");
bloomFilter.addString("value 3");
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
BloomKFilter.serialize(byteArrayOutputStream, bloomFilter);
String base64Serialized = Base64.encodeBase64String(byteArrayOutputStream.toByteArray());

You can then use the Base64 encoded string in JSON-based or SQL-based queries in Druid.

Filter queries with a Bloom filter

JSON specification

{
"type" : "bloom",
"dimension" : <dimension_name>,
"bloomKFilter" : <serialized_bytes_for_BloomKFilter>,
"extractionFn" : <extraction_fn>
}
PropertyDescriptionRequired
typeFilter type. Set to bloom.Yes
dimensionDimension to filter over.Yes
bloomKFilterBase64 encoded binary representation of org.apache.hive.common.util.BloomKFilter.Yes
extractionFnExtraction function to apply to the dimension values.No

Serialized format for BloomKFilter

Serialized BloomKFilter format:

  • 1 byte for the number of hash functions.
  • 1 big-endian integer for the number of longs in the bitset.
  • Big-endian longs in the BloomKFilter bitset.

org.apache.hive.common.util.BloomKFilter provides a method to serialize Bloom filters to outputStream.

Filter SQL queries

You can use Bloom filters in SQL WHERE clauses with the bloom_filter_test operator:

SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')

Expression and virtual column support

The Bloom filter extension also adds a Bloom filter Druid expression which shares syntax with the SQL operator.

bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')

Bloom filter query aggregator

You can create an input for a BloomKFilter from a Druid query with the bloom aggregator. Make sure to set a reasonable value for the maxNumEntries parameter to specify the maximum number of distinct entries that the Bloom filter can represent without increasing the false positive rate. Try performing a query using one of the unique count sketches to calculate the value for this parameter to build a Bloom filter appropriate for the query.

JSON specification

{
"type": "bloom",
"name": <output_field_name>,
"maxNumEntries": <maximum_number_of_elements_for_BloomKFilter>
"field": <dimension_spec>
}
PropertyDescriptionRequired
typeAggregator type. Set to bloom.Yes
nameOutput field name.Yes
fieldDimensionSpec to add to org.apache.hive.common.util.BloomKFilter.Yes
maxNumEntriesMaximum number of distinct values supported by org.apache.hive.common.util.BloomKFilter. Defaults to 1500.No

Example

The following example shows a timeseries query object with a bloom aggregator:

{
"queryType": "timeseries",
"dataSource": "wikiticker",
"intervals": [ "2015-09-12T00:00:00.000/2015-09-13T00:00:00.000" ],
"granularity": "day",
"aggregations": [
{
"type": "bloom",
"name": "userBloom",
"maxNumEntries": 100000,
"field": {
"type":"default",
"dimension":"user",
"outputType": "STRING"
}
}
]
}

Example response:

[
{
"timestamp":"2015-09-12T00:00:00.000Z",
"result":{"userBloom":"BAAAJhAAAA..."}
}
]

We recommend ordering by an alternative aggregation method instead of ordering results by a Bloom filter aggregator. Ordering results by a Bloom filter aggregator can be resource-intensive because Druid performs an expensive linear scan of the filter to approximate the count of items added to the set by counting the number of set bits.

SQL Bloom filter aggregator

You can compute Bloom filters in SQL expressions with the BLOOM_FILTER aggregator. For example:

SELECT BLOOM_FILTER(<expression>, <max number of entries>) FROM druid.foo WHERE dim2 = 'abc'

Druid serializes Bloom filter results in a SQL response into a Base64 string. You can use the resulting string in subsequent queries as a filter.