Deep Dive Toolkit

Bloom Filter

A Bloom filter is a memory-efficient way to answer one question: “Have we maybe seen this item before?”
It is useful when you want to avoid expensive lookups for items that definitely do not exist.

The simple idea

A Bloom filter is a probabilistic data structure.

It can answer:

Definitely not present
Maybe present

It cannot answer:

Definitely present

This is the key tradeoff.

A Bloom filter can have false positives, but it does not have false negatives.

What that means

If the Bloom filter says:

No

Then the item is definitely not in the set.

If it says:

Maybe

Then the item might be in the set, so you still check the database or storage.

When to use Bloom filters

Good interview signals:

many requests are for missing data,
database lookups are expensive,
you need large-scale deduplication,
you want a memory-efficient pre-check,
false positives are acceptable.

Examples:

URL shortener: check if a custom alias may already exist,
web crawler: check if URL was already seen,
cache penetration: avoid DB lookups for invalid keys,
username availability check,
blocked IP or URL lookup,
duplicate event detection.

Cache penetration example

Imagine users request many fake IDs:

/item/abc123
/item/fake999
/item/not-real

If these IDs do not exist, cache will miss and the database gets hit again and again.

A Bloom filter can sit before the database:

Request key → Bloom filter
             → definitely not exist: return 404
             → maybe exists: check cache or DB

This protects the database from useless lookups.

False positives

A false positive means:

Bloom filter says “maybe present”
but the item is actually not present.

This is usually okay.

It only causes an extra database lookup.

That is much better than a false negative, which would incorrectly say an existing item does not exist.

Deletion problem

Standard Bloom filters do not support easy deletion.

If your data changes a lot, you may need:

counting Bloom filter,
periodic rebuild,
short-lived Bloom filters,
or a different data structure.

Common mistakes

Saying Bloom filter gives exact answers.
Forgetting false positives.
Using it as the source of truth.
Using it when false positives are not acceptable.
Ignoring deletion.
Adding it when normal cache or database index is enough.

Final takeaway

Bloom filters are useful when many requests are for things that do not exist.

A strong answer is:

I would use a Bloom filter to quickly reject keys that definitely do not exist. False positives are acceptable because they only cause an extra database lookup. The database remains the source of truth.