Bloom Filter

Share
Bloom Filter
A Bloom filter is a memory-efficient way to answer one question: “Have we maybe seen this item before?”
It is useful when you want to avoid expensive lookups for items that definitely do not exist.

The simple idea

A Bloom filter is a probabilistic data structure.

It can answer:

Definitely not present
Maybe present

It cannot answer:

Definitely present

This is the key tradeoff.

A Bloom filter can have false positives, but it does not have false negatives.

What that means

If the Bloom filter says:

No

Then the item is definitely not in the set.

If it says:

Maybe

Then the item might be in the set, so you still check the database or storage.

When to use Bloom filters

Good interview signals:

  • many requests are for missing data,
  • database lookups are expensive,
  • you need large-scale deduplication,
  • you want a memory-efficient pre-check,
  • false positives are acceptable.

Examples:

  • URL shortener: check if a custom alias may already exist,
  • web crawler: check if URL was already seen,
  • cache penetration: avoid DB lookups for invalid keys,
  • username availability check,
  • blocked IP or URL lookup,
  • duplicate event detection.

Cache penetration example

Imagine users request many fake IDs:

/item/abc123
/item/fake999
/item/not-real

If these IDs do not exist, cache will miss and the database gets hit again and again.

A Bloom filter can sit before the database:

Request key → Bloom filter
             → definitely not exist: return 404
             → maybe exists: check cache or DB

This protects the database from useless lookups.

False positives

A false positive means:

Bloom filter says “maybe present”
but the item is actually not present.

This is usually okay.

It only causes an extra database lookup.

That is much better than a false negative, which would incorrectly say an existing item does not exist.

Deletion problem

Standard Bloom filters do not support easy deletion.

If your data changes a lot, you may need:

  • counting Bloom filter,
  • periodic rebuild,
  • short-lived Bloom filters,
  • or a different data structure.

Common mistakes

  • Saying Bloom filter gives exact answers.
  • Forgetting false positives.
  • Using it as the source of truth.
  • Using it when false positives are not acceptable.
  • Ignoring deletion.
  • Adding it when normal cache or database index is enough.

Final takeaway

Bloom filters are useful when many requests are for things that do not exist.

A strong answer is:

I would use a Bloom filter to quickly reject keys that definitely do not exist. False positives are acceptable because they only cause an extra database lookup. The database remains the source of truth.