On this page:
Sets
Structure
16.1 Bit vector
16.2 k hash functions
16.3 Bloom filter
8.6

Assignment 16: Bloom Filters

Goals: Practice working with arrays and hashes.

You should submit one .java file containing the solution to this problem.

Be sure to properly test your code and write purpose statements for your methods. A lack of tests and documentation will result in a lower grade! Remember that testing requires you to make some examples of data in an examples class.

Sets

While we haven’t discussed them much, sets are pervasive in computer science, and are primarily used to ask a perrenial question: is some thing in some collection of things?

Implementing a data stucture that performs this operation is incredibly simple (an IList can do it), but doing it smartly is not as easy. Smartly can mean a whole bunch of things, but largely has to do with efficiency. Namely, can the answer be found quickly, and can it be done without taking up too much memory?

The solution to the first problem is often solved with a hash of some sort, but the second question is not as simple. If your set contains a billion objects, how do you store all that information without running out of memory?

Enter Bloom filters!

Structure

A Bloom filter is a data structure that uses very little memory and can definitively tell you if an element is not part of a set, but can potentially return false positives. Elements can also be added to a Bloom filter.

A Bloom filter is stored as a bit vector (array of booleans) of some size \(m\) which are all initalized to 0 (false). When an element is added, \(k\) hash functions are applied to it, resulting in numbers \(k_1, k_2, .... k_k\), and all those bits modulo \(m\) are set to 1 (true) in the bit vector.

When one wants to know if an object is in the collection of data, one applies the aformentioned \(k\) hash functions to that object and sees if all of those bits modulo \(m\) are set to 1 in the bit vector.

16.1 Bit vector

For your first task, implement a bit vector class. It should take in a non-negative integer (enforce this) in its constructor, which determines the size of the bit vector. It should support the operations of getting the ith bit, setting the ith bit, and flipping the ith bit. All bits should have the default value of 0. Use a boolean[] to implement this, and note that the latter two of these operations are mutative; they should not return a new bit vector, but modify the existing one.

16.2 k hash functions

This strategy for hashing is non-ideal because it isn’t guaranteed to produce well distributed hash codes.

Being able to create an arbitrary number of hash functions is difficult (but doable), so you’re going to use Random numbers to simulate this behavior. Write a static method that given an object o, a non-negative integer k, and a positive integer m, produces k integers between and including 0 and m-1, using o’s hashcode as the random number seed (the number passed to Random’s constructor).

Use the nextInt method k times on the same Random object, and be sure to give the method m as a bound. Accumulate these results in an array list (or array) and return it.

16.3 Bloom filter

Finally, implement the Bloom filter class itself. It should be constructed with a non-negative integer k and a positive integer m, representing the number of "hash functions" and size of the bit vector, respectively.

The class should support the aforementioned behavior, including returning a boolean to indicate whether or not an element is definitely not in or probably in the collection, and also being able to add an element to the collection. Also include methods that inform the user the likelihood of a false positive, the number of times an element has been added, and one which gives them the ability to clear the data set (which should also reset the count of elements added to 0).

Additionally, include in a block comment why iteration over the data set and removal of individual elements from the data set are impossible.