Select Connection: INPUT[inlineListSuggester(optionQuery(#permanent_note), optionQuery(#literature_note), optionQuery(#fleeting_note)):connections]

A hash table is an effective data structure for implementing dictionaries. The average time to search is 

A hash table generalizs the simpler notion of an ordinary array.

Direct-address Table

→ we can take advantage of it when we can afford to allocate an array that has one position for every possible key

Suppose that we need a dynamic set in which each element has a key drawn from the universe , where  is not too large. To represent the dynamic set we use a direct-address table (i.e. array), in which each slot corresponds to a key in 

slot  points to an element in the set with key 

In some applications rather than storing the key and the data in an external object, we can store the object in the slot itself.

Hash Tables

→ direct-address tables can’t be used for large universes

Hash table requires much less storage when the set  of keys stored in a dictionary is much smaller than the universe . Memory →  Searching requires  time

We use a hash function  to compute the slot number from the key . The hash function maps the universe of keys into the slot of a hash table . The hash function reduces the range of array indices, instead of a size of , the array can have size .

Collisions

Problem: Two keys may hash to the same slot → collision Impossible to avoid due to the Pigeonhole Principle, so we need a way to resolve collisions.

Independent uniform hashing

→ what it means for a hash function to be “random”

Independent Uniform Hash Function

For each possible input , the output  is an element randomly and independently chosen uniformly from the range → implementing hash tables with this function, we say we are using independent uniform hashing (not implementable)

Separate chaining

The input set of  elements is divided randomly into  size. A hash function determines which subset an element belongs to and each subset is managed independently as a list.

An array with eight buckets. Bucket 2 links to a chain of two nodes. Bucket 5 links to a single node.

chaining → each nonempty slot points to a linked list. Slot  contains a pointer to the head of the list of all stored elements with hash value 

insertion:  at worst assuming the element is not already present searching: proportonial to the length of the list deletion:  if lists are doubly linked

Analysis

Define the load factor  for hash table  as , with  number of slots and  number of elements.

The worst case is in which all keys hash the same slot, creating a list of length . Searching is 

The average case depends on how well the hash function distributes the set of keys. We assume that we are using independent uniform hashing. Because hashes of distinct keys are assumed to be independent, independent uniform hashing is universal → chance of collide is . For  denote the length of the list  by  so that . Then .

Theorem 1 (Unsuccessful Search)

In hash table with chaining, an unsuccessful search takes  time on average, with the assumption of independent uniform hashing

Theorem 2 (Successful Search)

In hash table with chaining, a successful search takes  on average

So, concluding:

  • searching →  on average
  • deletion →  at worst → we can support all dictionary operations in  time on average

Open Addressing

store all entries in a single array The problem is that when inserting the bucket may be full, so we need to search another bucket. The process of finding an available bucket is called probing, while the bucket order is the probe sequence 1.

Pros:

  • easy
  • cache-friendly Cons:
  • prone to clustering

Hash Functions

takes a value (generic) and outputs a fixed-size integer hash code requirements:

  • deterministic
  • uniform
  • fast

They are different ones, one is FNV-1a.

pub trait Hashable {
    fn hash(&self) -> u32;
}
 
impl Hashable for String {
    fn hash(&self) -> u32 {
        let mut hash = 2166136261;
        for c in self.as_bytes() {
            hash = (hash ^ (*c as u32)).wrapping_mul(16777619);
        }
        hash
    }
}

Footnotes

  1. Same concept as Multi-Probing in LSH implementations