SUPPORT THE WORK

# GetWiki

### hash table

ARTICLE SUBJECTS
news  →
unix  →
wiki  →
ARTICLE TYPES
feed  →
help  →
wiki  →
ARTICLE ORIGINS
hash table
[ temporary import ]
- the content below is remote from Wikipedia
- it has been imported raw for GetWiki
{{distinguish|Hash list|Hash tree (disambiguation){{!}}Hash tree}}{{redirect|Rehash|the South Park episode|Rehash (South Park)|the IRC command|List of Internet Relay Chat commands#REHASH}}{{Use mdy dates|date=January 2013}}{{Short description|Associates data values with key values - a lookup table}}

factoids
name Hash table|type=Unordered associative array|invented_by=|invented_year=1953|space_avg=O(n)
|space_worst=O(n)|search_avg=O(1)|search_worst=O(n)|insert_avg=O(1)|insert_worst=O(n)|delete_avg=O(1)|delete_worst=O(n)}}(File:Hash table 3 1 1 0 1 0 0 SP.svg|thumb|315px|right|A small phone book as a hash table)In computing, a hash table (hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions where the hash function generates the same index for more than one key. Such collisions must be accommodated in some way.In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at (amortizedCharles E. Leiserson, Amortized Algorithms, Table Doubling, Potential Method {{webarchive|url=https://web.archive.org/web/20090807022046weblink |date=August 7, 2009 }} Lecture 13, course MIT 6.046J/18.410J Introduction to Algorithmsâ€”Fall 2005) constant average cost per operation.
BOOK
, Donald, Knuth, Donald Knuth
, The Art of Computer Programming
, 3: Sorting and Searching
, 2nd
, 1998
, 978-0-201-89685-5
, 513â€“558
,
In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

## Hashing

The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given a key, the algorithm computes an index that suggests where the entry can be found:
index = f(key, array_size)
Often this is done in two steps:
hash = hashfunc(key)
index = hash % array_size
In this method, the hash is independent of the array size, and it is then reduced to an index (a number between 0 and array_size âˆ’ 1) using the modulo operator (%).In the case that the array size is a power of two, the remainder operation is reduced to masking, which improves speed, but can increase problems with a poor hash function.WEB,weblink JDK HashMap Hashcode implementation, live,weblink" title="web.archive.org/web/20170521033827weblink">weblink May 21, 2017, mdy-all,

### Choosing a hash function

A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve.{{citation needed|date=October 2016}}WEB,weblink python/cpython, GitHub, en, 2018-09-19, A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson's chi-squared test for discrete uniform distributions.
NEWS
, Karl, Pearson, Karl Pearson
, 1900
, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling
, Philosophical Magazine, Series 5
, 50
, 302
, 157â€“175
, 10.1080/14786440009463897
,

NEWS
, Robin, Plackett, Robin Plackett
, 1983
, Karl Pearson and the Chi-Squared Test
, International Statistical Review (International Statistical Institute (ISI))
, 51
, 1
, 59â€“72
, 10.2307/1402731
,
The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number.WEB, Prime Double Hash Table,weblink March 1997, 2015-05-10, Wang, Thomas,weblink" title="web.archive.org/web/19990903133921weblink">weblink 1999-09-03, The modulus operation may provide some additional mixing; this is especially useful with a poor hash function.For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.Cryptographic hash functions are believed to provide good hash functions for any table size, either by modulo reduction or by bit masking{{Citation needed|date=July 2014}}. They may also be appropriate if there is a risk of malicious users trying to sabotage a network service by submitting requests designed to generate a large number of collisions in the server's hash tables. However, the risk of sabotage can also be avoided by cheaper methods (such as applying a secret salt to the data, or using a universal hash function). A drawback of cryptographic hashing functions is that they are often slower to compute, which means that in cases where the uniformity for any size is not necessary, a non-cryptographic hashing function might be preferable.{{citation needed|date=October 2016}}

### Perfect hash function

If all keys are known ahead of time, a perfect hash function can be used to create a perfect hash table that has no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well.Perfect hashing allows for constant time lookups in all cases. This is in contrast to most chaining and open addressing methods, where the time for lookup is low on average, but may be very large, O(n), for instance when all the keys hash to a few values.

## Key statistics

A critical statistic for a hash table is the load factor, defined as
where
• n is the number of entries occupied in the hash table.
• k is the number of buckets.
As the load factor grows larger, the hash table becomes slower, and it may even fail to work (depending on the method used). The expected constant time property of a hash table assumes that the load factor be kept below some bound. For a fixed number of buckets, the time for a lookup grows with the number of entries, and therefore the desired constant time is not achieved. In some implementations, the solution is to automatically grow (usually, double) the size of the table when the load factor bound is reached, thus forcing to re-hash all entries. As a real-world example, the default load factor for a HashMap in Java 10 is 0.75, which "offers a good trade-off between time and space costs."Javadoc for HashMap in Java 10weblink to the load factor, one can examine the variance of number of entries per bucket. For example, two tables both have 1,000 entries and 1,000 buckets; one has exactly one entry in each bucket, the other has all entries in the same bucket. Clearly the hashing is not working in the second one.A low load factor is not especially beneficial. As the load factor approaches 0, the proportion of unused areas in the hash table increases, but there is not necessarily any reduction in search cost. This results in wasted memory.

## Collision resolution

Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2,450 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday problem there is approximately a 95% chance of at least two of the keys being hashed to the same slot.Therefore, almost all hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values.

### Separate chaining

(File:Hash table 5 0 1 1 1 1 1 LL.svg|thumb|450px|right|Hash collision resolved by separate chaining.)In the method known as separate chaining, each bucket is independent, and has some sort of list of entries with the same index. The time for hash table operations is the time to find the bucket (which is constant) plus the time for the list operation.In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely more than that. Therefore, structures that are efficient in time and space for these cases are preferred. Structures that are efficient for a fairly large number of entries per bucket are not needed or desirable. If these cases happen often, the hashing function needs to be fixed.{{citation needed|date=October 2016}}

#### Separate chaining with list head cells

(File:Hash table 5 0 1 1 1 1 0 LL.svg|thumb|right|500px|Hash collision by separate chaining with head records in the bucket array.)Some chaining implementations store the first record of each chain in the slot array itself.BOOK
, Cormen, Thomas H., Thomas H. Cormen
, Leiserson, Charles E., Charles E. Leiserson
, Rivest, Ronald L., Ronald L. Rivest
, Stein, Clifford, Clifford Stein
, Introduction to Algorithms
, MIT Press and McGraw-Hill
, 2001
, 978-0-262-53196-2
, 2nd
, 221â€“252
, Chapter 11: Hash Tables
, Introduction to Algorithms,
The number of pointer traversals is decreased by one for most cases. The purpose is to increase cache efficiency of hash table access.The disadvantage is that an empty bucket takes the same space as a bucket with one entry. To save space, such hash tables often have about as many slots as stored entries, meaning that many slots have two or more entries.{{citation needed|date=October 2016}}

#### Separate chaining with other structures

Instead of a list, one can use any other data structure that supports the required operations. For example, by using a self-balancing binary search tree, the theoretical worst-case time of common hash table operations (insertion, deletion, lookup) can be brought down to O(log n) rather than O(n). However, this introduces extra complexity into the implementation, and may cause even worse performance for smaller hash tables, where the time spent inserting into and balancing the tree is greater than the time needed to perform a linear search on all of the elements of a list.WEB,weblink Linear vs Binary Search, Probst, Mark, 2010-04-30, 2016-11-20, live,weblink November 20, 2016, mdy-all, A real world example of a hash table that uses a self-balancing binary search tree for buckets is the HashMap class in Java version 8.WEB,weblink How does a HashMap work in JAVA, coding-geek.com, live,weblink" title="web.archive.org/web/20161119095443weblink">weblink November 19, 2016, mdy-all, The variant called array hash table uses a dynamic array to store all the entries that hash to the same slot.
BOOK
, Cache-conscious Collision Resolution in String Hash Tables
, Nikolas
, Justin
, Zobel
, October 2005
, 978-3-540-29740-6
, 91â€“102
, Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005)
, 10.1007/11575832_11
, 3772/2005,

JOURNAL
, Engineering scalable, cache and space efficient tries for strings
, Nikolas
, Ranjan
, Sinha
, 2010
, 1066-8888
, 10.1007/s00778-010-0183-9
, The VLDB Journal
, 17
, 5
, 633â€“660
,

BOOK
, Fast and Compact Hash Tables for Integer Keys
, Nikolas
, 2009
, 978-1-920682-72-9
, 113â€“122
, Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009)
, 91
, live
, February 16, 2011
, mdy-all
,
Each newly inserted entry gets appended to the end of the dynamic array that is assigned to the slot. The dynamic array is resized in an exact-fit manner, meaning it is grown only by as many bytes as needed. Alternative techniques such as growing the array by block sizes or pages were found to improve insertion performance, but at a cost in space. This variation makes more efficient use of CPU caching and the translation lookaside buffer (TLB), because slot entries are stored in sequential memory positions. It also dispenses with the next pointers that are required by linked lists, which saves space. Despite frequent array resizing, space overheads incurred by the operating system such as memory fragmentation were found to be small.{{citation needed|date=October 2016}}
An elaboration on this approach is the so-called dynamic perfect hashing,Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. WEB,weblink Archived copy, 2008-06-30, live,weblink" title="web.archive.org/web/20100615203901weblink">weblink June 15, 2010, mdy-all, where a bucket that contains k entries is organized as a perfect hash table with k2 slots. While it uses more memory (n2 slots for n entries, in the worst case and n Ã— k slots in the average case), this variant has guaranteed constant worst-case lookup time, and low amortized time for insertion.It is also possible to use a fusion tree for each bucket, achieving constant time for all operations with high probability.JOURNAL
, Willard, Dan E., Dan Willard
, 10.1137/S0097539797322425
, 3
, SIAM Journal on Computing
, 1740562
, 1030â€“1049
, Examining computational geometry, van Emde Boas trees, and hashing from the perspective of the fusion tree
, 29
, 2000, .

(File:Hash table 5 0 1 1 1 1 0 SP.svg|thumb|380px|right|Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously collided with "John Smith".)In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.
BOOK
, Data Structures Using C
, Aaron M.
, Tenenbaum
, Yedidyah
, Langsam
, Moshe J.
, Augenstein
, Prentice Hall
, 1990
, 978-0-13-199746-2
, 456â€“461, p. 472
,
The name "open addressing" refers to the fact that the location ("address") of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with "open hashing" or "closed addressing" that usually mean separate chaining.)
Well-known probe sequences include:
• Linear probing, in which the interval between probes is fixed (usually 1)
• Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the starting value given by the original hash computation
• Double hashing, in which the interval between probes is computed by a second hash function

#### Coalesced hashing

A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes within the table itself. Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more elements than table slots.

#### Cuckoo hashing

Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup and deletion time in the worst case, and constant amortized time for insertions (with low probability that the worst-case will be encountered). It uses two or more hash functions, which means any key/value pair could be in two or more locations. For lookup, the first hash function is used; if the key/value is not found, then the second hash function is used, and so on. If a collision happens during insertion, then the key is re-hashed with the second hash function to map it to another bucket. If all hash functions are used and there is still a collision, then the key it collided with is removed to make space for the new key, and the old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that location also results in a collision, then the process repeats until there is no collision or the process traverses all the buckets, at which point the table is resized. By combining multiple hash functions with multiple cells per bucket, very high space utilization can be achieved.{{citation needed|date=October 2016}}

#### Hopscotch hashing

Another alternative open-addressing solution is hopscotch hashing,"MEMBERWIDE">FIRST1=MAURICE FIRST2=NIR FIRST3=MORAN, Hopscotch Hashing, DISC '08: Proceedings of the 22nd international symposium on Distributed Computing, 2008, 350â€“364, Springer-Verlag, Berlin, Heidelberg, 10.1.1.296.8742, which combines the approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations. In particular it works well even when the load factor grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table.The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original hashed bucket, where a given entry is always found. Thus, search is limited to the number of entries in this neighborhood, which is logarithmic in the worst case, constant on average, and with proper alignment of the neighborhood typically requires one cache miss. When inserting an entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the neighborhood, instead of items being moved out with the hope of eventually finding an empty slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the neighborhood property of any of the buckets along the way. In the end, the open slot has been moved into the neighborhood, and the entry being inserted can be added to it.{{citation needed|date=October 2016}}

#### Robin Hood hashing

One interesting variation on double-hashing collision resolution is Robin Hood hashing.
TECHREPORT
, 1986
, Celis
, Pedro
, Pedro Celis
, Robin Hood hashing
, CS-86-14
, Computer Science Department, University of Waterloo
, live
, July 17, 2014
, mdy-all
,
WEB
, 2013
, Goossaert
, Emmanuel
, Robin Hood hashing
, live
, March 21, 2014
, mdy-all
,
The idea is that a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position. The net effect of this is that it reduces worst case search times in the table. This is similar to ordered hash tablesJOURNAL, Amble, Ole, Knuth, Don, 1974, Ordered hash tables, Computer Journal, 17, 2, 135, 10.1093/comjnl/17.2.135, except that the criterion for bumping a key does not depend on a direct relationship between the keys. Since both the worst case and the variation in the number of probes is reduced dramatically, an interesting variation is to probe the table starting at the expected successful probe value and then expand from that position in both directions.
JOURNAL
, Viola
, Alfredo
, Exact distribution of individual displacements in linear probing hashing
, Transactions on Algorithms (TALG)
, 1
, 2
, October 2005
, 214â€“242
,
, 10.1145/1103963.1103965, true,
External Robin Hood hashing is an extension of this algorithm where the table is stored in an external file and each table position corresponds to a fixed-sized page or bucket with B records.
TECHREPORT
, March 1988
, Celis, Pedro
, Pedro Celis
, External Robin Hood Hashing
, TR246
, Computer Science Department, Indiana University
,

### 2-choice hashing

2-choice hashing employs two different hash functions, h1(x) and h2(x), for the hash table. Both hash functions are used to compute two table locations. When an object is inserted in the table, it is placed in the table location that contains fewer objects (with the default being the h1(x) table location if there is equality in bucket size). 2-choice hashing employs the principle of the power of two choices.WEB,weblink Archived copy, 2015-04-10, live,weblink" title="web.archive.org/web/20150325175258weblink">weblink March 25, 2015, mdy-all,

## Dynamic resizing

When an insert is made such that the number of entries in a hash table exceeds the product of the load factor and the current capacity then the hash table will need to be rehashed. Rehashing includes increasing the size of the underlying data structure and mapping existing items to new bucket locations. In some implementations, if the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.To limit the proportion of memory wasted due to empty buckets, some implementations also shrink the size of the tableâ€”followed by a rehashâ€”when items are deleted. From the point of spaceâ€“time tradeoffs, this operation is similar to the deallocation in dynamic arrays.

### Resizing by copying all entries

A common approach is to automatically trigger a complete resizing when the load factor exceeds some threshold rmax. Then a new larger table is allocated, each entry is removed from the old table, and inserted into the new table. When all entries have been removed from the old table then the old table is returned to the free storage pool. Likewise, when the load factor falls below a second threshold rmin, all entries are moved to a new smaller table.For hash tables that shrink and grow frequently, the resizing downward can be skipped entirely. In this case, the table size is proportional to the maximum number of entries that ever were in the hash table at one time, rather than the current number. The disadvantage is that memory usage will be higher, and thus cache behavior may be worse. For best control, a "shrink-to-fit" operation can be provided that does this only on request.If the table size increases or decreases by a fixed percentage at each expansion, the total cost of these resizings, amortized over all insert and delete operations, is still a constant, independent of the number of entries n and of the number m of operations performed.For example, consider a table that was created with the minimum possible size and is doubled each time the load ratio exceeds some threshold. If m elements are inserted into that table, the total number of extra re-insertions that occur in all dynamic resizings of the table is at most m âˆ’ 1. In other words, dynamic resizing roughly doubles the cost of each insert or delete operation.

### Alternatives to all-at-once rehashing

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually.Disk-based hash tables almost always use some alternative to all-at-once rehashing, since the cost of rebuilding the entire table on disk would be too high.

#### Incremental resizing

One alternative to enlarging the table all at once is to perform the rehashing gradually:
• During the resize, allocate the new hash table, but keep the old table unchanged.
• In each lookup or delete operation, check both tables.
• Perform insertion operations only in the new table.
• At each insertion also move r elements from the old table to the new table.
• When all elements are removed from the old table, deallocate it.
To ensure that the old table is completely copied over before the new table itself needs to be enlarged, itis necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.

#### Monotonic keys

If it is known that key values will always increase (or decrease) monotonically, then a variation of consistent hashing can be achieved by keeping a list of the single most recent key value at each hash table resize operation. Upon lookup, keys that fall in the ranges defined by these list entries are directed to the appropriate hash functionâ€”and indeed hash tableâ€”both of which can be different for each range. Since it is common to grow the overall number of entries by doubling, there will only be O(log(N)) ranges to check, and binary search time for the redirection would be O(log(log(N))). As with consistent hashing, this approach guarantees that any key's hash, once issued, will never change, even when the hash table is later grown.

#### Linear hashing

Linear hashingCONFERENCE, Witold, Litwin, Linear hashing: A new tool for file and table addressing, 1980, 212â€“223, Proc. 6th Conference on Very Large Databases, is a hash table algorithm that permits incremental hash table expansion. It is implemented using a single hash table, but with two possible lookup functions.

#### Hashing for distributed hash tables

Another way to decrease the cost of table resizing is to choose a hash function in such a way that the hashes of most values do not change when the table is resized. Such hash functions are prevalent in disk-based and distributed hash tables, where rehashing is prohibitively costly.The problem of designing a hash such that most values do not change when the table is resized is known as the distributed hash table problem.The four most popular approaches are rendezvous hashing, consistent hashing, the content addressable network algorithm, and Kademlia distance.

## Performance analysis

In the simplest model, the hash function is completely unspecified and the table does not resize. With an ideal hash function, a table of size k with open addressing has no collisions and holds up to k elements with a single comparison for successful lookup, while a table of size k with chaining and n keys has the minimum {{nowrap|max(0,n-k)}} collisions and {{nowrap|Theta(1+frac{n}{k})}} comparisons for lookup. With the worst possible hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Theta(n) amortized comparisons per insertion and up to n comparisons for a successful lookup.Adding rehashing to this model is straightforward. As in a dynamic array, geometric resizing by a factor of b implies that only frac{n}{b^i} keys are inserted i or more times, so that the total number of insertions is bounded above by {{nowrap|frac{bn}{b-1}}}, which is Theta(n). By using rehashing to maintain {{nowrap|n

- content above as imported from Wikipedia
- "hash table" does not exist on GetWiki (yet)
- time: 6:34pm EDT - Sun, Sep 22 2019
[ this remote article is provided by Wikipedia ]
LATEST EDITS [ see all ]
GETWIKI 09 JUL 2019
Eastern Philosophy
History of Philosophy
GETWIKI 09 MAY 2016
GETWIKI 18 OCT 2015
M.R.M. Parrott
Biographies
GETWIKI 20 AUG 2014
GETWIKI 19 AUG 2014
CONNECT