Software Product Development | Software Testing Tutorial

Showing posts with label Hash. Show all posts

Tuesday, January 5, 2010

Open Addressing - a way to deal with collisions in hashing

Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.
In this technique all elements are stored in the hash table itself. That is, each table entry contains either an element or NIL. When searching for element (or empty slot), we systematically examine slots until we find an element (or empty slot). There are no lists and no elements stored outside the table. This implies that table can completely "fill up"; the load factor α can never exceed 1.
Advantage of this technique is that it avoids pointers (pointers need space too). Instead of chasing pointers, we compute the sequence of slots to be examined.
To perform insertion, we successively examine or probe, the hash table until we find an empty slot. The sequence of slots probed "depends upon the key being inserted." To determine which slots to probe, the hash function includes the probe number as a second input.
Well known probe sequences include:
* Linear probing in which the interval between probes is fixed (usually 1).
* Quadratic probing in which the interval between probes increases by some constant (usually 1) after each probe.
* Double hashing in which the interval between probes is computed by another hash function.

Drawbacks :
- Open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array.
- Open addressing schemes also put more stringent requirements on the hash function.
- Open addressing only saves memory if the entries are small.
- Normal open addressing is a poor choice for large elements, since these elements fill entire cache lines, and a large amount of space is wasted on large empty table slots.

Separate Chaining

A scheme in which each position in the hash table has a list to handle collisions. Each position may be just a link to the list (direct chaining) or may be an item and a link, essentially, the head of a list. In the latter, one item is in the table, and other colliding items are in the list.
Also known as external chaining.

Separate chaining with list heads :
Some chaining implementations store the first record of each chain in the slot array itself. The purpose is to increase cache efficiency of hash table access. In order to save memory space, such hash tables are often designed to have about as many slots as stored entries, meaning that many slots will have two or more entries.

Separate chaining with other structures :
Instead of a list, one can use any other data structure that supports the required operations. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, this approach is worth the trouble and extra memory cost only if long delays must be avoided at all costs or if one expects to have many entries hashed to the same slot.
The variant called array hashing uses a dynamic array to store all the entries that hash to the same bucket.Each inserted entry gets appended to the end of the dynamic array that is assigned to the hashed slot.

Monday, January 4, 2010

Overview of Collision Resolution in Hashing

Collisions are practically unavoidable when hashing a random subset of a large set of possible keys.Therefore, most hash table implementations have some collision resolution strategy to handle such events.

- Load factor : The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its bucket array. With a good hash function, the average look up cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases.
- Separate chaining : Each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Look-up requires scanning the list for an entry with the given key. Insertion requires appending a new entry record to either end of the list in the hashed slot. Deletion requires searching the list and removing the element.
* Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that would be unsuitable for other methods.
* Chained hash tables remain effective even when the number of entries n is much higher than the number of slots.
* For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure.
* Chained hash tables also inherit the disadvantages of linked lists.

Saturday, January 2, 2010

Properties of Hash Functions

A hash function is a function that takes a relatively arbitrary amount of input and
produces an output of fixed size. The properties of some hash functions can be
used to greatly increase the security of a system administrator’s network; when
implemented correctly they can verify the integrity and source of a file, network
packet, or any arbitrary data.
Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below :

- A hash is a one-way function.
- The cost of computing a hash function must be small enough to make a hashing-based solution advantageous with regard to alternative approaches.
- A hash procedure must be deterministic — meaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the hashed data, in the mathematical sense of the term.
- A good hash function should map the expected inputs as evenly as possible over its output range i.e. every hash value in the output range should be generated with roughly the same probability.
- The range of hash values may be different for each run of the program, or may change along the same run. In those situations, one needs a hash function which takes two parameters — the input data z, and the number n of allowed hash values.
- A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.

Software Product Development | Software Testing Tutorial | Software Process

Tuesday, January 5, 2010

Open Addressing - a way to deal with collisions in hashing

Separate Chaining

Monday, January 4, 2010

Overview of Collision Resolution in Hashing

Saturday, January 2, 2010

Properties of Hash Functions

Facebook activity

Feed

Search Site

Popular Posts

Blog Archive

Contributors

Followers

Pageviews