Software Product Development | Software Testing Tutorial

Showing posts with label Hashing. Show all posts

Tuesday, January 5, 2010

Open Addressing - a way to deal with collisions in hashing

Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.
In this technique all elements are stored in the hash table itself. That is, each table entry contains either an element or NIL. When searching for element (or empty slot), we systematically examine slots until we find an element (or empty slot). There are no lists and no elements stored outside the table. This implies that table can completely "fill up"; the load factor α can never exceed 1.
Advantage of this technique is that it avoids pointers (pointers need space too). Instead of chasing pointers, we compute the sequence of slots to be examined.
To perform insertion, we successively examine or probe, the hash table until we find an empty slot. The sequence of slots probed "depends upon the key being inserted." To determine which slots to probe, the hash function includes the probe number as a second input.
Well known probe sequences include:
* Linear probing in which the interval between probes is fixed (usually 1).
* Quadratic probing in which the interval between probes increases by some constant (usually 1) after each probe.
* Double hashing in which the interval between probes is computed by another hash function.

Drawbacks :
- Open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array.
- Open addressing schemes also put more stringent requirements on the hash function.
- Open addressing only saves memory if the entries are small.
- Normal open addressing is a poor choice for large elements, since these elements fill entire cache lines, and a large amount of space is wasted on large empty table slots.

Separate Chaining

A scheme in which each position in the hash table has a list to handle collisions. Each position may be just a link to the list (direct chaining) or may be an item and a link, essentially, the head of a list. In the latter, one item is in the table, and other colliding items are in the list.
Also known as external chaining.

Separate chaining with list heads :
Some chaining implementations store the first record of each chain in the slot array itself. The purpose is to increase cache efficiency of hash table access. In order to save memory space, such hash tables are often designed to have about as many slots as stored entries, meaning that many slots will have two or more entries.

Separate chaining with other structures :
Instead of a list, one can use any other data structure that supports the required operations. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, this approach is worth the trouble and extra memory cost only if long delays must be avoided at all costs or if one expects to have many entries hashed to the same slot.
The variant called array hashing uses a dynamic array to store all the entries that hash to the same bucket.Each inserted entry gets appended to the end of the dynamic array that is assigned to the hashed slot.

Monday, January 4, 2010

Overview of Collision Resolution in Hashing

Collisions are practically unavoidable when hashing a random subset of a large set of possible keys.Therefore, most hash table implementations have some collision resolution strategy to handle such events.

- Load factor : The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its bucket array. With a good hash function, the average look up cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases.
- Separate chaining : Each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Look-up requires scanning the list for an entry with the given key. Insertion requires appending a new entry record to either end of the list in the hashed slot. Deletion requires searching the list and removing the element.
* Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that would be unsuitable for other methods.
* Chained hash tables remain effective even when the number of entries n is much higher than the number of slots.
* For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure.
* Chained hash tables also inherit the disadvantages of linked lists.

Saturday, January 2, 2010

Geometric Hashing Technique

Geometric Hashing : It is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation. The models are assumed to be known in advance, and thus allow pre-processing, as follows:

- Pick a reference frame.
- Compute the 3D orthonormal basis associated with this reference frame and its shape signature (e.g. triangle sides length).
- Compute the coordinates of all the other points (in a pre-specified neighborhood) in this reference frame.
- Use each coordinate as an address to the hash (look-up) table. Store the entry [protein id, ref. frame, shape sign., point] at the hash table address.
- Repeat above steps for each model reference frame (non-collinear triplet of model points).

Recognition stage of the algorithm uses the hash table, prepared in the pre-processing step. The matching of a target object is accomplished as follows:

- For each reference frame of the target:
- Compute the 3D orthonormal basis and the shape signature associated with it.
- Compute the coordinates of all other points in the current reference frame.
- Use each coordinate to access the hash-table and retrieve all the records [protein id, ref. frame, shape sign.,point].
- For records with matching shape signature "vote" for the pair [protein, ref. frame].
- Compute the transformations of the ``high scoring'' hypotheses. For each hypothesis one can also register the pairs of matching points. This match list along with the transformation comprise a seed match.

Cryptographic Hash Function

Hashing as a tool to associate one set or bulk of data with an identifier has many different forms of application in the real-world. Cyptographic Hashing is used for data/user verification and authentication. A strong cryptographic hash function has the property of being very difficult to reverse the result of the hash and hence reproduce the original piece of data. Cryptographic hash functions are used to hash user's passwords and have the hash of the passwords stored on a system rather than having the password itself stored. Cryptographic hash functions are also seen as irreversible compression functions, being able to represent large quantities of data with a signal ID, they are useful in seeing whether or not the data has been tampered with, and can also be used as data one signs in order to prove authenticity of a document via other cryptographic means.
The ideal cryptographic hash function has four main properties:
* it is easy to compute the hash value for any given message.
* it is infeasible to find a message that has a given hash.
* it is infeasible to modify a message without changing its hash.
* it is infeasible to find two different messages with the same hash.
A cryptographic hash function must be able to withstand all known types of cryptanalytic attack. As a minimum, it must have the following properties:
* Preimage resistance: Given a hash h it should be hard to find any message m such that h = hash(m). This concept is related to that of one way function. Functions that lack this property are vulnerable to preimage attacks.
* Second preimage resistance : Given an input m1, it should be hard to find another input, m2 (not equal to m1) such that hash(m1) = hash(m2). This property is sometimes referred to as weak collision resistance. Functions that lack this property are vulnerable to second preimage attacks.
* Collision resistance : It should be hard to find two different messages m1 and m2 such that hash(m1) = hash(m2). Such a pair is called a (cryptographic) hash collision, and this property is sometimes referred to as strong collision resistance. It requires a hash value at least twice as long as what is required for preimage-resistance, otherwise collisions may be found by a birthday attack.

Properties of Hash Functions

A hash function is a function that takes a relatively arbitrary amount of input and
produces an output of fixed size. The properties of some hash functions can be
used to greatly increase the security of a system administrator’s network; when
implemented correctly they can verify the integrity and source of a file, network
packet, or any arbitrary data.
Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below :

- A hash is a one-way function.
- The cost of computing a hash function must be small enough to make a hashing-based solution advantageous with regard to alternative approaches.
- A hash procedure must be deterministic — meaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the hashed data, in the mathematical sense of the term.
- A good hash function should map the expected inputs as evenly as possible over its output range i.e. every hash value in the output range should be generated with roughly the same probability.
- The range of hash values may be different for each run of the program, or may change along the same run. In those situations, one needs a hash function which takes two parameters — the input data z, and the number n of allowed hash values.
- A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.

Sunday, December 27, 2009

Introduction to Hashing

Hashing is a method to store data in an array so that storing, searching, inserting and deleting data is fast. For this every record needs an unique key.The basic idea is not to search for the correct position of a record with comparisons but to compute the position within the array. The function that returns the position is called the 'hash function' and the array is called a 'hash table'.

Main idea: Use an array of size m and the key k as address of the array.
A hash function h is used to map keys to [0::m-1].
Key technical issues :
- What is a good h? A good function avoids (but does not eliminate)collisions, and is quick to compute.
- How do we resolving collisions? Retrieval time is a function of collisions.
- What if we run out of space in the table?
- Can we rearranging keys upon an insertion?

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.A hash function may map two or more keys to the same hash value.Hash functions are related to (and often confused with) check sums,check digits, fingerprints, randomization functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently.

A hash table or hash map is a data structure that uses a hash function to efficiently map certain identifiers or keys (e.g., person names) to associated values (e.g., their telephone numbers).n general, a hashing function may map several different keys to the same index. Therefore, each slot of a hash table is associated with (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called bucket indices.

Software Product Development | Software Testing Tutorial | Software Process

Tuesday, January 5, 2010

Open Addressing - a way to deal with collisions in hashing

Separate Chaining

Monday, January 4, 2010

Overview of Collision Resolution in Hashing

Saturday, January 2, 2010

Geometric Hashing Technique

Cryptographic Hash Function

Properties of Hash Functions

Sunday, December 27, 2009

Introduction to Hashing

Facebook activity

Feed

Search Site

Popular Posts

Blog Archive

Contributors

Followers

Pageviews