Subscribe by Email

Friday, January 29, 2010

Bad Block Recovery - Disk Management

Disks have moving parts and some tolerances, they are prone to failure. Most disks even come from the factory with bad blocks. Depending on the disk and controller in use, these blocks are handled in a variety of ways.

- Simple disks such as disks with IDE controllers, bad blocks are handled manually.
The MS-DOS format command does a logical format, scans the disk to find bad blocks. If format finds a bad block, it writes a special value into the corresponding FAT entry to tell the allocation routines not to use that block.

- More sophisticated disks, such as SCSI disks used in high-end PCs and most workstations, are smarter about bad block recovery. The controller has a list of bad blocks on the disk and this list is initialized during low-level format at factory, and is updated over the life of the disk. Low-level formatting also sets aside spare sectors not visible to operating system. The controller can be told to replace each bad sector logically with one of the spare sectors. This scheme is called sector sparing.

A typical bad sector transaction :
- The operating system tries to read logical block 87.
- The controller calculates the ECC and finds that the sector is bad. It reports this finding to the operating system.
- The next time the system is rebooted, a special command is run to tell the SCSI controller to replace the bad sector with a spare.
- After this, whenever the system requests logical block 87, the request is translated into the replacement sector's address by the controller.

An alternative to sector sparing, some controllers can be instructed to replace a bad block by sector slipping.
The replacement of a bad block generally is not a totally automatic process, because the data in the bad block usually are lost. Thus, whatever file was using that block must be repaired, and that requires manual intervention.

Boot Block - Disk Management

A program at some fixed location on a hard disk, floppy disk or other media, which is loaded when the computer is turned on or rebooted and which controls the next phase of loading the actual operating system. The loading and execution of the boot block is usually controlled by firmware in ROM or PROM.
There is an initial program that is run whenever a computer starts running. This initial program initializes all aspects of the system, from CPU registers to device controllers and the contents of the main memory, and then starts the operating system.
Generally, the bootstrap is stored in read-only-memory (ROM) because ROM needs no initialization, and is at a fixed location that the processor can start executing when powered up or reset. The problem is that changing this bootstrap code requires changing ROM hardware chips. For this reason, most systems store a tiny bootstrap loader program in the boot ROM, whose only job is to bring in a full bootstrap program from disk.
The code in the boot ROM instructs the disk controller to read the boot blocks into memory, and then starts executing the code. The full bootstrap program is more sophisticated than the bootstrap loader in the boot ROM, and is able to load the entire operating system from a non-fixed location on disk, and to start the operating system running.

Thursday, January 28, 2010

High Level Formatting

The second formatting step is high-level formatting. This is the process of creating the disk's logical structures such as the file allocation table and root directory. The high-level format uses the structures created by the low-level format to prepare the disk to hold files using the chosen file system.

Method of formatting a hard disk drive that initializes portions of the hard disk drive and creates important file system areas on the disk. A good example of a high-level format is using the format command in MS-DOS.

A high-level format is commonly done if a user wishes to erase the hard disk drive and reinstall the operating system back onto the hard disk drive. If errors are present on the hard disk drive, or a high-level format is unable to be completed, a low-level format may need to be done first.

For a hard disk, there is an intermediate task that is performed between the two formatting steps: partitioning. For this reason, combined with the incredible complexity of modern hard disks, they are low-level formatted by the manufacturer, and high-level formatting is done by the DOS FORMAT command (or equivalent). Floppy disks require no intermediate step, and due to their relative simplicity, they are both low-level and high-level formatted at the same time by default when you use the FORMAT command.

Disk Formatting and Low level formatting of floppy disks

Disk formatting is the initial part of the process for preparing a hard disk or other storage medium for its first use. The disk formatting includes setting up an empty file system. A disk formatting may setup multiple file systems by formatting partitions for each file system. Disk formatting is also part of a process involving rebuilding an entire disk from scratch.

There are two steps involved in formatting magnetic media such as floppy disks and hard disks.

Low-level formatting of floppy disks : The first step involves the creation of the actual structures on the surface of the media that are used to hold the data. This means recording the tracks and marking the start of each sector on each track. This is called low-level formatting, and sometimes is called "true formatting" since it is actually recording the format that will be used to store information on the disk. Once the floppy disk has been low-level formatted, the locations of the tracks on the disk are fixed in place. Since floppies use a stepper motor to drive the head actuator, the floppy drive must be aligned properly in order to read the tracks on the disk. Sometimes the heads of a particular drive can become out of alignment relative to where they should be; when this happens you may notice that a disk formatted on the misaligned drive will work in that drive but not in others, and vice-versa.
Since floppy disks tend to be put together cheaply these days and many of them are getting rather old, it is generally preferable to always low-level format a disk in the drive you plan to use to write to it.

C-SCAN Scheduling and LOOK Scheduling

Circular SCAN is a variant of SCAN that is designed to provide a more uniform wait time. Like SCAN, C-SCAN moves the head from one end of the disk to the other, servicing requests along the way. When the head reaches the other end, however, it immediately returns to the beginning of the disk, without servicing any requests on the return trip. The C-SCAN scheduling algorithm essentially treats the cylinders as a circular list that wraps around from the final cylinder to the first one.

Both SCAN and C-SCAN move the disk arm across the full width of the disk. In practice, neither of the algorithm is implemented this way. More commonly, the arm goes only as far as the final request in each direction. Then, it reverses the direction immediately, without first going all the way to the end of the disk. These versions of SCAN and C-SCAN are called LOOK and C-LOOK scheduling.

How to select a Disk Scheduling algorithm ?
With any scheduling algorithm, performance depends heavily on the number and types of requests. Suppose that the queue has just one request, then, all the scheduling algorithms are forced to behave the same, because they have no choice except one. They all behave like FCFS scheduling.
The requests for disk service can be greatly influenced by the file allocation method. The location of directories and index blocks is also important. Since every file must be opened to be used, and opening a file requires searching the directory structure, the directories will be accessed frequently.
The disk scheduling algorithm should be written as a separate module so that it can be replaced with a different algorithm if necessary. Either SSTF or LOOK is a reasonable choice for default algorithm.

Wednesday, January 27, 2010

Disk Scheduling Algorithm - SCAN Scheduling

In this algorithm the head always constantly moves from the most inner cylinder to the outer cylinder, then it changes its direction back towards the center. As the head moves, if there is a request for the current disk position it is satisfied. The throughput is better than in FIFO. The SCAN algorithm is more fair that SSTF. This algorithm is named after the behavior of a building elevator, where the elevator continues to travel in its current direction (up or down) until empty, stopping only to let individuals off or to pick up new individuals heading in the same direction.

SCAN Disk Scheduling Algorithm

The arm movement is thus always less than twice the number of total cylinders then, for both versions of the elevator algorithm. The variation has the advantage to have a smaller variance in response time. The algorithm is also relatively simple.
However, the elevator algorithm is not always better than Shortest seek first, which is slightly closer to optimal, but can result in high variance in response time and even in starvation when new requests continually get serviced prior to existing requests.
Anti-starvation techniques can be applied to the shortest seek time first algorithm to guarantee an optimum response time.

Disk Scheduling Algorithm - SSTF Scheduling

Shortest seek first (or shortest seek time first) is a secondary storage scheduling algorithm to determine the motion of the disk's arm and head in servicing read and write requests. This is a direct improvement upon a first-come first-served (FIFO) algorithm. The drive maintains an incoming buffer of requests, and tied with each request is a cylinder number of the request. Lower cylinder numbers indicate that the cylinder is closest to the spindle, and higher numbers indicate the cylinder is further away.

The shortest seek first algorithm determines which request is closest to the current position of the head, and then services that request next. The shortest seek first algorithm has the direct benefit of simplicity and is clearly advantageous in comparison to the FCFS method, in that overall arm movement is reduced, resulting in lower average response time.

However, since the buffer is always getting new requests, these can skew the service time of requests that may be farthest away from the disk head's current location, if the new requests are all close to the current location; in fact, starvation may result, with the faraway requests never being able to make progress.
SSTF scheduling is essentially a form of shortest-job-first (SJF) scheduling, and like SJF scheduling, it may cause starvation of some requests.

Disk Scheduling Algorithm - FCFS Scheduling

Scheduling is a key concept in computer multitasking, multiprocessing operating system and real-time operating system designs. The scheduler is concerned mainly with :
* CPU utilization - to keep the CPU as busy as possible.
* Throughput - number of process that complete their execution per time unit.
* Turnaround - total time between submission of a process and its completion.
* Waiting time - amount of time a process has been waiting in the ready queue.
* Response time - amount of time it takes from when a request was submitted until the first response is produced.
* Fairness - Equal CPU time to each thread.

Scheduling Algorithms :
CPU scheduling deals with the problem of deciding which of the processes in the ready queue is to be allocated the CPU. There are many different CPU scheduling algorithms.

- First Come First Served (FCFS): The disk controller processes the I/O requests in the order in which they arrive, thus moving backwards and forwards across the surface of the disk to get to the next requested location each time. Since no reordering of request takes place the head may move almost randomly across the surface of the disk. This policy aims to minimize response time with little regard for throughput.
Illustration shows total head movement of 640 cylinders.
queue = 98, 183, 37, 122, 14, 124, 65, 67
Head starts at 53.

First Come First Served Disk Scheduling Algorithm

The problem with this schedule is illustrated by the wide swing from 122 to 14 and then back to 124. If the requests for cylinders 37 and 14 could be serviced together, before or after the requests at 122 and 124, the total head movement could be decreased substantially, and performance could be improved.

Secondary Storage Structure - Disk Structure

Secondary storage structure helps to :
- Describe the physical structure of secondary and tertiary storage devices and the resulting effects on the uses of the devices.
- Explain the performance characteristics of mass-storage devices.
- Discuss operating-system services provided for mass storage, including RAID and HSM.

Disk provide the bulk of secondary storage for modern computer systems. Disk drives are addressed as large 1-dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer. The 1-dimensional array of logical blocks is mapped onto the sectors of the disk sequentially.
- Sector 0 is the first sector of the first track on the outermost cylinder.
- Mapping proceeds in order through that track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost.
- The operating system is responsible for using hardware efficiently — for the disk drives, this means having a fast access time and disk bandwidth.
- Access time has two major components.
* Seek time is the time for the disk arm to move the heads to the cylinder containing the desired sector.
* Rotational latency is the additional time waiting for the disk to rotate the desired sector to the disk head.
- Minimize seek time.
- Seek time seek distance.
- Disk bandwidth is the total number of bytes transferred, divided by the total time between the first request for service and the completion of the last transfer.
- Several algorithms exist to schedule the servicing of disk I/O requests.

Tuesday, January 26, 2010

Enabling a product to work across different locales - what is the need ?

Nowadays, in order to have a successful software product, there is a dire need to ensure that you release the product in different languages. You might argue that your main market is in the United States, and there are enough English speakers the world over to make your product successful. However, consider the following factors -
- Do a quick survey to determine potential users for your products in different countries
- Look at your competitors and see how many different countries they are sold in
- Consider the problem of negotiating deals with hardware sellers to include your software in their hardwares as part of OEM deals (they would want the ability to include the same software in their different country deals)
- The incremental benefit to having multiple language sales is more than the cost involved
- You can seriously cause harm to your image if you don't have a global presence. With an increasing percentage of people in countries being bilingual or from another country of origin (such as the increasing Hispanic population in the US, and the movement of people within the EU), it gets difficult for you to justify yourself as a serious software product if you don't have language versions.

Now, how do you actually go ahead with ensuring that your product can be localized easily and is properly available in different languages ?
Well, you do something called software internationalization; a process that ensures that your software application works in different languages and regions without having to make changes at the time of use. You enable the software during the design time to work with different languages, typically by adding something called language packs that allows the same software to pick up the relevant language packs, and in many cases, enabling the software to change languages easily during use. The engineering work needed to enable this to happen is a lot more complicated, and will be detailed over the next few posts.

Friday, January 22, 2010

Introduction to Swapping

When the physical memory in the system runs out and a process needs to bring a page into memory then the operating system must decide what to do. It must fairly share the physical pages in the system between the processes running in the system, therefore it may need to remove one or more pages from the system to make room for the new page to be brought into memory. How virtual pages are selected for removal from physical memory affects the efficiency of the system.

Swapping is the one of the efficient regular and authentic approach of memory management. It is the process of swapping of higher priority process on the lower priority process.
Advantages of swapping are as follows :-
1. Higher degree of multi-programming.
2. Dynamic relocation.
3. Greater memory utilization.
4. Priority based scheduling.
5. Less wastage of CPU time.
6. Higher performance.

If the page to be removed came from an image or data file and has not been written to then the page does not need to be saved. Instead it can be discarded and if the process needs that page again it can be brought back into memory from the image or data file again. However, if the page has been written to then the operating system must preserve the contents of that page so that it can be accessed at a later time.

Linux uses a page aging technique to fairly choose pages which might be removed from the system. This scheme involves every page in the system having an age which changes as the page is accessed. The more that a page is accessed, the younger it is; the less that it is accessed the older it becomes. Old pages are good candidates for swapping.

Demand Segmentation

Although demand paging is considered the most efficient virtual memory system, a significant amount of hardware is required to implement it. When this hardware is lacking, less efficient means are sometimes devised to provide virtual memory. A case in point is demand segmentation.
Operating system allocates memory in segments, rather than in pages. It keeps track of these segments through segment descriptors, which include information about the segment's size, protections, and location. A process does not need to have all its segments in memory to execute. Instead, the segment descriptor contains a valid bit for each segment to indicate whether the segment is currently in memory. If the segment is in memory, the access continues unhindered. If the segment is not in memory, a trap to the operating system occurs. Operating system then swaps out a segment to secondary storage, and brings in the entire requested segment. The interrupted instruction then continues.
To determine which segment to replace in case of segment fault, operating system uses another bit in the segment descriptor called an accessed bit. It is set whenever any byte in the segment is either read or written. A queue is kept containing an entry for each segment in memory. After every time slice, the operating system places at the head of the queue any segments with a set access bit.
It then clears all access bits. In this way, the queue stays ordered with the most recently used segments at the head.
Demand segmentation requires considerable overhead. Thus, demand segmentation is not an optimal means for making best use of the resources of a computer system.

Thursday, January 21, 2010

Thrashing and its Causes

It is technically possible to reduce the number of allocated frames to the minimum, there is some number of pages that are in active use. If the process does not have this number of frames, it will very quickly page fault. At this point, it must replace some page. However, since all its pages are in active use, it must replace a page that will be needed again right away. Consequently it very quickly faults again, and again, and again. The process continues to fault, replacing pages for which it will then fault and bring back in right away. This high paging activity is called thrashing. A process is thrashing if it is spending more time paging than executing.
Thrashing results in severe performance problems. If you consider the scenario below you will understand how early paging systems behaved :
CPU utilization is monitored by the operating system. If the system finds that CPU utilization is too low, multiprogramming is increased by adding a new process; the algorithm used replaces pages without considering their linked processes. However, a process may end up needing more frames and takes pages away from other processes, causing faulting. The processes from which those pages were taken away in turn will pull pages from other processes increasing the degree of faulting. As processes queue up for the paging device and end up waiting for pages, CPU utilization decreases, in turn causing a push to increase the degree of multiprogramming. This process will keep on happening with CPU utilization decreasing even further and the CPU scheduler trying to increase multiprogramming even further. This leads to thrashing and consequent decrease in system throughput (accompanied by a large increase in page fault rate).

Page Buffering Algorithm

There are some other procedures that are often used in addition to a specific page replacement algorithm.

- System commonly keep a pool of free frames. When a page fault occurs, a victim frame is chosen as before. However, the desired page is read into a free frame from the pool before the victim is written out. This procedure allows the process to restart as soon as possible, without waiting for the victim page to be written out. When the victim is later written out, its frame is added to the free-frame pool.
- Whenever the paging device is idle, a modified page is selected and is written to the disk. Its modify bit is then reset. This scheme increases the probability that a page will be clean when it is selected for replacement, and will not need to be written out.
- Another modification is to keep a pool of free frames, but to remember which page was in each frame. Since the frame contents are not modified when a frame is written to the disk, the old page can be reused directly from the free-frame pool if it is needed before that frame is reused. No I/O is needed in this case. When a page fault occurs, we first check whether the desired page is in the free-frame pool. If it is not, we must select a free frame and read into it.

This technique is used in VAX/VMS system, with a FIFO replacement algorithm. When the FIFO replacement algorithm mistakenly replaces a page that is still in active use, that page is quickly retrieved from the free-frame buffer, and no I/O is necessary.

Monday, January 18, 2010

Working Set Page Replacement

The working set of a process is the set of pages expected to be used by that process during some time interval.
While the "working set model" isn't a page replacement algorithm in the strict sense (it's actually a kind of mid-term scheduler), several page replacement algorithms attempt to approximate the working set model, attempting to keep all pages in the working set in RAM.

For a window size T (measured in memory references), the working set at time t is the set of pages which were referenced in the interval (t -T + 1, t). A page may be replaced when it no longer belongs to the working set (this is not necessarily when a page fault occurs.)

Given a program with 7 virtual pages {a,b,...,g} and the reference sequence
a b a c g a f c g a f d b g
with a window of 4 references.
Working Set Replacement

A variant of the basic working set replacement, which replaces pages only when there is a page fault, could do the following on a page fault:
- If all pages belong to the working set (i.e., have been accessed in the window W time units prior to the page fault) then increase the working set by 1 page.
- If one or more pages do not belong to the working set (i.i., have not been referenced in the window W time units prior to the page fault) then decrease the working set by discarding the last recently used page. If there is more than one page not in the working set, discard the 2 pages which have been least recently used.

Aging Page Replacement Algorithm

The aging algorithm is a descendant of the NFU algorithm, with modifications to make it aware of the time span of use.Instead of just incrementing the counters of pages referenced, putting equal emphasis on page references regardless of the time, the reference counter on a page is first shifted right (divided by 2), before adding the referenced bit to the left of that binary number.Suppose that after the first clock tick the R bits for pages 0 to 5 have the values 1,0, 1, 0, 1, and 1, respectively (page 0 is 1, page 1 is 0, page 2 is 1, etc.). In other words, between tick 0 and tick 1, pages 0, 2, 4, and 5 were referenced, setting their R bits to 1, while the other ones remain 0.


When a page fault occurs, the page whose counter is the lowest is removed. It
is clear that a page that has not been referenced for, say, four clock ticks will have four leading zeros in its counter and thus will have a lower value than a counter that has not been referenced for three clock ticks.

Aging differs from LRU in the sense that aging can only keep track of the references in the latest 16/32 (depending on the bit size of the processor's integers) time intervals. Consequently, two pages may have referenced counters of 00000000, even though one page was referenced 9 intervals ago and the other 1000 intervals ago. Generally speaking, knowing the usage within the past 16 intervals is sufficient for making a good decision as to which page to swap out. Thus, aging can offer near-optimal performance for a moderate price.

Random and Not Frequently Used (NFU) Page replacement Algorithms

- Random replacement algorithm replaces a random page in memory. This eliminates the overhead cost of tracking page references. Usually it fares better than FIFO, and for looping memory references it is better than LRU, although generally LRU performs better in practice. OS/390 uses global LRU approximation and falls back to random replacement when LRU performance degenerates, and the Intel i860 processor used a random replacement policy.

- The Not Frequently Used (NFU) page replacement algorithm requires a counter, and every page has one counter of its own which is initially set to 0. At each clock interrupt,the operating system scans all the pages in memory. For each page, the R
bit, which is 0 or 1, is added to the counter. In effect, the counters are an attempt to keep track of how often each page has been referenced. When a page fault
occurs, the page with the lowest counter is chosen for replacement.
The main problem with NFU is that it never forgets anything. For example, in a multi-pass compiler, pages that were heavily used during pass 1 may still have a high count well into later passes. In fact, if pass 1 happens to have the longest execution time of all the passes, the pages containing the code for subsequent passes may always have lower counts than the pass 1 pages. Consequently, the operating system will remove useful pages instead of pages no longer in use.

Fortunately, a small modification to NFU makes it able to simulate LRU quite well. The modification has two parts. First, the counters are each shifted right 1 bit before the R bit is added in. Second, the R bit is added to the leftmost, rather than the rightmost bit.

Least Recently Used Page Replacement Algorithm - LRU

The least recently used (LRU) algorithm applies to a cache memory having a number of independently replaceable pages. When the cache is full and a memory reference is not found in any page in the cache, the page to be replaced is the one which has not been accessed for the longest amount of time. An ordered history of page numbers, called an LRU stack, must be generated for successive memory references.
Although LRU is theoretically realizable, it is not cheap. To fully implement LRU, it is necessary to maintain a linked list of all pages in memory, with the most recently used page at the front and the least recently used page at the rear. The difficulty is that the list must be updated on every memory reference. Finding a page in the list, deleting it, and then moving it to the front is a very time consuming operation, even in hardware.

However, LRU can be implemented using special hardware. Suppose the hardware has a 64-bit counter that is incremented at every instruction. Whenever a page is accessed, it gains a value equal to the counter at the time of page access. Whenever a page needs to be replaced, the operating system selects the page with the lowest counter and swaps it out. With present hardware, this is not feasible because the required hardware counters do not exist.
One important advantage of the LRU algorithm is that it is amenable to full statistical analysis. It has been proven, for example, that LRU can never result in more than N-times more page faults than OPT algorithm, where N is proportional to the number of pages in the managed pool.
On the other hand, LRU's weakness is that its performance tends to degenerate under many quite common reference patterns.

Sunday, January 17, 2010

In a product development cycle, how to get engineering and product management on the same page

As mentioned in the previous post, one of the biggest problems in a product development cycle is the objective of ensuring that the engineering teams and product management are in agreement about the resource commitment. Product Management talks about getting new features in place, while engineering has to content with having to dedicate resources to efforts for handling infrastructural and legacy tasks.
What are some of the tasks that an engineering team has to do which are not related to new feature work:
1. Test features from earlier versions and fix issues in these
2. Incorporate new components - you could be incorporating common components (and once you are building a product, there are many features that can be done by utilizing common components such as disc burning engines, installer technologies such as MSI, Installshield, Visual Studio etc); over a period of time, such components need to be upgraded and there will need to be more testing required
3. Spending dedicated time to improve the quality. Companies with large products such as Microsoft, Apple, Adobe, etc defer a number of bugs in their products. Some of these bugs are deferred since they are not critical, and there is a need to release the products. However, over a period of time, it may turn out that the overall impact of these deferred bugs can be high, and may need to spend some dedicated time to fix these bugs.

Now, what needs to be done in such cases since there needs to be an agreement between the engineering teams and product managers.
First, before starting on a new cycle, the team needs to spend around a week (with multiple meetings) to work through what all is planned for the next cycle, and this includes discussing what needs to be the focus of the release. A good way to sort through the legacy features is to emphasize that legacy features are non-negotiable and need to be tested since they impact customers directly. In a number of cases, without this discussion, the Product Management team has not thought through the need to test legacy features and when this discussion happens, this issue will get cleared.
Secondly, when the discussions about the features for the cycle starts, the team will need to work through setting aside dedicated time for infrastructural items, and the engineering team will need to push a bit hard on this issue; and in most reasonable cases, there will be an agreement between the engineering and the product management team as long as the time needed for this infrastructural work does not take too long.
In all such cases, focusing on a mix of the needed technical architecture work and new features should help to resolve such issues.

Getting the priorities right in feature development - what are problems in terms of balance between new features and other work ?

One of the biggest problems that a product development team runs into during the process of product development is about how to prioritize features. Typically for very small products, where the team size is low (less than 5-10 people), there may not be a separate position for a Product Manager, and the work of defining what the Product should be, what features it should contain, and so on, is fairly easy since you have effectively the same person running the engineering and product management activities. However, if you leave aside such cases, you have the normal case of a product that has a separate Product Manager (who works with the engineering teams, and not works for engineering, in order to ensure that there is a clear separation between engineering and the people who drive features).
Now, as I mentioned in the beginning, the problem is in terms of defining the features that need to be added to the product (this is true whether the product is a new product, or whether the product is a newer version of an existing product). A classic Product Manager would want to get new features added that are:
1. Meant to appeal to people from the press and technical reviewers who write about the new product and help in forming pubic opinion about the product. If the product is in a competitive area, then it is critical to get a good series of reviews
2. Mean to ensure that there is not a major discrepancy between the product and competitive products, and more important, the product should have all the latest features that are present with competitors
3. Ensure that existing users get some great new features that will encourage them to update their existing product (for existing products, getting existing users to update is a huge chunk of the business).
So what is the problem in all these ? Consider a team that is working with an older version of Visual Studio or some tool, and needs to upgrade. Now, upgrading the entire project in terms of code can take some time and effort, and this time and effort needs to come out of the same effort that is being used for generating new features. In addition, time needs to be spent on testing existing areas of the product that are not being modified; since they form a part of the product that is being released, they need to be tested to ensure that such functionality is not broken.
The typical problem that arises from these conflicting needs is that, with a given set of resources, not everything that the Product Manager requires will be built. The amount of difference between what the Product Manager wants and what the Product Manager will get accounts for the tension in the team, and can lead to tense discussions (except when you have a seasoned Product Manager who knows the situation, and even while pushing for new features, accepts that a certain amount of time will be utilized for other work).
In the next article on this subject, I will talk about what can be done to make this process much smoother.

Friday, January 15, 2010

Clock Page Replacement Algorithm

Although second chance is a reasonable algorithm, it is unnecessarily inefficient because it is constantly moving pages around on its list. A better approach is to keep all the page frames on a circular list in the form of a clock.

The clock algorithm keeps a circular list of pages in memory, with the "hand" (iterator) pointing to the oldest page in the list. When a page fault occurs and no empty frames exist, then the R (referenced) bit is inspected at the hand's location.
* If its R bit is 0, the page is evicted, the new page is inserted into the clock in its place, and the hand is advanced one position.
* If R is 1, it is cleared and the hand is advanced to the next page. This process is repeated until a page is found with R = 0.

Clock Page Replacement Algorithm

Variants on Clock :
- Clock-Pro keeps a circular list of information about recently-referenced pages, including all M pages in memory as well as the most recent M pages that have been paged out. This extra information on paged-out pages, like the similar information maintained by ARC, helps it work better than LRU on large loops and one-time scans.
- WSclock: The "aging" algorithm and the "WSClock" algorithm are probably the most important page replacement algorithms in practice.
- CAR is a page replacement algorithm that has performance comparable to ARC, and substantially outperforms both LRU and CLOCK. The algorithm CAR is self-tuning and requires no user-specified magic parameters.

Second Chance Page Replacement Algorithm

- A simple modification to FIFO that avoids the problem of throwing out a heavily used page is to inspect the R bit of the oldest page.
- If it is 0, the page is both old and unused, so it is replaced immediately.
- If the R bit is 1, the bit is cleared, the page is put onto the end of the list of pages, and its load time is updated as though it had just arrived in memory. Then the search continues. The operation of this algorithm, called second chance.
* Suppose that a page fault occurs at time 20. The oldest page is A, which arrived at time 0, when the process started.
* If A has the R bit cleared, it is evicted from memory. On the other hand, if the R bit is set, A is put onto the end of the list and its ``load time'' is reset to the current time (20). The R bit is also cleared. The search for a suitable page continues with B.
- What second chance is doing is looking for an old page that has not been referenced in the previous clock interval. If all the pages have been referenced, second chance degenerates into pure FIFO.

Second Chance Page Replacement Algorithm
Operation of second chance.
(a) Pages sorted in FIFO order.
(b) Page list if a page fault occurs at time 20 and A has its R bit set. The numbers above the pages are their loading times.

Thursday, January 14, 2010

First-In First-Out Page Replacement Algorithm

The first-in, first-out (FIFO) page replacement algorithm is a low-overhead algorithm that requires little book-keeping on the part of the operating system. The operating system maintains a list of all pages currently in memory, with the page at the head of the list the oldest one and the page at the tail the most recent arrival. On a page fault, the page at the head is removed and the new page added to the tail of the list.
It is not strictly necessary to record the time when a page is brought in. We can create a FIFO queue to hold all the pages in memory. We replace the page at the head of the queue. When a page is brought into memory, we insert it at the tail of the queue.

While FIFO is cheap and intuitive, it performs poorly in practical application. Thus, it is rarely used in its unmodified form. The page replaced may be an initialization module that was used a long time ago and is no longer needed. On the other hand, it could contain a heavily used variable that was initialized early and is in constant use.
FIFO page replacement algorithm is used by the VAX/VMS operating system, with some modifications. Partial second chance is provided by skipping a limited number of entries with valid translation table references, and additionally, pages are displaced from process working set to a system wide pool from which they can be recovered if not already re-used.

Not Recently Used (NRU) Page Replacement Algorithm

The not recently used (NRU) page replacement algorithm is an algorithm that favors keeping pages in memory that have been recently used. This algorithm works on the following principle: when a page is referenced, a referenced bit is set for that page, marking it as referenced. Similarly, when a page is modified (written to), a modified bit is set. The setting of the bits is usually done by the hardware, although it is possible to do so on the software level as well.

- When a page fault occurs, the operating system inspects all the pages and divides them into four categories based on the current values of their R (referenced; read or written) and M (modified) bits:
* Class 0: not referenced, not modified.
* Class 1: not referenced, modified.
* Class 2: referenced, not modified.
* Class 3: referenced, modified.
- Although class 1 pages seems, at first glance, impossible, they occur when a class 3 page has its R bit cleared by a clock interrupt. Clock interrupts do not clear the M bit because this information is needed to know whether the page has to be rewritten to disk or not. Clearing R but not M leads to a class 1 page.
- The NRU (Not Recently Used) algorithm removes a page at random from the lowest numbered nonempty class.
- It is better to remove a modified page that has not been referenced in at least one clock tick (typically 20 msec) than a clean page that is in heavy use.
- The main attraction of NRU is that it is easy to understand, moderately efficient to implement, and gives a performance that, while certainly not optimal, may be adequate.

When to get Usability Reviews of your software done

Usability Review is an extremely important part of getting a software to market. By usability review, we mean that we want to ensure that the software we are building is being tested by end users before release. A software that is built on the basis of requirements given by Product Management and coded by the development team could be way off the mark in terms of usability unless there has been a systematic effort to ensure that some end-users have had time to play with the software and their interactions have been recorded.
Now, the best way to get people to give their feedback on a feature is to ask them to actually use the feature from within the software. This assumes that enough of the product schedule has happened that you have an application which is stable enough to get it reviewed by a end-user, and yet you need enough time in the schedule left that you can incorporate any changes suggested by the users. Sometimes what happens is that in the quest to get a better working software to show to end-users, the time period in which this usability testing is carried out is so late that it gets difficult to incorporate any feedback without impacting the schedule. The actual time period for this testing needs to be worked out pretty carefully.

Tuesday, January 12, 2010

Optimal Page Replacement Algorithm

Each page can be labeled with the number of instructions that will be executed before that page is first referenced. The optimal page algorithm simply says that the page with the highest label should be removed. If one page will not be used for 8 million instructions and another page will not be used for 6 million instructions, removing the former pushes the page fault that will fetch it back as far into the future as possible.
The only problem with this algorithm is that it is unrealizable. At the time of the page fault, the operating system has no way of knowing when each of the pages will be referenced next. Still, by running a program on a simulator and keeping track of all page references, it is possible to implement optimal page replacement on the second run by using the page reference information collected during the first run.

Despite this limitation, algorithms exist that can offer near-optimal performance — the operating system keeps track of all pages referenced by the program, and it uses those data to decide which pages to swap in and out on subsequent runs. This algorithm can offer near-optimal performance, but not on the first run of a program, and only if the program's memory reference pattern is relatively consistent each time it runs.

In short :
* Belady's optimal algorithm for the minimum number of page faults.
* Replace the page that will be referenced furthest in the future or not at all.
* Problem: we cannot implement it, because we cannot predict the future.
* This is the best case.
* Can use it to compare against other algorithms.

Page Replacement

Page replacement algorithms decide which memory pages to page out (swap out, write to disk) when a page of memory needs to be allocated. Paging happens when a page fault occurs and a free page cannot be used to satisfy the allocation, either because there are none, or because the number of free pages is lower than some threshold. Requirements for page replacement algorithms have changed due to differences in operating system kernel architectures. In particular, most modern OS kernels have unified virtual memory and file system caches, requiring the page replacement algorithm to select a page from among the pages of both user program virtual address spaces and cached files.

Replacement algorithms can be local or global.
When a process incurs a page fault, a local page replacement algorithm selects for replacement some page that belongs to that same process (or a group of processes sharing a memory partition). A global replacement algorithm is free to select any page in memory.
There are a variety of page replacement algorithms :
- Optimal page replacement algorithm
- Not Recently Used (NRU)
- First-in, first-out (FIFO)
- Second-chance
- Clock
- Least recently used (LRU)
- Random replacement algorithm
- Not frequently used (NFU)
- Aging

Monday, January 11, 2010

Performance of Demand Paging

Advantages of Demand Paging :
* Only loads pages that are demanded by the executing process.
* As there is more space in main memory, more processes can be loaded reducing context switching time which utilizes large amounts of resources.
* Less loading latency occurs at program start-up, as less information is accessed from secondary storage and less information is brought into main memory.
* Does not need extra hardware support than what paging needs, since protection fault can be used to get page fault.

Disadvantages of Demand Paging :
* Individual programs face extra latency when they access a page for the first time. So demand paging may have lower performance than anticipatory paging algorithms such as pre-paging.
* Programs running on low-cost, low-power embedded systems may not have a memory management unit that supports page replacement.
* Memory management with page replacement algorithms becomes slightly more complex.
* Possible security risks, including vulnerability to timing attacks.

Performance Of Demand Paging :
Let p be the probability of a page fault (0<=p<=1). We would expect p to be close to zero i.e. there will be only few page faults. The effective access time is then :
effective access time = (1-p) * ma + p * page fault time
To compute the effective access time, we must know how much time is needed to service a page fault. A page fault causes the following sequence to occur :
- Trap to the operating system.
- Save the user registers and process state.
- Determine that the interrupt was a page fault.
- Check that the page reference was legal and determine the location of the page on the disk.
- Issue a read from the disk to a free frame.
- While waiting, allocate the CPU to some other user.
- Interrupt from the disk.
- Save the registers and process state for the other user.
- Determine that the interrupt was from the disk.
- Correct the page table and other tables to show that the desired page is now in memory.
- Wait for the CPU to be allocated to this process again.
- Restore the user registers, process state, and new page table, then resume interrupted instruction.

Overview of Demand Paging

Demand paging follows that pages should only be brought into memory if the executing process demands them. This is often referred to as lazy evaluation as only those pages demanded by the process are swapped from secondary storage to main memory. Contrast this to pure swapping, where all memory for a process is swapped from secondary storage to main memory during the process start-up.

In pure demand paging a page is never moved from the backing store into main memory until that page is referenced. It is the responsibility of the operating system to check where the page is in main memory and OS uses an internal table for this. Operating system reads that page after finding it, and in order to reflect change, the page table is updated. So by using this process it is possible to run a process even its entire memory image is not taken from backing store into main memory.
In this way demand paging is a better approach then paging and it also increases the degree of multiprogramming and allows a process to run even it exceed the physical space allocated for it.

An invalid page is one that currently resides in secondary memory. When a process tries to access a page, the following steps are generally followed:
- Attempt to access page.
- If page is valid (in memory) then continue processing instruction as normal.
- If page is invalid then a page-fault trap occurs.
- Check if the memory reference is a valid reference to a location on secondary memory. If not, the process is terminated (illegal memory access). Otherwise, we have to page in the required page.
- Schedule disk operation to read the desired page into main memory.
- Restart the instruction that was interrupted by the operating system trap.

Saturday, January 9, 2010

Process Management

A process is a sequential program in execution. The components of a process are the following:
* The object program to be executed.
* The data on which the program will execute.
* Resources required by the program.
* The status of the process's execution.

Process management is an integral part of any modern day operating system. The OS must allocate resources to processes, enable processes to share and exchange information, protect the resources of each process from other processes and enable synchronization among processes. To meet these requirements, the OS must maintain a data structure for each process, which describes the state and resource ownership of that process and which enables the OS to exert control over each process.

An operating system kernel that allows multi-tasking needs processes to have certain states. Names for these states are not standardized, but they have similar functionality.
- Executing: the process is currently running and has control of a CPU.
- Waiting: the process is currently able to run, but must wait until a CPU becomes available.
- Blocked: the process is currently waiting on I/O, either for input to arrive or output to be sent.
- Suspended: the process is currently able to run, but for some reason the OS has not placed the process on the ready queue.
- Ready: the process is in memory, will execute given CPU time.

Introduction to Monitors

Monitors :
- Consist of private data and operations on that data.
- It can contain types, constants, variables and procedures.
- Only the procedures explicitly marked can be seen outside the monitor.
- The monitor body allows the private data to be initialized.
- The compiler enforces mutual exclusion on a particular monitor.
- Each monitor has a boundary queue, and processes wanting to call a monitor routine
join this queue if the monitor is already in use.
- Monitors are an improvement over conditional critical regions because all the code
that accesses the shared data is localized.

To allow a process to wait within the monitor, a condition variable must be declared, as: condition x,y.
Condition variable can only be used with the operations wait and signal.
The operation x.wait()means that the process invoking this operation is suspended until another process invokes. Also called delay function.
The operation x.signal()resumes exactly one suspended process. If no process is suspended, then the signal operation has no effect.Also called resume function.

Monitor Implementation using Semaphores
- For each condition variable x, we have:
semaphore x-sem; // (initially = 0)
int x-count = 0;
- The operation x.wait can be implemented as:
if (next-count > 0)
- The operation x.signal can be implemented as:
if (x-count > 0)

Critical Regions

A critical region is a section of code that is always executed under mutual exclusion. Critical regions shift the responsibility for enforcing mutual exclusion from the programmer(where it resides when semaphores are used) to the compiler.
They consist of two parts:
- Variables that must be accessed under mutual exclusion.
- A new language statement that identifies a critical region in which the variables
are accessed.
v : shared T;
region v do
All critical regions that are ‘tagged’ with the same variable have compiler-enforced mutual exclusion so that only one of them can be executed at a time:
Process A:
region V1 do
{Do some stuff.}
region V2 do
{Do more stuff.}

Process B:
region V1 do
{Do other stuff.}
Here process A can be executing inside its V2 region while process Bis executing inside its V1 region, but if they both want to execute inside their respective V1 regions only one will be permitted to proceed.

Friday, January 8, 2010


Semaphores can best be described as counters used to control access to shared resources by multiple processes. They are most often used as a locking mechanism to prevent processes from accessing a particular resource while another process is performing operations on it.
A semaphore is a protected variable or abstract data type which constitutes the classic method for restricting access to shared resources such as shared memory in a parallel programming environment. A counting semaphore is a counter for a set of available resources, rather than a locked/unlocked flag of a single resource.
A semaphore S is an integer variable that, apart from initialization, is accessed only through two standard atomic operations : wait and signal. These operations were originally termed P(for wait) and V(for signal). The classical definitions of wait and signal are :
wait(S) : while S<=0 do no-op;

signal(S) : S:= S+1;

Modifications to the integer value of the semaphore in the wait and signal operations must be executed indivisibly.
- Semaphores can be used to deal with the n-process critical- section problem.
- Semaphores can be used to solve various synchronization problems.

A mutex is a binary semaphore (a semaphore with capacity of 1), usually including extra features like ownership, priority inversion protection or recursivity. The differences between mutexes and semaphores are operating system dependent, though mutexes are implemented by specialized and faster routines. Mutexes are meant to be used for mutual exclusion (post/release operation is restricted to thread which called pend/acquire) only and binary semaphores are meant to be used for event notification (post-ability from any thread) and mutual exclusion.

Wednesday, January 6, 2010

Packet Sniffing

Packet sniffing is listening (with software) to the raw network device for packets that interest you. When your software sees a packet that fits certain criteria, it logs it to a file. The most common criteria for an interesting packet is one that contains words like "login" or "password."
Packet sniffing is a form of wire-tap applied to computer networks instead of phone networks. It came into vogue with Ethernet, which is known as a "shared medium" network. This means that traffic on a segment passes by all hosts attached to that segment. Ethernet cards have a filter that prevents the host machine from seeing traffic addressed to other stations. Sniffing programs turn off the filter, and thus see everyone's traffic.
A packet sniffer, sometimes referred to as a network monitor or network analyzer, can be used legitimately by a network or system administrator to monitor and troubleshoot network traffic. Using the information captured by the packet sniffer an administrator can identify erroneous packets and use the data to pinpoint bottlenecks and help maintain efficient network data transmission.
The versatility of packet sniffers means they can be used to:
* Analyze network problems.
* Detect network intrusion attempts.
* Gain information for effecting a network intrusion.
* Monitor network usage.
* Gather and report network statistics.
* Filter suspect content from network traffic.
* Spy on other network users and collect sensitive information such as passwords .
* Reverse engineer proprietary protocols used over the network.
* Debug client/server communications.
* Debug network protocol implementations.

Least Slack Time (LST) Scheduling

Slack time is defined as the temporal difference between the deadline, the ready time and the run time.More formally, the slack time for a process is defined as:
(d − t) − c'
where d is the process deadline, t is the real time since the cycle start, and c' is the remaining computation time.

Least Slack Time (LST) scheduling is a scheduling algorithm. It assigns priority based on the slack time of a process. Slack time is the amount of time left after a job if the job was started now. This algorithm is also known as Least Laxity First. Its most common use is in embedded systems, especially those with multiple processors. It imposes the simple constraint that each process on each available processor possesses the same run time, and that individual processes do not have an affinity to a certain processor. This is what lends it a suitability to embedded systems.
LST scheduling is most useful in systems comprising mainly aperiodic tasks, because no prior assumptions are made on the events' rate of occurrence. The main weakness of LST is that it does not look ahead, and works only on the current system state. Thus, during a brief overload of system resources, LST can be sub-optimal.

Tuesday, January 5, 2010

Open Addressing - a way to deal with collisions in hashing

Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.
In this technique all elements are stored in the hash table itself. That is, each table entry contains either an element or NIL. When searching for element (or empty slot), we systematically examine slots until we find an element (or empty slot). There are no lists and no elements stored outside the table. This implies that table can completely "fill up"; the load factor α can never exceed 1.
Advantage of this technique is that it avoids pointers (pointers need space too). Instead of chasing pointers, we compute the sequence of slots to be examined.
To perform insertion, we successively examine or probe, the hash table until we find an empty slot. The sequence of slots probed "depends upon the key being inserted." To determine which slots to probe, the hash function includes the probe number as a second input.
Well known probe sequences include:
* Linear probing in which the interval between probes is fixed (usually 1).
* Quadratic probing in which the interval between probes increases by some constant (usually 1) after each probe.
* Double hashing in which the interval between probes is computed by another hash function.

Drawbacks :
- Open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array.
- Open addressing schemes also put more stringent requirements on the hash function.
- Open addressing only saves memory if the entries are small.
- Normal open addressing is a poor choice for large elements, since these elements fill entire cache lines, and a large amount of space is wasted on large empty table slots.

Separate Chaining

A scheme in which each position in the hash table has a list to handle collisions. Each position may be just a link to the list (direct chaining) or may be an item and a link, essentially, the head of a list. In the latter, one item is in the table, and other colliding items are in the list.
Also known as external chaining.

Separate chaining with list heads :
Some chaining implementations store the first record of each chain in the slot array itself. The purpose is to increase cache efficiency of hash table access. In order to save memory space, such hash tables are often designed to have about as many slots as stored entries, meaning that many slots will have two or more entries.

Separate chaining with other structures :
Instead of a list, one can use any other data structure that supports the required operations. By using a self-balancing tree, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, this approach is worth the trouble and extra memory cost only if long delays must be avoided at all costs or if one expects to have many entries hashed to the same slot.
The variant called array hashing uses a dynamic array to store all the entries that hash to the same bucket.Each inserted entry gets appended to the end of the dynamic array that is assigned to the hashed slot.

Monday, January 4, 2010

Overview of Collision Resolution in Hashing

Collisions are practically unavoidable when hashing a random subset of a large set of possible keys.Therefore, most hash table implementations have some collision resolution strategy to handle such events.

- Load factor : The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its bucket array. With a good hash function, the average look up cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases.
- Separate chaining : Each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Look-up requires scanning the list for an entry with the given key. Insertion requires appending a new entry record to either end of the list in the hashed slot. Deletion requires searching the list and removing the element.
* Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that would be unsuitable for other methods.
* Chained hash tables remain effective even when the number of entries n is much higher than the number of slots.
* For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure.
* Chained hash tables also inherit the disadvantages of linked lists.

Saturday, January 2, 2010

Geometric Hashing Technique

Geometric Hashing : It is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation. The models are assumed to be known in advance, and thus allow pre-processing, as follows:

- Pick a reference frame.
- Compute the 3D orthonormal basis associated with this reference frame and its shape signature (e.g. triangle sides length).
- Compute the coordinates of all the other points (in a pre-specified neighborhood) in this reference frame.
- Use each coordinate as an address to the hash (look-up) table. Store the entry [protein id, ref. frame, shape sign., point] at the hash table address.
- Repeat above steps for each model reference frame (non-collinear triplet of model points).

Recognition stage of the algorithm uses the hash table, prepared in the pre-processing step. The matching of a target object is accomplished as follows:

- For each reference frame of the target:
- Compute the 3D orthonormal basis and the shape signature associated with it.
- Compute the coordinates of all other points in the current reference frame.
- Use each coordinate to access the hash-table and retrieve all the records [protein id, ref. frame, shape sign.,point].
- For records with matching shape signature "vote" for the pair [protein, ref. frame].
- Compute the transformations of the ``high scoring'' hypotheses. For each hypothesis one can also register the pairs of matching points. This match list along with the transformation comprise a seed match.

Cryptographic Hash Function

Hashing as a tool to associate one set or bulk of data with an identifier has many different forms of application in the real-world. Cyptographic Hashing is used for data/user verification and authentication. A strong cryptographic hash function has the property of being very difficult to reverse the result of the hash and hence reproduce the original piece of data. Cryptographic hash functions are used to hash user's passwords and have the hash of the passwords stored on a system rather than having the password itself stored. Cryptographic hash functions are also seen as irreversible compression functions, being able to represent large quantities of data with a signal ID, they are useful in seeing whether or not the data has been tampered with, and can also be used as data one signs in order to prove authenticity of a document via other cryptographic means.
The ideal cryptographic hash function has four main properties:
* it is easy to compute the hash value for any given message.
* it is infeasible to find a message that has a given hash.
* it is infeasible to modify a message without changing its hash.
* it is infeasible to find two different messages with the same hash.
A cryptographic hash function must be able to withstand all known types of cryptanalytic attack. As a minimum, it must have the following properties:
* Preimage resistance: Given a hash h it should be hard to find any message m such that h = hash(m). This concept is related to that of one way function. Functions that lack this property are vulnerable to preimage attacks.
* Second preimage resistance : Given an input m1, it should be hard to find another input, m2 (not equal to m1) such that hash(m1) = hash(m2). This property is sometimes referred to as weak collision resistance. Functions that lack this property are vulnerable to second preimage attacks.
* Collision resistance : It should be hard to find two different messages m1 and m2 such that hash(m1) = hash(m2). Such a pair is called a (cryptographic) hash collision, and this property is sometimes referred to as strong collision resistance. It requires a hash value at least twice as long as what is required for preimage-resistance, otherwise collisions may be found by a birthday attack.

Properties of Hash Functions

A hash function is a function that takes a relatively arbitrary amount of input and
produces an output of fixed size. The properties of some hash functions can be
used to greatly increase the security of a system administrator’s network; when
implemented correctly they can verify the integrity and source of a file, network
packet, or any arbitrary data.
Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below :

- A hash is a one-way function.
- The cost of computing a hash function must be small enough to make a hashing-based solution advantageous with regard to alternative approaches.
- A hash procedure must be deterministic — meaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the hashed data, in the mathematical sense of the term.
- A good hash function should map the expected inputs as evenly as possible over its output range i.e. every hash value in the output range should be generated with roughly the same probability.
- The range of hash values may be different for each run of the program, or may change along the same run. In those situations, one needs a hash function which takes two parameters — the input data z, and the number n of allowed hash values.
- A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values.

Facebook activity