Subscribe by Email

Tuesday, August 11, 2009

Introduction to Data Mining

Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.
Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are
both understandable and useful to the data owner.
Data mining is an interdisciplinary field bringing together techniques from
machine learning, pattern recognition, statistics, databases, and visualization to
address the issue of information extraction from large data bases.

1. Commercial View :
- Lots of data is being collected and warehoused.
* Web data, e-commerce.
* Purchases at department/grocery stores.
* Bank/Credit Card transactions.
- Computers have become cheaper and more powerful
* Competitive Pressure is strong.
* Provide better, customized services for an edge.
2. Scientific View :
- Data collected and stored at enormous speeds(GB/hour).
* Remote sensors on a satellite.
* Telescopes scanning the skies.
* Micro arrays generating gene expression data.
* Scientific simulations generating terabytes of data.
- Traditional techniques infeasible for raw data.
- Data mining may help scientists :
* in classifying and segmenting data.
* in Hypothesis Formation.

Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:
- Automated prediction of trends and behaviors.
- Automated discovery of previously unknown patterns.

Automated discovery of previously unknown patterns.
* More columns : Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explore the full depth of a database, without preselect a subset of variables.
* More rows : Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population.

No comments:

Facebook activity