Data Mining

Concepts and Techniques

1st Edition - August 25, 2000
Latest edition
Authors: Jiawei Han, Micheline Kamber
Language: English

Here's the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. Data Mining: Concepts and Techniques equips you wi… Read more

Purchase options

Summer Sale

Save up to 30%

Bright savings for research, study and discovery

Shop the summer sale

Description

Here's the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. Data Mining: Concepts and Techniques equips you with a sound understanding of data mining principles and teaches you proven methods for knowledge discovery in large corporate databases.Written expressly for database practitioners and professionals, this book begins with a conceptual introduction designed to get you up to speed. This is followed by a comprehensive and state-of-the-art coverage of data mining concepts and techniques. Each chapter functions as a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability, keeping your eye on the issues that will affect your project's results and your overall success. Data Mining: Concepts and Techniques is the master reference that practitioners and researchers have long been seeking. It is also the obvious choice for academic and professional classrooms.

Key features

Offers a comprehensive, practical look at the concepts and techniques you need to know to get the most out of real business data.
Organized as a series of stand-alone chapters so you can begin anywhere and immediately apply what you learn.
Presents dozens of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects.
Provides in-depth, practical coverage of essential data mining topics, including OLAP and data warehousing, data preprocessing, concept description, association rules, classification and prediction, and cluster analysis.
Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields.

Readership

Database professionals and researchers in industry and academia, including graduate students (and possibly undergraduates) who need a good reference/ handbook on data mining

1 Introduction1.1 What motivated data mining? Why is it important?1.2 So, what is data mining?1.3 Data mining-on what kind of data?1.3.1 Relational databases1.3.2 Data warehouses1.3.3 Transactional databases1.3.4 Advanced database systems and advanced database applications1.4 Data mining functionalities-what kinds of patterns can be mined?1.4.1 Concept/class description: characterization and discrimination1.4.2 Association analysis1.4.3 Classification and prediction1.4.4 Cluster analysis1.4.5 Outlier analysis1.4.6 Evolution analysis1.5 Are all of the patterns interesting?1.6 Classification of data mining systems1.7 Major issues in data mining1.8 Summary2 Data Warehouse and OLAP Technology for Data Mining2.1 What is a data warehouse?2.1.1 Differences between operational database systems and data warehouses2.1.2 But, why have a separate data warehouse?2.2 A multidimensional data model2.2.1 From tables and spreadsheets to data cubes2.2.2 Stars, snowflakes, and fact constellations: schemas for multidimensional databases2.2.3 Examples for defining star, snowflake and fact constellation schemas2.2.4 Measures: their categorization and computation2.2.5 Introducing concept hierarchies2.2.6 OLAP operations in the multidimensional data model2.2.7 A starnet query model for querying multidimensional databases2.3 Data warehouse architecture2.3.1 Steps for the design and construction of data warehouses2.3.2 A three-tier data warehouse architecture2.3.3 Types of OLAP servers: ROLAP versus MOLAP versus HOLAP2.4 Data warehouse implementation2.4.1 Efficient computation of data cubes2.4.2 Indexing OLAP data2.4.3 Efficient processing of OLAP queries2.4.4 Metadata repository2.4.5 Data warehouse back-end tools and utilities2.5 Further development of data cube technology2.5.1 Discovery-driven exploration of data cubes2.5.2 Complex aggregation at multiple granularities: multifeature cubes2.5.3 Other developments2.6 From data warehousing to data mining2.6.1 Data warehouse usage2.6.2 From on-line analytical processing to on-line analytical mining2.7 Summary3 Data Preparation3.1 Why preprocess the data?3.2 Data cleaning3.2.1 Missing values3.2.2 Noisy data3.2.3 Inconsistent data3.3 Data integration and transformation3.3.1 Data integration3.3.2 Data transformation3.4 Data reduction3.4.1 Data cube aggregation3.4.2 Dimensionality reduction3.4.3 Data compression3.4.4 Numerosity reduction3.5 Discretization and concept hierarchy generation3.5.1 Discretization and concept hierarchy generation for numeric data3.5.2 Concept hierarchy generation for categorical data3.6 Summary4 Data Mining Primitives, Languages, and System Architectures4.1 Data mining primitives: what defines a data mining task?4.1.1 Task-relevant data4.1.2 The kind of knowledge to be mined4.1.3 Background knowledge: concept hierarchies4.1.4 Interestingness measures4.1.5 Presentation and visualization of discovered patterns4.2 A data mining query language4.2.1 Syntax for task-relevant data specification4.2.2 Syntax for specifying the kind of knowledge to be mined4.2.3 Syntax for concept hierarchy specification4.2.4 Syntax for interestingness measure specification4.2.5 Syntax for pattern presentation and visualization specification4.2.6 Putting it all together-an example of a DMQL query4.2.7 Other data mining languages and the standardization of data mining primitives4.3 Designing graphical user interfaces based on a data mining query language4.4 Architecture of data mining systems4.5 Summary5 Concept Description: Characterization and Comparison5.1 What is concept description?5.2 Data generalization and summarization-based characterization5.2.1 Attribute-oriented induction5.2.2 Efficient implementation of attribute-oriented induction5.2.3 Presentation of the derived generalization5.3 Analytical characterization: analysis of attribute relevance5.3.1 Why perform attribute relevance analysis?5.3.2 Methods of attribute relevance analysis5.3.3 Analytical characterization: an example5.4 Mining class comparisons: discriminating between different classes5.4.1 Class comparison methods and implementations5.4.2 Presentation of class comparison descriptions5.4.3 Class description: presentation of both characterization and comparison5.5 Mining descriptive statistical measures in large databases5.5.1 Measuring the central tendency5.5.2 Measuring the dispersion of data5.5.3 Graph displays of basic statistical class descriptions5.6 Discussion5.6.1 Concept description: a comparison with typical machine learning methods5.6.2 Incremental and parallel mining of concept description5.7 Summary6 Mining Association Rules in Large Databases6.1 Association rule mining6.1.1 Market basket analysis: a motivating example for association rule mining6.1.2 Basic concepts6.1.3 Association rule mining: a road map6.2 Mining single-dimensional Boolean association rules from transactional databases6.2.1 The Apriori algorithm: finding frequent itemsets using candidate generation6.2.2 Generating association rules from frequent itemsets6.2.3 Improving the efficiency of Apriori6.2.4 Mining frequent itemsets without candidate generation6.2.5 Iceberg queries6.3 Mining multilevel association rules from transaction databases6.3.1 Multilevel association rules6.3.2 Approaches to mining multilevel association rules6.3.3 Checking for redundant multilevel association rules6.4 Mining multidimensional association rules from relational databases and data warehouses6.4.1 Multidimensional association rules6.4.2 Mining multidimensional association rules using static discretization of quantitative attributes6.4.3 Mining quantitative association rules6.4.4 Mining distance-based association rules6.5 From association mining to correlation analysis6.5.1 Strong rules are not necessarily interesting: an example6.5.2 From association analysis to correlation analysis6.6 Constraint-based association mining6.6.1 Metarule-guided mining of association rules6.6.2 Mining guided by additional rule constraints6.7 Summary7 Classification and Prediction7.1 What is classification? What is prediction?7.2 Issues regarding classification and prediction 7.2.1 Preparing data for classification and prediction7.2.2 Comparing classification methods7.3 Classification by decision tree induction7.3.1 Decision tree induction7.3.2 Tree pruning7.3.3 Extracting classification rules from decision trees7.3.4 Enhancements to basic decision tree induction7.3.5 Scalability and decision tree induction7.3.6 Integrating data warehousing techniques and decision tree induction7.4 Bayesian classification7.4.1 Bayes theorem7.4.2 Naïve Bayesian classification7.4.3 Bayesian belief networks7.4.4 Traning Bayesian belief networks7.5 Classification by backpropagation7.5.1 A multiplayer feed-forward neural network7.5.2 Defining a network topology7.5.3 Backpropagation7.5.4 Backpropagation and interpretability 7.6 Classification based on concepts from association rule mining7.7 Other classification methods7.7.1 k-nearest neighbor classifiers7.7.2 Case-based reasoning7.7.3 Genetic algorithms7.7.4 Rough set approach7.7.5 Fuzzy set approaches7.8 Prediction7.8.1 Linear and multiple regression7.8.2 Nonlinear regression7.8.3 Other regression models7.9 Classifier accuracy7.9.1 Estimating classifier accuracy7.9.2 Increasing classifier accuracy7.9.3 Is accuracy enough to judge a classifier7.10 Summary8 Cluster Analysis8.1 What is cluster analysis?8.2 Types of data in clustering analysis8.2.1 Interval-scaled variables8.2.2 Binary variables8.2.3 Nominal, ordinal, and ratio-scaled variables8.2.4 Variables of mixed types8.3 A categorization of major clustering methods8.4 Partitioning methods8.4.1 Classical partitioning methods: k-means and k-medoids8.4.2 Partitioning methods in large databases: from k-medoids to CLARANS8.5 Hierarchical methods8.5.1 Agglomerative and divisive hierarchical clustering8.5.2 BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies8.5.3 CURE: Clustering Using REpresentatives8.5.4 CHAMELEON: A hierarchical clustering algorithm using dynamic modeling8.6 Density-based methods8.6.1 DBSCAN: A density-based clustering method based on connected regions with sufficiently high density8.6.2 OTPICS: Ordering Points to Identify the Clustering Structure8.6.3 DENCLUE: Clustering based on density distribution functions8.7 Grid-based methods8.7.1 STING: A STatistcal INformation Grid approach8.7.2 WaveCluster: Clustering using wavelet transformations8.7.3 CLIQUE: Clustering high-dimensional space8.8 Model-based clustering methods8.8.1 Statistical approach8.8.2 Neural network approach8.9 Outlier analysis8.9.1 Statistical-based outlier detection8.9.2 Distance-based outlier detection8.9.3 Deviation-based outlier detection8.10 Summary9 Mining Complex Types of Data9.1 Multidimensional analysis and descriptive mining of complex data objects9.1.1 Generalization of structured data9.1.2 Aggregation and approximation in spatial and multimedia data generalization9.1.3 Generalization of object identifiers and class/subclass hierarchies9.1.4 Generalization of class composition hierarchies9.1.5 Construction and mining of object cubes9.1.6 Generalization-based mining of plan databases by divide-and-conquer9.2 Mining spatial databases9.2.1 Spatial data cube construction and spatial OLAP9.2.2 Spatial association analysis9.2.3 Spatial clustering methods9.2.4 Spatial classification and spatial trend analysis9.2.5 Mining raster databases9.3 Mining multimedia databases9.3.1 Similarity search in multimedia data9.3.2 Multidimensional analysis of multimedia data9.3.3 Classification and prediction analysis of multimedia data9.3.4 Mining associations in multimedia data9.4 Mining time-series and sequence data9.4.1 Trend analysis9.4.2 Similarity search in time-series analysis9.4.3 Sequential pattern mining9.4.4 Periodicity analysis9.5 Mining text databases9.5.1 Text data analysis and information retrieval9.5.2 Text mining: keyword-based association and document classification9.6 Mining the World-Wide Web9.6.1 Mining the Web's link structures to identify authoritative Web pages9.6.2 Automatic classification of Web documents9.6.3 Construction of a multilayered Web information base9.6.4 Web usage mining9.7 Summary10 Data Mining Applications and Trends in Data Mining10.l Data mining applications10.1.1 Data mining for biomedical and DNA data analysis10.1.2 Data mining for financial data analysis10.1.3 Data mining for the retail industry10.1.4 Data mining for the telecommunication industry10.2 Data mining system products and research prototypes10.2.1 How to choose a data mining system10.2.2 Examples of commercial data mining systems10.3 Additional themes on data mining10.3.1 Visual and audio data mining10.3.2 Scientific and statistical data mining10.3.3 Theoretical foundations of data mining10.3.4 Data mining and intelligent query answering10.4 Social impacts of data mining10.4.1 Is data mining a hype or a persistent, steadily growing business?10.4.2 Is data mining merely managers' business or everyone's business?10.4.3 Is data mining a threat to privacy and data security?10.5 Trends in data mining10.6 SummaryAppendix A An Introduction to Microsoft's OLE DB for Data MiningAppendix B An Introduction to DBMinerBibliography

Product details

Edition: 1
Latest edition
Published: August 25, 2000
Language: English

About the authors

Jiawei Han

Jiawei Han is Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. Well known for his research in the areas of data mining and database systems, he has received many awards for his contributions in the field, including the 2004 ACM SIGKDD Innovations Award. He has served as Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data, and on editorial boards of several journals, including IEEE Transactions on Knowledge and Data Engineering and Data Mining and Knowledge Discovery.

Affiliations and expertise

Professor, Department of Computer ScienceUniversity of Illinois, Urbana Champaign, USA

Micheline Kamber

Micheline Kamber is a researcher with a passion for writing in easy-to-understand terms. She has a master's degree in computer science (specializing in artificial intelligence) from Concordia University, Canada.

Affiliations and expertise

Simon Fraser University, Burnaby, Canada

Life Sciences

Physical Sciences & Engineering

Social Sciences & Humanities

Health