数据挖掘导论

王朝百科·作者佚名  2010-05-11  
宽屏版  字体: |||超大  

数据挖掘导论

数据挖掘导论(英文版)作者: [美]Pang-Ning Tan,Michael Steinbach,Vipin Kumar 著

出 版 社: 人民邮电出版社

出版时间: 2006-1-1

字数: 713000

版次: 1

页数: 516

纸张: 胶版纸

I S B N : 9787115141446

包装: 平装

所属分类: 图书 >> 计算机/网络 >> 数据库 >> 数据仓库与数据挖掘

定价:¥59.00

编辑推荐“这是一本全新的数据挖掘教材,值得大力推荐。”

——Jiawei Han,伊利诺伊大学教授

本书全面介绍了数据挖掘,涵盖了五个主题:数据、分类、关联分析、聚类和异常检测。除异常检测外,每个主题都有两章:前一章涵盖基本概念、代表性算法和评估技术,而后一章讨论高级概念和算法。这样读者在透彻地理解数据挖掘的基础的同时,还能够了解更多重要的高级主题。

本书是明尼苏达大学和密歇根州立大学数据挖掘课程的教材,由于独具特色,正式出版之前就已经被斯坦福大学、得克萨斯大学奥斯汀分校等众多名校采用。

本书特色·与许多其他同类图书不同,本书将重点放在如何用数据挖掘知识解决各种实际问题。

·只要求具备很少的预备知识——不需要数据库背景,只需要很少的统计学或数学背景知识。

·书中包含大量的图表、综合示例和丰富的习题,并且使用示例、关键算法的简洁描述和习题,尽可能直接地聚集于数据挖掘的主要概念。

·教辅内容极为丰富,包括课程幻灯片、学生课题建议、数据挖掘资源(如数据挖掘算法和数据集)、联机指南(使用实际的数据集和数据分析软件,为本书介绍的部分数据挖掘技术提供例子讲解)。

·为采用本书作为教材的教师提供习题解答。

内容简介本书对数据挖掘进行了全面介绍,旨在为读者提供将数据挖掘应用于实际问题所必需的知识。本书涵盖五个主题:数据、分类、关联分析、聚类和异常检测。除异常检测外,每个主题都有两章:前面一章讲述基本概念、代表性算法和评估技术,而后面一章较深入地讨论高级概念和算法。目的是在使读者透彻地理解数据挖掘基础的同时,还能了解更多重要的高级主题。此外,书中还提供了大量例子、图表和习题。

本书适合作为相关专业高年级本科生和研究生数据挖掘课程的教材,同时也可作为从事数据挖掘研究和应用开发工作的技术人员的参考书。

作者简介Pang-Ning Tan 现为密歇根州立大学计算机与工程系助理教授,主要教授数据挖掘、数据库系统等课程。此前,他曾是明尼苏达大学美国陆军高性能计算研究中心副研究员(2002-2003)。

Michael Steinbach 明尼苏达大学计算机与工程系研究员,在读博士。

Vipin Kumar 明尼苏达大学计算机科学与工程系主任,曾任美国陆军高性能计算研究中心主任。他拥有马里兰大学博士学位,是数据挖掘和高性能计算方面的国际权威,IEEE会士。

目录1Introduction1

1.1What Is Data Mining?2

1.2Motivating Challenges3

1.3The Origins of Data Mining4

1.4Data Mining Tasks5

1.5Scope and Organization of the Book8

1.6Bibliographic Notes9

1.7Exercises12

2Data13

2.1Types of Data15

2.1.1Attributes and Measurement15

2.1.2Types of Data Sets20

2.2Data Quality25

2.2.1Measurement and Data Collection Issues26

2.2.2Issues Related to Applications31

2.3Data Preprocessing32

2.3.1Aggregation32

2.3.2Sampling34

2.3.3Dimensionality Reduction36

2.3.4Feature Subset Selection37

2.3.5Feature Creation39

2.3.6Discretization and Binarization41

2.3.7Variable Transformation45

2.4Measures of Similarity and Dissimilarity47

2.4.1Basics47

2.4.2Similarity and Dissimilarity between Simple Attributes49

2.4.3Dissimilarities between Data Objects50

2.4.4Similarities between Data Objects52

2.4.5Examples of Proximity Measures53

2.4.6Issues in Proximity Calculation58

2.4.7Selecting the Right Proximity Measure60

2.5Bibliographic Notes61

2.6Exercises64

3Exploring Data71

3.1The Iris Data Set71

3.2Summary Statistics72

3.2.1Frequencies and the Mode72

3.2.2Percentiles73

3.2.3Measures of Location: Mean and Median73

3.2.4Measures of Spread: Range and Variance75

3.2.5Multivariate Summary Statistics76

3.2.6Other Ways to Summarize the Data77

3.3Visualization77

3.3.1Motivations for Visualization77

3.3.2General Concepts78

3.3.3Techniques81

3.3.4Visualizing Higher-Dimensional Data90

3.3.5Do's and Don'ts94

3.4OLAP and Multidimensional Data Analysis95

3.4.1Representing Iris Data as a Multidimensional Array95

3.4.2Multidimensional Data: The General Case97

3.4.3Analyzing Multidimensional Data98

3.4.4Final Comments on Multidimensional Data Analysis101

3.5Bibliographic Notes102

3.6Exercises103

4Classification: Basic Concepts, Decision Trees, and Model Evaluation105

4.1Preliminaries105

4.2General Approach to Solving a Classification Problem107

4.3Decision Tree Induction108

4.3.1How a Decision Tree Works108

4.3.2How to Build a Decision Tree110

4.3.3Methods for Expressing Attribute Test Conditions112

4.3.4Measures for Selecting the Best Split114

4.3.5Algorithm for Decision Tree Induction119

4.3.6An Example: Web Robot Detection120

4.3.7Characteristics of Decision Tree Induction122

4.4Model Overfitting125

4.4.1Overfitting Due to Presence of Noise127

4.4.2Overfitting Due to Lack of Representative Samples129

4.4.3Overfitting and the Multiple Comparison Procedure129

4.4.4Estimation of Generalization Errors131

4.4.5Handling Overfitting in Decision Tree Induction134

4.5Evaluating the Performance of a Classifier135

4.5.1Holdout Method136

4.5.2Random Subsampling136

4.5.3Cross-Validation136

4.5.4Bootstrap137

4.6Methods for Comparing Classifiers137

4.6.1Estimating a Confidence Interval for Accuracy138

4.6.2Comparing the Performance ofTwo Models139

4.6.3Comparing the Performance of Two Classifiers140

4.7Bibliographic Notes141

4.8Exercises144

5Classification: Alternative Techniques151

5.1Rule-Based Classifier151

5.1.1How a Rule-Based Classifier Works153

5.1.2Rule-Ordering Schemes154

5.1.3How to Build a Rule-Based Classifier155

5.1.4Direct Methods for Rule Extraction155

5.1.5Indirect Methods for Rule Extraction161

5.1.6Characteristics of Rule-Based Classifiers163

5.2Nearest-Neighbor classifiers163

5.2.1Algorithm165

5.2.2Characteristics of Nearest-Neighbor Classifiers165

5.3Bayesian Classifiers166

5.3.1Bayes Theorem166

5.3.2Using the Bayes Theorem for Classification168

5.3.3Na?ve Bayes Classifier169

5.3.4Bayes Error Rate175

5.3.5Bayesian Belief Networks176

5.4Artificial Neural Network (ANN)181

5.4.1Perceptron181

5.4.2Multilayer Artificial Neural Network184

5.4.3Characteristics of ANN187

5.5Support Vector Machine (SVM)188

5.5.1Maximum Margin Hyperplanes188

5.5.2Linear SVM: Separable Case190

5.5.3Linear SVM: Nonseparable Case195

5.5.4Nonlinear SVM198

5.5.5Characteristics of SVM203

5.6Ensemble Methods203

5.6.1Rationale for Ensemble Method203

5.6.2Methods for Constructing an Ensemble Classifier204

5.6.3Bias-Variance Decomposition206

5.6.4Bagging209

5.6.5Boosting211

5.6.6Random Forests215

5.6.7Empirical Comparison among Ensemble Methods216

5.7Class Imbalance Problem217

5.7.1Alternative Metrics218

5.7.2The Receiver Operating Characteristic Curve220

5.7.3Cost-Sensitive Learning223

5.7.4Sampling-Based Approaches225

5.8Multiclass Problem226

5.9Bibliographic Notes228

5.10 Exercises233

6Association Analysis: Basic Concepts and Algorithms241

6.1Problem Definition242

6.2Frequent Itemset Generation244

6.2.1The Apriori Principle246

6.2.2Frequent Itemset Generation in the Apriori Algorithm247

6.2.3Candidate Generation and Pruning249

6.2.4Support Counting252

6.2.5Computational Complexity255

6.3Rule Generation257

6.3.1Confidence-Based Pruning258

6.3.2Rule Generation in Apriori Algorithm258

6.3.3An Example: Congressional Voting Records259

6.4Compact Representation of Frequent Itemsets260

6.4.1Maximal Frequent Itemsets260

6.4.2Closed Frequent Itemsets262

6.5Alternative Methods for Generating Frequent Itemsets264

6.6FP-Growth Algorithm268

6.6.1FP-Tree Representation268

6.6.2Frequent Itemset Generation in FP-Growth Algorithm270

6.7Evaluation of Association Patterns273

6.7.1Objective Measures of Interestingness274

6.7.2Measures beyond Pairs of Binary Variables282

6.7.3Simpson's Paradox283

6.8Effect of Skewed Support Distribution285

6.9Bibliographic Notes288

6.10 Exercises298

7Association Analysis: Advanced Concepts307

7.1Handling Categorical Attributes307

7.2Handling Continuous Attributes309

7.2.1Discretization-Based Methods310

7.2.2Statistics-Based Methods312

7.2.3Non-discretization Methods314

7.3Handling a Concept Hierarchy316

7.4Sequential Patterns318

7.4.1Problem Formulation318

7.4.2Sequential Pattern Discovery320

7.4.3Timing Constraints323

7.4.4Alternative Counting Schemes327

7.5Subgraph Patterns328

7.5.1Graphs and Subgraphs329

7.5.2Frequent Subgraph Mining330

7.5.3Apriori-like Method332

7.5.4Candidate Generation333

7.5.5Candidate Pruning338

7.5.6Support Counting340

7.6Infrequent Patterns340

7.6.1Negative Patterns341

7.6.2Negatively Correlated Patterns342

7.6.3Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns343

7.6.4Techniques for Mining Interesting Infrequent Patterns344

7.6.5Techniques Based on Mining Negative Patterns345

7.6.6Techniques Based on Support Expectation347

7.7Bibliographic Notes350

7.8Exercises353

8Cluster Analysis: Basic Concepts and Algorithms363

8.1Overview365

8.1.1What Is Cluster Analysis?365

8.1.2Different Types of Clusterings366

8.1.3Different Types of Clusters368

8.2K-means370

8.2.1The Basic K-means Algorithm371

8.2.2K-means: Additional Issues378

8.2.3Bisecting K-means380

8.2.4K-means and Different Types of Clusters381

8.2.5Strengths and Weaknesses383

8.2.6K-means as an Optimization Problem383

8.3Agglomerative Hierarchical Clustering385

8.3.1Basic Agglomerative Hierarchical Clustering Algorithm385

8.3.2Specific Techniques387

8.3.3The Lance-Williams Formula for Cluster Proximity391

8.3.4Key Issues in Hierarchical Clustering391

8.3.5Strengths and Weaknesses393

8.4DBSCAN393

8.4.1Traditional Density: Center-Based Approach393

8.4.2The DBSCAN Algorithm394

8.4.3Strengths and Weaknesses398

8.5Cluster Evaluation398

8.5.1Overview399

8.5.2Unsupervised Cluster Evaluation Using Cohesion and Separation401

8.5.3Unsupervised Cluster Evaluation Using the Proximity Matrix406

8.5.4Unsupervised Evaluation of Hierarchical Clustering408

8.5.5Determining the Correct Number of Clusters409

8.5.6Clustering Tendency410

8.5.7Supervised Measures of Cluster Validity411

8.5.8Assessing the Significance of Cluster Validity Measures414

8.6Bibliographic Notes416

8.7Exercises419

9Cluster Analysis: Additional Issues and Algorithms427

9.1Characteristics of Data, Clusters, and Clustering Algorithms427

9.1.1Example: Comparing K-means and DBSCAN428

9.1.2Data Characteristics429

9.1.3Cluster Characteristics430

9.1.4General Characteristics of Clustering Algorithms431

9.2Prototype-Based Clustering433

9.2.1Fuzzy Clustering433

9.2.2Clustering Using Mixture Models437

9.2.3Self-Organizing Maps (SOM)446

9.3Density-Based Clustering451

9.3.1Grid-Based Clustering451

9.3.2Subspace Clustering454

9.3.3DENCLUE: A Kernel-Based Scheme for Density-Based Clustering457

9.4Graph-Based Clustering460

9.4.1Sparsification461

9.4.2Minimum Spanning Tree (MST) Clustering462

9.4.3OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS463

9.4.4Chameleon: Hierarchical Clustering with Dynamic Modeling464

9.4.5Shared Nearest Neighbor Similarity468

9.4.6The Jarvis-Patrick Clustering Algorithm471

9.4.7SNN Density472

9.4.8SNN Density-Based Clustering473

9.5Scalable Clustering Algorithms475

9.5.1Scalability: General Issues and Approaches476

9.5.2BIRCH477

9.5.3CURE479

9.6Which Clustering Algorithm?482

9.7Bibliographic Notes484

9.8Exercises488

10Anomaly Detection491

10.1Preliminaries492

10.1.1Causes of Anomalies492

10.1.2Approaches to Anomaly Detection493

10.1.3The Use of Class Labels494

10.1.4Issues495

10.2Statistical Approaches496

10.2.1Detecting Outliers in a Univariate Normal Distribution497

10.2.2Outliers in a Multivariate Normal Distribution499

10.2.3A Mixture Model Approach for Anomaly Detection500

10.2.4Strengths and Weaknesses502

10.3Proximity-Based Outlier Detection502

10.3.1Strengths and Weaknesses503

10.4Density-Based Outlier Detection504

10.4.1Detection of Outliers Using Relative Density505

10.4.2Strengths and Weaknesses506

10.5Clustering-Based Techniques506

10.5.1Assessing the Extent to Which an Object Belongs to a Cluster507

10.5.2Impact of Outliers on the Initial Clustering509

10.5.3The Number of Clusters to Use509

10.5.4Strengths and Weaknesses509

10.6Bibliographic Notes510

10.7Exercises513

 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
 
© 2005- 王朝百科 版权所有