Preface 1 Statistical Learning as a Regression Problem 1.1 Getting Started 1.2 Setting the Regression Context 1.3 The Transition to Statistical Learning 1.3.1 Some Goals of Statistical Learning 1.3.2 Statistical Inference 1.3.3 Some Initial Cautions 1.3.4 A Cartoon Illustration 1.3.5 A Taste of Things to Come 1.4 Some Initial Concepts and Definitions 1.4.1 Overall Goals 1.4.2 Loss Functions and Related Concepts 1.4.3 Linear Estimators 1.4.4 Degrees of Freedom 1.4.5 Model Evaluation 1.4.6 Model Selection 1.4.7 Basis Functions 1.5 Some Common Themes 1.6 Summary and Conclusions 2 Regression Splines and Regression Smoothers 2.1 Introduction 2.2 Regression Splines 2.2.1 Applying a Piecewise Linear Basis 2.2.2 Polynomial Regression Splines 2.2.3 Natural Cubic Splines 2.2.4 B-Splines 2.3 Penalized Smoothing 2.3.1 Shrinkage 2.3.2 Shrinkage and Statistical Inference 2.3.3 Shrinkage: So What? 2.4 Smoothing Splines 2.4.1 An Illustration 2.5 Locally Weighted Regression as a Smoother 2.5.1 Nearest Neighbor Methods 2.5.2 Locally Weighted Regression 2.6 Smoothers for Multiple Predictors 2.6.1 Smoothing in Two Dimensions 2.6.2 The Generalized Additive Model 2.7 Smoothers with Categorical Variables 2.7.1 An Illustration 2.8 Locally Adaptive Smoothers 2.9 The Role of Statistical Inference 2.9.1 Some Apparent Prerequisites 2.9.2 Confidence Intervals 2.9.3 Statistical Tests 2.9.4 Can Asymptotics Help? 2.10 Software Issues 2.11 Summary and Conclusions 3 Classification and Regression Trees (CART)
3.1 Introduction 3.2 An Overview of Recursive Partitioning with CART 3.2.1 Tree Diagrams 3.2.2 Classification and Forecasting with CART 3.2.3 Confusion Tables 3.2.4 CART as an Adaptive Nearest Neighbor Method 3.2.5 What CART Needs to Do 3.3 Splitting a Node 3.4 More on Classification 3.4.1 Fitted Values and Related Terms 3.4.2 An Example 3.5 Classification Errors and Costs 3.5.1 Default Costs in CART 3.5.2 Prior Probabilities and Costs 3.6 Pruning 3.6.1 Impurity Versus Rа(T) 3.7 Missing Data 3.7.1 Missing Data with CART 3.8 Statistical Inference with CART 3.9 Classification Versus Forecasting 3.10 Varying the Prior, Costs, and the Complexity Penalty 3.11 An Example with Three Response Categories 3.12 CART with Highly Skewed Response Distributions 3.13 Some Cautions in Interpreting CART Results 3.13.1 Model Bias 3.13.2 Model Variance 3.14 Regression Trees 3.14.1 An Illustration 3.14.2 Some Extensions 3.14.3 Multivariate Adaptive Regression Splines (MARS) 3.15 Software Issues 3.16 Summary and Conclusions 4 Bagging 4.1 Introduction 4.2 Overfitting and Cross-Validation 4.3 Bagging as an Algorithm 4.3.1 Margins 4.3.2 Out-Of-Bag Observations 4.4 Some Thinking on Why Bagging Works 4.4.1 More on Instability in CART 4.4.2 How Bagging Can Help 4.4.3 A Somewhat More Formal Explanation 4.5 Some Limitations of Bagging 4.5.1 Sometimes Bagging Does Not Help 4.5.2 Sometimes Bagging Can Make the Bias Worse 4.5.3 Sometimes Bagging Can Make the Variance Worse 4.5.4 Losing the Trees for the Forest 4.5.5 Bagging Is Only an Algorithm 4.6 An Example 4.7 Bagging a Quantitative Response Variable
4.8 Software Considerations 4.9 Summary and Conclusions 5 Random Forests 5.1 Introduction and Overview 5.1.1 Unpacking How Random Forests Works 5.2 An Initial Illustration 5.3 A Few Formalities 5.3.1 What Is a Random Forest? 5.3.2 Margins and Generalization Error for Classifiers in General 5.3.3 Generalization Error for Random Forests 5.3.4 The Strength of a Random Forest 5.3.5 Dependence 5.3.6 Implications 5.4 Random Forests and Adaptive Nearest Neighbor Methods 5.5 Taking Costs into Account in Random Forests 5.5.1 A Brief Illustration 5.6 Determining the Importance of the Predictors 5.6.1 Contributions to the Fit 5.6.2 Contributions to Forecasting Skill 5.7 Response Functions 5.7.1 An Example 5.8 The Proximity Matrix 5.8.1 Clustering by Proximity Values 5.8.2 Using Proximity Values to Impute Missing Data 5.8.3 Using Proximities to Detect Outliers 5.9 Quantitative Response Variables 5.10 Tuning Parameters 5.11 An Illustration Using a Binary Response Variable 5.12 An Illustration Using a Quantitative Response Variable 5.13 Software Considerations 5.14 Summary and Conclusions 5.14.1 Problem Set 1 5.14.2 Problem Set 2 5.14.3 Problem Set 3 6 Boosting 6.1 Introduction 6.2 Adaboost 6.2.1 A Toy Numerical Example of Adaboost 6.2.2 A Statistical Perspective on Adaboost 6.3 Why Does Adaboost Work So Well? 6.3.1 Least Angle Regression (LARS) 6.4 Stochastic Gradient Boosting 6.4.1 Tuning Parameters 6.4.2 Output 6.5 Some Problems and Some Possible Solutions 6.5.1 Some Potential Problems 6.5.2 Some Potential Solutions 6.6 Some Examples 6.6.1 A Garden Variety Data Analysis 6.6.2 Inmate Misconduct Again
6.6.3 Homicides and the Impact of Executions 6.6.4 Imputing the Number of Homeless 6.6.5 Estimating Conditional Probabilities 6.7 Software Considerations 6.8 Summary and Conclusions 7 Support Vector Machines 7.1 A Simple Didactic Illustration 7.2 Support Vector Machines in Pictures 7.2.1 Support Vector Classifiers 7.2.2 Support Vector Machines 7.3 Support Vector Machines in Statistical Notation 7.3.1 Support Vector Classifiers 7.3.2 Support Vector Machines 7.3.3 SVM for Regression 7.4 A Classification Example 7.4.1 SVM Analysis with a Linear Kernel 7.4.2 SVM Analysis with a Radial Kernel 7.4.3 Varying Tuning Parameters 7.4.4 Taking the Costs of Classification Errors into Account 7.4.5 Comparisons to Logistic Regression 7.5 Software Considerations 7.6 Summary and Conclusions 8 Broader Implications and a Bit of Craft Lore 8.1 Some Fundamental Limitations of Statistical Learning 8.2 Some Assets of Statistical Learning 8.2.1 The Attitude Adjustment 8.2.2 Selectively Better Performance 8.2.3 Improving Other Procedures 8.3 Some Practical Suggestions 8.3.1 Matching Tools to Jobs 8.3.2 Getting to Know Your Software 8.3.3 Not Forgetting the Basics 8.3.4 Getting Good Data 8.3.5 Being Sensitive to Overtuning 8.3.6 Matching Your Goals to What You Can Credibly Do 8.4 Some Concluding Observations References Index