## Annotated List of Exercises for Statistical Modeling: A Fresh Approach

June 13, 2010

### Contents

At the start of the term:

### 1 Chapter 1: Statistical Models

This is an introductory chapter that sketches out broad concepts.

An analogy between models and scientific “laws”: 1.10,

#### 1.1 Computing: Introduction to R

• General R syntax: 1.2, 1.3, 1.4, 1.5, 1.6
• R syntax for functions (not essential) 1.7 [CURLY BRACKETS need fixing]
• Simple computations in R: 1.1,
• Computer numbers. Scientific notation 1.8, 1.9; machine precision 1.11,

### 2 Chapter 2: Data

#### 2.1 Drills and Exercises

• Standard case variable format for data: 2.8,
• Distinguishing between variables and cases: 2.1, 2.4, 2.9,
• Levels and coding: 2.5, 2.14,
• Categorical vs quantitative variables, with an explanation of ordinal, interval, and ratio: 2.10
• Bias in sampling: 2.7. An example about investment funds: 2.11. An example about labor statistics: 2.12.
• Creating small data sets: 2.15,

#### 2.2 In-Class Activities

• Bias in sampling: picking a random interval: 2.13, and a similar example involving choosing library books off a shelf (an R simulation): [Get prob from 2009-11-24 library-books.tex and select-books.r]

#### 2.3 Computing in R

• Counting (using table) and setting conditions: 2.2,
• Basic computations on data frames: 2.3,
• Creating CSV files: 2.csv
• General R syntax and errors: 2.6,

### 3 Chapter 3: Describing Variation

#### 3.1 Drills and Exercises

• “Guesstimating” centers and spreads of everyday quantities: 3.23, 3.25
• Estimation by eye of simple descriptive statistics. From graphs: 3.18. From a table of quintiles: 3.24.
• Computing percentiles and coverage intervals: 3.1,
• Appropriate displays for different variable types: 3.13, 3.31, 3.32,
• Boxplots: 3.9. Outliers and the whiskers: 3.14. Interpretation: 3.17.
• Comparing different forms of displays of distributions: 3.10a, 3.10b, 3.10c, 3.16,
• Density plots and areas: 3.15,
• Bias: 3.8,
• Hand calculation of simple descriptive statistics: 3.11, 3.12,
• Relating the different descriptors to one another: 3.53
• Body temperature: 3.50
• Mutual fund growth and sampling bias: 3.51

#### 3.2 Statistical Practice

• A simple introductory experiment involving elevators: 3.36 [Note to instructors: This is useful to get students thinking about experiment and also to have them enter data into a spreadsheet. They will do this badly, making no clear distinction between cases and variables, using the spreadsheet as if it were a table cloth. This is an opportunity to put the students on course. I go over the spreadsheets that the students submit in class and point out the ways in which they violate the case-variable paradigm.]
• Where “average” is misleading: 3.29
• “Outliers” and skew distributions: the log transform: 3.30
• Relationship between the 5-number summary and the distribution. 3.33

#### 3.3 Computing

• Basic operations on data frames and variables: 3.2, 3.3, 3.4,
• Calculating descriptive statistics: 3.5,
• Calculating percentiles: the p and q operators in R: 3.54 [Note to instructors: The ISM software adds two operators, pdata and qdata, to R. These are analogous to the operators on probability distributions, e.g., pnorm and rnorm. You could, of course, use quantile in the base R distribution to compute quantiles. But I don’t see any reason to give a completely different name, e.g., quantile to an operator that is conceptually related to another set of operators like qnorm.
• Outliers and the 1.5 IQR rule of thumb: 3.6,
• Extracting subsets of data: 3.7,
• Drawing graphics. These relate to the speeders.CV data: 3.21, 3.22, 3.26, 3.27
• Drawing compound graphics — e.g., adding more than one graphic to a plot. 3.35
• Programming project (for students who want to write functions in R): 3.34

#### 3.4 Quick Quiz

3.19, 3.20 [Need to modify]

### 4 Chapter 4: The Language of Models

#### 4.1 Exercises

• Identifying response and explanatory variables: 4.1, 4.2
• Model design and the graphical “shape” of models: 4.3, 4.4, in the context of educational testing 4.6,
• Estimating coefficients by eye: 4.5

#### 4.2 Statistical Practice

• Interpreting poll results: 4.7

#### 4.3 Elaborations

• Three-way interactions: 4.10
• Threshold regression: 4.11

#### 4.4 Computation in R

• Graphing model values: 4.graph
• Graphics involving a conditioning variable: 4.8

### 5 Chapter 5: Model Formulas and Coefficients

#### 5.1 Exercises

• Coefficients and model formulas: 5.1, 5.2
• Coefficients and grand group-wise means: 5.3, 5.4, 5.17
• Model design and graphs of model values: 5.11, 5.12, 5.13
• Model values and residuals from graphs of models: 5.5
• Estimating coefficients from graphs of models: 5.8, 5.9,
• Choosing the correct model design and interpreting model coefficients: 5.6, 5.14
• Units of coefficients: 5.7
• Calculating model values from coefficients: 5.10

#### 5.2 Statistical Practice

• Project: Used-car prices : 5.15
• Translating a news report on drinking behavior into modeling terms: 5.16

### 6 Chapter 6: Fitting Models to Data

#### 6.1 Exercises

• .
• Computing sums of squares of residuals: 6.2, 6.3
• Sizes of residuals: 6.4
• Overall review: 6.1

#### 6.2 Statistical Practice

• Finding the speed of sound: 6.5 (a statistical issue is whether to include an intercept in the model)
• Looking for patterns in residuals: 6.7

#### 6.3 Elaboration

• Coming soon: Elasticities and proportions (using logs), dropping the units (using standardized data and ranks).

### 7 Chapter 7: Measuring Correlation

Quick Quiz: 7.9

#### 7.1 Exercises

• Calculating R2 as a ratio of variances: 7.1
• Partitioning of variance by a model: 7.2
• Properties of R2: 7.4, 7.6, 7.7
• Nesting: 7.3, 7.8, 7.10
• “Bigger” models reduce the residuals: 7.5

#### 7.2 Statistical Practice

• The population of US Congressional districts: 7.11
• Is larger R2 better? 7.12

#### 7.3 Elaboration

• R2 and the intercept term: 7.13

#### 7.4 In-Class Activity

• Constructing an exactly redundant model term (exact multicolinearity) and exploring the consequences of allowing such redundancy in a model. 7.14

### 8 Chapter 8: Total and Partial Relationships

#### 8.1 Exercises

• Computing a partial change: 8.1. In the context of bond ratings and interest rates, 8.5. In the context of used-car prices, using contour plots to display the response and two explanatory variables: 8.10
• Total and partial change in poll results: 8.8,
• Simpson’s paradox. 8.4. An example from economics and the Phillips Curve: 8.3
• How the model design relates to what is being compared to what: Context of educational assessment 8.12,
• Interpreting a newspaper description of the relationship between earnings of college graduates and the eliteness of the school they go to: 8.13,

#### 8.2 Statistical Practice

• Whether to look at a partial change or a total change depends on the question being asked: 8.2
• An example relating to health-care expenditures and outcomes: 8.9

• .

### 9 Chapter 9: Model Vectors

Graph paper and protractor/rulers at the same scale: http://www.macalester.edu/~kaplan/ISM/graph-paper.pdf. You can print out the first page on transparency paper to produce protractors for a class.

#### 9.1 Exercises

• Linear combinations (adding vectors): 9.1, 9.2
• Orthogonality: 9.3, 9.4, 9.12
• Computing square length, dot product, dimension: 9.6, computing angles with a dot product 9.8
• Translating model terms to model vectors: 9.7 (This is a very long exercise and ought to be divided into parts.), 9.9
• Angles between vectors: 9.10 Cosine of important angles: 9.5

#### 9.2 Activities

• Measuring angles and lengths, comparing a ruler/protractor to the dot product: 9.8 (You will need to print out some protractors on transparency paper. A PDF file containing protractors and graph paper at the same scale is available at http://www.macalester.edu/~kaplan/ISM/graph-paper.pdf. When the problem is being displayed, the browser window can be sized so that the graph matches the scale on the ruler protractor.)

### 10 Chapter 10: Statistical Geometry

#### 10.1 Exercises

• Properties of case space vs variable space: 10.1
• Projection and extraction of the coefficient: 10.2, 10.7
• Properties of the model triangle: 10.3, 10.4, 10.5, and R2 10.6, 10.8

#### 10.2 Elaborations

• The intercept and the sum of residuals: 10.9
• Projections, algebraically (via the dot product): 10.11, 10.12

#### 10.3 Activities

• Points in case and in variable space: 10.10 This exercise helps students to see that case and variable space provide different representations of the same data.

### 11 Chapter 11: Geometry with Multiple Vectors

#### 11.1 Exercises

• Showing how all space can be reached by a suitable combination: 11.2
• Computing the relationship between the response, the fitted model values, and the residuals: 11.4
• Translating a description into a vector diagram: 11.5, 11.6

#### 11.2 Activities

• Fitting by hand compared to fitting by software: 11.13 (The second part of this, involving R2, is a little more advanced.)
• Linear combinations, 11.7
• Contrasting fitting in case and variable space: 11.8, 11.9, 11.11
• Sum of squares of residuals: 11.10, 11.12
• Visually fitting models using the vsolve “grow-that-vector” software: Instructions: 11.vsolve; in 2 dimensions:11.14, redundancy 11.15; in 3 dimensions 11.16 redundancy 11.17

### 12 Chapter 12: Modeling Randomness

#### 12.1 Exercises

Quick Quiz: 12.20,

Fun to show that intuition doesn’t correctly represent joint probabilities: 12.21 from Kahneman and Tversky

### 13 Chapter 13: Geometry of Random Vectors

#### 13.2 Activities

REDUNDANCY EXERCISE FROM 2009-10-08, s2009-29?

### 14 Chapter 14: Confidence in Models

#### 14.1 Exercises

• Vocabulary of confidence intervals 14.1
• How the margin of error depends on the sample size n: 14.4
• Coverage: 14.5
• Constructing a confidence interval on a coefficient from a regression report: 14.6, 14.7
• How the margin of error depends on the confidence level: qualitatively 14.3, quantitatively 14.2

#### 14.2 Statistical Practice

• Error bars and confidence intervals. 14.12 (reflecting the diverse practices in the scientific literature.) [[[More problems from Trish and from those medical reports downloaded from NYT in late Dec. 2009.]]]]
• Contrasting a confidence interval on a sample mean with a coverage interval on a distribution of individuals: in the context of a claim about weight loss 14.13; an extension of this using the actual data from the weight loss clinic 14.14
• How many digits to report? 14.9
• How a measurement that’s not reliable for an individual can be reliable for a group: 14.10
• Accuracy versus precision: 14.11
• Finding a confidence interval using probability models: an example from Darwin: 14.22.
• Finding a confidence interval on a calculated estimate: an example from Robert Hooke’s 17th century observations: 14.23
• Fitting a power law. Context: Wind power generation: 14.24
• PLANNED: Using the 0-1 encoding to get standard errors on sample proportions and on differences of proportions. Comparing this to the Wald interval via a formula. (It’s useful to have the formula when all you have is the reported sample proportion and sample size.)

#### 14.3 Computation in R

• Constructing sampling distributions on the sample mean and sample standard deviation: 14.19
• Bootstrapping the standard error of the mean: 14.8
• Bootstrapping other simple statistics (median, sd, 75th percentile): 14.17
• A simulation for exploring the effects of sample size, colinearity, and residual size on the standard error: 14.21

#### 14.4 Elaboration

• Geometry: Why colinearity among explanatory vectors will increase standard errors: 14.20 PLANNED: A new version of this that uses a simpler diagram. With the current diagram, it helps to walk students through the problem as an activity, showing them what the coefficients would be for various different locations of variable A, and explaining why the contour lines describe the values of the coefficients.
• Confidence intervals on a sample proportion: Comparing the Wald and modified Wald method 14.18

#### 14.5 In-class Activities

• The standard error of the mean versus sample size n: 14.16
• Coverage exercise
• How sampling distributions depend on the sample size: 14.19

### 15 Chapter 15: The Logic of Hypothesis Testing

#### 15.1 Exercises

• Outcomes of a hypothesis test: 15.1

#### 15.2 Statistical Practice

• Significance and power estimated from conditional sampling distribution graphs: 15.3, 15.4, 15.6
• Settings for hypothesis tests: presidential elections 15.2 (This is implicitly about multiple tests.)
• Calculating p-values given the form of the sampling distribution: binomial 15.7, poisson 15.5
• Power and the alternative hypothesis: 15.8,

#### 15.3 Elaborations

• Are the heights of husbands and wives related? 15.9 (This problem also deals with the question of the unit of analysis “independent” measurements.)

#### 15.4 In-Class Activities

• The sampling distribution of R2 and F under the null hypothesis: 15.10 (This can also be assigned as homework, but it helps to be able to walk students through it in class.)

### 16 Chapter 16: Hypothesis Testing on Whole Models

#### 16.1 Exercises

• Meaning of the p-value: 16.1
• Testing the mean: 16.3

#### 16.2 Statistical Practice

• A case study (about zebra mussels) that exams how covariates can be useful by eating variance: 16.8
• Interpreting models: 16.2
• Adjusting p values 16.4, 16.11

#### 16.3 Elaborations

• Mean and variance as outputs from modeling: 16.7
• Consequences of a larger sample size: 16.5
• t and F distributions: 16.6
• Testing the Bonferroni correction: 16.12
• Hypothesis testing and units: 16.13
• False detection rate and q-values in microarray analysis (in genetics): 16.14

#### 16.4 In-Class Activities

• A demonstration of shuffling and how it implements the null hypothesis: 16.15
• Working through an ANOVA calculation by hand: 16.9
• How random terms affect R2: 16.10

### 17 Chapter 17: Hypothesis Testing on Parts of Models

#### 17.1 Exercises

• The structure of an ANOVA table: 17.2
• Degrees of freedom. 17.83H, 17.42H [*edit*] . The intercept: 17.516H [*edit*] .
• Null hypotheses for the different coefficients in a model: 17.1
• PLANNED: Is there evidence for the short-term and long-term Phillips curve hypotheses. (See 8.3 F2006/inflation and analyze the data there.)

#### 17.3 Statistical Practice

• Interpretation of ANOVA tables; 17.6, 17.9, order dependence 17.10, ecology 17.12,
• Significant versus substantial: 17.7,
• Covariates and nesting: 17.11, [*edit*] There’s more variance structure in this model, so the standard error is wrong. 17.14.
• Randomization, the null hypothesis, and p-values: 17.15
• Analysis/critique of a news report: School test scores 17.19

#### 17.4 Elaborations

• ANOVA combining multiple model terms; 17.16
• Sampling bias in survival studies (with a simulation): 17.18

### 18 Chapter 18: Models of Yes/No Variables

#### 18.1 Exercises

• Link values and probability values. An abstract setting: 18.1. In the context of the space shuttle Challenger: 18.2.
• Deviance and degrees of freedom: 18.3.
• Likelihood: 18.7. An example of the fitting process for logistic models, etc.: 18.4.

#### 18.2 Statistical Practice

• Calculating odds ratios using results from the National Osteoporosis Risk Assessment: 18.5.
• Comparing the effects of explanatory variables measured on different scales: 18.6.
• COMING: Modeling whether a driver gets a ticket or a warning for speeding.

### 21 Review and Exam Problems

These problems combine materials from multiple chapters in a manner suitable for exams and other reviews.

INSTRUCTORS: A larger set of Review and Exam problems are available. Contact mailto:kaplan@macalester.edu.

#### 21.1 R-Quiz

A quiz on basic operations in R. This is helpful in getting students to memorize the basic commands so that they can use them more fluently.

Quiz Study Guide: R-quiz-study-guide

The quiz itself, which is mainly a reprise of the study guide. Contact mailto:kaplan@macalester.edu.

Mid.3

#### 21.3 End of Semester Review

A few problems .... Others are available to instructors upon request.

• Knowing what kind of numbers statistical values will be. Rev.14
• Skin temperature: Rev.5
• Heating degree days: Rev.4
• Based on the Netflix data set: Rev.2

### 22 General Elaborations

• Graphics. Re-ordering categorical variables for displays: 4.9.