Statistical Modeling: A Fresh Approach

February 3, 2010

1 Chapter 1: Statistical Models

1.1 Computing: Introduction to R

2 Chapter 2: Data

2.1 Drills and Exercises

2.2 In-Class Activities

2.3 Computing in R

3 Chapter 3: Describing Variation

3.1 Drills and Exercises

3.2 Statistical Practice

3.3 Computing

3.4 Quick Quiz

3.5 Class Activities

4 Chapter 4: The Language of Models

4.1 Exercises

4.2 Statistical Practice

4.3 Elaborations

4.4 Computation in R

5 Chapter 5: Model Formulas and Coefficients

5.1 Exercises

5.2 Statistical Practice

6 Chapter 6: Fitting Models to Data

6.1 Exercises

6.2 Statistical Practice

6.3 Elaboration

7 Chapter 7: Measuring Correlation

7.1 Exercises

7.2 Statistical Practice

7.3 Elaboration

7.4 In-Class Activity

8 Chapter 8: Total and Partial Relationships

8.1 Exercises

8.2 Statistical Practice

8.3 Elaborations

9 Chapter 9: Model Vectors

9.1 Exercises

9.2 Activities

10 Chapter 10: Statistical Geometry

10.1 Exercises

10.2 Elaborations

10.3 Activities

11 Chapter 11: Geometry with Multiple Vectors

11.1 Exercises

11.2 Activities

12 Chapter 12: Modeling Randomness

12.1 Exercises

13 Chapter 13: Geometry of Random Vectors

13.1 Exercises

13.2 Activities

14 Chapter 14: Confidence in Models

14.1 Exercises

14.2 Statistical Practice

14.3 Computation in R

14.4 Elaboration

14.5 In-class Activities

15 Chapter 15: The Logic of Hypothesis Testing

15.1 Exercises

15.2 Statistical Practice

15.3 Elaborations

15.4 In-Class Activities

16 Chapter 16: Hypothesis Testing on Whole Models

16.1 Exercises

16.2 Statistical Practice

16.3 Elaborations

16.4 In-Class Activities

17 Chapter 17: Hypothesis Testing on Parts of Models

17.1 Exercises

17.2 In-Class Activity

17.3 Statistical Practice

17.4 Elaborations

18 Chapter 18: Models of Yes/No Variables

18.1 Exercises

18.2 Statistical Practice

19 Chapter 19: Causation

20 Chapter 20: Experiment

21 Review and Exam Problems

21.1 R-Quiz

21.2 Mid-Semester Review

21.3 End of Semester Review

22 General Elaborations

1.1 Computing: Introduction to R

2 Chapter 2: Data

2.1 Drills and Exercises

2.2 In-Class Activities

2.3 Computing in R

3 Chapter 3: Describing Variation

3.1 Drills and Exercises

3.2 Statistical Practice

3.3 Computing

3.4 Quick Quiz

3.5 Class Activities

4 Chapter 4: The Language of Models

4.1 Exercises

4.2 Statistical Practice

4.3 Elaborations

4.4 Computation in R

5 Chapter 5: Model Formulas and Coefficients

5.1 Exercises

5.2 Statistical Practice

6 Chapter 6: Fitting Models to Data

6.1 Exercises

6.2 Statistical Practice

6.3 Elaboration

7 Chapter 7: Measuring Correlation

7.1 Exercises

7.2 Statistical Practice

7.3 Elaboration

7.4 In-Class Activity

8 Chapter 8: Total and Partial Relationships

8.1 Exercises

8.2 Statistical Practice

8.3 Elaborations

9 Chapter 9: Model Vectors

9.1 Exercises

9.2 Activities

10 Chapter 10: Statistical Geometry

10.1 Exercises

10.2 Elaborations

10.3 Activities

11 Chapter 11: Geometry with Multiple Vectors

11.1 Exercises

11.2 Activities

12 Chapter 12: Modeling Randomness

12.1 Exercises

13 Chapter 13: Geometry of Random Vectors

13.1 Exercises

13.2 Activities

14 Chapter 14: Confidence in Models

14.1 Exercises

14.2 Statistical Practice

14.3 Computation in R

14.4 Elaboration

14.5 In-class Activities

15 Chapter 15: The Logic of Hypothesis Testing

15.1 Exercises

15.2 Statistical Practice

15.3 Elaborations

15.4 In-Class Activities

16 Chapter 16: Hypothesis Testing on Whole Models

16.1 Exercises

16.2 Statistical Practice

16.3 Elaborations

16.4 In-Class Activities

17 Chapter 17: Hypothesis Testing on Parts of Models

17.1 Exercises

17.2 In-Class Activity

17.3 Statistical Practice

17.4 Elaborations

18 Chapter 18: Models of Yes/No Variables

18.1 Exercises

18.2 Statistical Practice

19 Chapter 19: Causation

20 Chapter 20: Experiment

21 Review and Exam Problems

21.1 R-Quiz

21.2 Mid-Semester Review

21.3 End of Semester Review

22 General Elaborations

At the start of the term:

- A “Knowledge Survey” pre-test: KnowledgeSurvey
- A concise reference sheet’ for R commands: http://www.macalester.edu/%7Ekaplan/ISM/r-commands.pdf
- Instructions for using ISM.Rdata prelimR
- Instruction for using AcroScore: HTML version prelim3
- Login link to AcroScore for students: http://datavis.math.macalester.edu:8080/AcroScore08
- Login link to Acroscore for INSTRUCTORS: http://datavis.math.macalester.edu:8080/AcroScore08/AcroScoreInstructor.html

This is an introductory chapter that sketches out broad concepts.

Chapter reading questions: ch1read

An analogy between models and scientific “laws”: 1.10,

- General R syntax: 1.2, 1.3, 1.4, 1.5, 1.6
- R syntax for functions (not essential) 1.7 [CURLY BRACKETS need fixing]
- Simple computations in R: 1.1,
- Computer numbers. Scientific notation 1.8, 1.9; machine precision 1.11,

Chapter reading questions: ch2read

- Standard case variable format for data: 2.8,
- Distinguishing between variables and cases: 2.1, 2.4, 2.9,
- Levels and coding: 2.5, 2.14,
- Categorical vs quantitative variables, with an explanation of ordinal, interval, and ratio: 2.10
- Bias in sampling: 2.7. An example about investment funds: 2.11. An example about labor statistics: 2.12.
- Creating small data sets: 2.15,

- Bias in sampling: picking a random interval: 2.13, and a similar example involving choosing library books off a shelf (an R simulation): [Get prob from 2009-11-24 library-books.tex and select-books.r]

- Counting (using table) and setting conditions: 2.2,
- Basic computations on data frames: 2.3,
- Creating CSV files: 2.csv
- General R syntax and errors: 2.6,

Chapter reading questions: ch3read

- “Guesstimating” centers and spreads of everyday quantities: 3.23, 3.25
- Estimation by eye of simple descriptive statistics. From graphs: 3.18. From a table of quintiles: 3.24.
- Computing percentiles and coverage intervals: 3.1,
- Appropriate displays for different variable types: 3.13, 3.31, 3.32,
- Boxplots: 3.9. Outliers and the whiskers: 3.14. Interpretation: 3.17.
- Comparing different forms of displays of distributions: 3.10a, 3.10b, 3.10c, 3.16,
- Density plots and areas: 3.15,
- Bias: 3.8,
- Hand calculation of simple descriptive statistics: 3.11, 3.12,
- Relating the different descriptors to one another: 3.53
- Body temperature: 3.50
- Mutual fund growth and sampling bias: 3.51

- A simple introductory experiment involving elevators: 3.36 [Note to instructors: This is useful to get students thinking about experiment and also to have them enter data into a spreadsheet. They will do this badly, making no clear distinction between cases and variables, using the spreadsheet as if it were a table cloth. This is an opportunity to put the students on course. I go over the spreadsheets that the students submit in class and point out the ways in which they violate the case-variable paradigm.]
- What does “typical” mean in an advertisement? An example about a weight-loss advertisement. 3.28
- Where “average” is misleading: 3.29
- “Outliers” and skew distributions: the log transform: 3.30
- Relationship between the 5-number summary and the distribution. 3.33

- Basic operations on data frames and variables: 3.2, 3.3, 3.4,
- Calculating descriptive statistics: 3.5,
- Calculating percentiles: the p and q operators in R: 3.54 [Note to instructors: The ISM software adds two operators, pdata and qdata, to R. These are analogous to the operators on probability distributions, e.g., pnorm and rnorm. You could, of course, use quantile in the base R distribution to compute quantiles. But I don’t see any reason to give a completely different name, e.g., quantile to an operator that is conceptually related to another set of operators like qnorm.
- Outliers and the 1.5 IQR rule of thumb: 3.6,
- Extracting subsets of data: 3.7,
- Drawing graphics. These relate to the speeders.CV data: 3.21, 3.22, 3.26, 3.27
- Drawing compound graphics — e.g., adding more than one graphic to a plot. 3.35
- Programming project (for students who want to write functions in R): 3.34

Chapter reading questions: ch4read

- Identifying response and explanatory variables: 4.1, 4.2
- Model design and the graphical “shape” of models: 4.3, 4.4, in the context of educational testing 4.6,
- Estimating coefficients by eye: 4.5

- Interpreting poll results: 4.7

Chapter reading questions: ch5read

- Coefficients and model formulas: 5.1, 5.2
- Coefficients and grand group-wise means: 5.3, 5.4, 5.17
- Model design and graphs of model values: 5.11, 5.12, 5.13
- Model values and residuals from graphs of models: 5.5
- Estimating coefficients from graphs of models: 5.8, 5.9,
- Choosing the correct model design and interpreting model coefficients: 5.6, 5.14
- Units of coefficients: 5.7
- Calculating model values from coefficients: 5.10

- Project: Used-car prices : 5.15
- Translating a news report on drinking behavior into modeling terms: 5.16

Chapter reading questions: ch6read

- .
- Computing sums of squares of residuals: 6.2, 6.3
- Sizes of residuals: 6.4
- Reading residuals from a graph: 6.6 (also about outliers).
- Overall review: 6.1

- Finding the speed of sound: 6.5 (a statistical issue is whether to include an intercept in the model)
- Looking for patterns in residuals: 6.7

- Coming soon: Elasticities and proportions (using logs), dropping the units (using standardized data and ranks).

Chapter reading questions: ch7read

Quick Quiz: 7.9

- Calculating R
^{2}as a ratio of variances: 7.1 - Partitioning of variance by a model: 7.2
- Properties of R
^{2}: 7.4, 7.6, 7.7 - Nesting: 7.3, 7.8, 7.10
- “Bigger” models reduce the residuals: 7.5

- R
^{2}and the intercept term: 7.13

- Constructing an exactly redundant model term (exact multicolinearity) and exploring the consequences of allowing such redundancy in a model. 7.14

Chapter reading questions: ch8read

- Computing a partial change: 8.1. In the context of bond ratings and interest rates, 8.5. In the context of used-car prices, using contour plots to display the response and two explanatory variables: 8.10
- Total and partial change in poll results: 8.8,
- Simpson’s paradox. 8.4. An example from economics and the Phillips Curve: 8.3
- How the model design relates to what is being compared to what: Context of educational assessment 8.12,
- Interpreting a newspaper description of the relationship between earnings of college graduates and the eliteness of the school they go to: 8.13,

- Asking students to find an example of Simpson’s paradox: 8.7
- Misleading aggregation in Simpson’s paradox: 8.6
- Whether to look at a partial change or a total change depends on the question being asked: 8.2
- An example relating to health-care expenditures and outcomes: 8.9
- The “ecological paradox” 8.11

- .

Chapter reading questions: ch9read

Graph paper and protractor/rulers at the same scale: http://www.macalester.edu/~kaplan/ISM/graph-paper.pdf. You can print out the first page on transparency paper to produce protractors for a class.

- Linear combinations (adding vectors): 9.1, 9.2
- Orthogonality: 9.3, 9.4, 9.12
- Computing square length, dot product, dimension: 9.6, computing angles with a dot product 9.8
- Translating model terms to model vectors: 9.7 (This is a very long exercise and ought to be divided into parts.), 9.9
- Angles between vectors: 9.10 Cosine of important angles: 9.5

- Measuring angles and lengths, comparing a ruler/protractor to the dot product: 9.8 (You will need to print out some protractors on transparency paper. A PDF file containing protractors and graph paper at the same scale is available at http://www.macalester.edu/~kaplan/ISM/graph-paper.pdf. When the problem is being displayed, the browser window can be sized so that the graph matches the scale on the ruler protractor.)

Chapter reading questions: ch10read

- Properties of case space vs variable space: 10.1
- Projection and extraction of the coefficient: 10.2, 10.7
- Properties of the model triangle: 10.3,
10.4,
10.5,
and R
^{2}10.6, 10.8

- The intercept and the sum of residuals: 10.9
- Projections, algebraically (via the dot product): 10.11, 10.12

- Points in case and in variable space: 10.10 This exercise helps students to see that case and variable space provide different representations of the same data.

Chapter reading questions: ch11read

- Showing how all space can be reached by a suitable combination: 11.2
- Simpson’s paradox: 11.1
- Computing the relationship between the response, the fitted model values, and the residuals: 11.4
- Translating a description into a vector diagram: 11.5, 11.6

- Fitting by hand compared to fitting by software: 11.13
(The second part of this, involving R
^{2}, is a little more advanced.) - Linear combinations, 11.7
- Contrasting fitting in case and variable space: 11.8, 11.9, 11.11
- Sum of squares of residuals: 11.10, 11.12
- Visually fitting models using the vsolve “grow-that-vector” software: Instructions: 11.vsolve; in 2 dimensions:11.14, redundancy 11.15; in 3 dimensions 11.16 redundancy 11.17

Chapter reading questions: ch12read

- Basic operations on probability distributions: 12.1, 12.7, 12.12, 12.14, 12.15, 12.18, 12.23, an introduction to the computer commands 12.24
- Coverage intervals on probability distributions: 12.2, 12.3
- Families of probability distributions: parameters 12.4, 12.5, binomial or not 12.11, 12.19
- Poisson: 12.6
- Estimation by eye: Density 12.13, Cumulative 12.8, 12.28
- Sums and averages of random variables, z-scores 12.9, 12.10, 12.27
- Percentiles and the parameters: 12.16, 12.17, 12.25, 12.26; Density and the parameters: 12.22

Quick Quiz: 12.20,

Fun to show that intuition doesn’t correctly represent joint probabilities: 12.21 from Kahneman and Tversky

Chapter reading questions: ch13read

- Taking your class for a random walk: notes at http://www.macalester.edu/~kaplan/ISM/RandomWalkSeminar/notes.pdf. This also includes some software for simulating the walks visually on the computer, as described in the notes http://www.macalester.edu/~kaplan/ISM/RandomWalkSeminar/random-walks.r. “Source” the software into R.

REDUNDANCY EXERCISE FROM 2009-10-08, s2009-29?

Chapter reading questions: ch14read

- Vocabulary of confidence intervals 14.1
- How the margin of error depends on the sample size n: 14.4
- Coverage: 14.5
- Constructing a confidence interval on a coefficient from a regression report: 14.6, 14.7
- How the margin of error depends on the confidence level: qualitatively 14.3, quantitatively 14.2

- Error bars and confidence intervals. 14.12 (reflecting the diverse practices in the scientific literature.) [[[More problems from Trish and from those medical reports downloaded from NYT in late Dec. 2009.]]]]
- Contrasting a confidence interval on a sample mean with a coverage interval on a distribution of individuals: in the context of a claim about weight loss 14.13; an extension of this using the actual data from the weight loss clinic 14.14
- How many digits to report? 14.9
- How a measurement that’s not reliable for an individual can be reliable for a group: 14.10
- Accuracy versus precision: 14.11
- Finding a confidence interval using probability models: an example from Darwin: 14.22.
- Finding a confidence interval on a calculated estimate: an example from Robert Hooke’s 17th century observations: 14.23
- Fitting a power law. Context: Wind power generation: 14.24
- PLANNED: Using the 0-1 encoding to get standard errors on sample proportions and on differences of proportions. Comparing this to the Wald interval via a formula. (It’s useful to have the formula when all you have is the reported sample proportion and sample size.)

- Constructing sampling distributions on the sample mean and sample standard deviation: 14.19
- Bootstrapping the standard error of the mean: 14.8
- Bootstrapping other simple statistics (median, sd, 75th percentile): 14.17
- A simulation for exploring the effects of sample size, colinearity, and residual size on the standard error: 14.21

- Geometry: Why colinearity among explanatory vectors will increase standard errors: 14.20 PLANNED: A new version of this that uses a simpler diagram. With the current diagram, it helps to walk students through the problem as an activity, showing them what the coefficients would be for various different locations of variable A, and explaining why the contour lines describe the values of the coefficients.
- Confidence intervals on a sample proportion: Comparing the Wald and modified Wald method 14.18

- The standard error of the mean versus sample size n: 14.16
- Coverage exercise
- How sampling distributions depend on the sample size: 14.19

Chapter reading questions: ch15read

- Outcomes of a hypothesis test: 15.1

- Significance and power estimated from conditional sampling distribution graphs: 15.3, 15.4, 15.6
- Settings for hypothesis tests: presidential elections 15.2 (This is implicitly about multiple tests.)
- Calculating p-values given the form of the sampling distribution: binomial 15.7, poisson 15.5
- Power and the alternative hypothesis: 15.8,

- Are the heights of husbands and wives related? 15.9 (This problem also deals with the question of the unit of analysis “independent” measurements.)

- The sampling distribution of R
^{2}and F under the null hypothesis: 15.10 (This can also be assigned as homework, but it helps to be able to walk students through it in class.)

Chapter reading questions: ch16read

- A case study (about zebra mussels) that exams how covariates can be useful by eating variance: 16.8
- Interpreting models: 16.2
- Adjusting p values 16.4, 16.11

- Mean and variance as outputs from modeling: 16.7
- Consequences of a larger sample size: 16.5
- t and F distributions: 16.6
- Testing the Bonferroni correction: 16.12
- Hypothesis testing and units: 16.13
- False detection rate and q-values in microarray analysis (in genetics): 16.14

- A demonstration of shuffling and how it implements the null hypothesis: 16.15
- Working through an ANOVA calculation by hand: 16.9
- How random terms affect R
^{2}: 16.10

Chapter reading questions: ch17read

- The structure of an ANOVA table: 17.2
- Degrees of freedom. 17.83H, 17.42H [*edit*] . The intercept: 17.516H [*edit*] .
- Null hypotheses for the different coefficients in a model: 17.1
- PLANNED: Is there evidence for the short-term and long-term Phillips curve hypotheses. (See 8.3 F2006/inflation and analyze the data there.)

- Geometry of ANOVA 17.20

- Interpretation of ANOVA tables; 17.6, 17.9, order dependence 17.10, ecology 17.12,
- Significant versus substantial: 17.7,
- Covariates and nesting: 17.11, [*edit*] There’s more variance structure in this model, so the standard error is wrong. 17.14.
- Randomization, the null hypothesis, and p-values: 17.15
- Analysis/critique of a news report: School test scores 17.19

- ANOVA combining multiple model terms; 17.16
- Sampling bias in survival studies (with a simulation): 17.18

Chapter reading questions: ch18read

- Link values and probability values. An abstract setting: 18.1. In the context of the space shuttle Challenger: 18.2.
- Deviance and degrees of freedom: 18.3.
- Likelihood: 18.7. An example of the fitting process for logistic models, etc.: 18.4.

- Calculating odds ratios using results from the National Osteoporosis Risk Assessment: 18.5.
- Comparing the effects of explanatory variables measured on different scales: 18.6.
- COMING: Modeling whether a driver gets a ticket or a warning for speeding.

Chapter reading questions: ch19read

Chapter reading questions: ch20read

These problems combine materials from multiple chapters in a manner suitable for exams and other reviews.

INSTRUCTORS: A larger set of Review and Exam problems are available. Contact mailto:kaplan@macalester.edu.

A quiz on basic operations in R. This is helpful in getting students to memorize the basic commands so that they can use them more fluently.

Quiz Study Guide: R-quiz-study-guide

The quiz itself, which is mainly a reprise of the study guide. Contact mailto:kaplan@macalester.edu.

A few problems .... Others are available to instructors upon request.

- Knowing what kind of numbers statistical values will be. Rev.14
- Skin temperature: Rev.5
- Heating degree days: Rev.4
- Based on the Netflix data set: Rev.2

- Graphics. Re-ordering categorical variables for displays: 4.9.