Annotated List of Exercises for
Statistical Modeling: A Fresh Approach
Daniel Kaplan
June 13, 2010
Contents
At the start of the term:
1 Chapter 1: Statistical Models
This is an introductory chapter that sketches out broad concepts.
Chapter reading questions:
ch1read
An analogy between models and scientific “laws”:
1.10,
1.1 Computing: Introduction to R
- General R syntax: 1.2,
1.3,
1.4,
1.5,
1.6
- R syntax for functions (not essential) 1.7
[CURLY BRACKETS need fixing]
- Simple computations in R: 1.1,
- Computer numbers. Scientific notation 1.8,
1.9;
machine precision 1.11,
2 Chapter 2: Data
Chapter reading questions: ch2read
2.1 Drills and Exercises
- Standard case variable format for data: 2.8,
- Distinguishing between variables and cases: 2.1,
2.4,
2.9,
- Levels and coding: 2.5,
2.14,
- Categorical vs quantitative variables, with an explanation of ordinal, interval,
and ratio: 2.10
- Bias in sampling: 2.7.
An example about investment funds: 2.11.
An example about labor statistics: 2.12.
- Creating small data sets: 2.15,
2.2 In-Class Activities
- Bias in sampling: picking a random interval: 2.13,
and a similar example involving choosing library books off a shelf (an R
simulation): [Get prob from 2009-11-24 library-books.tex and select-books.r]
2.3 Computing in R
- Counting (using table) and setting conditions: 2.2,
- Basic computations on data frames: 2.3,
- Creating CSV files: 2.csv
- General R syntax and errors: 2.6,
3 Chapter 3: Describing Variation
Chapter reading questions: ch3read
3.1 Drills and Exercises
- “Guesstimating” centers and spreads of everyday quantities: 3.23,
3.25
- Estimation by eye of simple descriptive statistics. From graphs: 3.18.
From a table of quintiles: 3.24.
- Computing percentiles and coverage intervals: 3.1,
- Appropriate displays for different variable types: 3.13,
3.31,
3.32,
- Boxplots: 3.9.
Outliers and the whiskers: 3.14.
Interpretation: 3.17.
- Comparing different forms of displays of distributions: 3.10a,
3.10b,
3.10c,
3.16,
- Density plots and areas: 3.15,
- Bias: 3.8,
- Hand calculation of simple descriptive statistics: 3.11,
3.12,
- Relating the different descriptors to one another: 3.53
- Body temperature: 3.50
- Mutual fund growth and sampling bias: 3.51
3.2 Statistical Practice
- A simple introductory experiment involving elevators: 3.36
[Note to instructors: This is useful to get students thinking about experiment
and also to have them enter data into a spreadsheet. They will do this
badly, making no clear distinction between cases and variables, using the
spreadsheet as if it were a table cloth. This is an opportunity to put the
students on course. I go over the spreadsheets that the students submit
in class and point out the ways in which they violate the case-variable
paradigm.]
- What does “typical” mean in an advertisement? An example about a
weight-loss advertisement. 3.28
- Where “average” is misleading: 3.29
- “Outliers” and skew distributions: the log transform: 3.30
- Relationship between the 5-number summary and the distribution. 3.33
3.3 Computing
- Basic operations on data frames and variables: 3.2,
3.3,
3.4,
- Calculating descriptive statistics: 3.5,
- Calculating percentiles: the p and q operators in R: 3.54
[Note to instructors: The ISM software adds two operators, pdata and
qdata, to R. These are analogous to the operators on probability distributions,
e.g., pnorm and rnorm. You could, of course, use quantile in the base R
distribution to compute quantiles. But I don’t see any reason to give a
completely different name, e.g., quantile to an operator that is conceptually
related to another set of operators like qnorm.
- Outliers and the 1.5 IQR rule of thumb: 3.6,
- Extracting subsets of data: 3.7,
- Drawing graphics. These relate to the speeders.CV data: 3.21,
3.22,
3.26,
3.27
- Drawing compound graphics — e.g., adding more than one graphic to a
plot. 3.35
- Programming project (for students who want to write functions in R):
3.34
3.4 Quick Quiz
3.19,
3.20
[Need to modify]
3.5 Class Activities
4 Chapter 4: The Language of Models
Chapter reading questions: ch4read
4.1 Exercises
- Identifying response and explanatory variables: 4.1,
4.2
- Model design and the graphical “shape” of models: 4.3,
4.4,
in the context of educational testing 4.6,
- Estimating coefficients by eye: 4.5
4.2 Statistical Practice
- Interpreting poll results: 4.7
4.3 Elaborations
- Three-way interactions: 4.10
- Threshold regression: 4.11
4.4 Computation in R
- Graphing model values: 4.graph
- Graphics involving a conditioning variable: 4.8
5 Chapter 5: Model Formulas and Coefficients
Chapter reading questions: ch5read
5.1 Exercises
- Coefficients and model formulas: 5.1,
5.2
- Coefficients and grand group-wise means: 5.3,
5.4,
5.17
- Model design and graphs of model values: 5.11,
5.12,
5.13
- Model values and residuals from graphs of models: 5.5
- Estimating coefficients from graphs of models: 5.8,
5.9,
- Choosing the correct model design and interpreting model coefficients:
5.6,
5.14
- Units of coefficients: 5.7
- Calculating model values from coefficients: 5.10
5.2 Statistical Practice
- Project: Used-car prices : 5.15
- Translating a news report on drinking behavior into modeling terms:
5.16
6 Chapter 6: Fitting Models to Data
Chapter reading questions: ch6read
6.1 Exercises
- .
- Computing sums of squares of residuals: 6.2,
6.3
- Sizes of residuals: 6.4
- Reading residuals from a graph: 6.6
(also about outliers).
- Overall review: 6.1
6.2 Statistical Practice
- Finding the speed of sound: 6.5
(a statistical issue is whether to include an intercept in the model)
- Looking for patterns in residuals: 6.7
6.3 Elaboration
- Coming soon: Elasticities and proportions (using logs), dropping the units
(using standardized data and ranks).
7 Chapter 7: Measuring Correlation
Chapter reading questions: ch7read
Quick Quiz: 7.9
7.1 Exercises
- Calculating R2 as a ratio of variances: 7.1
- Partitioning of variance by a model: 7.2
- Properties of R2: 7.4,
7.6,
7.7
- Nesting: 7.3,
7.8,
7.10
- “Bigger” models reduce the residuals: 7.5
7.2 Statistical Practice
- The population of US Congressional districts: 7.11
- Is larger R2 better? 7.12
7.3 Elaboration
- R2 and the intercept term: 7.13
7.4 In-Class Activity
- Constructing an exactly redundant model term (exact multicolinearity)
and exploring the consequences of allowing such redundancy in a model.
7.14
8 Chapter 8: Total and Partial Relationships
Chapter reading questions: ch8read
8.1 Exercises
- Computing a partial change: 8.1.
In the context of bond ratings and interest rates, 8.5.
In the context of used-car prices, using contour plots to display the response
and two explanatory variables: 8.10
- Total and partial change in poll results: 8.8,
- Simpson’s paradox. 8.4.
An example from economics and the Phillips Curve: 8.3
- How the model design relates to what is being compared to what: Context
of educational assessment 8.12,
- Interpreting a newspaper description of the relationship between earnings
of college graduates and the eliteness of the school they go to: 8.13,
8.2 Statistical Practice
- Asking students to find an example of Simpson’s paradox: 8.7
- Misleading aggregation in Simpson’s paradox: 8.6
- Whether to look at a partial change or a total change depends on the
question being asked: 8.2
- An example relating to health-care expenditures and outcomes: 8.9
- The “ecological paradox” 8.11
8.3 Elaborations
9 Chapter 9: Model Vectors
Chapter reading questions: ch9read
Graph paper and protractor/rulers at the same scale: http://www.macalester.edu/~kaplan/ISM/graph-paper.pdf. You can
print out the first page on transparency paper to produce protractors for a
class.
9.1 Exercises
- Linear combinations (adding vectors): 9.1,
9.2
- Orthogonality: 9.3,
9.4,
9.12
- Computing square length, dot product, dimension: 9.6,
computing angles with a dot product 9.8
- Translating model terms to model vectors: 9.7
(This is a very long exercise and ought to be divided into parts.), 9.9
- Angles between vectors: 9.10
Cosine of important angles: 9.5
9.2 Activities
- Measuring angles and lengths, comparing a ruler/protractor to the dot
product: 9.8
(You will need to print out some protractors on transparency paper. A
PDF file containing protractors and graph paper at the same scale is
available at http://www.macalester.edu/~kaplan/ISM/graph-paper.pdf.
When the problem is being displayed, the browser window can be sized so
that the graph matches the scale on the ruler protractor.)
10 Chapter 10: Statistical Geometry
Chapter reading questions:
ch10read
10.1 Exercises
- Properties of case space vs variable space: 10.1
- Projection and extraction of the coefficient: 10.2,
10.7
- Properties of the model triangle: 10.3,
10.4,
10.5,
and R2 10.6,
10.8
10.2 Elaborations
- The intercept and the sum of residuals: 10.9
- Projections, algebraically (via the dot product): 10.11,
10.12
10.3 Activities
- Points in case and in variable space: 10.10
This exercise helps students to see that case and variable space provide
different representations of the same data.
11 Chapter 11: Geometry with Multiple Vectors
Chapter reading questions:
ch11read
11.1 Exercises
- Showing how all space can be reached by a suitable combination: 11.2
- Simpson’s paradox: 11.1
- Computing the relationship between the response, the fitted model values,
and the residuals: 11.4
- Translating a description into a vector diagram: 11.5,
11.6
11.2 Activities
- Fitting by hand compared to fitting by software: 11.13
(The second part of this, involving R2, is a little more advanced.)
- Linear combinations, 11.7
- Contrasting fitting in case and variable space: 11.8,
11.9,
11.11
- Sum of squares of residuals: 11.10,
11.12
- Visually fitting models using the vsolve “grow-that-vector” software: Instructions:
11.vsolve;
in 2 dimensions:11.14,
redundancy 11.15;
in 3 dimensions 11.16
redundancy 11.17
12 Chapter 12: Modeling Randomness
Chapter reading questions:
ch12read
12.1 Exercises
- Basic operations on probability distributions: 12.1,
12.7,
12.12,
12.14,
12.15,
12.18,
12.23,
an introduction to the computer commands 12.24
- Coverage intervals on probability distributions: 12.2,
12.3
- Families of probability distributions: parameters 12.4,
12.5,
binomial or not 12.11,
12.19
- Poisson: 12.6
- Estimation by eye: Density 12.13,
Cumulative 12.8,
12.28
- Sums and averages of random variables, z-scores 12.9,
12.10,
12.27
- Percentiles and the parameters: 12.16,
12.17,
12.25,
12.26;
Density and the parameters: 12.22
Quick Quiz: 12.20,
Fun to show that intuition doesn’t correctly represent joint probabilities:
12.21
from Kahneman and Tversky
13 Chapter 13: Geometry of Random Vectors
Chapter reading questions:
ch13read
13.1 Exercises
13.2 Activities
REDUNDANCY EXERCISE FROM 2009-10-08, s2009-29?
14 Chapter 14: Confidence in Models
Chapter reading questions:
ch14read
14.1 Exercises
- Vocabulary of confidence intervals 14.1
- How the margin of error depends on the sample size n: 14.4
- Coverage: 14.5
- Constructing a confidence interval on a coefficient from a regression report:
14.6,
14.7
- How the margin of error depends on the confidence level: qualitatively
14.3,
quantitatively 14.2
14.2 Statistical Practice
- Error bars and confidence intervals. 14.12
(reflecting the diverse practices in the scientific literature.) [[[More problems
from Trish and from those medical reports downloaded from NYT in late
Dec. 2009.]]]]
- Contrasting a confidence interval on a sample mean with a coverage interval
on a distribution of individuals: in the context of a claim about weight loss
14.13;
an extension of this using the actual data from the weight loss clinic
14.14
- How many digits to report? 14.9
- How a measurement that’s not reliable for an individual can be reliable for
a group: 14.10
- Accuracy versus precision: 14.11
- Finding a confidence interval using probability models: an example from
Darwin: 14.22.
- Finding a confidence interval on a calculated estimate: an example from
Robert Hooke’s 17th century observations: 14.23
- Fitting a power law. Context: Wind power generation: 14.24
- PLANNED: Using the 0-1 encoding to get standard errors on sample
proportions and on differences of proportions. Comparing this to the Wald
interval via a formula. (It’s useful to have the formula when all you have
is the reported sample proportion and sample size.)
14.3 Computation in R
- Constructing sampling distributions on the sample mean and sample standard
deviation: 14.19
- Bootstrapping the standard error of the mean: 14.8
- Bootstrapping other simple statistics (median, sd, 75th percentile): 14.17
- A simulation for exploring the effects of sample size, colinearity, and
residual size on the standard error: 14.21
14.4 Elaboration
- Geometry: Why colinearity among explanatory vectors will increase standard
errors: 14.20
PLANNED: A new version of this that uses a simpler diagram. With the
current diagram, it helps to walk students through the problem as an
activity, showing them what the coefficients would be for various different
locations of variable A, and explaining why the contour lines describe the
values of the coefficients.
- Confidence intervals on a sample proportion: Comparing the Wald and
modified Wald method 14.18
14.5 In-class Activities
- The standard error of the mean versus sample size n: 14.16
- Coverage exercise
- How sampling distributions depend on the sample size: 14.19
15 Chapter 15: The Logic of Hypothesis Testing
Chapter reading questions:
ch15read
15.1 Exercises
- Outcomes of a hypothesis test: 15.1
15.2 Statistical Practice
- Significance and power estimated from conditional sampling distribution
graphs: 15.3,
15.4,
15.6
- Settings for hypothesis tests: presidential elections 15.2
(This is implicitly about multiple tests.)
- Calculating p-values given the form of the sampling distribution: binomial
15.7,
poisson 15.5
- Power and the alternative hypothesis: 15.8,
15.3 Elaborations
- Are the heights of husbands and wives related? 15.9
(This problem also deals with the question of the unit of analysis “independent”
measurements.)
15.4 In-Class Activities
- The sampling distribution of R2 and F under the null hypothesis: 15.10
(This can also be assigned as homework, but it helps to be able to walk
students through it in class.)
16 Chapter 16: Hypothesis Testing on Whole Models
Chapter reading questions:
ch16read
16.1 Exercises
- Meaning of the p-value: 16.1
- Testing the mean: 16.3
16.2 Statistical Practice
- A case study (about zebra mussels) that exams how covariates can be
useful by eating variance: 16.8
- Interpreting models: 16.2
- Adjusting p values 16.4,
16.11
16.3 Elaborations
- Mean and variance as outputs from modeling: 16.7
- Consequences of a larger sample size: 16.5
- t and F distributions: 16.6
- Testing the Bonferroni correction: 16.12
- Hypothesis testing and units: 16.13
- False detection rate and q-values in microarray analysis (in genetics):
16.14
16.4 In-Class Activities
- A demonstration of shuffling and how it implements the null hypothesis:
16.15
- Working through an ANOVA calculation by hand: 16.9
- How random terms affect R2: 16.10
17 Chapter 17: Hypothesis Testing on Parts of Models
Chapter reading questions:
ch17read
17.1 Exercises
- The structure of an ANOVA table: 17.2
- Degrees of freedom. 17.83H,
17.42H
[*edit*] . The intercept: 17.516H
[*edit*] .
- Null hypotheses for the different coefficients in a model: 17.1
- PLANNED: Is there evidence for the short-term and long-term Phillips
curve hypotheses. (See 8.3 F2006/inflation and analyze the data there.)
17.2 In-Class Activity
17.3 Statistical Practice
- Interpretation of ANOVA tables; 17.6,
17.9,
order dependence 17.10,
ecology 17.12,
- Significant versus substantial: 17.7,
- Covariates and nesting: 17.11,
[*edit*] There’s more variance structure in this model, so the standard
error is wrong. 17.14.
- Randomization, the null hypothesis, and p-values: 17.15
- Analysis/critique of a news report: School test scores 17.19
17.4 Elaborations
- ANOVA combining multiple model terms; 17.16
- Sampling bias in survival studies (with a simulation): 17.18
18 Chapter 18: Models of Yes/No Variables
Chapter reading questions:
ch18read
18.1 Exercises
- Link values and probability values. An abstract setting: 18.1.
In the context of the space shuttle Challenger: 18.2.
- Deviance and degrees of freedom: 18.3.
- Likelihood: 18.7.
An example of the fitting process for logistic models, etc.: 18.4.
18.2 Statistical Practice
- Calculating odds ratios using results from the National Osteoporosis Risk
Assessment: 18.5.
- Comparing the effects of explanatory variables measured on different scales:
18.6.
- COMING: Modeling whether a driver gets a ticket or a warning for
speeding.
19 Chapter 19: Causation
Chapter reading questions:
ch19read
20 Chapter 20: Experiment
Chapter reading questions:
ch20read
21 Review and Exam Problems
These problems combine materials from multiple chapters in a manner suitable for
exams and other reviews.
INSTRUCTORS: A larger set of Review and Exam problems are available.
Contact mailto:kaplan@macalester.edu.
21.1 R-Quiz
A quiz on basic operations in R. This is helpful in getting students to memorize the
basic commands so that they can use them more fluently.
Quiz Study Guide: R-quiz-study-guide
The quiz itself, which is mainly a reprise of the study guide. Contact mailto:kaplan@macalester.edu.
21.2 Mid-Semester Review
Mid.3
21.3 End of Semester Review
A few problems .... Others are available to instructors upon request.
- Knowing what kind of numbers statistical values will be. Rev.14
- Skin temperature: Rev.5
- Heating degree days: Rev.4
- Based on the Netflix data set: Rev.2
22 General Elaborations
- Graphics. Re-ordering categorical variables for displays: 4.9.