Tuesday, September 22, 2009

Homework assignment, due mardi 29 sept

The data for all assignments are at http://www.stat.columbia.edu/~gelman/arm/examples/

1. Exercise 3.1: The folder pyth contains outcome y and inputs x1, x2 for 40 data points, with a further 20 points with the inputs but no observed outcome. Save the file to your working directory and read it into R using the read.table() function.

(a) Use R to fit a linear regression model predicting y from x1, x2, using the first
40 data points in the file. Summarize the inferences and check the fit of your
model.

(b) Display the estimated model graphically as in Figure 3.2.

(c) Make a residual plot for this model. Do the assumptions appear to be met?

(d) Make predictions for the remaining 20 data points in the file. How confident
do you feel about these predictions?

2. Exercise 3.2(a): Suppose that, for a certain population, we can predict log earnings from log height as follows:

• A person who is 66 inches tall is predicted to have earnings of $30,000.

• Every increase of 1% in height corresponds to a predicted increase of 0.8% in
earnings.

• The earnings of approximately 95% of people fall within a factor of 1.1 of
predicted values.

(a) Give the equation of the regression line and the residual standard deviation of the regression.

3. Exercise 3.3: In this exercise you will simulate two variables that are statistically independent of each other to see what happens when we run a regression of one on the other.

(a) First generate 1000 data points from a normal distribution with mean 0 and
standard deviation 1 by typing var1 <- rnorm(1000,0,1) in R. Generate
another variable in the same way (call it var2). Run a regression of one
variable on the other. Is the slope coefficient statistically significant?

(b) Now run a simulation repeating this process 100 times. This can be done
using a loop. From each simulation, save the z-score (the estimated coefficient
of var1 divided by its standard error). If the absolute value of the z-score
exceeds 2, the estimate is statistically significant. Here is code to perform the
simulation:

z.scores <- rep (NA, 100)
for (k in 1:100) {
var1 <- rnorm (1000,0,1)
var2 <- rnorm (1000,0,1)
fit <- lm (var2 ~ var1)
z.scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}

How many of these 100 z-scores are statistically significant?

No comments:

Post a Comment