Andrew Gelman's Sciences Po Course: September 2009

Wednesday, September 30, 2009

Homework assignment, due mardi 6 oct

1. Exercise 4.1: Logarithmic transformation and regression: consider the following regression:

log(weight) = −3.5 + 2.0 log(height) + error,

with errors that have standard deviation 0.25. Weights are in pounds and heights
are in inches.

(a) Fill in the blanks: approximately 68% of the persons will have weights within
a factor of and of their predicted values from the regression.

(b) Draw the regression line and scatterplot of log(weight) versus log(height) that
make sense and are consistent with the fitted model. Be sure to label the axes
of your graph.

2. Exercise 4.2: The folder earnings has data from the Work, Family, and Well-Being Survey. Pull out the data on earnings, sex, height, and weight.

(a) In R, check the dataset and clean any unusually coded data.

(b) Fit a linear regression model predicting earnings from height. What transformation should you perform in order to interpret the intercept from this model as average earnings for people with average height?

(c) Fit some regression models with the goal of predicting earnings from some combination of sex, height, and weight. Be sure to try various transformations and interactions that might make sense. Choose your preferred model and justify.

(d) Interpret all model coefficients.

3. Exercise 4.3: lotting linear and nonlinear regressions: we downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We first created new variables: age10 = age/10 and age10.sq = (age/10)2, and indicators age18.29, age30.44, age45.64, and age65up for four age categories. We then fit some regressions, with the following results:

lm(formula = weight ~ age10)
coef.est coef.se
(Intercept) 161.0 7.3
age10 2.6 1.6
n = 2009, k = 2
residual sd = 119.7, R-Squared = 0.00

lm(formula = weight ~ age10 + age10.sq)
coef.est coef.se
(Intercept) 96.2 19.3
age10 33.6 8.7
age10.sq -3.2 0.9
n = 2009, k = 3
residual sd = 119.3, R-Squared = 0.01

lm(formula = weight ~ age30.44 + age45.64 + age65up)
coef.est coef.se
(Intercept) 157.2 5.4
age30.44TRUE 19.1 7.0
age45.64TRUE 27.2 7.6
age65upTRUE 8.5 8.7
n = 2009, k = 4
residual sd = 119.4, R-Squared = 0.01

(a) On a graph of weights versus age (that is, weight on y-axis, age on x-axis), draw the fitted regression line from the first model.

(b) On the same graph, draw the fitted regression line from the second model.

(c) On another graph with the same axes and scale, draw the fitted regression line from the third model. (It will be discontinuous.)

Sunday, September 27, 2009

Who has babies when?

Hi all. Take a look at this (including the linked article and the blog comments). These are the kind of issues that arise in statistical analysis in social science.

Tuesday, September 22, 2009

some links concerning R and Emacs

you can read this and type the different commands yourself in R:
http://mephisto.unige.ch/pub/biotree/ED-Pavie/articles/R-FrancoisMarin_modulad-2007.pdf

for those of you who want to try emacs with ESS (not required at all !!!), you can go to:
http://vgoulet.act.ulaval.ca/ressources/emacs/
and choose the windows or mac os version depending on your computer.
(download the corresponding .emacs file as well)

some R code for Conference #1

##### install new packages in particular "arm"
chooseCRANmirror()
install.packages()

##### test
library("arm")
x <- c(1,2,3)
y <- c(1,2,5)
fit <- lm (y ~ x)
display (fit)

########################
##### exercise 2.4 #####
########################

mean.men <- 69.1
std.men <- 2.9

mean.women <- 63.7
std.women <- 2.7

# check syntax of function rnorm
?rnorm

# try the process once:
# take a sample of 100 men
# take a sample of 100 women
# take the means and the differences

sample.men <- rnorm(100, mean=mean.men, sd=std.men)
sample.women <- rnorm(100, mean=mean.women, sd=std.women)

mean(sample.men)
mean(sample.women)

## repeat the process 1000 times
# initialise an empty vector with 1000 missing values
x.y <- rep(NA,1000)
# check the length and the first 5 observations
length(x.y)
x.y[1:5]

# for loop to repeat the previous process
# each result is stored in one of the 1000 "spots" of the vector
for (i in 1:1000) {
x.y[i] <- mean(rnorm(100, mean=mean.men, sd=std.men))-mean(rnorm(100, mean=mean.women, sd=std.women))
}

# check the result
summary(x.y)
hist(x.y)
mean(x.y)
sd(x.y)

# theoretical values:
mean.men - mean.women
(sqrt(std.men^2 + std.women^2))/10

########################
##### exercise 3.4 #####
########################

# this library allows to read dataset in other formats (Stata, ...)
library("foreign")

# get the data on Andrew Gelman's web page
# http://www.stat.columbia.edu/~gelman/arm/examples/child.iq/
# store it somewhere
# place of the working directory
getwd()
# change your working directory if needed or include the full path to the dataset that you just saved
# (if you use the full path, use / instead of \ )
# setwd("C:/blabla/blablabla/")

iq.data <- read.dta("child.iq.dta")
# or:
# iq.data <- read.dta("C:/blabla/blablabla/child.iq.dta")

# check what's inside this dataset
str(iq.data)
summary(iq.data)

# fit the model
iq.fit <- lm(ppvt ~ momage, data=iq.data)
display(iq.fit)

# plot some stuff
plot(iq.data$momage, iq.data$ppvt, xlab="mother's age", ylab="kid's score", cex=.25)
abline(iq.fit)

# another fit with a quadratic function of momage
iq.fit2 <- lm(ppvt ~ momage + I(momage^2), data=iq.data)
display(iq.fit2)
# the corresponding plot
curve(cbind(1,x,x^2) %*% coef(iq.fit2), add=TRUE, col="red")

# check for heteroskedasticity
plot(iq.data$momage, resid(iq.fit))

# including mother's education
# several ways...

iq.data$momeduc <- as.factor(iq.data$educ_cat)
str(iq.data)
summary(iq.data)
iq.fit.m1 <- lm(ppvt ~ momage + momeduc, data=iq.data)
display(iq.fit.m1)

iq.fit.m2 <- lm(ppvt ~ momage + educ_cat, data=iq.data)
display(iq.fit.m2)

iq.fit.m3 <- lm(ppvt ~ momage + momeduc + momage:momeduc, data=iq.data)
display(iq.fit.m3)

iq.fit.m4 <- lm(ppvt ~ momeduc + momage:momeduc -1, data=iq.data)
display(iq.fit.m4)

colors <- ifelse(iq.data$momeduc==1, "black",
ifelse(iq.data$momeduc==2, "blue",
ifelse(iq.data$momeduc==3, "red", "green")
)
)
plot(iq.data$momage, iq.data$ppvt, xlab="mother's age", ylab="kid's score", cex=1, col=colors)

# unsolved problem with the colors, if anyone has any idea... (did not have time to debug this...)
par(mfrow=c(2,2))
for (i in 1:4){
plot(iq.data$momage[iq.data$momeduc==i], iq.data$ppvt[iq.data$momeduc==i], xlab="mother's age", ylab="kid's score", cex=1)
}

################################
##### end of the exercises #####
################################

### beauty and teaching evaluation
# create a graph with R which is consistent with the table

b <- 0.13
a <- 4.01
par(mfrow=c(1,1))
plot(c(-10,10), c(-10, 10), type="n")
abline(a,b)

curve(cbind(1,x) %*% c(a,b))

beauty.n <- 463
beauty.resid.sd <- 0.55
beauty.R2 <- .04

beauty.resid <- rnorm(beauty.n, mean=0, sd=beauty.resid.sd)
hist(beauty.resid)
sd(beauty.resid)

beauty.sd <- sd(beauty.resid)*sqrt(beauty.R2/(1-beauty.R2))/b
beauty.sd

sample.beauty <- rnorm(beauty.n, mean=0, sd=beauty.sd)
sample.eval <- a + b*sample.beauty + beauty.resid

plot(sample.beauty, sample.eval, cex=.25)
fit.beauty <- lm(sample.eval ~ sample.beauty)
display(fit.beauty)
abline(fit.beauty)

# you can add dashed lined for the standard deviation...

### Earnings and height
# create a graph with R which is consistent with the table

earnings.n <- 10000
sample.height.in <- rnorm(earnings.n, mean=69.1, sd=2.9)
sample.height.cm <- sample.height.in*2.54
hist(sample.height)
earnings.resid.eur.sd <- 13000
earnings.resid.eur <- rnorm(earnings.n, mean=0, sd=earnings.resid.eur.sd)
sd(earnings.resid.eur)
sample.earnings.eur <- 340*sample.height.cm -41000 + earnings.resid.eur
sample.earnings.usd <- sample.earnings.eur*1.5

plot(sample.height.cm, sample.earnings.eur, cex=.25)
fit.earnings.eur <- lm(sample.earnings.eur ~ sample.height.cm)
display(fit.earnings.eur)
abline(fit.earnings.eur)

plot(sample.height.in, sample.earnings.usd, cex=.25)
fit.earnings.usd <- lm(sample.earnings.usd ~ sample.height.in)
display(fit.earnings.usd)
abline(fit.earnings.eur)
41000*1.5
340*2.54*1.5
13000*1.5

Homework assignment, due mardi 29 sept

The data for all assignments are at http://www.stat.columbia.edu/~gelman/arm/examples/

1. Exercise 3.1: The folder pyth contains outcome y and inputs x1, x2 for 40 data points, with a further 20 points with the inputs but no observed outcome. Save the file to your working directory and read it into R using the read.table() function.

(a) Use R to fit a linear regression model predicting y from x1, x2, using the first
40 data points in the file. Summarize the inferences and check the fit of your
model.

(b) Display the estimated model graphically as in Figure 3.2.

(c) Make a residual plot for this model. Do the assumptions appear to be met?

(d) Make predictions for the remaining 20 data points in the file. How confident
do you feel about these predictions?

2. Exercise 3.2(a): Suppose that, for a certain population, we can predict log earnings from log height as follows:

• A person who is 66 inches tall is predicted to have earnings of $30,000.

• Every increase of 1% in height corresponds to a predicted increase of 0.8% in
earnings.

• The earnings of approximately 95% of people fall within a factor of 1.1 of
predicted values.

(a) Give the equation of the regression line and the residual standard deviation of the regression.

3. Exercise 3.3: In this exercise you will simulate two variables that are statistically independent of each other to see what happens when we run a regression of one on the other.

(a) First generate 1000 data points from a normal distribution with mean 0 and
standard deviation 1 by typing var1 <- rnorm(1000,0,1) in R. Generate
another variable in the same way (call it var2). Run a regression of one
variable on the other. Is the slope coefficient statistically significant?

(b) Now run a simulation repeating this process 100 times. This can be done
using a loop. From each simulation, save the z-score (the estimated coefficient
of var1 divided by its standard error). If the absolute value of the z-score
exceeds 2, the estimate is statistically significant. Here is code to perform the
simulation:

z.scores <- rep (NA, 100)
for (k in 1:100) {
var1 <- rnorm (1000,0,1)
var2 <- rnorm (1000,0,1)
fit <- lm (var2 ~ var1)
z.scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}

How many of these 100 z-scores are statistically significant?

R code for the height and earnings example

## Read & clean the data
# Data are at http://www.stat.columbia.edu/~gelman/arm/examples/earnings

library ("arm")
heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta")
attach.all (heights)

# recode sex variable
male <- 2 - sex

# (for simplicity) remove cases with missing data
ok <- !is.na (earn+height+sex) & earn>0
heights.clean <- as.data.frame (cbind (earn, height, male)[ok,])
n <- nrow (heights.clean)
attach.all (heights.clean)
height.jitter.add <- runif (n, -.2, .2)

## Model fit
lm.earn <- lm (earn ~ height)
display (lm.earn)
sim.earn <- sim (lm.earn)
beta.hat <- coef(lm.earn)

## Figure 4.1 (left)
par (mar=c(6,6,4,2)+.1)
plot (height + height.jitter.add, earn, xlab="height", ylab="earnings", pch=20, mgp=c(4,2,0), yaxt="n", col="gray10",
main="Fitted linear model")
axis (2, c(0,100000,200000), c("0","100000","200000"), mgp=c(4,1.1,0))
for (i in 1:20){
curve (sim.earn$coef[i,1] + sim.earn$coef[i,2]*x, lwd=.5, col="gray", add=TRUE)}
curve (beta.hat[1] + beta.hat[2]*x, add=TRUE, col="red")

## Figure 4.1 (right)
par (mar=c(6,6,4,2)+.1)
plot (height + height.jitter.add, earn, xlab="height", ylab="earnings", pch=20, mgp=c(4,2,0), yaxt="n", col="gray10",
main="Fitted linear model",xlim=c(0,80),ylim=c(-200000,200000))
axis (2, c(-100000,0,100000), c("-100000","0","100000"), mgp=c(4,1.1,0))
for (i in 1:20){
curve (sim.earn$coef[i,1] + sim.earn$coef[i,2]*x, lwd=.5, col="gray", add=TRUE)}
curve (beta.hat[1] + beta.hat[2]*x, add=TRUE, col="red")

## Log transformation

log.earn <- log(earn)
earn.logmodel.1 <- lm(log.earn ~ height)
display(earn.logmodel.1)

# Figure 4.3

sim.logmodel.1 <- sim (earn.logmodel.1)
beta.hat <- coef (earn.logmodel.1)

par (mar=c(6,6,4,2)+.1)
plot (height + runif(n,-.2,.2), log.earn, xlab="height", ylab="log (earnings)", pch=20, yaxt="n", mgp=c(4,2,0), col="gray10",
main="Log regression, plotted on log scale")
axis (2, seq(6,12,2), mgp=c(4,1.1,0))
for (i in 1:20)
curve (sim.logmodel.1$coef[i,1] + sim.logmodel.1$coef[i,2]*x, lwd=.5, col="gray", add=TRUE)
curve (beta.hat[1] + beta.hat[2]*x, add=TRUE, col="red")

par (mar=c(6,6,4,2)+.1)
plot (height + runif(n,-.2,.2), earn, xlab="height", ylab="earnings", pch=20, yaxt="n", mgp=c(4,2,0), col="gray10",
main="Log regression, plotted on original scale")
axis (2, c(0,100000,200000), c("0","100000","200000"), mgp=c(4,1.1,0))
for (i in 1:20)
curve (exp(sim.logmodel.1$coef[i,1] + sim.logmodel.1$coef[i,2]*x), lwd=.5, col="gray", add=TRUE)
curve (exp(beta.hat[1] + beta.hat[2]*x), add=TRUE, col="red")

## Log-base-10 transformation

log10.earn <- log10(earn)
earn.log10model <- lm(log10.earn ~ height)
display(earn.log10model)

## Log scale regression model

earn.logmodel.2 <- lm(log.earn ~ height + male)
display(earn.logmodel.2)

## Including interactions

earn.logmodel.3 <- lm(log.earn ~ height + male + height:male)
display(earn.logmodel.3)

## Linear transformations

z.height <- (height - mean(height))/sd(height)
earn.logmodel.4 <- lm(log.earn ~ z.height + male + z.height:male)
display(earn.logmodel.4)

## Log-log model

log.height <- log(height)
earn.logmodel.5 <- lm(log.earn ~ log.height + male)
display(earn.logmodel.5)

R code for the beauty and teaching evaluations example

# Load in the "arm" package. (Always do this at the start of your R session.)
library (arm)

# The data came as an Excel file (ProfEvaltnsBeautyPublic.xls)
# I went into Excel and saved it as a .csv file (comma-separated version)

# Read the data into R, including the variable names (headers)
beauty.data <- read.table ("http://www.stat.columbia.edu/~gelman/arm/examples/beauty/ProfEvaltnsBeautyPublic.csv", header=TRUE, sep=",")
#beauty.data <- read.table ("ProfEvaltnsBeautyPublic.csv", header=TRUE,sep=",")

# Attach the data so they are accessible using their variable names
attach.all (beauty.data)

###############################################################################
# Do more beautiful profs get higher evaluations?
###############################################################################

# Rename the two variables for convenience
beauty <- btystdave
eval <- courseevaluation

# Make a scatterplot
plot (beauty, eval)

# Make a scatterplot, labeling the axes
plot (beauty, eval, xlab="beauty", ylab="average teaching evaluation")

# Fit a linear regression
lm.1 <- lm (eval ~ beauty)
display (lm.1)

# Extract the information: fitted line is y = a + b*x + error
# Errors have mean 0, standard deviation sigma
a <- coef (lm.1)["(Intercept)"]
b <- coef (lm.1)["beauty"]
sigma <- sigma.hat (lm.1)

# Plot the regression line on top of the scatterplot
curve (a + b*x, add=TRUE)

# Add dotted lines to show +/- 1 standard deviation
curve (a + b*x + sigma, lty=2, add=TRUE)
curve (a + b*x - sigma, lty=2, add=TRUE)

###############################################################################
# Do things differ for male and female profs? Parallel regression lines
###############################################################################

lm.2 <- lm (eval ~ beauty + female)
display (lm.2)

# Set up a 2x2 grid of plots
par (mfrow=c(2,2))

# Make separate plots for men and women
b0 <- coef (lm.2)["(Intercept)"]
b1 <- coef (lm.2)["beauty"]
b2 <- coef (lm.2)["female"]

plot (beauty[female==0], eval[female==0], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Men")
curve (b0 + b1*x + b2*0, add=TRUE)

plot (beauty[female==1], eval[female==1], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Women")
curve (b0 + b1*x + b2*1, add=TRUE)

# Display both sexes on the same plot
# First make the plot with type="n" (which displays axes but does not plot
# the points), then plot the points and lines separately for each sex
plot (beauty, eval, xlab="beauty", ylab="average teaching evaluation",
main="Both sexes", type="n")
points (beauty[female==0], eval[female==0], col="blue")
curve (b0 + b1*x + b2*0, add=TRUE, col="blue")
points (beauty[female==1], eval[female==1], col="red")
curve (b0 + b1*x + b2*1, add=TRUE, col="red")

###############################################################################
# Do things differ for male and female profs? Non-parallel regression lines
###############################################################################

lm.3 <- lm (eval ~ beauty + female + beauty*female)
display (lm.3)

# Set up a new 2x2 grid of plots
par (mfrow=c(2,2))

# Display the parallel regression lines in gray and the non-parallel lines
# in heavy black
b0.new <- coef (lm.3)["(Intercept)"]
b1.new <- coef (lm.3)["beauty"]
b2.new <- coef (lm.3)["female"]
b3.new <- coef (lm.3)["beauty:female"]

plot (beauty[female==0], eval[female==0], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Men")
curve (b0 + b1*x + b2*0, lwd=.5, col="gray", add=TRUE)
curve (b0.new + b1.new*x + b2.new*0 + b3.new*x*0, lwd=2, col="black", add=TRUE)

plot (beauty[female==1], eval[female==1], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Women")
curve (b0 + b1*x + b2*1, lwd=.5, col="gray", add=TRUE)
curve (b0.new + b1.new*x + b2.new*1 + b3.new*x*1, lwd=2, col="black", add=TRUE)

###############################################################################
# More models
###############################################################################

lm.4 <- lm (eval ~ beauty + female + beauty*female + age)
display (lm.4)

lm.5 <- lm (eval ~ beauty + female + beauty*female + minority)
display (lm.5)

lm.6 <- lm (eval ~ beauty + female + beauty*female + nonenglish)
display (lm.6)

lm.7 <- lm (eval ~ beauty + female + beauty*female + nonenglish + lower)
display (lm.7)

Thursday, September 17, 2009

How to install packages

if you run into the following problem:
> library ("arm")
Erreur dans library("arm") : aucun package nommé 'arm' n'est trouvé
(or a similar message in English)

it's probably because you have not installed the "arm" package yet.

To do so, open R and use the GUI (Graphical User Interface) as follows:
- click on the "Packages" menu
- select "choose mirror"
- select one in France
- click on the "Packages" menu again
- select "install packages"
- select "arm"

If you have any trouble with the GUI, you can directly type in the R console:
> chooseCRANmirror()
(R is case sensitive! -- meaning upper case letters are not interpreted the same way as lower case letters)

then type:
> install.packages()
then select "arm"

or directly, type:
> install.packages("arm")

Tuesday, September 15, 2009

First lecture, including roach and rat images

Here are the images from the 15 sept lecture.

Homework assignment, due mardi 22 sept

The data for all assignments are at http://www.stat.columbia.edu/~gelman/arm/examples/

1. Follow the instructions to download and configure R and a text editor. (You don't need to download or set up Bugs.)

To make sure everything's working, go into R and type the following:

library ("arm")
x <- c(1,2,3)
y <- c(1,2,5)
fit <- lm (y ~ x)
display (fit)

The following should then appear in your R console:

lm(formula = y ~ x)
coef.est coef.sd
(Intercept) -1.33 1.25
x 2.00 0.58
n = 3, k = 2
residual sd = 0.82, R-Squared = 0.92

If you can not get this to work, please speak with Romain right away!

2. Exercise 2.1: A test is graded from 0 to 50, with an average score of 35 and a standard deviation of 10. For comparison to other tests, it would be convenient to rescale to a mean of 100 and standard deviation of 15.

(a) How can the scores be linearly transformed to have this new mean and standard
deviation?

(b) There is another linear transformation that also rescales the scores to have mean 100 and standard deviation 15. What is it, and why would you not want to use it for this purpose?

3. Exercise 2.4: Distribution of averages and differences: the heights of men in the United States are approximately normally distributed with mean 69.1 inches and standard deviation 2.9 inches. The heights of women are approximately normally distributed with mean 63.7 inches and standard deviation 2.7 inches. Let x be the average height of 100 randomly sampled men, and y be the average height of 100 randomly sampled women. In R, create 1000 simulations of x − y and plot their histogram. Using the simulations, compute the mean and standard deviation of the distribution of x − y and compare to their exact values.

4. Exercise 3.4 (a,b): The child.iq folder contains a subset of the children and mother data discussed earlier in the chapter. You have access to children’s test scores at age 3, mother’s education, and the mother’s age at the time she gave birth for a sample of 400 children. The data are a Stata file which you can read into R by saving in your working directory and then typing the following:

library ("foreign")
iq.data <- read.dta ("child.iq.dta")

(a) Fit a regression of child test scores on mother’s age, display the data and fitted model, check assumptions, and interpret the slope coefficient. When do you recommend mothers should give birth? What are you assuming in making these recommendations?

(b) Repeat this for a regression that further includes mother’s education, interpreting both slope coefficients in this model. Have your conclusions about the
timing of birth changed?

Course blog

Here we will post material for my Fall, 2009, introductory statistics course at Sciences Po.

We will post class notes, R code, homework assignments, and other things.

Books for the course:

Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. We will cover chapters 2-10.

John Fox, An R and S-Plus Companion to Applied Regression. A useful book for learning R.

Also, you should set up your computer using the instructions on this page. But you don't have to set up WinBugs and OpenBugs.

Andrew Gelman's Sciences Po Course