Andrew Gelman's Sciences Po Course: 2009

Friday, December 18, 2009

Pieces of R code for conference #10

######################
### conference #10 ###
######################

library("arm")

############
### 10.2 ###
############

Bypass <- read.table("bypass.data.txt", header=T)
str(Bypass)

### b
logit1 <- glm(new ~ I(age<80)
, family=binomial(link="logit"), data=Bypass)

display(logit1)

logit2 <- glm(new ~ I(age<80) + severity
, family=binomial(link="logit"), data=Bypass)

display(logit2, digits=3)

logit3 <- glm(new ~ I(age<80) + severity
, family=binomial(link="logit"), data=Bypass, subset=(age>=70 & age <=90))

display(logit3, digits=3)

logit4 <- glm(new ~ I(age<80) + severity
, family=binomial(link="logit"), data=Bypass, subset=(age>=75 & age <=85))

display(logit4, digits=3)

invlogit(coef(logit4)[1] + coef(logit4)[2] + coef(logit4)[3]*mean(Bypass$severity[Bypass$age>=75 & Bypass$age<80]))
invlogit(coef(logit4)[1] + coef(logit4)[3]*mean(Bypass$severity[Bypass$age>=80 & Bypass$age<=85]))

invlogit(coef(logit4)[1] + coef(logit4)[2] + coef(logit4)[3]*mean(Bypass$severity[Bypass$age>=75 & Bypass$age<80]))*(1-invlogit(coef(logit4)[1]>=75 & Bypass$age<80])))

invlogit(coef(logit4)[1] + coef(logit4)[3]*mean(Bypass$severity[Bypass$age>=80 & Bypass$age<=85]))*(1-invlogit(coef(logit4)[1] + coef(logit4)[3]*mean(Bypass$severity[Bypass$age>=80 & Bypass$age<=85])))

lm1 <- lm(stay ~ severity, data=Bypass)
display(lm1)

lm2 <- lm(stay ~ severity + age, data=Bypass)
display(lm2)

lm3 <- lm(stay ~ severity + age + I(age^2), data=Bypass)
display(lm3)

plot(stay ~ severity, data=Bypass, subset=new==0, ylim=c(15,45))
points(stay ~ severity, pch=3, data=Bypass, subset=new==1)

plot(stay ~ age, data=Bypass, subset=new==0, ylim=c(15,45), xlim=c(50,90))
points(stay ~ age, pch=3, data=Bypass, subset=new==1)

plot(severity ~ age, data=Bypass, subset=(new==1), xlim=range(Bypass$age), ylim=range(Bypass$severity))
points(severity ~ age, data=Bypass, subset=(new==0), pch=3)

### BONUS c ###

# without severity
Bypass$age80 <- Bypass$age - 80

TE1 <- lm(stay ~ new + age80 + age80:new, data=Bypass)
display(TE1)

# using the age threshold as a local IV
library("sem")

RDD.tsls <- tsls(stay ~ new + age80 , ~ I(age<80) + age80:I(age<80) + age80 , data=Bypass)
summary(RDD.tsls)

RDD.tsls2 <- tsls(stay ~ new + age80 + age80:new, ~ I(age<80) + age80:I(age<80) + age80 , data=Bypass)
summary(RDD.tsls2)

RDD.tsls3 <- tsls(stay ~ new + age80 , ~ I(age<80) + age80 , data=Bypass)
summary(RDD.tsls3a)

# example of a corresponding Wald estimate
RDD.itt <- lm(stay ~ I(age<80) + age80 + age80:I(age<80), data=Bypass)
display(RDD.itt)

RDD.comply <- lm(new ~ I(age<80) + age80 + age80:I(age<80), data=Bypass)
display(RDD.comply)

RDD.Wald <- coef(RDD.itt)[2]/coef(RDD.comply)[2]
RDD.Wald

# using severity
TE <- lm(stay ~ new + severity + age, data=Bypass)
display(TE)

############
### 10.3 ###
############

### a

n <- 1000
ni <- 500

h0 <- rnorm(n, mean=69.1, sd=2.9)
h1 <- h0 + rnorm(n, mean=2, sd=.5)

milk0 <- rbinom(n, size=1, prob=.3)
milk1 <- ifelse(milk0==1, 1, rbinom(n, size=1, prob=.5))

never <- milk0==0 & milk1==0
always <- milk0==1 & milk1==1
complier <- milk0==0 & milk1==1

sum(never)
sum(always)
sum(complier)

voucher <- rep(c(1,0), c(ni, n-ni))

milk <- ifelse(voucher, milk1, milk0)
h <- ifelse(milk, h1, h0)

ITT <- lm(h ~ voucher)
display(ITT)

comply <- lm(milk ~ voucher)
display(comply)

LATE.Wald <- coef(ITT)[2]/coef(comply)[2]
LATE.Wald

library("sem")

LATE.tsls <- tsls(h ~ milk, ~ voucher )
summary(LATE.tsls)

### b
n.b <- 1000

never.b <- rep(c(1, 0, 0), c(100, 100, 800))
always.b <- rep(c(0, 1, 0), c(100, 100, 800))
complier.b <- rep(c(0, 0, 1), c(100, 100, 800))

h0.b <- rnorm(n.b, mean=69.1, sd=2.9)
h1.b <- ifelse(complier.b, h0.b + rnorm(n.b, mean=2, sd=.5), h0.b + rnorm(n.b, mean=1, sd=.5))

voucher.b <- rbinom(n.b, 1, .5)

milk.b <- ifelse(never.b, 0, ifelse(always.b, 1, ifelse(voucher.b, 1, 0)))
h.b <- ifelse(milk.b, h1.b, h0.b)

ITT.b <- lm(h.b ~ voucher.b)
display(ITT.b)

comply.b <- lm(milk.b ~ voucher.b)
display(comply.b)

LATE.Wald.b <- coef(ITT.b)[2]/coef(comply.b)[2]
LATE.Wald.b

LATE.tsls.b <- tsls(h.b ~ milk.b, ~ voucher.b )
summary(LATE.tsls.b)

### c
n.c <- 1000

never.c <- rep(c(1, 0, 0), c(400, 400, 200))
always.c <- rep(c(0, 1, 0), c(400, 400, 200))
complier.c <- rep(c(0, 0, 1), c(400, 400, 200))

h0.c <- rnorm(n.c, mean=69.1, sd=2.9)
h1.c <- ifelse(complier.c, h0.c + rnorm(n.c, mean=2, sd=.5), h0.c + rnorm(n.c, mean=1, sd=.5))

voucher.c <- rbinom(n.c, 1, .5)

milk.c <- ifelse(never.c, 0, ifelse(always.c, 1, ifelse(voucher.c, 1, 0)))
h.c <- ifelse(milk.c, h1.c, h0.c)

ITT.c <- lm(h.c ~ voucher.c)
display(ITT.c)

comply.c <- lm(milk.c ~ voucher.c)
display(comply.c)

LATE.Wald.c <- coef(ITT.c)[2]/coef(comply.c)[2]
LATE.Wald.c

LATE.tsls.c <- tsls(h.c ~ milk.c, ~ voucher.c )
summary(LATE.tsls.c)

Thursday, December 10, 2009

First 5 problems of practice final exam for Sciences Po course, for mardi, 15 dec, 2009

You may use the textbook and one feuille of notes (both sides). No other references. You can use a pocket calculator but no computer.

Also please note that all these examples are fake; they are not based on real data.

Below are the first five problems; I'll send more soon.

Problem 1.

1000 people are surveyed and asked how many political activities they have done in the previous year. 300 people say “none,” 300 say “one,” 200 say “two,” 100 say “three,” and 100 say “4 or more.”

Give an estimate and standard error of the average number of political activities done by people in the population. (Hint: to compute the standard error of the mean, you need to estimate the standard distribution of the responses. If you draw a histogram of the responses, that should help.)

Problem 2.

A survey is done asking people if they support government-supplied child care. 60% of the men in the survey answer yes (with a standard error of 5%), and 70% of the women answer yes (with a standard error of 5%).

Give an estimate and standard error for the difference in support between men and women.

Assuming the data were collected using a simple random sample, how many people responded to the survey?

How large would the sample size have to be for the difference in support between men and women to have a standard error of 1%?

Problem 3.

A survey is done asking people if they support affirmative action for ethnic minorities. 30% of whites say yes (with a standard error of 3%), 80% of blacks say yes (with a standard error of 5%), and70% of others say yes (with a standard error of 10%).

The population of interest is 70% white, 20% black, and 10% other. Assuming the survey is a simple random sample, give an estimate and standard error of the percentage of people in the population who support affirmative action for ethnic minorities.

Problem 4.

1100 people are surveyed: 500 say they voted for John McCain and 600 say they voted for Barack Obama. Of the McCain voters, 40% support the health-care bill that is being debated in Congress. Of the Obama voters, 70% support the bill. (People who did not vote in the past election are excluded from this study.)

Suppose you take the data (y=1 for supporting the bill and 0 for opposing it) and run a linear regression, predicting y from a variable called “obama,” which is 1 for people who say they voted for Obama and 0 for people who say they voted for McCain.

The regression output looks like this:

lm(formula = y ~ obama)
coef.est coef.se
(Intercept) 0.__ 0.__
obama 0.__ 0.__
---
n = ____, k = _
residual sd = 0.__, R-Squared = 0.__

Fill in the blanks.

Problem 5.

Students in a laboratory experiment are given the choice of which of two video games to play, a storybook adventure or a shoot-em-up game. A logistic regression is fit predicting the students’ choice (y=0 if they choose the adventure or 1 if they choose the shooting game), given their age in years. (The students’ ages in the experiment range from 18 to 25 years.)

The logistic regression output looks like this:

glm(formula = y ~ age, family = binomial(link = "logit"))
coef.est coef.se
(Intercept) 5.64 1.61
age -0.30 0.08
---
n = 230, k = 2
residual deviance = 263.5, null deviance = 281.0 (difference = 17.4)

From this model, what is the probability of choosing the shooting game if you are 20 years old?

Graph the fitted regression line (from x ranging from 18 to 25) , and put in a bunch of dots representing values of x and y that are consistent with the fitted line.

Wednesday, December 9, 2009

Research article to be discussed mardi 15 dec

Regression models for decision making: a cost-benefit analysis of incentives in telephone surveys. Journal of Business and Economic Statistics (2003). (Andrew Gelman, Matt Stevens, and Valerie Chan)

Saturday, December 5, 2009

some R code for conference #10

######################
### conference #10 ###
######################

library("arm")

###########
### 9.8 ###
###########

n.C <- 100
n.T <- 100

x.C <- rnorm(n.C, mean=10, sd=4)
x.T <- rnorm(n.T, mean=10, sd=4)

e.C <- rnorm(n.C, mean=0, sd=1)
e.T <- rnorm(n.C, mean=0, sd=1)

y.C <- 1 + 2*x.C + e.C

y.Ta <- 1 + 2*x.T + e.T
y.Tb <- 1 + 2*x.T + e.T + 10
y.Tc <- 1 + 2*x.T + e.T + .5*x.T

plot(x.C, y.C, pch=1)
abline(lm(y.C ~ x.C))

points(x.T, y.Ta, pch=3)
abline(lm(y.Ta ~ x.T))

plot(x.C, y.C, pch=1)
abline(lm(y.C ~ x.C))

points(x.T, y.Tb, pch=3)
abline(lm(y.Tb ~ x.T))

plot(x.C, y.C, pch=1)
abline(lm(y.C ~ x.C))

points(x.T, y.Tc, pch=3)
abline(lm(y.Tc ~ x.T))

###########
### 9.9 ###
###########

### a

n.C <- 100
n.T <- 100

x.C <- rnorm(n.C, mean=10, sd=4)
x.T1 <- rnorm(n.T, mean=20, sd=4)

e.C <- rnorm(n.C, mean=0, sd=1)
e.T <- rnorm(n.T, mean=0, sd=1)

y.C <- 1 + 2*x.C + e.C
y.T1 <- 1 + 2*x.T1 + e.T + 10

y.a <- c(y.C, y.T1)
x.a <- c(x.C, x.T1)
T.a <- rep(c(0,1), c(n.C, n.T))

fit1 <- lm(y.a ~ T.a)
display(fit1)

fit2 <- lm(y.a ~ T.a + x.a)
display(fit2)

plot(x.C, y.C, pch=1, xlim=c(0,40), ylim=c(0,70))
abline(lm(y.C ~ x.C))

points(x.T1, y.T1, pch=3)
abline(lm(y.T1 ~ x.T1))

abline(h=mean(y.C))
abline(h=mean(y.T1))

### b

y.C2 <- 1 + 2*x.C + .2*I(x.C^2) + e.C
y.T2 <- 1 + 2*x.T1 + .2*I(x.T1^2) + e.T + 10

y.b <- c(y.C2, y.T2)
x.b <- c(x.C, x.T1)
T.b <- rep(c(0,1), c(n.C, n.T))

plot(x.C, y.C2, pch=1, xlim=c(0,40), ylim=c(0,200))
points(x.T1, y.T2, pch=3)

abline(lm(y.C2 ~ x.C))
abline(lm(y.T2 ~ x.T1))

fit3 <- lm(y.b ~ T.b + x.b)
display(fit3)

fit3b <- lm(y.b ~ T.b + x.b + T.b*x.b)
display(fit3b)

fit4 <- lm(y.b ~ T.b + x.b + I(x.b^2))
display(fit4)

Wednesday, December 2, 2009

Homework assignment due mardi, 8 dec

1. Exercise 10.2: Regression discontinuity analysis: suppose you are trying to evaluate the effect of a new procedure for coronary bypass surgery that is supposed to help with the post-operative healing process. The new procedure is risky, however, and is rarely performed in patients who are over 80 years old. Data from this (hypothetical) example are displayed in Figure 10.10.

(a) Does this seem like an appropriate setting in which to implement a regression discontinuity analysis?

(b) The folder bypass contains data for this example: stay is the length of hospital stay after surgery, age is the age of the patient, and new is the indicator variable indicating that the new surgical procedure was used. Pre-operative disease severity (severity) was unobserved by the researchers but we have access to it for illustrative purposes. Can you find any evidence using these data that the regression discontinuity design is inappropriate?

2. Exercise 10.3: Instrumental variables: come up with a hypothetical example in which it would be appropriate to estimate treatment effects using an instrumental variables strategy. For simplicity, stick to an example with a binary instrument and binary treatment variable.

(a) Simulate data for this imaginary example assuming all the assumptions are met. Estimate the local average treatment effect (LATE) for the data by dividing the intent-to-treat effect by the percentage of compliers. Show that two-stage least squares yields the same point estimate.

(b) Now simulate data in which the exclusion restriction is not met (so, for instance, those whose treatment level is left unaffected by the instrument have a treatment effect of half the magnitude of the compliers) but the instrument is strong (say, 80% of the population are compliers) and see how far off your estimate is.

(c) Finally simulate data in which the exclusion restriction is violated in the same way, however, the instrument is weak (only 20% of the population are compliers) and see how far off your estimate is.

Monday, November 30, 2009

Research article to be discussed mardi 8 dec 2009

Can losing lead to winning?, by Jonah Berger and Devin Pope.

Also see discussion here.

Friday, November 27, 2009

Homework assignment due mardi, 1 dec

1. Exercise 9.7: Gain-score models: in the discussion of gain-score models in Section 9.3, we noted that if we include the pre-treatment measure of the outcome in a gain score model, the coefficient on the treatment indicator will be the same as if we had just run a standard regression of the outcome on the treatment indicator and the pre-treatment measure. Show why this is true.

2. Exercise 9.8: Assume that linear regression is appropriate for the regression of an outcome, y, on treatment indicator, T , and a single confounding covariate, x. Sketch hypothetical data (plotting y versus x, with treated and control units indicated by circles and dots, respectively) and regression lines (for treatment and control group) that represent each of the following situations:

(a) No treatment effect,

(b) Constant treatment effect,

(c) Treatment effect increasing with x.

3. Exercise 9.9: Consider a study with an outcome, y, a treatment indicator, T , and a single confounding covariate, x. Draw a scatterplot of treatment and control observations that demonstrates each of the following:

(a) A scenario where the difference in means estimate would not capture the true treatment effect but a regression of y on x and T would yield the correct estimate.

(b) A scenario where a linear regression would yield the wrong estimate but a nonlinear regression would yield the correct estimate.

Sunday, November 22, 2009

some R code for conference #8

#####################
### Conference #8 ###
#####################

library("arm")

###########
### 9.4 ###
###########

categoryZ <- 1:8
nbpersZ <- c(300, 300, 500, 500, 200, 200, 200, 200)
xZ <- rep(c(0,1),4)
xZ
TZ <- rep(rep(c(0,1),c(2,2)),2)
T
y0Z <- rep(c(4,10), c(4,4))
y1Z<- rep(c(6,12), c(4,4))

base <- data.frame(category = rep(categoryZ, nbpersZ),
x = rep(xZ, nbpersZ),
T = rep(TZ, nbpersZ),
y0 = rep(y0Z, nbpersZ),
y1 = rep(y1Z, nbpersZ))
str(base)
base$y <- ifelse(base$T == 0, base$y0, base$y1)
base$z <- ifelse(base$y0 == 4, 0, 1)

### a

# ATE = 2

# estimations

fit1 <- lm(y ~ T, data=base)
display(fit1)

mean(base$y[base$T==1]) - mean(base$y[base$T==0])

fit2 <- lm(y ~ T + x, data=base)
display(fit2)

fit3 <- lm(y ~ T + x + T:x, data=base)
display(fit3)

m0 <- mean(base$y[base$T==1 & base$x==0]) - mean(base$y[base$T==0 & base$x==0])
m1 <- mean(base$y[base$T==1 & base$x==1]) - mean(base$y[base$T==0 & base$x==1])
m0
m1

####
fit4 <- lm(y ~ T + z + T:z, data=base)
display(fit4)

Saturday, November 21, 2009

R codes for book

Go to my webpage, then to the link for my book with Jennifer, then click on Data and software, then Directories with data, then you can download the zip files with data, code for the book, and code for the book.

Recently in class we've been discussing the examples from chapters 7 and 8.

Friday, November 20, 2009

Homework assignment due mardi, 24 nov

1. Exercise 9.1: Suppose you are interested in the effect of the presence of vending machines in schools on childhood obesity. What randomized experiment would you want to
do (in a perfect world) to evaluate this question?

2. Exercise 9.4 (b,c,d): See statement of Exercise 9.4 from last week's homework assignment. Then:

(b) Is it plausible to believe that these data came from a randomized experiment? Defend your answer.

(c) Another population quantity is the mean of y for those who received the treatment minus the mean of y for those who did not. What is the relation between this quantity and the average treatment effect?

(d) For these data, is it plausible to believe that treatment assignment is ignorable given sex? Defend your answer.

3. Exercise 9.5: For the hypothetical study in the previous exercise, figure out the estimate and the standard error of the coefficient of T in a regression of y on T and x.

4. Exercise 9.6: You are consulting for a researcher who has performed a randomized trial where the treatment was a series of 26 weekly therapy sessions, the control was no therapy, and the outcome was self-report of emotional state one year later. However, most people in the treatment group did not attend every therapy session. In fact there was a good deal of variation in the number of therapy sessions actually attended. The researcher is concerned that her results represent “watered down” estimates because of this variation and suggests adding in another predictor to the model: number of therapy sessions attended. What would you advise her?

Tuesday, November 17, 2009

some R code for conference #7

#####################
### conference #7 ###
#####################

library("arm")
library("car")

###########
### 2.3 ###
###########

n.sum <- 20
n.sim <- 1000

simCLT <- replicate(n.sim, sum(runif(n.sum)))
hist(simCLT, probability=TRUE, xlim=c(n.sum/2-3*sqrt(n.sum/12), n.sum/2+3*sqrt(n.sum/12)))
curve(dnorm(x, mean=n.sum/2, sd=sqrt(n.sum/12)), add=T)

# use number of bins according to car:
hist(simCLT, probability=TRUE, xlim=c(n.sum/2-3*sqrt(n.sum/12), n.sum/2+3*sqrt(n.sum/12)), nclass=n.bins(simCLT))
curve(dnorm(x, mean=n.sum/2, sd=sqrt(n.sum/12)), add=T)

###########
### 8.1 ###
###########

### y = 3 + 0.1*x1 + 0.5*x2 + 5*error, error ~ T(mean=0, scale=5, df=4)

### a

n.obs <- 100

x1 <- 1:n.obs
x2 <- rbinom(n.obs, size=1, prob=0.5)
error <- rt(n.obs, df=4)
true.coef <- c(3, 0.1, 0.5, 5)
y <- cbind(rep(1, n.obs), x1, x2, error) %*% true.coef
y[1:10]
plot(x1, y)

fit1 <- lm(y ~ x1 + x2)
display(fit1)
hat.coef <- coef(fit1)
hat.coef.se <- se.coef(fit1)
hat.coef.se <- sqrt(diag(vcov(fit1)))
cover.68 <- abs(hat.coef - true.coef[1:3]) < hat.coef.se
cover.68

### b

sim.T <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit1 <- lm(y ~ x1 + x2)
hat.coef <- coef(fit1)
hat.coef.se <- sqrt(diag(vcov(fit1)))
abs(hat.coef - true.coef[1:3]) < hat.coef.se
}

res.sim <- replicate(1000, sim.T(3, 0.1, 0.5, 5, 100))

#dim(res.sim)
#is.array(res.sim)

apply(res.sim, 1, mean)

### c
library("hett")

fit2 <- tlm(y ~ x1 + x2, start=list(dof=4), estDof=FALSE)
summary(fit2)
fit2
hat.coef <- coef(fit2$loc.fit)
hat.coef.se <- sqrt(diag(summary.tlm(fit2)$loc.summary[["cov.unscaled"]]))
abs(hat.coef - true.coef[1:3]) < hat.coef.se

sim.T2 <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit2 <- tlm(y ~ x1 + x2, start=list(dof=4), estDof=FALSE)
hat.coef <- coef(fit2$loc.fit)
hat.coef.se <- sqrt(diag(summary.tlm(fit2)$loc.summary[["cov.unscaled"]]))
abs(hat.coef - true.coef[1:3]) < hat.coef.se
}

res.sim2 <- replicate(1000, sim.T2(3, 0.1, 0.5, 5, 100))

#dim(res.sim)
#is.array(res.sim)

apply(res.sim2, 1, mean)

### compare standard error of the parameter estimates

sim.T.se <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit1 <- lm(y ~ x1 + x2)
se.coef(fit1)
}
sim.T2.se <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit2 <- tlm(y ~ x1 + x2, start=list(dof=4), estDof=FALSE)
sqrt(diag(summary.tlm(fit2)$loc.summary[["cov.unscaled"]]))
}

sim.se1 <- replicate(1000, sim.T.se(3, 0.1, 0.5, 5, 100))
sim.se2 <- replicate(1000, sim.T2.se(3, 0.1, 0.5, 5, 100))
apply(sim.se1, 1, mean)
apply(sim.se2, 1, mean)

Sunday, November 15, 2009

Homework assignment due mardi, 17 nov

1. Exercise 9.3: Suppose you are a consultant for a researcher who is interested in investigating the effects of teacher quality on student test scores. Use the strategy of mapping this question to a randomized experiment to help define the question more clearly. Write a memo to the researcher asking for needed clarifications to this study proposal.

2. Exercise 9.4 (a): The table below describes a hypothetical experiment on 2400 persons. Each row of the table specifies a category of person, as defined by his or her pre-treatment predictor x, treatment indicator T, and potential outcomes y0, y1. (For simplicity, we assume unrealistically that all the people in this experiment fit into these eight categories.)

Category #persons.in.category x T y0 y1
1 300 0 0 4 6
2 300 1 0 4 6
3 500 0 1 4 6
4 500 1 1 4 6
5 200 0 0 10 12
6 200 1 0 10 12
7 200 0 1 10 12
8 200 1 1 10 12

In making the table we are assuming omniscience, so that we know both y0 and y1 for all observations. But the (nonomniscient) investigator would only observe x, T, and yT for each unit. (For example, a person in category 1 would have x=0, T=0, y=4, and a person in category 3 would have x=0, T=1, y=6.)

(a) What is the average treatment effect in this population of 2400 persons?

Tuesday, November 10, 2009

Voter turnout in the United States

As requested, here is a link to the research on the political attitudes of nonvoters and the potential effects of increased voter turnout.

Research article to be discussed mardi, 17 nov

Follow this link and read the linked article in Behavioral and Brain Sciences.

Also this.

Sunday, November 8, 2009

Research article to be discussed mardi 10 nov

The Accident Externality from Driving, by Aaron Edlin and Pinar Mandic.

We [Edlin and Karaca-Mandic] estimate auto accident externalities (more specifically insurance externalities) using panel data on state-average insurance premiums and loss costs. Externalities appear to be substantial in traffic-dense states: in California, for example, we find that the increase in traffic density from a typical additional driver increases total statewide insurance costs of other drivers by $1,725-$3,239 per year, depending on the model. High-traffic density states have large economically and statistically significant externalities in all specifications we check. In contrast, the accident externality per driver in low-traffic states appears quite small. On balance, accident externalities are so large that a correcting Pigouvian tax could raise $66 billion annually in California alone, more than all existing California state taxes during our study period, and over $220 billion per year nationally.

Homework assignment, due mardi, 10 nov

1. Exercise 2.3: Demonstration of the Central Limit Theorem: let x = x1 + · · · + x20, the sum of 20 independent Uniform(0,1) random variables. In R, create 1000 simulations of x and plot their histogram. On the histogram, overlay a graph of the normal density function. Comment on any differences between the histogram and the curve.

2. Exercise 8.1: Fitting the wrong model: suppose you have 100 data points that arose from the following model: y = 3+0.1x1+0.5x2+error, with errors having a t distribution with mean 0, scale 5, and 4 degrees of freedom. We shall explore the implications of fitting a standard linear regression to these data.

(a) Simulate data from this model. For simplicity, suppose the values of x1 are simply the integers from 1 to 100, and that the values of x2 are random and equally likely to be 0 or 1. (In R, you can define x.1 <- 1:100, simulate x.2 using rbinom(), then create the linear predictor, and finally simulate the random errors in y using the rt() function.) Fit a linear regression (with normal errors) to these data and see if the 68% confidence intervals for the regression coefficients (for each, the estimates ±1 standard error) cover the true values.

(b) Put the above step in a loop and repeat 1000 times. Calculate the confidence
coverage for the 68% intervals for each of the three coefficients in the model.

(c) Repeat this simulation, but instead fit the model using t errors (see Exercise 6.6).

Friday, November 6, 2009

Some R code for conference #6

#####################
### conference #6 ###
#####################

library("arm")

###########
### 7.1 ###
###########

### a

f.shots <- function(p.success=.6, hits.stop=2){
shots <- 0
missed <- 0
success <- 0
while (missed < hits.stop){
if (rbinom(1,1,p.success)==1) {
shots <- shots+1
success <- success + 1
missed <- 0
} else {
shots <- shots+1
missed <- missed + 1
}
}
c(shots=shots,success=success)
}
f.shots()

### b
sim.shots <- rep(NA,1000)
mytime <- proc.time()
for (i in 1:1000){
sim.shots[i] <- f.shots()["shots"]
}
proc.time() - mytime

mean(sim.shots)
sd(sim.shots)

# slightly faster with "replicate"
mytime <- proc.time()
sim.shots2 <- replicate(1000, f.shots()["shots"])
proc.time() - mytime

mean(sim.shots2)
sd(sim.shots2)

# theoretical results:

$\alpha=0.6$ is the probability of success
$m= 2 + \frac{\alpha(3-2\alpha)}{(1-\alpha)^2}$
$variance= 4 + \frac{m\alpha(6-4\alpha)}{(1-\alpha)^2} + \frac{\alpha(5-4\alpha)}{(1-\alpha)^2}$
$m=8.75, sd \approx 7.16$

sim.shots.success <- data.frame(shots=rep(NA,1000), success=rep(NA,1000))
for (i in 1:1000){
sim.shots.success[i,] <- f.shots()
}
str(sim.shots.success)
sim.shots.success[1:10,]
sim.shots.success$prop.success <- sim.shots.success$succes/sim.shots.success$shots

mean(sim.shots.success$shots)
mean(sim.shots.success$success)
sd(sim.shots.success$shots)
sd(sim.shots.success$success)

### c
plot(sim.shots.success$shots, sim.shots.success$prop.success)

#############
### BONUS ###
#############

### if what we want is the total number of non missed shots
f.hits <- function(p.success=.6, hits.stop=2){
hits <- 0
missed <- 0
while (missed < hits.stop){
if (rbinom(1,1,p.success)==1) {
hits <- hits+1
missed <- 0
} else {
missed <- missed + 1
}
}
hits
}
f.hits()

sim.hits <- rep(NA,10000)
mytime <- proc.time()
for (i in 1:10000){
sim.hits[i] <- f.hits()
}
proc.time() - mytime

mean(sim.hits)
sd(sim.hits)

## mat10000 <- matrix(c(.6,2), nrow=2, ncol=10000)
## mytime <- proc.time()
## sim.hits2 <- apply(mat10000, 2, f.hits)
## proc.time() - mytime

## mean(sim.hits2)
## sd(sim.hits2)

# theoretical results in that case:
# compute the probability to take n hits by induction
# p(0)=.4*.4
# a sequence leading to n hits starts either by 1 or by 0
# the set of sequences leading to n hits and starting with a 0 is the same as the set of sequences leading to n hits and starting with a 1, but adding a 0 at the beginning of the sequence
# n -> n+1
# the set of sequences leading to n+1 hits and starting with a 1 is the same as the set of sequences leading to n hits and starting with a 1, but adding a 1 at the beginning plus the set of sequences starting with a 1 but adding 10 at the beginning

$p_{n+1}=p_n*(0.6 + 0.6 \times 0.4)$
$p_n=p_0*\alpha^n$
$with ~~ p_0=0.4 \times 0.4 = 0.16$
$and ~~ \alpha=0.6 + 0.6 \times 0.4 = 0.84$

# let H be the random variable equal to the number of hits

$\mathbb{E}(H)=\sum_{n=0}^{+\infty} np(n)=p(0)\sum_{n=0}^{+\infty} n\alpha^n$
$\mathbb{E}(H)=p(0)\frac{\alpha}{(1-\alpha)^2}$
$\mathbb{E}(H)=5.25$

$\mathbb{V}(H)=\mathbb{E}(H^2)-\mathbb{E}(H)^2$
$\mathbb{E}(H^2)=p(0)\sum_{n=0}^{+\infty} n^2\alpha^n$
$\mathbb{E}(H^2)=p(0)\frac{\alpha(1+\alpha)}{(1-\alpha)^3}$
$\sigma(H)=5.73$

###########
### 7.3 ###
###########

# let's assume all random variables are independent, continuous and normally distributed

f.savings <- function(m.unit=5, sd.unit=4, m.market=40000, s.market=10000){
market.size <- round(rnorm(1, mean=m.market, sd=s.market),0)
if (market.size<0) {0 <- market.size}
savings <- rnorm(market.size, mean=m.unit, sd=sd.unit)
sum(savings)
}
f.savings()

total.savings <- rep(NA,1000)
for (i in 1:1000){
total.savings[i] <- f.savings()
}
mean(total.savings)
sd(total.savings)

# the order of magnitude should be about s.market*m.unit=10000*5

# another story:
# let's assume that there are 40000 firms
# each firm will independently enter the market with probability 1/2501
# each firm which enters the market will buy 2501 widgets
# otherwise it will buy 0 widgets

f.savings2 <- function(m.unit=5, sd.unit=4){
market.size <- 2501*rbinom(1,40000,1/2501)
savings <- rnorm(market.size, mean=m.unit, sd=sd.unit)
sum(savings)
}
f.savings2()
total.savings2 <- rep(NA,1000)
for (i in 1:1000){
total.savings2[i] <- f.savings2()
}
mean(total.savings2)
sd(total.savings2)

###########
### 7.8 ###
###########

unidim.sim <- function(estimate, sd.estimate, df, n){
X <- rchisq(n, df=df)
rnorm(n, mean=estimate, sd=sd.estimate*sqrt(df/X))
}
unidim.sim(estimate=600, sd.estimate=400, df=50, n=5)

plot(unidim.sim(estimate=600, sd.estimate=400, df=50, n=1000), unidim.sim(estimate=3, sd.estimate=1, df=100, n=1000))

cost.eff <- rep(NA,1000)
for (i in 1:1000){
cost.eff[i] <- mean(unidim.sim(estimate=600, sd.estimate=400, df=50, n=1000))/mean(unidim.sim(estimate=3, sd.estimate=1, df=100, n=1000))
}

hist(cost.eff)
quantile(cost.eff,probs=c(.025, .25, .75, .975))

cost.eff2 <- rep(NA,1000)
for (i in 1:1000){
cost.eff2[i] <- mean(unidim.sim(estimate=600, sd.estimate=400, df=50, n=1000))/mean(unidim.sim(estimate=3, sd.estimate=2, df=100, n=1000))
}

hist(cost.eff2)
quantile(cost.eff2,probs=c(.025, .25, .75, .975))

Monday, October 26, 2009

Hints for conference #6

# You can complete the following pieces of code

###########
### 7.1 ###
###########

### a

f.shots <- function(p.success=.6, hits.stop=2){
shots <- 0
missed <- 0
success <- 0
while ( ### condition to complete ### ){

### actions to complete here ###

}

# result:
c(shots=shots,success=success)
}
f.shots()

### b
# initialize a vector of size 1000 with missing values
sim.shots <- rep(NA,1000)
# fill in the missing values...

mean(sim.shots)
sd(sim.shots)

# do the same with both shots and successes:
sim.shots.success <- data.frame(shots=rep(NA,1000), success=rep(NA,1000))

mean(sim.shots.success$shots)
mean(sim.shots.success$success)
sd(sim.shots.success$shots)
sd(sim.shots.success$success)

### c
# plot the result

###########
### 7.3 ###
###########

# let's assume all random variables are independent, continuous and normally distributed

f.savings <- function(m.unit=5, sd.unit=4, m.market=40000, s.market=10000){
# calculate market size:
market.size <-

# calculate vector of savings and sum it

f.savings()

# initialize
total.savings <- rep(NA,1000)
# fill in

mean(total.savings)
sd(total.savings)

# another story:
# let's assume that there are 40000 firms
# each firm will independently enter the market with probability 1/2501
# each firm which enters the market will buy 2501 widgets
# otherwise it will buy 0 widgets

# do the same

###########
### 7.8 ###
###########

# create function according to the procedure p.143
# use rchisq and rnorm
unidim.sim <- function(estimate, sd.estimate, df, n){
}

# try it
unidim.sim(estimate=600, sd.estimate=400, df=50, n=5)

# plot the result with 1000 points

# 1000 simulations:
cost.eff <- rep(NA,1000)
# fill in

# histogram
hist(cost.eff)

# use function quantile for the confidence

# try again with different values

Saturday, October 24, 2009

some R code for conference #5

### conf #5

library("arm")

###########
### 5.5 ###
###########

### a

# plot of invlogit
curve(invlogit(x), from=-10, to=10)

# plot of invlogit(-24 +.4*x) with different ranges of x
curve(invlogit(-24 +.4*x), from=0, to=100)
curve(invlogit(-24 +.4*x), from=45, to=75)
curve(invlogit(-24 +.4*x), from=30, to=90)

midterm <- rnorm(50, mean=60, sd=15)
midterm <- ifelse(midterm<0, 0, ifelse(midterm>100, 100, midterm))

jitter.binary <- function(a, jitt=.05){
ifelse(a==0, runif(length(a), 0, jitt), runif(length(a), 1-jitt, 1))
}

# generate consistent final results using the model and midterm results
# do not just use the predicted value...
final.bad <- ifelse(invlogit(-24 + .4*midterm)>.5, 1, 0)
final.good <- rbinom(length(midterm), 1, invlogit(-24 + .4*midterm))

# little test to check the syntax above
# test <- c(rep(.1,100),rep(.9,100))
# rbinom(length(test), 1, test)

plot(midterm,jitter.binary(final.good), cex=.25)
curve(invlogit(-24 + .4*x), add=T)
abline(v=60)

plot(midterm,jitter.binary(final.bad), cex=.25)
curve(invlogit(-24 + .4*x), add=T)
abline(v=60)

### b

$z=\frac{x-\bar x}{\sigma_x}$
$x=\sigma_x z + \bar x$
$\mathbb{P}(y=1)=logit^{-1}(-24+0.4x)$
$=logit^{-1}(-24+0.4(\sigma_x z + \bar x))$
$=logit^{-1}(-24+0.4\bar x + 0.4\sigma_x z)$

# test the transformation
logit1 <- glm(final.good ~ midterm, family=binomial(link="logit"))
display(logit1)

midterm.cs <- (midterm-mean(midterm))/sd(midterm)
logit2 <- glm(final.good ~ midterm.cs, family=binomial(link="logit"))
display(logit2)

### c
newpred <- rnorm(length(midterm), 0, 1)
logit3 <- glm(final.good ~ midterm + newpred, family=binomial(link="logit"))
display(logit3)

###########
### 5.6 ###
###########

curve(dlogis(x), from=-6, to=6)

curve(dlogis(x, location=4.5), from=-2, to=11)
abline(h=0)
#abline(v=0)
polygon(c(0,seq(0,12, length=100),12), c(0,dlogis(seq(0,12, length=100), location=4.5),0), col="grey")
polygon(c(0,seq(0,12, length=10),12), c(0,dlogis(seq(0,12, length=10), location=4.5),0), col="grey")

Homework assignment, due mardi, 27 oct

1. Exercise 7.1: Discrete probability simulation: suppose that a basketball player has a 60% chance of making a shot, and he keeps taking shots until he misses two in a row. Also assume his shots are independent (so that each shot has 60% probability of success, no matter what happened before).

(a) Write an R function to simulate this process.

(b) Put the R function in a loop to simulate the process 1000 times. Use the simulation to estimate the mean, standard deviation, and distribution of the total number of shots that the player will take.

(c) Using your simulations, make a scatterplot of the number of shots the player will take and the proportion of shots that are successes.

2. Exercise 7.3: Propagation of uncertainty: we use a highly idealized setting to illustrate the use of simulations in combining uncertainties. Suppose a company changes its technology for widget production, and a study estimates the cost savings at $5 per unit, but with a standard error of $4. Furthermore, a forecast estimates the size of the market (that is, the number of widgets that will be sold) at 40,000, with a standard error of 10,000. Assuming these two sources of uncertainty are independent, use simulation to estimate the total amount of money saved by the new product (that is, savings per unit, multiplied by size of the market).

3. Exercise 7.8: Inference for the ratio of parameters: a (hypothetical) study compares the costs and effectiveness of two different medical treatments.

• In the first part of the study, the difference in costs between treatments A and B is estimated at $600 per patient, with a standard error of $400, based on a regression with 50 degrees of freedom.

• In the second part of the study, the difference in effectiveness is estimated at 3.0 (on some relevant measure), with a standard error of 1.0, based on a regression with 100 degrees of freedom.

• For simplicity, assume that the data from the two parts of the study were collected independently. Inference is desired for the incremental cost-effectiveness ratio: the difference between the average costs of the two treatments, divided by the difference between their average effectiveness. (This problem is discussed further by Heitjan, Moskowitz, and Whang, 1999.)

(a) Create 1000 simulation draws of the cost difference and the effectiveness difference, and make a scatterplot of these draws.

(b) Use simulation to come up with an estimate, 50% interval, and 95% interval for the incremental cost-effectiveness ratio.

(c) Repeat this problem, changing the standard error on the difference in effectiveness to 2.0.

Sunday, October 18, 2009

Some R code for conference #4

### conf #4

library("arm")
library ("foreign")

### a
# do not convert categorical variables into factors in order to have numerical variables for the analysis
Nes <- read.dta("nes5200_processed_voters_realideo.dta", convert.factors=F)

# same but conversion into factors in order to have the labels
Nes.fact <- read.dta("nes5200_processed_voters_realideo.dta")
str(Nes)
str(Nes.fact)

summary(Nes.fact)
#labels(Nes)["educ1"]

ok <- !is.na(Nes$age + Nes$female + Nes$black + Nes$educ1 + Nes$income + Nes$partyid7 + Nes$presvote + Nes$real_ideo) & Nes$year==1992 & Nes$presvote<3
Nes.red <- Nes[ok, c("age", "female", "black", "educ1", "income", "partyid7", "presvote", "real_ideo")]
summary(Nes.red)
summary(Nes.red$presvote[Nes.red$black==1])
table(Nes.red$income[Nes.red$presvote==1])
table(Nes.red$income[Nes.red$presvote==2])
table(Nes.red$presvote, Nes.red$partyid7)

Nes.red$presvote <- Nes.red$presvote -1

logit1 <- glm(presvote ~ income, family=binomial(link="logit"), data=Nes.red)
display(logit1)

logit2 <- glm(presvote ~ income + age + female + black + educ1 + partyid7 + real_ideo, family=binomial(link="logit"), data=Nes.red)
display(logit2, digits=4)

# drop partyid and female
# interact black, income and real_ideo with other variables
logit3 <- glm(presvote ~ age + black + educ1 + partyid7 + real_ideo +
black:educ1 + partyid7:educ1 + real_ideo:educ1 +
black:partyid7 + black:real_ideo + partyid7:real_ideo
, family=binomial(link="logit"), data=Nes.red)
display(logit3, digits=4)

logit4 <- glm(presvote ~ age + black + income + real_ideo + income:real_ideo
, family=binomial(link="logit"), data=Nes.red)
display(logit4, digits=4)

# standardize and refit to know the respective magnitudes

# fitted values -- 2 ways (at least)
# invlogit(predict(logit1))-logit1$fitted.values

# residual plot
resid <- logit4$y - logit4$fitted.values
plot(logit4$fitted.values, logit4$y - logit4$fitted.values)

# binned residual plot
# 40 bins of equal size
resid.quant <- quantile(logit4$fitted.values, probs=seq(0,1,length=41))
# is.vector(bin.resid)
length(resid.quant)
resid.quant
mean.bin <- rep(NA, 40)
for (k in 1:40){
mean.bin[k] <- mean(resid[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ])
}
plot(resid.quant[-1], mean.bin, ylim=c(-.3,.3))

# var.bin <- rep(NA, 40)
# for (k in 1:40){
# var.bin[k] <- var(logit4$fitted.values[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ])
# }

var.bin <- rep(NA, 40)
for (k in 1:40){
var.bin[k] <- mean((logit4$fitted.values[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ])*(1-logit4$fitted.values[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ]))
}

# fitted.sort <- sort(logit4$fitted.values)
# lines(fitted.sort, 2*sqrt(fitted.sort*(1-fitted.sort)/948*40))

lines(resid.quant[-1], 2*sqrt(var.bin/948*40))
lines(resid.quant[-1], -2*sqrt(var.bin/948*40))

Tuesday, October 13, 2009

Research articles to be discussed mardi 20 oct

Three articles on statistical significance:

Of beauty, sex, and power: statistical challenges in estimating small effects. (Andrew Gelman and David Weakliem)

Letter to the editors regarding some papers of Dr. Satoshi Kanazawa. (Andrew Gelman)

The difference between "significant" and "not significant" is not itself statistically significant. (Andrew Gelman and Hal Stern)

Homework assignment, due mardi 20 oct

1. Exercise 5.5: In a class of 50 students, a logistic regression is performed of course grade (pass or fail) on midterm exam score (continuous values with mean 60 and standard deviation 15). The fitted model is Pr(pass) = logit−1(−24 + 0.4x).

(a) Graph the fitted model. Also on this graph put a scatterplot of hypothetical data consistent with the information given.

(b) Suppose the midterm scores were transformed to have a mean of 0 and standard deviation of 1. What would be the equation of the logistic regression using these transformed scores as a predictor?

(c) Create a new predictor that is pure noise (for example, in R you can create
newpred <- rnorm (n,0,1)). Add it to your model. How much does the deviance decrease?

2. Exercise 5.6: Latent-data formulation of the logistic model: take the model
Pr(y = 1) = invlogit (1 + 2x1 + 3x2) and consider a person for whom x1 = 1 and x2 = 0.5. Sketch the distribution of the latent data for this person. Figure out the probability that y=1 for the person and shade the corresponding area on your graph.

3. Exercise 5.7: Limitations of logistic regression: consider a dataset with n = 20 points, a single predictor x that takes on the values 1, . . . , 20, and binary data y. Construct data values y1, . . . , y20 that are inconsistent with any logistic regression on x. Fit a logistic regression to these data, plot the data and fitted curve, and explain why you can say that the model does not fit the data.

Powerpoint for fifth lecture

class5.pdf

Powerpoints for second, third, and fourth lectures

class2.pdf

class3.pdf

class4.pdf

Thursday, October 8, 2009

Homework assignment, due mardi 13 oct

1. Exercise 5.1: The folder nes contains the survey data of presidential preference and income for the 1992 election analyzed in Section 5.1, along with other variables including sex, ethnicity, education, party identification, and political ideology.

(a) Fit a logistic regression predicting support for Bush given all these inputs. Consider how to include these as regression predictors and also consider possible interactions.

(b) Evaluate and compare the different models you have fit. Consider coefficient estimates and standard errors, residual plots, and deviances.

(c) For your chosen model, discuss and compare the importance of each input variable in the prediction.

2. Exercise 5.2: Without using a computer, sketch the following logistic regression lines:
(a) Pr(y = 1) = invlogit (x)
(b) Pr(y = 1) = invlogit (2 + x)
(c) Pr(y = 1) = invlogit (2x)
(d) Pr(y = 1) = invlogit (2 + 2x)
(e) Pr(y = 1) = invlogit (−2x)

3. Exercise 5.3: You are interested in how well the combined earnings of the parents in a child’s family predicts high school graduation. You are told that the probability a child graduates from high school is 27% for children whose parents earn no income and is 88% for children whose parents earn $60,000. Determine the logistic regression model that is consistent with this information. (For simplicity you may want to assume that income is measured in units of $10,000).

Tuesday, October 6, 2009

pieces of R code for conference #3

########################
##### Exercise 4.1 #####
########################

### a
exp(-.25)
exp(.25)

### b
a <- -3.5
b <- 2
error.sd <- .25
earnings.n <- 1000

height.in <- rnorm(earnings.n, mean=69.1, sd=2.9)

l.height.in <- log(height.in)

l.weight.error <- rnorm(earnings.n, mean=0, sd=error.sd)

l.weight.lbs <- a + b*l.height.in + l.weight.error

display(lm(l.weight.lbs ~ l.height.in))

plot(l.height.in, l.weight.lbs, cex=.25,
xlab="log(height) in inches",
ylab="log(weight) in pounds")
curve(cbind(1,x) %*% c(a,b), add=T)
curve(cbind(1,x) %*% c(a+error.sd,b), add=T, lty=2)
curve(cbind(1,x) %*% c(a-error.sd,b), add=T, lty=2)

########################
##### Exercise 4.2 #####
########################
library ("foreign")
library("arm")

### a
Heights <- read.dta ("heights.dta")
str(Heights)
summary(Heights)

##### piece of code taken and adapted from earnings_setup.R #####

attach(Heights)

# create variables for age and ethnicity categories

age <- 90 - yearbn # survey was conducted in 1990
age[age<18] <- NA
age.category <- ifelse (age<35, 1, ifelse (age<50, 2, 3))
eth <- ifelse (race==2, 1, ifelse (hisp==1, 2, ifelse (race==1, 3, 4)))
male <- 2 - sex

# (for simplicity) remove cases with missing data
# and restrict to people with positive earnings born after 1925

ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25
heights.clean <- as.data.frame (cbind (earn, height, sex, race, hisp, ed, age, age.category, eth, male)[ok,])
n <- nrow (heights.clean)

###############################
detach(Heights)
remove(age, age.category, eth, male)
attach(heights.clean)

### b
earnings.fit1 <- lm(earn ~ height)
display(earnings.fit1)

c.height <- height - mean(height)
earnings.fit2 <- lm(earn ~ c.height)
display(earnings.fit2)

### c
l.earn=log(earn)
l.height=log(height)
age2=age*age
earnings.fit3 <- lm(l.earn ~ male + height + age)
display(earnings.fit3, digits=4)

earnings.fit4 <- lm(l.earn ~ male + height + age + male:height)
display(earnings.fit4, digits=4)

earnings.fit5 <- lm(l.earn ~ male + c.height + age + male:c.height)
display(earnings.fit5, digits=4)

z.height=c.height/sd(height)
age.25 <- age-25
earnings.fit6 <- lm(l.earn ~ male + z.height + age.25 + male:z.height + male:age.25)
display(earnings.fit6, digits=4)

earnings.fit7 <- lm(l.earn ~ male + z.height + age.25 + male:z.height + male:age.25 + age.25:z.height)
display(earnings.fit7, digits=4)

########################
##### Exercice 4.3 #####
########################

plot(c(1.8,8), c(100,200),
xlab="age",
ylab="weight (in pounds)",
type='n'
)
curve(cbind(1,x) %*% c(161,2.6), add=T)
curve(cbind(1,x,x^2) %*% c(96.2,33.6,-3.2), add=T, col="red")
curve(cbind(1,(x>=3.0 & x<4.5),(x>=4.5 & x<6.5),(x>=6.5) ) %*% c(157.2,19.1, 27.2, 8.5), add=T, col="green")

Friday, October 2, 2009

some R code for conference #2

### Conference de methode #2

library("arm")

###########
### 3.1 ###
###########
getwd()
setwd("/home/argon/ScPo")

# first try... does not work
#Pyth <- read.table("exercise2.1.dat")
#str(Pyth)
#Pyth

Pyth <- read.table("exercise2.1.dat", header=T)
str(Pyth)
Pyth

summary(Pyth[1:40,])

fit.Pyth <- lm(y ~ x1 + x2, data=Pyth[1:40,])
display(fit.Pyth)

## try to understand what kind of objects are fit.Pyth and coef(fit.Pyth)
attributes(fit.Pyth)
attributes(coef(fit.Pyth))
class(coef(fit.Pyth))
is.vector(coef(fit.Pyth))
# coef(fit.Pyth) is a named vector
coef(fit.Pyth)["x1"]
coef(fit.Pyth)[2]

## 2 graph windows instead of one
par(mfrow=c(1,2))

plot(Pyth$x1[1:40], Pyth$y[1:40], cex=.25)
# curve takes a function of 'x' as an argument
# here this function is given as the product of 2 vectors in the sense of the matrix multiplication
# the choice of the intercepts is explained below
curve(cbind(1,x,mean(Pyth$x2[1:40])) %*% coef(fit.Pyth), add=T )

plot(Pyth$x2[1:40], Pyth$y[1:40], cex=.25)
curve(cbind(1,mean(Pyth$x1[1:40]),x) %*% coef(fit.Pyth), add=T )

plot(Pyth$x1[1:40], Pyth$x2[1:40], cex=.25)

# residual plots versus x1, x2, and predicted values
summary(residuals(fit.Pyth))
str(residuals(fit.Pyth))

plot(Pyth$x1[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# errors are bigger at both ends of x1

plot(Pyth$x2[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# not very centered

plot(fitted.values(fit.Pyth)[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# quite ok

pred.Pyth <- predict(fit.Pyth, Pyth[41:60,], interval="prediction", level=.95)
pred.Pyth

### choice of the intercepts for the two separate graphs of y against x1 and x2
# the graph of x1 against x2 as well as a regression of x1 on x2 seem to indicate that they are independant
# so it makes sense to use the coefficients of the regression of y against x1 and x2
# to draw two separate graphs of y on x1 and x2
# the slopes are given by the regression but the intercepts have to be recalculated
# for each graph we want the line to be centered.
#
# the fitted model is:
# $y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i$
#
# the estimated coefficients are $a, b_1, \text{~~and~~} b_2$
# graph of y against x1:
# the line is of the form:
# $y=b_1 x + c_1$
# we define
# $u_i = y_i - (b_1 x_{1i} + c_1 )$
# we want the line to be centered in the sense that $\frac{1}{n} \sum_{i=1}^n u_i=0$
#
# since there is a constant term in the regression, we know that:
# $\frac{1}{n}\sum_{i=1}^n y_i = a + b_1\frac{1}{n} \sum_{i=1}^n x_{1i} + b_2 \frac{1}{n}\sum_{i=1}^n x_{2i}$
# therefore:
# $\frac{1}{n}\sum_{i=1}^n y_i -b_1 \frac{1}{n}\sum_{i=1}^n x_{1i} = a + b_2 \frac{1}{n}\sum_{i=1}^n x_{2i}$
# and we want to find c1 such that
# $\frac{1}{n}\sum_{i=1}^n y_i -b_1 \frac{1}{n}\sum_{i=1}^n x_{1i} = c_1$
# therefore we take
# $c_1=a + b_2 \frac{1}{n}\sum_{i=1}^n x_{2i}$

# for the graph of y against x2, it's symmetrical

###########
### 3.2 ###
###########

#(a)
# l(earnings)=\alpha + \beta*log(height) + error
# alpha_hat + beta_hat * log(66)=log(30000)
# beta_hat=.8
log(30000)-.8*log(66)
# alpha_hat=6.96
# 2*sigma_hat=1.1
# sigma_hat=.55

###########
### 3.3 ###
###########

var1 <- rnorm(1000, 0,1)
var2 <- rnorm(1000, 0,1)
display(lm(var2 ~ var1))

z.scores <- rep(NA,1000)
for (k in 1:1000){
var1 <- rnorm(1000, 0,1)
var2 <- rnorm(1000, 0,1)
fit <- lm(var2 ~ var1)
z.scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}

length(which(abs(z.scores)>2))

Research article to be discussed mardi 13 oct

The Effect of Geography-Based Group Threat on Voter Mobilization: A Field Experiment, by Ryan Enos. Click to download. Here's the abstract:

The effect of group threat on voter mobilization has been tested using observational data across a number of different geographies and units of analysis. Previous studies have yielded inconsistent findings. To date, no study of voter mobilization has directly manipulated group threat using a controlled experiment. I take advantage of the unique racial geography of Los Angeles County, California, which brings different racial/ethnic groups into close, yet spatially separated, proximity. This geography allows for a randomized, controlled experiment to directly test the effects of stimulating racial threat on voter turnout. A test of 3,666 African American and Hispanic voters shows an average treatment effect of 2.3 percentage points. The eect is 50% larger for African Americans than Hispanics. These results suggest that even low propensity voters are aware of the geographic proximity of other groups and can be motivated to participate by this awareness.

Also see discussion here about the ethics of the experiment.

Thursday, October 1, 2009

Research articles to be discussed mardi 6 oct

Is Well-being U-Shaped over the Life Cycle?, by David Blanchflower and Andrew Oswald. Follow the link above to download the article. Here's the abstract:

We present evidence that psychological well-being is U-shaped through life. A difficulty with research on this issue is that there are likely to be omitted cohort effects (earlier generations may have been born in, say, particularly good or bad times). First, using data on 500,000 randomly sampled Americans and West Europeans, the paper designs a test that can control for cohort effects. Holding other factors constant, we show that a typical individual’s happiness reaches its minimum -- on both sides of the Atlantic and for both males and females -- in middle age. Second, evidence is provided for the existence of a similar U-shape through the life-course in East European, Latin American and Asian nations. Third, a U-shape in age is found in separate well-being regression equations in 72 developed and developing nations. Fourth, using measures that are closer to psychiatric scores, we document a comparable well-being curve across the life cycle in two other data sets: (i) in GHQ-N6 mental health levels among a sample of 16,000 Europeans, and (ii) in reported depression and anxiety levels among 1 million U.K. citizens. Fifth, we discuss some apparent exceptions, particularly in developing nations, to the U-shape. Sixth, we note that American male birth-cohorts seem to have become progressively less content with their lives. Our paper’s results are based on regression equations in which other influences, such as demographic variables and income, are held constant.

Association between political ideology and health in Europe, by S. V. Subramanian, Tim Huijts, and Jessica M. Perkins. Follow the link above to download the article. Here's the abstract:

Studies have largely examined the association between political ideology and health at the aggregate/ecological level. Using individual-level data from 29 European countries, we investigated whether self-reports of political ideology and health are associated. In adjusted models, we found an inverse association between political ideology and self-rated poor health; for a unit increase in the political ideology scale (towards right) the odds ratio (OR) for reporting poor health decreased (OR 0.95, 95% confidence interval 0.94–0.96). Although political ideology per se is unlikely to have a causal link to health, it could be a marker for health-promoting latent attitudes, values and beliefs.

Wednesday, September 30, 2009

Homework assignment, due mardi 6 oct

1. Exercise 4.1: Logarithmic transformation and regression: consider the following regression:

log(weight) = −3.5 + 2.0 log(height) + error,

with errors that have standard deviation 0.25. Weights are in pounds and heights
are in inches.

(a) Fill in the blanks: approximately 68% of the persons will have weights within
a factor of and of their predicted values from the regression.

(b) Draw the regression line and scatterplot of log(weight) versus log(height) that
make sense and are consistent with the fitted model. Be sure to label the axes
of your graph.

2. Exercise 4.2: The folder earnings has data from the Work, Family, and Well-Being Survey. Pull out the data on earnings, sex, height, and weight.

(a) In R, check the dataset and clean any unusually coded data.

(b) Fit a linear regression model predicting earnings from height. What transformation should you perform in order to interpret the intercept from this model as average earnings for people with average height?

(c) Fit some regression models with the goal of predicting earnings from some combination of sex, height, and weight. Be sure to try various transformations and interactions that might make sense. Choose your preferred model and justify.

(d) Interpret all model coefficients.

3. Exercise 4.3: lotting linear and nonlinear regressions: we downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We first created new variables: age10 = age/10 and age10.sq = (age/10)2, and indicators age18.29, age30.44, age45.64, and age65up for four age categories. We then fit some regressions, with the following results:

lm(formula = weight ~ age10)
coef.est coef.se
(Intercept) 161.0 7.3
age10 2.6 1.6
n = 2009, k = 2
residual sd = 119.7, R-Squared = 0.00

lm(formula = weight ~ age10 + age10.sq)
coef.est coef.se
(Intercept) 96.2 19.3
age10 33.6 8.7
age10.sq -3.2 0.9
n = 2009, k = 3
residual sd = 119.3, R-Squared = 0.01

lm(formula = weight ~ age30.44 + age45.64 + age65up)
coef.est coef.se
(Intercept) 157.2 5.4
age30.44TRUE 19.1 7.0
age45.64TRUE 27.2 7.6
age65upTRUE 8.5 8.7
n = 2009, k = 4
residual sd = 119.4, R-Squared = 0.01

(a) On a graph of weights versus age (that is, weight on y-axis, age on x-axis), draw the fitted regression line from the first model.

(b) On the same graph, draw the fitted regression line from the second model.

(c) On another graph with the same axes and scale, draw the fitted regression line from the third model. (It will be discontinuous.)

Sunday, September 27, 2009

Who has babies when?

Hi all. Take a look at this (including the linked article and the blog comments). These are the kind of issues that arise in statistical analysis in social science.

Tuesday, September 22, 2009

some links concerning R and Emacs

you can read this and type the different commands yourself in R:
http://mephisto.unige.ch/pub/biotree/ED-Pavie/articles/R-FrancoisMarin_modulad-2007.pdf

for those of you who want to try emacs with ESS (not required at all !!!), you can go to:
http://vgoulet.act.ulaval.ca/ressources/emacs/
and choose the windows or mac os version depending on your computer.
(download the corresponding .emacs file as well)

some R code for Conference #1

##### install new packages in particular "arm"
chooseCRANmirror()
install.packages()

##### test
library("arm")
x <- c(1,2,3)
y <- c(1,2,5)
fit <- lm (y ~ x)
display (fit)

########################
##### exercise 2.4 #####
########################

mean.men <- 69.1
std.men <- 2.9

mean.women <- 63.7
std.women <- 2.7

# check syntax of function rnorm
?rnorm

# try the process once:
# take a sample of 100 men
# take a sample of 100 women
# take the means and the differences

sample.men <- rnorm(100, mean=mean.men, sd=std.men)
sample.women <- rnorm(100, mean=mean.women, sd=std.women)

mean(sample.men)
mean(sample.women)

## repeat the process 1000 times
# initialise an empty vector with 1000 missing values
x.y <- rep(NA,1000)
# check the length and the first 5 observations
length(x.y)
x.y[1:5]

# for loop to repeat the previous process
# each result is stored in one of the 1000 "spots" of the vector
for (i in 1:1000) {
x.y[i] <- mean(rnorm(100, mean=mean.men, sd=std.men))-mean(rnorm(100, mean=mean.women, sd=std.women))
}

# check the result
summary(x.y)
hist(x.y)
mean(x.y)
sd(x.y)

# theoretical values:
mean.men - mean.women
(sqrt(std.men^2 + std.women^2))/10

########################
##### exercise 3.4 #####
########################

# this library allows to read dataset in other formats (Stata, ...)
library("foreign")

# get the data on Andrew Gelman's web page
# http://www.stat.columbia.edu/~gelman/arm/examples/child.iq/
# store it somewhere
# place of the working directory
getwd()
# change your working directory if needed or include the full path to the dataset that you just saved
# (if you use the full path, use / instead of \ )
# setwd("C:/blabla/blablabla/")

iq.data <- read.dta("child.iq.dta")
# or:
# iq.data <- read.dta("C:/blabla/blablabla/child.iq.dta")

# check what's inside this dataset
str(iq.data)
summary(iq.data)

# fit the model
iq.fit <- lm(ppvt ~ momage, data=iq.data)
display(iq.fit)

# plot some stuff
plot(iq.data$momage, iq.data$ppvt, xlab="mother's age", ylab="kid's score", cex=.25)
abline(iq.fit)

# another fit with a quadratic function of momage
iq.fit2 <- lm(ppvt ~ momage + I(momage^2), data=iq.data)
display(iq.fit2)
# the corresponding plot
curve(cbind(1,x,x^2) %*% coef(iq.fit2), add=TRUE, col="red")

# check for heteroskedasticity
plot(iq.data$momage, resid(iq.fit))

# including mother's education
# several ways...

iq.data$momeduc <- as.factor(iq.data$educ_cat)
str(iq.data)
summary(iq.data)
iq.fit.m1 <- lm(ppvt ~ momage + momeduc, data=iq.data)
display(iq.fit.m1)

iq.fit.m2 <- lm(ppvt ~ momage + educ_cat, data=iq.data)
display(iq.fit.m2)

iq.fit.m3 <- lm(ppvt ~ momage + momeduc + momage:momeduc, data=iq.data)
display(iq.fit.m3)

iq.fit.m4 <- lm(ppvt ~ momeduc + momage:momeduc -1, data=iq.data)
display(iq.fit.m4)

colors <- ifelse(iq.data$momeduc==1, "black",
ifelse(iq.data$momeduc==2, "blue",
ifelse(iq.data$momeduc==3, "red", "green")
)
)
plot(iq.data$momage, iq.data$ppvt, xlab="mother's age", ylab="kid's score", cex=1, col=colors)

# unsolved problem with the colors, if anyone has any idea... (did not have time to debug this...)
par(mfrow=c(2,2))
for (i in 1:4){
plot(iq.data$momage[iq.data$momeduc==i], iq.data$ppvt[iq.data$momeduc==i], xlab="mother's age", ylab="kid's score", cex=1)
}

################################
##### end of the exercises #####
################################

### beauty and teaching evaluation
# create a graph with R which is consistent with the table

b <- 0.13
a <- 4.01
par(mfrow=c(1,1))
plot(c(-10,10), c(-10, 10), type="n")
abline(a,b)

curve(cbind(1,x) %*% c(a,b))

beauty.n <- 463
beauty.resid.sd <- 0.55
beauty.R2 <- .04

beauty.resid <- rnorm(beauty.n, mean=0, sd=beauty.resid.sd)
hist(beauty.resid)
sd(beauty.resid)

beauty.sd <- sd(beauty.resid)*sqrt(beauty.R2/(1-beauty.R2))/b
beauty.sd

sample.beauty <- rnorm(beauty.n, mean=0, sd=beauty.sd)
sample.eval <- a + b*sample.beauty + beauty.resid

plot(sample.beauty, sample.eval, cex=.25)
fit.beauty <- lm(sample.eval ~ sample.beauty)
display(fit.beauty)
abline(fit.beauty)

# you can add dashed lined for the standard deviation...

### Earnings and height
# create a graph with R which is consistent with the table

earnings.n <- 10000
sample.height.in <- rnorm(earnings.n, mean=69.1, sd=2.9)
sample.height.cm <- sample.height.in*2.54
hist(sample.height)
earnings.resid.eur.sd <- 13000
earnings.resid.eur <- rnorm(earnings.n, mean=0, sd=earnings.resid.eur.sd)
sd(earnings.resid.eur)
sample.earnings.eur <- 340*sample.height.cm -41000 + earnings.resid.eur
sample.earnings.usd <- sample.earnings.eur*1.5

plot(sample.height.cm, sample.earnings.eur, cex=.25)
fit.earnings.eur <- lm(sample.earnings.eur ~ sample.height.cm)
display(fit.earnings.eur)
abline(fit.earnings.eur)

plot(sample.height.in, sample.earnings.usd, cex=.25)
fit.earnings.usd <- lm(sample.earnings.usd ~ sample.height.in)
display(fit.earnings.usd)
abline(fit.earnings.eur)
41000*1.5
340*2.54*1.5
13000*1.5

Homework assignment, due mardi 29 sept

The data for all assignments are at http://www.stat.columbia.edu/~gelman/arm/examples/

1. Exercise 3.1: The folder pyth contains outcome y and inputs x1, x2 for 40 data points, with a further 20 points with the inputs but no observed outcome. Save the file to your working directory and read it into R using the read.table() function.

(a) Use R to fit a linear regression model predicting y from x1, x2, using the first
40 data points in the file. Summarize the inferences and check the fit of your
model.

(b) Display the estimated model graphically as in Figure 3.2.

(c) Make a residual plot for this model. Do the assumptions appear to be met?

(d) Make predictions for the remaining 20 data points in the file. How confident
do you feel about these predictions?

2. Exercise 3.2(a): Suppose that, for a certain population, we can predict log earnings from log height as follows:

• A person who is 66 inches tall is predicted to have earnings of $30,000.

• Every increase of 1% in height corresponds to a predicted increase of 0.8% in
earnings.

• The earnings of approximately 95% of people fall within a factor of 1.1 of
predicted values.

(a) Give the equation of the regression line and the residual standard deviation of the regression.

3. Exercise 3.3: In this exercise you will simulate two variables that are statistically independent of each other to see what happens when we run a regression of one on the other.

(a) First generate 1000 data points from a normal distribution with mean 0 and
standard deviation 1 by typing var1 <- rnorm(1000,0,1) in R. Generate
another variable in the same way (call it var2). Run a regression of one
variable on the other. Is the slope coefficient statistically significant?

(b) Now run a simulation repeating this process 100 times. This can be done
using a loop. From each simulation, save the z-score (the estimated coefficient
of var1 divided by its standard error). If the absolute value of the z-score
exceeds 2, the estimate is statistically significant. Here is code to perform the
simulation:

z.scores <- rep (NA, 100)
for (k in 1:100) {
var1 <- rnorm (1000,0,1)
var2 <- rnorm (1000,0,1)
fit <- lm (var2 ~ var1)
z.scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}

How many of these 100 z-scores are statistically significant?

R code for the height and earnings example

## Read & clean the data
# Data are at http://www.stat.columbia.edu/~gelman/arm/examples/earnings

library ("arm")
heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta")
attach.all (heights)

# recode sex variable
male <- 2 - sex

# (for simplicity) remove cases with missing data
ok <- !is.na (earn+height+sex) & earn>0
heights.clean <- as.data.frame (cbind (earn, height, male)[ok,])
n <- nrow (heights.clean)
attach.all (heights.clean)
height.jitter.add <- runif (n, -.2, .2)

## Model fit
lm.earn <- lm (earn ~ height)
display (lm.earn)
sim.earn <- sim (lm.earn)
beta.hat <- coef(lm.earn)

## Figure 4.1 (left)
par (mar=c(6,6,4,2)+.1)
plot (height + height.jitter.add, earn, xlab="height", ylab="earnings", pch=20, mgp=c(4,2,0), yaxt="n", col="gray10",
main="Fitted linear model")
axis (2, c(0,100000,200000), c("0","100000","200000"), mgp=c(4,1.1,0))
for (i in 1:20){
curve (sim.earn$coef[i,1] + sim.earn$coef[i,2]*x, lwd=.5, col="gray", add=TRUE)}
curve (beta.hat[1] + beta.hat[2]*x, add=TRUE, col="red")

## Figure 4.1 (right)
par (mar=c(6,6,4,2)+.1)
plot (height + height.jitter.add, earn, xlab="height", ylab="earnings", pch=20, mgp=c(4,2,0), yaxt="n", col="gray10",
main="Fitted linear model",xlim=c(0,80),ylim=c(-200000,200000))
axis (2, c(-100000,0,100000), c("-100000","0","100000"), mgp=c(4,1.1,0))
for (i in 1:20){
curve (sim.earn$coef[i,1] + sim.earn$coef[i,2]*x, lwd=.5, col="gray", add=TRUE)}
curve (beta.hat[1] + beta.hat[2]*x, add=TRUE, col="red")

## Log transformation

log.earn <- log(earn)
earn.logmodel.1 <- lm(log.earn ~ height)
display(earn.logmodel.1)

# Figure 4.3

sim.logmodel.1 <- sim (earn.logmodel.1)
beta.hat <- coef (earn.logmodel.1)

par (mar=c(6,6,4,2)+.1)
plot (height + runif(n,-.2,.2), log.earn, xlab="height", ylab="log (earnings)", pch=20, yaxt="n", mgp=c(4,2,0), col="gray10",
main="Log regression, plotted on log scale")
axis (2, seq(6,12,2), mgp=c(4,1.1,0))
for (i in 1:20)
curve (sim.logmodel.1$coef[i,1] + sim.logmodel.1$coef[i,2]*x, lwd=.5, col="gray", add=TRUE)
curve (beta.hat[1] + beta.hat[2]*x, add=TRUE, col="red")

par (mar=c(6,6,4,2)+.1)
plot (height + runif(n,-.2,.2), earn, xlab="height", ylab="earnings", pch=20, yaxt="n", mgp=c(4,2,0), col="gray10",
main="Log regression, plotted on original scale")
axis (2, c(0,100000,200000), c("0","100000","200000"), mgp=c(4,1.1,0))
for (i in 1:20)
curve (exp(sim.logmodel.1$coef[i,1] + sim.logmodel.1$coef[i,2]*x), lwd=.5, col="gray", add=TRUE)
curve (exp(beta.hat[1] + beta.hat[2]*x), add=TRUE, col="red")

## Log-base-10 transformation

log10.earn <- log10(earn)
earn.log10model <- lm(log10.earn ~ height)
display(earn.log10model)

## Log scale regression model

earn.logmodel.2 <- lm(log.earn ~ height + male)
display(earn.logmodel.2)

## Including interactions

earn.logmodel.3 <- lm(log.earn ~ height + male + height:male)
display(earn.logmodel.3)

## Linear transformations

z.height <- (height - mean(height))/sd(height)
earn.logmodel.4 <- lm(log.earn ~ z.height + male + z.height:male)
display(earn.logmodel.4)

## Log-log model

log.height <- log(height)
earn.logmodel.5 <- lm(log.earn ~ log.height + male)
display(earn.logmodel.5)

R code for the beauty and teaching evaluations example

# Load in the "arm" package. (Always do this at the start of your R session.)
library (arm)

# The data came as an Excel file (ProfEvaltnsBeautyPublic.xls)
# I went into Excel and saved it as a .csv file (comma-separated version)

# Read the data into R, including the variable names (headers)
beauty.data <- read.table ("http://www.stat.columbia.edu/~gelman/arm/examples/beauty/ProfEvaltnsBeautyPublic.csv", header=TRUE, sep=",")
#beauty.data <- read.table ("ProfEvaltnsBeautyPublic.csv", header=TRUE,sep=",")

# Attach the data so they are accessible using their variable names
attach.all (beauty.data)

###############################################################################
# Do more beautiful profs get higher evaluations?
###############################################################################

# Rename the two variables for convenience
beauty <- btystdave
eval <- courseevaluation

# Make a scatterplot
plot (beauty, eval)

# Make a scatterplot, labeling the axes
plot (beauty, eval, xlab="beauty", ylab="average teaching evaluation")

# Fit a linear regression
lm.1 <- lm (eval ~ beauty)
display (lm.1)

# Extract the information: fitted line is y = a + b*x + error
# Errors have mean 0, standard deviation sigma
a <- coef (lm.1)["(Intercept)"]
b <- coef (lm.1)["beauty"]
sigma <- sigma.hat (lm.1)

# Plot the regression line on top of the scatterplot
curve (a + b*x, add=TRUE)

# Add dotted lines to show +/- 1 standard deviation
curve (a + b*x + sigma, lty=2, add=TRUE)
curve (a + b*x - sigma, lty=2, add=TRUE)

###############################################################################
# Do things differ for male and female profs? Parallel regression lines
###############################################################################

lm.2 <- lm (eval ~ beauty + female)
display (lm.2)

# Set up a 2x2 grid of plots
par (mfrow=c(2,2))

# Make separate plots for men and women
b0 <- coef (lm.2)["(Intercept)"]
b1 <- coef (lm.2)["beauty"]
b2 <- coef (lm.2)["female"]

plot (beauty[female==0], eval[female==0], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Men")
curve (b0 + b1*x + b2*0, add=TRUE)

plot (beauty[female==1], eval[female==1], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Women")
curve (b0 + b1*x + b2*1, add=TRUE)

# Display both sexes on the same plot
# First make the plot with type="n" (which displays axes but does not plot
# the points), then plot the points and lines separately for each sex
plot (beauty, eval, xlab="beauty", ylab="average teaching evaluation",
main="Both sexes", type="n")
points (beauty[female==0], eval[female==0], col="blue")
curve (b0 + b1*x + b2*0, add=TRUE, col="blue")
points (beauty[female==1], eval[female==1], col="red")
curve (b0 + b1*x + b2*1, add=TRUE, col="red")

###############################################################################
# Do things differ for male and female profs? Non-parallel regression lines
###############################################################################

lm.3 <- lm (eval ~ beauty + female + beauty*female)
display (lm.3)

# Set up a new 2x2 grid of plots
par (mfrow=c(2,2))

# Display the parallel regression lines in gray and the non-parallel lines
# in heavy black
b0.new <- coef (lm.3)["(Intercept)"]
b1.new <- coef (lm.3)["beauty"]
b2.new <- coef (lm.3)["female"]
b3.new <- coef (lm.3)["beauty:female"]

plot (beauty[female==0], eval[female==0], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Men")
curve (b0 + b1*x + b2*0, lwd=.5, col="gray", add=TRUE)
curve (b0.new + b1.new*x + b2.new*0 + b3.new*x*0, lwd=2, col="black", add=TRUE)

plot (beauty[female==1], eval[female==1], xlim=range(beauty), ylim=range(eval),
xlab="beauty", ylab="average teaching evaluation", main="Women")
curve (b0 + b1*x + b2*1, lwd=.5, col="gray", add=TRUE)
curve (b0.new + b1.new*x + b2.new*1 + b3.new*x*1, lwd=2, col="black", add=TRUE)

###############################################################################
# More models
###############################################################################

lm.4 <- lm (eval ~ beauty + female + beauty*female + age)
display (lm.4)

lm.5 <- lm (eval ~ beauty + female + beauty*female + minority)
display (lm.5)

lm.6 <- lm (eval ~ beauty + female + beauty*female + nonenglish)
display (lm.6)

lm.7 <- lm (eval ~ beauty + female + beauty*female + nonenglish + lower)
display (lm.7)

Thursday, September 17, 2009

How to install packages

if you run into the following problem:
> library ("arm")
Erreur dans library("arm") : aucun package nommé 'arm' n'est trouvé
(or a similar message in English)

it's probably because you have not installed the "arm" package yet.

To do so, open R and use the GUI (Graphical User Interface) as follows:
- click on the "Packages" menu
- select "choose mirror"
- select one in France
- click on the "Packages" menu again
- select "install packages"
- select "arm"

If you have any trouble with the GUI, you can directly type in the R console:
> chooseCRANmirror()
(R is case sensitive! -- meaning upper case letters are not interpreted the same way as lower case letters)

then type:
> install.packages()
then select "arm"

or directly, type:
> install.packages("arm")

Tuesday, September 15, 2009

First lecture, including roach and rat images

Here are the images from the 15 sept lecture.

Homework assignment, due mardi 22 sept

The data for all assignments are at http://www.stat.columbia.edu/~gelman/arm/examples/

1. Follow the instructions to download and configure R and a text editor. (You don't need to download or set up Bugs.)

To make sure everything's working, go into R and type the following:

library ("arm")
x <- c(1,2,3)
y <- c(1,2,5)
fit <- lm (y ~ x)
display (fit)

The following should then appear in your R console:

lm(formula = y ~ x)
coef.est coef.sd
(Intercept) -1.33 1.25
x 2.00 0.58
n = 3, k = 2
residual sd = 0.82, R-Squared = 0.92

If you can not get this to work, please speak with Romain right away!

2. Exercise 2.1: A test is graded from 0 to 50, with an average score of 35 and a standard deviation of 10. For comparison to other tests, it would be convenient to rescale to a mean of 100 and standard deviation of 15.

(a) How can the scores be linearly transformed to have this new mean and standard
deviation?

(b) There is another linear transformation that also rescales the scores to have mean 100 and standard deviation 15. What is it, and why would you not want to use it for this purpose?

3. Exercise 2.4: Distribution of averages and differences: the heights of men in the United States are approximately normally distributed with mean 69.1 inches and standard deviation 2.9 inches. The heights of women are approximately normally distributed with mean 63.7 inches and standard deviation 2.7 inches. Let x be the average height of 100 randomly sampled men, and y be the average height of 100 randomly sampled women. In R, create 1000 simulations of x − y and plot their histogram. Using the simulations, compute the mean and standard deviation of the distribution of x − y and compare to their exact values.

4. Exercise 3.4 (a,b): The child.iq folder contains a subset of the children and mother data discussed earlier in the chapter. You have access to children’s test scores at age 3, mother’s education, and the mother’s age at the time she gave birth for a sample of 400 children. The data are a Stata file which you can read into R by saving in your working directory and then typing the following:

library ("foreign")
iq.data <- read.dta ("child.iq.dta")

(a) Fit a regression of child test scores on mother’s age, display the data and fitted model, check assumptions, and interpret the slope coefficient. When do you recommend mothers should give birth? What are you assuming in making these recommendations?

(b) Repeat this for a regression that further includes mother’s education, interpreting both slope coefficients in this model. Have your conclusions about the
timing of birth changed?

Course blog

Here we will post material for my Fall, 2009, introductory statistics course at Sciences Po.

We will post class notes, R code, homework assignments, and other things.

Books for the course:

Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. We will cover chapters 2-10.

John Fox, An R and S-Plus Companion to Applied Regression. A useful book for learning R.

Also, you should set up your computer using the instructions on this page. But you don't have to set up WinBugs and OpenBugs.