Andrew Gelman's Sciences Po Course: November 2009

Monday, November 30, 2009

Research article to be discussed mardi 8 dec 2009

Can losing lead to winning?, by Jonah Berger and Devin Pope.

Also see discussion here.

Friday, November 27, 2009

Homework assignment due mardi, 1 dec

1. Exercise 9.7: Gain-score models: in the discussion of gain-score models in Section 9.3, we noted that if we include the pre-treatment measure of the outcome in a gain score model, the coefficient on the treatment indicator will be the same as if we had just run a standard regression of the outcome on the treatment indicator and the pre-treatment measure. Show why this is true.

2. Exercise 9.8: Assume that linear regression is appropriate for the regression of an outcome, y, on treatment indicator, T , and a single confounding covariate, x. Sketch hypothetical data (plotting y versus x, with treated and control units indicated by circles and dots, respectively) and regression lines (for treatment and control group) that represent each of the following situations:

(a) No treatment effect,

(b) Constant treatment effect,

(c) Treatment effect increasing with x.

3. Exercise 9.9: Consider a study with an outcome, y, a treatment indicator, T , and a single confounding covariate, x. Draw a scatterplot of treatment and control observations that demonstrates each of the following:

(a) A scenario where the difference in means estimate would not capture the true treatment effect but a regression of y on x and T would yield the correct estimate.

(b) A scenario where a linear regression would yield the wrong estimate but a nonlinear regression would yield the correct estimate.

Sunday, November 22, 2009

some R code for conference #8

#####################
### Conference #8 ###
#####################

library("arm")

###########
### 9.4 ###
###########

categoryZ <- 1:8
nbpersZ <- c(300, 300, 500, 500, 200, 200, 200, 200)
xZ <- rep(c(0,1),4)
xZ
TZ <- rep(rep(c(0,1),c(2,2)),2)
T
y0Z <- rep(c(4,10), c(4,4))
y1Z<- rep(c(6,12), c(4,4))

base <- data.frame(category = rep(categoryZ, nbpersZ),
x = rep(xZ, nbpersZ),
T = rep(TZ, nbpersZ),
y0 = rep(y0Z, nbpersZ),
y1 = rep(y1Z, nbpersZ))
str(base)
base$y <- ifelse(base$T == 0, base$y0, base$y1)
base$z <- ifelse(base$y0 == 4, 0, 1)

### a

# ATE = 2

# estimations

fit1 <- lm(y ~ T, data=base)
display(fit1)

mean(base$y[base$T==1]) - mean(base$y[base$T==0])

fit2 <- lm(y ~ T + x, data=base)
display(fit2)

fit3 <- lm(y ~ T + x + T:x, data=base)
display(fit3)

m0 <- mean(base$y[base$T==1 & base$x==0]) - mean(base$y[base$T==0 & base$x==0])
m1 <- mean(base$y[base$T==1 & base$x==1]) - mean(base$y[base$T==0 & base$x==1])
m0
m1

####
fit4 <- lm(y ~ T + z + T:z, data=base)
display(fit4)

Saturday, November 21, 2009

R codes for book

Go to my webpage, then to the link for my book with Jennifer, then click on Data and software, then Directories with data, then you can download the zip files with data, code for the book, and code for the book.

Recently in class we've been discussing the examples from chapters 7 and 8.

Friday, November 20, 2009

Homework assignment due mardi, 24 nov

1. Exercise 9.1: Suppose you are interested in the effect of the presence of vending machines in schools on childhood obesity. What randomized experiment would you want to
do (in a perfect world) to evaluate this question?

2. Exercise 9.4 (b,c,d): See statement of Exercise 9.4 from last week's homework assignment. Then:

(b) Is it plausible to believe that these data came from a randomized experiment? Defend your answer.

(c) Another population quantity is the mean of y for those who received the treatment minus the mean of y for those who did not. What is the relation between this quantity and the average treatment effect?

(d) For these data, is it plausible to believe that treatment assignment is ignorable given sex? Defend your answer.

3. Exercise 9.5: For the hypothetical study in the previous exercise, figure out the estimate and the standard error of the coefficient of T in a regression of y on T and x.

4. Exercise 9.6: You are consulting for a researcher who has performed a randomized trial where the treatment was a series of 26 weekly therapy sessions, the control was no therapy, and the outcome was self-report of emotional state one year later. However, most people in the treatment group did not attend every therapy session. In fact there was a good deal of variation in the number of therapy sessions actually attended. The researcher is concerned that her results represent “watered down” estimates because of this variation and suggests adding in another predictor to the model: number of therapy sessions attended. What would you advise her?

Tuesday, November 17, 2009

some R code for conference #7

#####################
### conference #7 ###
#####################

library("arm")
library("car")

###########
### 2.3 ###
###########

n.sum <- 20
n.sim <- 1000

simCLT <- replicate(n.sim, sum(runif(n.sum)))
hist(simCLT, probability=TRUE, xlim=c(n.sum/2-3*sqrt(n.sum/12), n.sum/2+3*sqrt(n.sum/12)))
curve(dnorm(x, mean=n.sum/2, sd=sqrt(n.sum/12)), add=T)

# use number of bins according to car:
hist(simCLT, probability=TRUE, xlim=c(n.sum/2-3*sqrt(n.sum/12), n.sum/2+3*sqrt(n.sum/12)), nclass=n.bins(simCLT))
curve(dnorm(x, mean=n.sum/2, sd=sqrt(n.sum/12)), add=T)

###########
### 8.1 ###
###########

### y = 3 + 0.1*x1 + 0.5*x2 + 5*error, error ~ T(mean=0, scale=5, df=4)

### a

n.obs <- 100

x1 <- 1:n.obs
x2 <- rbinom(n.obs, size=1, prob=0.5)
error <- rt(n.obs, df=4)
true.coef <- c(3, 0.1, 0.5, 5)
y <- cbind(rep(1, n.obs), x1, x2, error) %*% true.coef
y[1:10]
plot(x1, y)

fit1 <- lm(y ~ x1 + x2)
display(fit1)
hat.coef <- coef(fit1)
hat.coef.se <- se.coef(fit1)
hat.coef.se <- sqrt(diag(vcov(fit1)))
cover.68 <- abs(hat.coef - true.coef[1:3]) < hat.coef.se
cover.68

### b

sim.T <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit1 <- lm(y ~ x1 + x2)
hat.coef <- coef(fit1)
hat.coef.se <- sqrt(diag(vcov(fit1)))
abs(hat.coef - true.coef[1:3]) < hat.coef.se
}

res.sim <- replicate(1000, sim.T(3, 0.1, 0.5, 5, 100))

#dim(res.sim)
#is.array(res.sim)

apply(res.sim, 1, mean)

### c
library("hett")

fit2 <- tlm(y ~ x1 + x2, start=list(dof=4), estDof=FALSE)
summary(fit2)
fit2
hat.coef <- coef(fit2$loc.fit)
hat.coef.se <- sqrt(diag(summary.tlm(fit2)$loc.summary[["cov.unscaled"]]))
abs(hat.coef - true.coef[1:3]) < hat.coef.se

sim.T2 <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit2 <- tlm(y ~ x1 + x2, start=list(dof=4), estDof=FALSE)
hat.coef <- coef(fit2$loc.fit)
hat.coef.se <- sqrt(diag(summary.tlm(fit2)$loc.summary[["cov.unscaled"]]))
abs(hat.coef - true.coef[1:3]) < hat.coef.se
}

res.sim2 <- replicate(1000, sim.T2(3, 0.1, 0.5, 5, 100))

#dim(res.sim)
#is.array(res.sim)

apply(res.sim2, 1, mean)

### compare standard error of the parameter estimates

sim.T.se <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit1 <- lm(y ~ x1 + x2)
se.coef(fit1)
}
sim.T2.se <- function(c.int, c1, c2, scale, nobs){
x1 <- 1:nobs
x2 <- rbinom(nobs, size=1, prob=0.5)
error <- rt(nobs, df=4)
true.coef <- c(c.int, c1, c2, scale)
y <- cbind(rep(1, nobs), x1, x2, error) %*% true.coef
fit2 <- tlm(y ~ x1 + x2, start=list(dof=4), estDof=FALSE)
sqrt(diag(summary.tlm(fit2)$loc.summary[["cov.unscaled"]]))
}

sim.se1 <- replicate(1000, sim.T.se(3, 0.1, 0.5, 5, 100))
sim.se2 <- replicate(1000, sim.T2.se(3, 0.1, 0.5, 5, 100))
apply(sim.se1, 1, mean)
apply(sim.se2, 1, mean)

Sunday, November 15, 2009

Homework assignment due mardi, 17 nov

1. Exercise 9.3: Suppose you are a consultant for a researcher who is interested in investigating the effects of teacher quality on student test scores. Use the strategy of mapping this question to a randomized experiment to help define the question more clearly. Write a memo to the researcher asking for needed clarifications to this study proposal.

2. Exercise 9.4 (a): The table below describes a hypothetical experiment on 2400 persons. Each row of the table specifies a category of person, as defined by his or her pre-treatment predictor x, treatment indicator T, and potential outcomes y0, y1. (For simplicity, we assume unrealistically that all the people in this experiment fit into these eight categories.)

Category #persons.in.category x T y0 y1
1 300 0 0 4 6
2 300 1 0 4 6
3 500 0 1 4 6
4 500 1 1 4 6
5 200 0 0 10 12
6 200 1 0 10 12
7 200 0 1 10 12
8 200 1 1 10 12

In making the table we are assuming omniscience, so that we know both y0 and y1 for all observations. But the (nonomniscient) investigator would only observe x, T, and yT for each unit. (For example, a person in category 1 would have x=0, T=0, y=4, and a person in category 3 would have x=0, T=1, y=6.)

(a) What is the average treatment effect in this population of 2400 persons?

Tuesday, November 10, 2009

Voter turnout in the United States

As requested, here is a link to the research on the political attitudes of nonvoters and the potential effects of increased voter turnout.

Research article to be discussed mardi, 17 nov

Follow this link and read the linked article in Behavioral and Brain Sciences.

Also this.

Sunday, November 8, 2009

Research article to be discussed mardi 10 nov

The Accident Externality from Driving, by Aaron Edlin and Pinar Mandic.

We [Edlin and Karaca-Mandic] estimate auto accident externalities (more specifically insurance externalities) using panel data on state-average insurance premiums and loss costs. Externalities appear to be substantial in traffic-dense states: in California, for example, we find that the increase in traffic density from a typical additional driver increases total statewide insurance costs of other drivers by $1,725-$3,239 per year, depending on the model. High-traffic density states have large economically and statistically significant externalities in all specifications we check. In contrast, the accident externality per driver in low-traffic states appears quite small. On balance, accident externalities are so large that a correcting Pigouvian tax could raise $66 billion annually in California alone, more than all existing California state taxes during our study period, and over $220 billion per year nationally.

Homework assignment, due mardi, 10 nov

1. Exercise 2.3: Demonstration of the Central Limit Theorem: let x = x1 + · · · + x20, the sum of 20 independent Uniform(0,1) random variables. In R, create 1000 simulations of x and plot their histogram. On the histogram, overlay a graph of the normal density function. Comment on any differences between the histogram and the curve.

2. Exercise 8.1: Fitting the wrong model: suppose you have 100 data points that arose from the following model: y = 3+0.1x1+0.5x2+error, with errors having a t distribution with mean 0, scale 5, and 4 degrees of freedom. We shall explore the implications of fitting a standard linear regression to these data.

(a) Simulate data from this model. For simplicity, suppose the values of x1 are simply the integers from 1 to 100, and that the values of x2 are random and equally likely to be 0 or 1. (In R, you can define x.1 <- 1:100, simulate x.2 using rbinom(), then create the linear predictor, and finally simulate the random errors in y using the rt() function.) Fit a linear regression (with normal errors) to these data and see if the 68% confidence intervals for the regression coefficients (for each, the estimates ±1 standard error) cover the true values.

(b) Put the above step in a loop and repeat 1000 times. Calculate the confidence
coverage for the 68% intervals for each of the three coefficients in the model.

(c) Repeat this simulation, but instead fit the model using t errors (see Exercise 6.6).

Friday, November 6, 2009

Some R code for conference #6

#####################
### conference #6 ###
#####################

library("arm")

###########
### 7.1 ###
###########

### a

f.shots <- function(p.success=.6, hits.stop=2){
shots <- 0
missed <- 0
success <- 0
while (missed < hits.stop){
if (rbinom(1,1,p.success)==1) {
shots <- shots+1
success <- success + 1
missed <- 0
} else {
shots <- shots+1
missed <- missed + 1
}
}
c(shots=shots,success=success)
}
f.shots()

### b
sim.shots <- rep(NA,1000)
mytime <- proc.time()
for (i in 1:1000){
sim.shots[i] <- f.shots()["shots"]
}
proc.time() - mytime

mean(sim.shots)
sd(sim.shots)

# slightly faster with "replicate"
mytime <- proc.time()
sim.shots2 <- replicate(1000, f.shots()["shots"])
proc.time() - mytime

mean(sim.shots2)
sd(sim.shots2)

# theoretical results:

$\alpha=0.6$ is the probability of success
$m= 2 + \frac{\alpha(3-2\alpha)}{(1-\alpha)^2}$
$variance= 4 + \frac{m\alpha(6-4\alpha)}{(1-\alpha)^2} + \frac{\alpha(5-4\alpha)}{(1-\alpha)^2}$
$m=8.75, sd \approx 7.16$

sim.shots.success <- data.frame(shots=rep(NA,1000), success=rep(NA,1000))
for (i in 1:1000){
sim.shots.success[i,] <- f.shots()
}
str(sim.shots.success)
sim.shots.success[1:10,]
sim.shots.success$prop.success <- sim.shots.success$succes/sim.shots.success$shots

mean(sim.shots.success$shots)
mean(sim.shots.success$success)
sd(sim.shots.success$shots)
sd(sim.shots.success$success)

### c
plot(sim.shots.success$shots, sim.shots.success$prop.success)

#############
### BONUS ###
#############

### if what we want is the total number of non missed shots
f.hits <- function(p.success=.6, hits.stop=2){
hits <- 0
missed <- 0
while (missed < hits.stop){
if (rbinom(1,1,p.success)==1) {
hits <- hits+1
missed <- 0
} else {
missed <- missed + 1
}
}
hits
}
f.hits()

sim.hits <- rep(NA,10000)
mytime <- proc.time()
for (i in 1:10000){
sim.hits[i] <- f.hits()
}
proc.time() - mytime

mean(sim.hits)
sd(sim.hits)

## mat10000 <- matrix(c(.6,2), nrow=2, ncol=10000)
## mytime <- proc.time()
## sim.hits2 <- apply(mat10000, 2, f.hits)
## proc.time() - mytime

## mean(sim.hits2)
## sd(sim.hits2)

# theoretical results in that case:
# compute the probability to take n hits by induction
# p(0)=.4*.4
# a sequence leading to n hits starts either by 1 or by 0
# the set of sequences leading to n hits and starting with a 0 is the same as the set of sequences leading to n hits and starting with a 1, but adding a 0 at the beginning of the sequence
# n -> n+1
# the set of sequences leading to n+1 hits and starting with a 1 is the same as the set of sequences leading to n hits and starting with a 1, but adding a 1 at the beginning plus the set of sequences starting with a 1 but adding 10 at the beginning

$p_{n+1}=p_n*(0.6 + 0.6 \times 0.4)$
$p_n=p_0*\alpha^n$
$with ~~ p_0=0.4 \times 0.4 = 0.16$
$and ~~ \alpha=0.6 + 0.6 \times 0.4 = 0.84$

# let H be the random variable equal to the number of hits

$\mathbb{E}(H)=\sum_{n=0}^{+\infty} np(n)=p(0)\sum_{n=0}^{+\infty} n\alpha^n$
$\mathbb{E}(H)=p(0)\frac{\alpha}{(1-\alpha)^2}$
$\mathbb{E}(H)=5.25$

$\mathbb{V}(H)=\mathbb{E}(H^2)-\mathbb{E}(H)^2$
$\mathbb{E}(H^2)=p(0)\sum_{n=0}^{+\infty} n^2\alpha^n$
$\mathbb{E}(H^2)=p(0)\frac{\alpha(1+\alpha)}{(1-\alpha)^3}$
$\sigma(H)=5.73$

###########
### 7.3 ###
###########

# let's assume all random variables are independent, continuous and normally distributed

f.savings <- function(m.unit=5, sd.unit=4, m.market=40000, s.market=10000){
market.size <- round(rnorm(1, mean=m.market, sd=s.market),0)
if (market.size<0) {0 <- market.size}
savings <- rnorm(market.size, mean=m.unit, sd=sd.unit)
sum(savings)
}
f.savings()

total.savings <- rep(NA,1000)
for (i in 1:1000){
total.savings[i] <- f.savings()
}
mean(total.savings)
sd(total.savings)

# the order of magnitude should be about s.market*m.unit=10000*5

# another story:
# let's assume that there are 40000 firms
# each firm will independently enter the market with probability 1/2501
# each firm which enters the market will buy 2501 widgets
# otherwise it will buy 0 widgets

f.savings2 <- function(m.unit=5, sd.unit=4){
market.size <- 2501*rbinom(1,40000,1/2501)
savings <- rnorm(market.size, mean=m.unit, sd=sd.unit)
sum(savings)
}
f.savings2()
total.savings2 <- rep(NA,1000)
for (i in 1:1000){
total.savings2[i] <- f.savings2()
}
mean(total.savings2)
sd(total.savings2)

###########
### 7.8 ###
###########

unidim.sim <- function(estimate, sd.estimate, df, n){
X <- rchisq(n, df=df)
rnorm(n, mean=estimate, sd=sd.estimate*sqrt(df/X))
}
unidim.sim(estimate=600, sd.estimate=400, df=50, n=5)

plot(unidim.sim(estimate=600, sd.estimate=400, df=50, n=1000), unidim.sim(estimate=3, sd.estimate=1, df=100, n=1000))

cost.eff <- rep(NA,1000)
for (i in 1:1000){
cost.eff[i] <- mean(unidim.sim(estimate=600, sd.estimate=400, df=50, n=1000))/mean(unidim.sim(estimate=3, sd.estimate=1, df=100, n=1000))
}

hist(cost.eff)
quantile(cost.eff,probs=c(.025, .25, .75, .975))

cost.eff2 <- rep(NA,1000)
for (i in 1:1000){
cost.eff2[i] <- mean(unidim.sim(estimate=600, sd.estimate=400, df=50, n=1000))/mean(unidim.sim(estimate=3, sd.estimate=2, df=100, n=1000))
}

hist(cost.eff2)
quantile(cost.eff2,probs=c(.025, .25, .75, .975))

Andrew Gelman's Sciences Po Course