Andrew Gelman's Sciences Po Course

Tuesday, October 13, 2009

Thursday, October 8, 2009

Homework assignment, due mardi 13 oct

1. Exercise 5.1: The folder nes contains the survey data of presidential preference and income for the 1992 election analyzed in Section 5.1, along with other variables including sex, ethnicity, education, party identification, and political ideology.

(a) Fit a logistic regression predicting support for Bush given all these inputs. Consider how to include these as regression predictors and also consider possible interactions.

(b) Evaluate and compare the different models you have fit. Consider coefficient estimates and standard errors, residual plots, and deviances.

(c) For your chosen model, discuss and compare the importance of each input variable in the prediction.

2. Exercise 5.2: Without using a computer, sketch the following logistic regression lines:
(a) Pr(y = 1) = invlogit (x)
(b) Pr(y = 1) = invlogit (2 + x)
(c) Pr(y = 1) = invlogit (2x)
(d) Pr(y = 1) = invlogit (2 + 2x)
(e) Pr(y = 1) = invlogit (−2x)

3. Exercise 5.3: You are interested in how well the combined earnings of the parents in a child’s family predicts high school graduation. You are told that the probability a child graduates from high school is 27% for children whose parents earn no income and is 88% for children whose parents earn $60,000. Determine the logistic regression model that is consistent with this information. (For simplicity you may want to assume that income is measured in units of $10,000).

Tuesday, October 6, 2009

pieces of R code for conference #3

########################
##### Exercise 4.1 #####
########################

### a
exp(-.25)
exp(.25)

### b
a <- -3.5
b <- 2
error.sd <- .25
earnings.n <- 1000

height.in <- rnorm(earnings.n, mean=69.1, sd=2.9)

l.height.in <- log(height.in)

l.weight.error <- rnorm(earnings.n, mean=0, sd=error.sd)

l.weight.lbs <- a + b*l.height.in + l.weight.error

display(lm(l.weight.lbs ~ l.height.in))

plot(l.height.in, l.weight.lbs, cex=.25,
xlab="log(height) in inches",
ylab="log(weight) in pounds")
curve(cbind(1,x) %*% c(a,b), add=T)
curve(cbind(1,x) %*% c(a+error.sd,b), add=T, lty=2)
curve(cbind(1,x) %*% c(a-error.sd,b), add=T, lty=2)

########################
##### Exercise 4.2 #####
########################
library ("foreign")
library("arm")

### a
Heights <- read.dta ("heights.dta")
str(Heights)
summary(Heights)

##### piece of code taken and adapted from earnings_setup.R #####

attach(Heights)

# create variables for age and ethnicity categories

age <- 90 - yearbn # survey was conducted in 1990
age[age<18] <- NA
age.category <- ifelse (age<35, 1, ifelse (age<50, 2, 3))
eth <- ifelse (race==2, 1, ifelse (hisp==1, 2, ifelse (race==1, 3, 4)))
male <- 2 - sex

# (for simplicity) remove cases with missing data
# and restrict to people with positive earnings born after 1925

ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25
heights.clean <- as.data.frame (cbind (earn, height, sex, race, hisp, ed, age, age.category, eth, male)[ok,])
n <- nrow (heights.clean)

###############################
detach(Heights)
remove(age, age.category, eth, male)
attach(heights.clean)

### b
earnings.fit1 <- lm(earn ~ height)
display(earnings.fit1)

c.height <- height - mean(height)
earnings.fit2 <- lm(earn ~ c.height)
display(earnings.fit2)

### c
l.earn=log(earn)
l.height=log(height)
age2=age*age
earnings.fit3 <- lm(l.earn ~ male + height + age)
display(earnings.fit3, digits=4)

earnings.fit4 <- lm(l.earn ~ male + height + age + male:height)
display(earnings.fit4, digits=4)

earnings.fit5 <- lm(l.earn ~ male + c.height + age + male:c.height)
display(earnings.fit5, digits=4)

z.height=c.height/sd(height)
age.25 <- age-25
earnings.fit6 <- lm(l.earn ~ male + z.height + age.25 + male:z.height + male:age.25)
display(earnings.fit6, digits=4)

earnings.fit7 <- lm(l.earn ~ male + z.height + age.25 + male:z.height + male:age.25 + age.25:z.height)
display(earnings.fit7, digits=4)

########################
##### Exercice 4.3 #####
########################

plot(c(1.8,8), c(100,200),
xlab="age",
ylab="weight (in pounds)",
type='n'
)
curve(cbind(1,x) %*% c(161,2.6), add=T)
curve(cbind(1,x,x^2) %*% c(96.2,33.6,-3.2), add=T, col="red")
curve(cbind(1,(x>=3.0 & x<4.5),(x>=4.5 & x<6.5),(x>=6.5) ) %*% c(157.2,19.1, 27.2, 8.5), add=T, col="green")

Friday, October 2, 2009

some R code for conference #2

### Conference de methode #2

library("arm")

###########
### 3.1 ###
###########
getwd()
setwd("/home/argon/ScPo")

# first try... does not work
#Pyth <- read.table("exercise2.1.dat")
#str(Pyth)
#Pyth

Pyth <- read.table("exercise2.1.dat", header=T)
str(Pyth)
Pyth

summary(Pyth[1:40,])

fit.Pyth <- lm(y ~ x1 + x2, data=Pyth[1:40,])
display(fit.Pyth)

## try to understand what kind of objects are fit.Pyth and coef(fit.Pyth)
attributes(fit.Pyth)
attributes(coef(fit.Pyth))
class(coef(fit.Pyth))
is.vector(coef(fit.Pyth))
# coef(fit.Pyth) is a named vector
coef(fit.Pyth)["x1"]
coef(fit.Pyth)[2]

## 2 graph windows instead of one
par(mfrow=c(1,2))

plot(Pyth$x1[1:40], Pyth$y[1:40], cex=.25)
# curve takes a function of 'x' as an argument
# here this function is given as the product of 2 vectors in the sense of the matrix multiplication
# the choice of the intercepts is explained below
curve(cbind(1,x,mean(Pyth$x2[1:40])) %*% coef(fit.Pyth), add=T )

plot(Pyth$x2[1:40], Pyth$y[1:40], cex=.25)
curve(cbind(1,mean(Pyth$x1[1:40]),x) %*% coef(fit.Pyth), add=T )

plot(Pyth$x1[1:40], Pyth$x2[1:40], cex=.25)

# residual plots versus x1, x2, and predicted values
summary(residuals(fit.Pyth))
str(residuals(fit.Pyth))

plot(Pyth$x1[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# errors are bigger at both ends of x1

plot(Pyth$x2[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# not very centered

plot(fitted.values(fit.Pyth)[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# quite ok

pred.Pyth <- predict(fit.Pyth, Pyth[41:60,], interval="prediction", level=.95)
pred.Pyth

### choice of the intercepts for the two separate graphs of y against x1 and x2
# the graph of x1 against x2 as well as a regression of x1 on x2 seem to indicate that they are independant
# so it makes sense to use the coefficients of the regression of y against x1 and x2
# to draw two separate graphs of y on x1 and x2
# the slopes are given by the regression but the intercepts have to be recalculated
# for each graph we want the line to be centered.
#
# the fitted model is:
# $y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i$
#
# the estimated coefficients are $a, b_1, \text{~~and~~} b_2$
# graph of y against x1:
# the line is of the form:
# $y=b_1 x + c_1$
# we define
# $u_i = y_i - (b_1 x_{1i} + c_1 )$
# we want the line to be centered in the sense that $\frac{1}{n} \sum_{i=1}^n u_i=0$
#
# since there is a constant term in the regression, we know that:
# $\frac{1}{n}\sum_{i=1}^n y_i = a + b_1\frac{1}{n} \sum_{i=1}^n x_{1i} + b_2 \frac{1}{n}\sum_{i=1}^n x_{2i}$
# therefore:
# $\frac{1}{n}\sum_{i=1}^n y_i -b_1 \frac{1}{n}\sum_{i=1}^n x_{1i} = a + b_2 \frac{1}{n}\sum_{i=1}^n x_{2i}$
# and we want to find c1 such that
# $\frac{1}{n}\sum_{i=1}^n y_i -b_1 \frac{1}{n}\sum_{i=1}^n x_{1i} = c_1$
# therefore we take
# $c_1=a + b_2 \frac{1}{n}\sum_{i=1}^n x_{2i}$

# for the graph of y against x2, it's symmetrical

###########
### 3.2 ###
###########

#(a)
# l(earnings)=\alpha + \beta*log(height) + error
# alpha_hat + beta_hat * log(66)=log(30000)
# beta_hat=.8
log(30000)-.8*log(66)
# alpha_hat=6.96
# 2*sigma_hat=1.1
# sigma_hat=.55

###########
### 3.3 ###
###########

var1 <- rnorm(1000, 0,1)
var2 <- rnorm(1000, 0,1)
display(lm(var2 ~ var1))

z.scores <- rep(NA,1000)
for (k in 1:1000){
var1 <- rnorm(1000, 0,1)
var2 <- rnorm(1000, 0,1)
fit <- lm(var2 ~ var1)
z.scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}

length(which(abs(z.scores)>2))

Research article to be discussed mardi 13 oct

The Effect of Geography-Based Group Threat on Voter Mobilization: A Field Experiment, by Ryan Enos. Click to download. Here's the abstract:

The effect of group threat on voter mobilization has been tested using observational data across a number of different geographies and units of analysis. Previous studies have yielded inconsistent findings. To date, no study of voter mobilization has directly manipulated group threat using a controlled experiment. I take advantage of the unique racial geography of Los Angeles County, California, which brings different racial/ethnic groups into close, yet spatially separated, proximity. This geography allows for a randomized, controlled experiment to directly test the effects of stimulating racial threat on voter turnout. A test of 3,666 African American and Hispanic voters shows an average treatment effect of 2.3 percentage points. The eect is 50% larger for African Americans than Hispanics. These results suggest that even low propensity voters are aware of the geographic proximity of other groups and can be motivated to participate by this awareness.

Also see discussion here about the ethics of the experiment.

Thursday, October 1, 2009

Research articles to be discussed mardi 6 oct

Is Well-being U-Shaped over the Life Cycle?, by David Blanchflower and Andrew Oswald. Follow the link above to download the article. Here's the abstract:

We present evidence that psychological well-being is U-shaped through life. A difficulty with research on this issue is that there are likely to be omitted cohort effects (earlier generations may have been born in, say, particularly good or bad times). First, using data on 500,000 randomly sampled Americans and West Europeans, the paper designs a test that can control for cohort effects. Holding other factors constant, we show that a typical individual’s happiness reaches its minimum -- on both sides of the Atlantic and for both males and females -- in middle age. Second, evidence is provided for the existence of a similar U-shape through the life-course in East European, Latin American and Asian nations. Third, a U-shape in age is found in separate well-being regression equations in 72 developed and developing nations. Fourth, using measures that are closer to psychiatric scores, we document a comparable well-being curve across the life cycle in two other data sets: (i) in GHQ-N6 mental health levels among a sample of 16,000 Europeans, and (ii) in reported depression and anxiety levels among 1 million U.K. citizens. Fifth, we discuss some apparent exceptions, particularly in developing nations, to the U-shape. Sixth, we note that American male birth-cohorts seem to have become progressively less content with their lives. Our paper’s results are based on regression equations in which other influences, such as demographic variables and income, are held constant.

Association between political ideology and health in Europe, by S. V. Subramanian, Tim Huijts, and Jessica M. Perkins. Follow the link above to download the article. Here's the abstract:

Studies have largely examined the association between political ideology and health at the aggregate/ecological level. Using individual-level data from 29 European countries, we investigated whether self-reports of political ideology and health are associated. In adjusted models, we found an inverse association between political ideology and self-rated poor health; for a unit increase in the political ideology scale (towards right) the odds ratio (OR) for reporting poor health decreased (OR 0.95, 95% confidence interval 0.94–0.96). Although political ideology per se is unlikely to have a causal link to health, it could be a marker for health-promoting latent attitudes, values and beliefs.

Andrew Gelman's Sciences Po Course

Tuesday, October 13, 2009

Powerpoint for fifth lecture

Powerpoints for second, third, and fourth lectures