Monday, October 26, 2009

Hints for conference #6

# You can complete the following pieces of code

###########
### 7.1 ###
###########

### a

f.shots <- function(p.success=.6, hits.stop=2){
shots <- 0
missed <- 0
success <- 0
while ( ### condition to complete ### ){

### actions to complete here ###

}

# result:
c(shots=shots,success=success)
}
f.shots()

### b
# initialize a vector of size 1000 with missing values
sim.shots <- rep(NA,1000)
# fill in the missing values...

mean(sim.shots)
sd(sim.shots)

# do the same with both shots and successes:
sim.shots.success <- data.frame(shots=rep(NA,1000), success=rep(NA,1000))


mean(sim.shots.success$shots)
mean(sim.shots.success$success)
sd(sim.shots.success$shots)
sd(sim.shots.success$success)

### c
# plot the result


###########
### 7.3 ###
###########

# let's assume all random variables are independent, continuous and normally distributed

f.savings <- function(m.unit=5, sd.unit=4, m.market=40000, s.market=10000){
# calculate market size:
market.size <-

# calculate vector of savings and sum it

f.savings()

# initialize
total.savings <- rep(NA,1000)
# fill in

mean(total.savings)
sd(total.savings)

# another story:
# let's assume that there are 40000 firms
# each firm will independently enter the market with probability 1/2501
# each firm which enters the market will buy 2501 widgets
# otherwise it will buy 0 widgets

# do the same

###########
### 7.8 ###
###########

# create function according to the procedure p.143
# use rchisq and rnorm
unidim.sim <- function(estimate, sd.estimate, df, n){
}

# try it
unidim.sim(estimate=600, sd.estimate=400, df=50, n=5)

# plot the result with 1000 points

# 1000 simulations:
cost.eff <- rep(NA,1000)
# fill in

# histogram
hist(cost.eff)

# use function quantile for the confidence

# try again with different values

Saturday, October 24, 2009

some R code for conference #5


### conf #5

library("arm")

###########
### 5.5 ###
###########

### a

# plot of invlogit
curve(invlogit(x), from=-10, to=10)

# plot of invlogit(-24 +.4*x) with different ranges of x
curve(invlogit(-24 +.4*x), from=0, to=100)
curve(invlogit(-24 +.4*x), from=45, to=75)
curve(invlogit(-24 +.4*x), from=30, to=90)

midterm <- rnorm(50, mean=60, sd=15)
midterm <- ifelse(midterm<0, 0, ifelse(midterm>100, 100, midterm))

jitter.binary <- function(a, jitt=.05){
ifelse(a==0, runif(length(a), 0, jitt), runif(length(a), 1-jitt, 1))
}

# generate consistent final results using the model and midterm results
# do not just use the predicted value...
final.bad <- ifelse(invlogit(-24 + .4*midterm)>.5, 1, 0)
final.good <- rbinom(length(midterm), 1, invlogit(-24 + .4*midterm))

# little test to check the syntax above
# test <- c(rep(.1,100),rep(.9,100))
# rbinom(length(test), 1, test)

plot(midterm,jitter.binary(final.good), cex=.25)
curve(invlogit(-24 + .4*x), add=T)
abline(v=60)

plot(midterm,jitter.binary(final.bad), cex=.25)
curve(invlogit(-24 + .4*x), add=T)
abline(v=60)

### b








# test the transformation
logit1 <- glm(final.good ~ midterm, family=binomial(link="logit"))
display(logit1)

midterm.cs <- (midterm-mean(midterm))/sd(midterm)
logit2 <- glm(final.good ~ midterm.cs, family=binomial(link="logit"))
display(logit2)

### c
newpred <- rnorm(length(midterm), 0, 1)
logit3 <- glm(final.good ~ midterm + newpred, family=binomial(link="logit"))
display(logit3)

###########
### 5.6 ###
###########

curve(dlogis(x), from=-6, to=6)

curve(dlogis(x, location=4.5), from=-2, to=11)
abline(h=0)
#abline(v=0)
polygon(c(0,seq(0,12, length=100),12), c(0,dlogis(seq(0,12, length=100), location=4.5),0), col="grey")
polygon(c(0,seq(0,12, length=10),12), c(0,dlogis(seq(0,12, length=10), location=4.5),0), col="grey")

Homework assignment, due mardi, 27 oct

1. Exercise 7.1: Discrete probability simulation: suppose that a basketball player has a 60% chance of making a shot, and he keeps taking shots until he misses two in a row. Also assume his shots are independent (so that each shot has 60% probability of success, no matter what happened before).

(a) Write an R function to simulate this process.

(b) Put the R function in a loop to simulate the process 1000 times. Use the simulation to estimate the mean, standard deviation, and distribution of the total number of shots that the player will take.

(c) Using your simulations, make a scatterplot of the number of shots the player will take and the proportion of shots that are successes.

2. Exercise 7.3: Propagation of uncertainty: we use a highly idealized setting to illustrate the use of simulations in combining uncertainties. Suppose a company changes its technology for widget production, and a study estimates the cost savings at $5 per unit, but with a standard error of $4. Furthermore, a forecast estimates the size of the market (that is, the number of widgets that will be sold) at 40,000, with a standard error of 10,000. Assuming these two sources of uncertainty are independent, use simulation to estimate the total amount of money saved by the new product (that is, savings per unit, multiplied by size of the market).

3. Exercise 7.8: Inference for the ratio of parameters: a (hypothetical) study compares the costs and effectiveness of two different medical treatments.

• In the first part of the study, the difference in costs between treatments A and B is estimated at $600 per patient, with a standard error of $400, based on a regression with 50 degrees of freedom.

• In the second part of the study, the difference in effectiveness is estimated at 3.0 (on some relevant measure), with a standard error of 1.0, based on a regression with 100 degrees of freedom.

• For simplicity, assume that the data from the two parts of the study were collected independently. Inference is desired for the incremental cost-effectiveness ratio: the difference between the average costs of the two treatments, divided by the difference between their average effectiveness. (This problem is discussed further by Heitjan, Moskowitz, and Whang, 1999.)

(a) Create 1000 simulation draws of the cost difference and the effectiveness difference, and make a scatterplot of these draws.

(b) Use simulation to come up with an estimate, 50% interval, and 95% interval for the incremental cost-effectiveness ratio.

(c) Repeat this problem, changing the standard error on the difference in effectiveness to 2.0.

Sunday, October 18, 2009

Some R code for conference #4

### conf #4

library("arm")
library ("foreign")

### a
# do not convert categorical variables into factors in order to have numerical variables for the analysis
Nes <- read.dta("nes5200_processed_voters_realideo.dta", convert.factors=F)

# same but conversion into factors in order to have the labels
Nes.fact <- read.dta("nes5200_processed_voters_realideo.dta")
str(Nes)
str(Nes.fact)

summary(Nes.fact)
#labels(Nes)["educ1"]

ok <- !is.na(Nes$age + Nes$female + Nes$black + Nes$educ1 + Nes$income + Nes$partyid7 + Nes$presvote + Nes$real_ideo) & Nes$year==1992 & Nes$presvote<3
Nes.red <- Nes[ok, c("age", "female", "black", "educ1", "income", "partyid7", "presvote", "real_ideo")]
summary(Nes.red)
summary(Nes.red$presvote[Nes.red$black==1])
table(Nes.red$income[Nes.red$presvote==1])
table(Nes.red$income[Nes.red$presvote==2])
table(Nes.red$presvote, Nes.red$partyid7)

Nes.red$presvote <- Nes.red$presvote -1

logit1 <- glm(presvote ~ income, family=binomial(link="logit"), data=Nes.red)
display(logit1)


logit2 <- glm(presvote ~ income + age + female + black + educ1 + partyid7 + real_ideo, family=binomial(link="logit"), data=Nes.red)
display(logit2, digits=4)

# drop partyid and female
# interact black, income and real_ideo with other variables
logit3 <- glm(presvote ~ age + black + educ1 + partyid7 + real_ideo +
black:educ1 + partyid7:educ1 + real_ideo:educ1 +
black:partyid7 + black:real_ideo + partyid7:real_ideo
, family=binomial(link="logit"), data=Nes.red)
display(logit3, digits=4)

logit4 <- glm(presvote ~ age + black + income + real_ideo + income:real_ideo
, family=binomial(link="logit"), data=Nes.red)
display(logit4, digits=4)

# standardize and refit to know the respective magnitudes

# fitted values -- 2 ways (at least)
# invlogit(predict(logit1))-logit1$fitted.values

# residual plot
resid <- logit4$y - logit4$fitted.values
plot(logit4$fitted.values, logit4$y - logit4$fitted.values)


# binned residual plot
# 40 bins of equal size
resid.quant <- quantile(logit4$fitted.values, probs=seq(0,1,length=41))
# is.vector(bin.resid)
length(resid.quant)
resid.quant
mean.bin <- rep(NA, 40)
for (k in 1:40){
mean.bin[k] <- mean(resid[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ])
}
plot(resid.quant[-1], mean.bin, ylim=c(-.3,.3))

# var.bin <- rep(NA, 40)
# for (k in 1:40){
# var.bin[k] <- var(logit4$fitted.values[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ])
# }

var.bin <- rep(NA, 40)
for (k in 1:40){
var.bin[k] <- mean((logit4$fitted.values[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ])*(1-logit4$fitted.values[logit4$fitted.values>=resid.quant[k] &logit4$fitted.values<=resid.quant[k+1] ]))
}


# fitted.sort <- sort(logit4$fitted.values)
# lines(fitted.sort, 2*sqrt(fitted.sort*(1-fitted.sort)/948*40))

lines(resid.quant[-1], 2*sqrt(var.bin/948*40))
lines(resid.quant[-1], -2*sqrt(var.bin/948*40))

Tuesday, October 13, 2009

Research articles to be discussed mardi 20 oct

Three articles on statistical significance:

Of beauty, sex, and power: statistical challenges in estimating small effects. (Andrew Gelman and David Weakliem)

Letter to the editors regarding some papers of Dr. Satoshi Kanazawa. (Andrew Gelman)

The difference between "significant" and "not significant" is not itself statistically significant. (Andrew Gelman and Hal Stern)

Homework assignment, due mardi 20 oct

1. Exercise 5.5: In a class of 50 students, a logistic regression is performed of course grade (pass or fail) on midterm exam score (continuous values with mean 60 and standard deviation 15). The fitted model is Pr(pass) = logit−1(−24 + 0.4x).

(a) Graph the fitted model. Also on this graph put a scatterplot of hypothetical data consistent with the information given.

(b) Suppose the midterm scores were transformed to have a mean of 0 and standard deviation of 1. What would be the equation of the logistic regression using these transformed scores as a predictor?

(c) Create a new predictor that is pure noise (for example, in R you can create
newpred <- rnorm (n,0,1)). Add it to your model. How much does the deviance decrease?

2. Exercise 5.6: Latent-data formulation of the logistic model: take the model
Pr(y = 1) = invlogit (1 + 2x1 + 3x2) and consider a person for whom x1 = 1 and x2 = 0.5. Sketch the distribution of the latent data for this person. Figure out the probability that y=1 for the person and shade the corresponding area on your graph.

3. Exercise 5.7: Limitations of logistic regression: consider a dataset with n = 20 points, a single predictor x that takes on the values 1, . . . , 20, and binary data y. Construct data values y1, . . . , y20 that are inconsistent with any logistic regression on x. Fit a logistic regression to these data, plot the data and fitted curve, and explain why you can say that the model does not fit the data.

Powerpoint for fifth lecture

class5.pdf

Powerpoints for second, third, and fourth lectures

class2.pdf

class3.pdf

class4.pdf

Thursday, October 8, 2009

Homework assignment, due mardi 13 oct

1. Exercise 5.1: The folder nes contains the survey data of presidential preference and income for the 1992 election analyzed in Section 5.1, along with other variables including sex, ethnicity, education, party identification, and political ideology.

(a) Fit a logistic regression predicting support for Bush given all these inputs. Consider how to include these as regression predictors and also consider possible interactions.

(b) Evaluate and compare the different models you have fit. Consider coefficient estimates and standard errors, residual plots, and deviances.

(c) For your chosen model, discuss and compare the importance of each input variable in the prediction.

2. Exercise 5.2: Without using a computer, sketch the following logistic regression lines:
(a) Pr(y = 1) = invlogit (x)
(b) Pr(y = 1) = invlogit (2 + x)
(c) Pr(y = 1) = invlogit (2x)
(d) Pr(y = 1) = invlogit (2 + 2x)
(e) Pr(y = 1) = invlogit (−2x)

3. Exercise 5.3: You are interested in how well the combined earnings of the parents in a child’s family predicts high school graduation. You are told that the probability a child graduates from high school is 27% for children whose parents earn no income and is 88% for children whose parents earn $60,000. Determine the logistic regression model that is consistent with this information. (For simplicity you may want to assume that income is measured in units of $10,000).

Tuesday, October 6, 2009

pieces of R code for conference #3

########################
##### Exercise 4.1 #####
########################

### a
exp(-.25)
exp(.25)

### b
a <- -3.5
b <- 2
error.sd <- .25
earnings.n <- 1000

height.in <- rnorm(earnings.n, mean=69.1, sd=2.9)

l.height.in <- log(height.in)

l.weight.error <- rnorm(earnings.n, mean=0, sd=error.sd)

l.weight.lbs <- a + b*l.height.in + l.weight.error

display(lm(l.weight.lbs ~ l.height.in))

plot(l.height.in, l.weight.lbs, cex=.25,
xlab="log(height) in inches",
ylab="log(weight) in pounds")
curve(cbind(1,x) %*% c(a,b), add=T)
curve(cbind(1,x) %*% c(a+error.sd,b), add=T, lty=2)
curve(cbind(1,x) %*% c(a-error.sd,b), add=T, lty=2)

########################
##### Exercise 4.2 #####
########################
library ("foreign")
library("arm")

### a
Heights <- read.dta ("heights.dta")
str(Heights)
summary(Heights)

##### piece of code taken and adapted from earnings_setup.R #####

attach(Heights)

# create variables for age and ethnicity categories

age <- 90 - yearbn # survey was conducted in 1990
age[age<18] <- NA
age.category <- ifelse (age<35, 1, ifelse (age<50, 2, 3))
eth <- ifelse (race==2, 1, ifelse (hisp==1, 2, ifelse (race==1, 3, 4)))
male <- 2 - sex

# (for simplicity) remove cases with missing data
# and restrict to people with positive earnings born after 1925

ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25
heights.clean <- as.data.frame (cbind (earn, height, sex, race, hisp, ed, age, age.category, eth, male)[ok,])
n <- nrow (heights.clean)

###############################
detach(Heights)
remove(age, age.category, eth, male)
attach(heights.clean)

### b
earnings.fit1 <- lm(earn ~ height)
display(earnings.fit1)

c.height <- height - mean(height)
earnings.fit2 <- lm(earn ~ c.height)
display(earnings.fit2)

### c
l.earn=log(earn)
l.height=log(height)
age2=age*age
earnings.fit3 <- lm(l.earn ~ male + height + age)
display(earnings.fit3, digits=4)

earnings.fit4 <- lm(l.earn ~ male + height + age + male:height)
display(earnings.fit4, digits=4)

earnings.fit5 <- lm(l.earn ~ male + c.height + age + male:c.height)
display(earnings.fit5, digits=4)

z.height=c.height/sd(height)
age.25 <- age-25
earnings.fit6 <- lm(l.earn ~ male + z.height + age.25 + male:z.height + male:age.25)
display(earnings.fit6, digits=4)

earnings.fit7 <- lm(l.earn ~ male + z.height + age.25 + male:z.height + male:age.25 + age.25:z.height)
display(earnings.fit7, digits=4)


########################
##### Exercice 4.3 #####
########################


plot(c(1.8,8), c(100,200),
xlab="age",
ylab="weight (in pounds)",
type='n'
)
curve(cbind(1,x) %*% c(161,2.6), add=T)
curve(cbind(1,x,x^2) %*% c(96.2,33.6,-3.2), add=T, col="red")
curve(cbind(1,(x>=3.0 & x<4.5),(x>=4.5 & x<6.5),(x>=6.5) ) %*% c(157.2,19.1, 27.2, 8.5), add=T, col="green")

Friday, October 2, 2009

some R code for conference #2

### Conference de methode #2

library("arm")

###########
### 3.1 ###
###########
getwd()
setwd("/home/argon/ScPo")

# first try... does not work
#Pyth <- read.table("exercise2.1.dat")
#str(Pyth)
#Pyth


Pyth <- read.table("exercise2.1.dat", header=T)
str(Pyth)
Pyth



summary(Pyth[1:40,])

fit.Pyth <- lm(y ~ x1 + x2, data=Pyth[1:40,])
display(fit.Pyth)

## try to understand what kind of objects are fit.Pyth and coef(fit.Pyth)
attributes(fit.Pyth)
attributes(coef(fit.Pyth))
class(coef(fit.Pyth))
is.vector(coef(fit.Pyth))
# coef(fit.Pyth) is a named vector
coef(fit.Pyth)["x1"]
coef(fit.Pyth)[2]

## 2 graph windows instead of one
par(mfrow=c(1,2))


plot(Pyth$x1[1:40], Pyth$y[1:40], cex=.25)
# curve takes a function of 'x' as an argument
# here this function is given as the product of 2 vectors in the sense of the matrix multiplication
# the choice of the intercepts is explained below
curve(cbind(1,x,mean(Pyth$x2[1:40])) %*% coef(fit.Pyth), add=T )

plot(Pyth$x2[1:40], Pyth$y[1:40], cex=.25)
curve(cbind(1,mean(Pyth$x1[1:40]),x) %*% coef(fit.Pyth), add=T )

plot(Pyth$x1[1:40], Pyth$x2[1:40], cex=.25)

# residual plots versus x1, x2, and predicted values
summary(residuals(fit.Pyth))
str(residuals(fit.Pyth))

plot(Pyth$x1[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# errors are bigger at both ends of x1

plot(Pyth$x2[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# not very centered

plot(fitted.values(fit.Pyth)[1:40], residuals(fit.Pyth), cex=.25)
abline(c(0,0))
abline(c(sd(residuals(fit.Pyth)),0), lty=2)
abline(c(-sd(residuals(fit.Pyth)),0), lty=2)
# quite ok

pred.Pyth <- predict(fit.Pyth, Pyth[41:60,], interval="prediction", level=.95)
pred.Pyth

### choice of the intercepts for the two separate graphs of y against x1 and x2
# the graph of x1 against x2 as well as a regression of x1 on x2 seem to indicate that they are independant
# so it makes sense to use the coefficients of the regression of y against x1 and x2
# to draw two separate graphs of y on x1 and x2
# the slopes are given by the regression but the intercepts have to be recalculated
# for each graph we want the line to be centered.
#
# the fitted model is:
#

#
# the estimated coefficients are

# graph of y against x1:
# the line is of the form:
#

# we define
#

# we want the line to be centered in the sense that
#
# since there is a constant term in the regression, we know that:
#

# therefore:
#

# and we want to find c1 such that
#

# therefore we take
#


# for the graph of y against x2, it's symmetrical



###########
### 3.2 ###
###########

#(a)
# l(earnings)=\alpha + \beta*log(height) + error
# alpha_hat + beta_hat * log(66)=log(30000)
# beta_hat=.8
log(30000)-.8*log(66)
# alpha_hat=6.96
# 2*sigma_hat=1.1
# sigma_hat=.55

###########
### 3.3 ###
###########

var1 <- rnorm(1000, 0,1)
var2 <- rnorm(1000, 0,1)
display(lm(var2 ~ var1))

z.scores <- rep(NA,1000)
for (k in 1:1000){
var1 <- rnorm(1000, 0,1)
var2 <- rnorm(1000, 0,1)
fit <- lm(var2 ~ var1)
z.scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}



length(which(abs(z.scores)>2))

Research article to be discussed mardi 13 oct

The E ffect of Geography-Based Group Threat on Voter Mobilization: A Field Experiment, by Ryan Enos. Click to download. Here's the abstract:

The eff ect of group threat on voter mobilization has been tested using observational data across a number of di fferent geographies and units of analysis. Previous studies have yielded inconsistent findings. To date, no study of voter mobilization has directly manipulated group threat using a controlled experiment. I take advantage of the unique racial geography of Los Angeles County, California, which brings di fferent racial/ethnic groups into close, yet spatially separated, proximity. This geography allows for a randomized, controlled experiment to directly test the eff ects of stimulating racial threat on voter turnout. A test of 3,666 African American and Hispanic voters shows an average treatment e ffect of 2.3 percentage points. The e ect is 50% larger for African Americans than Hispanics. These results suggest that even low propensity voters are aware of the geographic proximity of other groups and can be motivated to participate by this awareness.


Also see discussion here about the ethics of the experiment.

Thursday, October 1, 2009

Research articles to be discussed mardi 6 oct

Is Well-being U-Shaped over the Life Cycle?, by David Blanchflower and Andrew Oswald. Follow the link above to download the article. Here's the abstract:

We present evidence that psychological well-being is U-shaped through life. A difficulty with research on this issue is that there are likely to be omitted cohort effects (earlier generations may have been born in, say, particularly good or bad times). First, using data on 500,000 randomly sampled Americans and West Europeans, the paper designs a test that can control for cohort effects. Holding other factors constant, we show that a typical individual’s happiness reaches its minimum -- on both sides of the Atlantic and for both males and females -- in middle age. Second, evidence is provided for the existence of a similar U-shape through the life-course in East European, Latin American and Asian nations. Third, a U-shape in age is found in separate well-being regression equations in 72 developed and developing nations. Fourth, using measures that are closer to psychiatric scores, we document a comparable well-being curve across the life cycle in two other data sets: (i) in GHQ-N6 mental health levels among a sample of 16,000 Europeans, and (ii) in reported depression and anxiety levels among 1 million U.K. citizens. Fifth, we discuss some apparent exceptions, particularly in developing nations, to the U-shape. Sixth, we note that American male birth-cohorts seem to have become progressively less content with their lives. Our paper’s results are based on regression equations in which other influences, such as demographic variables and income, are held constant.


Association between political ideology and health in Europe, by S. V. Subramanian, Tim Huijts, and Jessica M. Perkins. Follow the link above to download the article. Here's the abstract:

Studies have largely examined the association between political ideology and health at the aggregate/ecological level. Using individual-level data from 29 European countries, we investigated whether self-reports of political ideology and health are associated. In adjusted models, we found an inverse association between political ideology and self-rated poor health; for a unit increase in the political ideology scale (towards right) the odds ratio (OR) for reporting poor health decreased (OR 0.95, 95% confidence interval 0.94–0.96). Although political ideology per se is unlikely to have a causal link to health, it could be a marker for health-promoting latent attitudes, values and beliefs.