Title: | Political Science Computational Laboratory |
---|---|
Description: | Bayesian analysis of item-response theory (IRT) models, roll call analysis; computing highest density regions; maximum likelihood estimation of zero-inflated and hurdle models for count data; goodness-of-fit measures for GLMs; data sets used in writing and teaching; seats-votes curves. |
Authors: | Simon Jackman [aut, cre] , Alex Tahk [ctb] , Achim Zeileis [ctb] , Christina Maimone [ctb] , James Fearon [ctb], Zoe Meers [ctb] |
Maintainer: | Simon Jackman <[email protected]> |
License: | GPL-2 |
Version: | 1.5.9 |
Built: | 2024-11-11 03:59:42 UTC |
Source: | https://github.com/atahk/pscl |
Absentee ballot outcomes contrasted with machine ballots, cast in Pennsylvania State Senate elections, selected districts, 1982-1993.
data(absentee)
data(absentee)
A data frame with 22 observations on the following 8 variables.
year
a numeric vector, year of election, 19xx
district
a numeric vector, Pennsylvania State Senate district
absdem
a numeric vector, absentee ballots cast for the Democratic candidate
absrep
a numeric vector, absentee ballots cast for the Republican candidate
machdem
a numeric vector, votes cast on voting machines for the Democratic candidate
machrep
a numeric vector, votes cast on voting machines for the Republican candidate
dabs
a numeric vector, Democratic margin among absentee ballots
dmach
a numeric vector, Democratic margin among ballots case on voting machines
In November 1993, the state of Pennsylvania conducted elections for its state legislature. The result in the Senate election in the 2nd district (based in Philadelphia) was challenged in court, and ultimately overturned. The Democratic candidate won 19,127 of the votes cast by voting machine, while the Republican won 19,691 votes cast by voting machine, giving the Republican a lead of 564 votes. However, the Democrat won 1,396 absentee ballots, while the Republican won just 371 absentee ballots, more than offsetting the Republican lead based on the votes recorded by machines on election day. The Republican candidate sued, claiming that many of the absentee ballots were fraudulent. The judge in the case solicited expert analysis from Orley Ashenfelter, an economist at Princeton University. Ashenfelter examined the relationship between absentee vote margins and machine vote margins in 21 previous Pennsylvania Senate elections in seven districts in the Philadelphia area over the preceding decade.
Ashenfelter, Orley. 1994. Report on Expected Absentee Ballots. Typescript. Department of Economics, Princeton University.
Ashenfelter, Orley, Phillip Levine and David Zimmerman. 2003. Statistics and Econometrics: Methods and Applications. New York: John Wiley and Sons.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey. Examples 2.13, 2.14, 2.15.
data(absentee) summary(absentee) denom <- absentee$absdem + absentee$absrep y <- (absentee$absdem - absentee$absrep)/denom * 100 denom <- absentee$machdem + absentee$machrep x <- (absentee$machdem - absentee$machrep)/denom *100 ols <- lm(y ~ x, subset=c(rep(TRUE,21),FALSE) ## drop data point 22 ) ## predictions for disputed absentee point yhat22 <- predict(ols, newdata=list(x=x[22]), se.fit=TRUE, interval="prediction") tstat <- (y[22]-yhat22$fit[,"fit"])/yhat22$se.fit cat("tstat on actual outcome for obs 22:",tstat,"\n") cat(paste("Pr(t>",round(tstat,2),") i.e., one-sided:\n",sep="")) cat(1-pt(tstat,df=yhat22$df),"\n") ## make a picture xseq <- seq(min(x)-.1*diff(range(x)), max(x)+.1*diff(range(x)), length=100) yhat <- predict(ols,interval="prediction", newdata=list(x=xseq)) plot(y~x, type="n", axes=FALSE, ylim=range(yhat,y), xlim=range(xseq),xaxs="i", xlab="Democratic Margin, Machine Ballots (Percentage Points)", ylab="Democratic Margin, Absentee Ballots (Percentage Points)") polygon(x=c(xseq,rev(xseq)), ## overlay 95% prediction CI y=c(yhat[,"lwr"],rev(yhat[,"upr"])), border=FALSE, col=gray(.85)) abline(ols,lwd=2) ## overlay ols points(x[-22],y[-22],pch=1) ## data points(x[22],y[22],pch=16) ## disputed data point text(x[22],y[22], "Disputed\nElection", cex=.75, adj=1.25) axis(1) axis(2)
data(absentee) summary(absentee) denom <- absentee$absdem + absentee$absrep y <- (absentee$absdem - absentee$absrep)/denom * 100 denom <- absentee$machdem + absentee$machrep x <- (absentee$machdem - absentee$machrep)/denom *100 ols <- lm(y ~ x, subset=c(rep(TRUE,21),FALSE) ## drop data point 22 ) ## predictions for disputed absentee point yhat22 <- predict(ols, newdata=list(x=x[22]), se.fit=TRUE, interval="prediction") tstat <- (y[22]-yhat22$fit[,"fit"])/yhat22$se.fit cat("tstat on actual outcome for obs 22:",tstat,"\n") cat(paste("Pr(t>",round(tstat,2),") i.e., one-sided:\n",sep="")) cat(1-pt(tstat,df=yhat22$df),"\n") ## make a picture xseq <- seq(min(x)-.1*diff(range(x)), max(x)+.1*diff(range(x)), length=100) yhat <- predict(ols,interval="prediction", newdata=list(x=xseq)) plot(y~x, type="n", axes=FALSE, ylim=range(yhat,y), xlim=range(xseq),xaxs="i", xlab="Democratic Margin, Machine Ballots (Percentage Points)", ylab="Democratic Margin, Absentee Ballots (Percentage Points)") polygon(x=c(xseq,rev(xseq)), ## overlay 95% prediction CI y=c(yhat[,"lwr"],rev(yhat[,"upr"])), border=FALSE, col=gray(.85)) abline(ols,lwd=2) ## overlay ols points(x[-22],y[-22],pch=1) ## data points(x[22],y[22],pch=16) ## disputed data point text(x[22],y[22], "Disputed\nElection", cex=.75, adj=1.25) axis(1) axis(2)
Ordinal ratings (faculty evaluations) of applicants to a Political Science PhD Program.
data(admit)
data(admit)
A data frame with 106 observations on the following 6 variables.
score
an ordered factor with levels 1
< 2
< 3
< 4
< 5
gre.quant
applicant's score on the quantitative section of the GRE; the maximum score is 800
gre.verbal
applicant's score on the verbal section of the GRE; the maximum score is 800
ap
1 if the applicant indicated an interest in American politics; 0 otherwise
pt
1 if the applicant indicated an interest in Political Theory; 0 otherwise
female
1 for female applicants; 0 otherwise
Jackman, Simon. 2004. "What Do We Learn From Graduate Admissions Committees?: A Multiple-Rater, Latent Variable Model, with Incomplete Discrete and Continuous Indicators." Political Analysis. 12(4):400-424.
data(admit) summary(admit) ## ordered probit model op1 <- MASS::polr(score ~ gre.quant + gre.verbal + ap + pt + female, Hess=TRUE, data=admit, method="probit") summary(op1) hitmiss(op1) logLik(op1) pR2(op1)
data(admit) summary(admit) ## ordered probit model op1 <- MASS::polr(score ~ gre.quant + gre.verbal + ap + pt + female, Hess=TRUE, data=admit, method="probit") summary(op1) hitmiss(op1) logLik(op1) pR2(op1)
The results of 239 published opinion polls measuring vote intentions (1st preference vote intention in a House of Representatives election) between the 2004 and 2007 Australian Federal elections, from 4 survey houses.
data(AustralianElectionPolling)
data(AustralianElectionPolling)
A data frame with 239 observations on the following 14 variables.
ALP
a numeric vector, percentage of respondents reported as intending to vote for the Australian Labor Party
Lib
a numeric vector, percentage of respondents reported as intending to vote for the Liberal Party
Nat
a numeric vector, percentage of respondents reported as intending to vote for the National Party
Green
a numeric vector, percentage of respondents reported as intending to vote for the Greens
FamilyFirst
a numeric vector, percentage of respondents reported as intending to vote for the Family First party
Dems
a numeric vector, percentage of respondents reported as intending to vote for the Australian Democrats
OneNation
a numeric vector, percentage of respondents reported as intending to vote for One Nation
DK
a numeric vector, percentage of respondents reported as expressing no preference or a “don't know” response
sampleSize
a numeric vector, reported sample size of the poll
org
a factor with levels Galaxy
, Morgan, F2F
, Newspoll
, Nielsen
and Morgan, Phone
, indicating the survey house and/or mode of the poll
startDate
a Date, reported start of the field period
endDate
a Date, reported end of the field period
source
a character vector, source of the poll report
remark
a character vector, remarks noted by author and/or research assistant coders
Morgan uses two modes: phone and face-to-face.
The 2004 Australian election was on October 9; the ALP won 37.6% of the 1st preferences cast in elections for the House of Representatives. The ALP won the 2007 election (November 24) with 43.4% of 1st preferences.
The ALP changed leaders twice in the 2004-07 inter-election period spanned by these data: (1) Mark Latham resigned the ALP leadership on January 18 2005 and was replaced by Kim Beazley; (2) Beazley lost the ALP leadership to Kevin Rudd on December 4, 2006.
The then Prime Minister, John Howard, announced the November 2007 election on October 14, 2007.
See the source
variable. Andrea Abel assisted with the data collection.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey. Example 9.3.
data(AustralianElectionPolling) if(require(lattice)) { lattice::xyplot(ALP ~ startDate | org, data=AustralianElectionPolling, layout=c(1,5), type="b", xlab="Start Date", ylab="ALP") } ## test for house effects y <- AustralianElectionPolling$ALP/100 v <- y*(1-y)/AustralianElectionPolling$sampleSize w <- 1/v m1 <- mgcv::gam(y ~ s(as.numeric(startDate)), weight=w, data=AustralianElectionPolling) m2 <- update(m1, ~ . + org) anova(m1,m2)
data(AustralianElectionPolling) if(require(lattice)) { lattice::xyplot(ALP ~ startDate | org, data=AustralianElectionPolling, layout=c(1,5), type="b", xlab="Start Date", ylab="ALP") } ## test for house effects y <- AustralianElectionPolling$ALP/100 v <- y*(1-y)/AustralianElectionPolling$sampleSize w <- 1/v m1 <- mgcv::gam(y ~ s(as.numeric(startDate)), weight=w, data=AustralianElectionPolling) m2 <- update(m1, ~ . + org) anova(m1,m2)
Aggregate data on the 24 elections to Australia's House of Representatives, 1949 to 2016.
data(AustralianElections)
data(AustralianElections)
A data frame with the following variables:
date
date of election, stored using the
Date
class
Seats
numeric, number of seats in the House of Representatives
Uncontested
numeric, number of uncontested seats
ALPSeats
numeric, number of seats won by the Australian Labor Party
LPSeats
numeric, number of seats won by the Liberal Party
NPSeats
numeric, number of seats won by the National Party (previously known as the Country Party)
OtherSeats
numeric, number of seats won by other parties and/or independent candidates
ALP
numeric, percentage of first preference votes cast for Australian Labor Party candidates
ALP2PP
numeric, percentage of the two-party preferred vote won by Australian Labor Party candidates
LP
numeric, percent of first preference votes cast for Liberal Party candidates
NP
numeric, percent of first preference votes cast for National Party (Country Party) candidates
DLP
numeric, percent of first preference votes cast for Democratic Labor Party candidates
Dem
numeric, percent of first preference votes cast for Australian Democrat candidates
Green
numeric, percent of first preference votes cast for Green Party candidates
Hanson
numeric, percent of first preference votes cast for candidates from Pauline Hanson's One Nation party
Com
numeric, percent of first preference votes cast for Communist Party candidates
AP
numeric, percent of first preference votes cast for Australia Party candidates
Informal
numeric, percent of ballots cast that are spoiled, blank, or otherwise uncountable (usually because of errors in enumerating preferences)
Turnout
numeric, percent of enrolled voters recorded as having turned out to vote (Australia has compulsory voting)
The Liberal National Party of Queensland formed in 2008 after a merger of the Liberal Party and the National Party. In all elections following 2008, they have been categorised under LP
.
Australian Electoral Commission. https://www.aec.gov.au.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey. Example 3.5.
data(AustralianElections) attach(AustralianElections) alpSeatShare <- ALPSeats/Seats alpVoteShare <- ALP2PP/100 ## log-odds transforms x <- log(alpVoteShare/(1-alpVoteShare)) y <- log(alpSeatShare/(1-alpSeatShare)) ols <- lm(y~x) ## Tufte-style seats-votes regression xseq <- seq(-4.5,4.5,length=500) yhat <- coef(ols)[1] + coef(ols)[2]*xseq yhat <- exp(yhat)/(1+exp(yhat)) xseq <- exp(xseq)/(1+exp(xseq)) ## seats vote curve plot(x=alpVoteShare, y=alpSeatShare, xlab="ALP Vote Share", ylab="ALP Seat Share") lines(xseq,yhat,lwd=2) abline(h=.5,lty=2) abline(v=.5,lty=2)
data(AustralianElections) attach(AustralianElections) alpSeatShare <- ALPSeats/Seats alpVoteShare <- ALP2PP/100 ## log-odds transforms x <- log(alpVoteShare/(1-alpVoteShare)) y <- log(alpSeatShare/(1-alpSeatShare)) ols <- lm(y~x) ## Tufte-style seats-votes regression xseq <- seq(-4.5,4.5,length=500) yhat <- coef(ols)[1] + coef(ols)[2]*xseq yhat <- exp(yhat)/(1+exp(yhat)) xseq <- exp(xseq)/(1+exp(xseq)) ## seats vote curve plot(x=alpVoteShare, y=alpSeatShare, xlab="ALP Vote Share", ylab="ALP Seat Share") lines(xseq,yhat,lwd=2) abline(h=.5,lty=2) abline(v=.5,lty=2)
Compute and optionally plot highest density regions for the Beta distribution.
betaHPD(alpha,beta,p=.95,plot=FALSE,xlim=NULL,debug=FALSE)
betaHPD(alpha,beta,p=.95,plot=FALSE,xlim=NULL,debug=FALSE)
alpha |
scalar, first shape parameter of the Beta density. Must be greater than 1, see details |
beta |
scalar, second shape parameter of the Beta density. Must be greater than 1, see details |
p |
scalar, content of HPD, must lie between 0 and 1 |
plot |
logical flag, if |
xlim |
numeric vector of length 2, the limits of the density's
support to show when plotting; the default is |
debug |
logical flag, if |
The Beta density arises frequently in Bayesian models of
binary events, rates, and proportions, which take on values in the
open unit interval. For instance, the Beta density is a conjugate prior
for the unknown success probability in binomial trials. With shape
parameters and
, the Beta density is
unimodal.
In general, suppose
is a random variable with density
. A highest
density region (HDR) of
with content
is a set
with the
following properties:
and
For a unimodal
Beta density (the class of Beta densities handled by this function),
a HDR of content is simply an interval
.
This function uses numerical methods to solve for the
end points of a HDR for a Beta density with user-specified shape
parameters, via repeated calls to the functions dbeta
,
pbeta
and qbeta
. The function
optimize
is used to find points and
such that
subject to the constraint
where is a Beta density with shape
parameters
and
.
In the special case of , the end points
of a HDR with content
are given by the
quantiles of the Beta density, and are computed with the
qbeta
function.
Again note that the function will only compute a HDR for a unimodal
Beta density, and exit with an error if alpha<=1 | beta <=1
.
Note that the uniform density results with ,
which does not have a unique HDR with content
. With shape parameters
and
(or
vice-versa, respectively), the Beta density is infinite at 0 (or 1,
respectively), but still integrates to one, and so a HDR is still
well-defined (but not implemented here, at least not yet).
Similarly, with
the Beta density is
infinite at both 0 and 1, but integrates to one, and again a HDR of
content
is well-defined in this case, but will be a set of
two disjoint intervals (again, at present, this function does not
cover this case).
If the numerical optimization is successful an vector of length 2,
containing and
, defined above. If the optimization
fails for whatever reason, a vector of
NAs
is returned.
The function will also produce a plot of the density with area under
the density supported by the HDR shaded, if the user calls the
function with plot=TRUE
; the plot will appear on the current
graphics device.
Debugging messages are printed to the console if the debug
logical flag is set to TRUE
.
Simon Jackman [email protected]. Thanks to John Bullock who discovered a bug in an earlier version.
betaHPD(4,5) betaHPD(2,120) betaHPD(120,45,p=.75,xlim=c(0,1))
betaHPD(4,5) betaHPD(2,120) betaHPD(120,45,p=.75,xlim=c(0,1))
A sample of 915 biochemistry graduate students.
data(bioChemists)
data(bioChemists)
art
count of articles produced during last 3 years of Ph.D.
fem
factor indicating gender of student, with levels Men and Women
mar
factor indicating marital status of student, with levels Single and Married
kid5
number of children aged 5 or younger
phd
prestige of Ph.D. department
ment
count of articles produced by Ph.D. mentor during last 3 years
Long, J. Scott. 1990. The origins of sex differences in science. Social Forces. 68(3):1297-1316.
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, California: Sage.
Election returns and identifying information, California's 53 congressional districts in the 2006 Congressional elections.
data(ca2006)
data(ca2006)
A data frame with 53 observations on the following 11 variables.
district
numeric, number of Congressional district
D
numeric, number of votes for the Democratic candidate
R
numeric, votes for the Republican candidate
Other
numeric, votes for other candidates
IncParty
character, party of the incumbent (or retiring member),
D
or R
IncName
character, last name of the incumbent, character
NA
if no incumbent running
open
logical, TRUE
if no incumbent running
contested
logical, TRUE
if both major parties
ran candidates
Bush2004
numeric, votes for George W. Bush (R) in the district in the 2004 presidential election
Kerry2004
numeric, votes for John Kerry (D) in 2004
Other2004
numeric votes for other candidates in 2004
Bush2000
numeric, votes for George W. Bush in 2000
Gore2000
numeric, votes for Al Gore (D) in 2000
2006 data from the California Secretary of State's web site, https://www.sos.ca.gov/elections/prior-elections/statewide-election-results/general-election-november-7-2006/statement-vote. 2004 and 2000 presidential vote in congressional districts from the 2006 Almanac of American Politics.
Thanks to Arthur Aguirre for the updated links, above.
Michael Baraon and Richard E. Cohen. 2006. The Almanac of American Politics, 2006. National Journal Group: Washington, D.C.
data(ca2006) ## 2006 CA congressional vote against 2004 pvote y <- ca2006$D/(ca2006$D+ca2006$R) x <- ca2006$Kerry2004/(ca2006$Kerry2004+ca2006$Bush2004) pch <- rep(19,length(y)) pch[ca2006$open] <- 1 col <- rep("black",length(y)) col[11] <- "red" ## Pembo (R) loses to McNerney (D) plot(y~x,pch=pch, col=col, xlim=range(x,y,na.rm=TRUE), ylim=range(x,y,na.rm=TRUE), xlab="Kerry Two-Party Vote, 2004", ylab="Democratic Two-Party Vote Share, 2006") abline(0,1) abline(h=.5,lty=2) abline(v=.5,lty=2) legend(x="topleft", bty="n", col=c("red","black","black"), pch=c(19,19,1), legend=c("Seat Changing Hands", "Seat Retained by Incumbent Party", "Open Seat (no incumbent running)") )
data(ca2006) ## 2006 CA congressional vote against 2004 pvote y <- ca2006$D/(ca2006$D+ca2006$R) x <- ca2006$Kerry2004/(ca2006$Kerry2004+ca2006$Bush2004) pch <- rep(19,length(y)) pch[ca2006$open] <- 1 col <- rep("black",length(y)) col[11] <- "red" ## Pembo (R) loses to McNerney (D) plot(y~x,pch=pch, col=col, xlim=range(x,y,na.rm=TRUE), ylim=range(x,y,na.rm=TRUE), xlab="Kerry Two-Party Vote, 2004", ylab="Democratic Two-Party Vote Share, 2006") abline(0,1) abline(h=.5,lty=2) abline(v=.5,lty=2) legend(x="topleft", bty="n", col=c("red","black","black"), pch=c(19,19,1), legend=c("Seat Changing Hands", "Seat Retained by Incumbent Party", "Open Seat (no incumbent running)") )
Add summaries of each roll call vote to a rollcall
object.
computeMargins(object, dropList = NULL)
computeMargins(object, dropList = NULL)
object |
an object of class |
dropList |
a |
The subsetting implied by the dropList
is first applied
to the rollcall
object, via dropRollCall
.
Then, for each remaining roll call vote, the number of legislators
voting “Yea”, “Nay”, and not voting are computed, using
the encoding information in the codes
component of the
rollcall
object via the convertCodes
function. The matrix of vote counts are added to the rollcall
object as a component voteMargins
.
An object of class rollcall
, with a component
voteMargins
that is a matrix with four columns:
Yea |
number of legislators voting “Yea” |
Nay |
number of legislators voting “Nay” |
NA |
number of legislators not voting “Nay” |
Min |
the number of legislators voting on the losing side of the roll call |
Simon Jackman [email protected]
dropRollCall
on specifying a dropList
.
The vote-specific marginals produced by this function are used by as
dropRollCall
, summary.ideal
and
predict.ideal
.
data(s109) tmp <- computeMargins(s109) dim(tmp$voteMargins) ## 645 by 4 tmp <- computeMargins(s109, dropList=list(codes="notInLegis",lop=0)) dim(tmp$voteMargins) ## 544 by 4
data(s109) tmp <- computeMargins(s109) dim(tmp$voteMargins) ## 645 by 4 tmp <- computeMargins(s109, dropList=list(codes="notInLegis",lop=0)) dim(tmp$voteMargins) ## 544 by 4
Sets constraints on specified item parameters in Bayesian analysis of roll call data by generating appropriate priors and start values for Markov chain Monte Carlo iterations.
constrain.items(obj, dropList = list(codes = "notInLegis", lop = 0), x, d = 1)
constrain.items(obj, dropList = list(codes = "notInLegis", lop = 0), x, d = 1)
obj |
an object of class |
dropList |
a |
x |
a |
d |
numeric, positive integer, the number of dimensions for which to set up the priors and start values. |
constrain.items
and its cousin,
constrain.legis
are usefully thought of as
“pre-processor” functions, generating priors and start
values for both the item parameters and the ideal points. For the
items specified in x
, the prior mean for each dimension is set
to the value given in x
, and the prior precision for each
dimension is set to 1e12
(i.e., a near-degenerate
“spike” prior). For the other items, the priors are set to a
mean of 0 and precision 0.01. All of the ideal points are given
normal priors with mean 0, precision 1.
Start values are also generated for both ideal points and item
parameters. The start values for the items specified in x
are
set to the values specified in x
. The list resulting from
constrain.items
can then be given as the value for the
parameters priors
and startvals
when ideal
is run. The user is responsible for ensuring that a sufficient number
of items are constrained such that when ideal
is run,
the model parameters are identified.
dropRollCall
is first called to generate the desired
roll call matrix. The entries of the roll call matrix are mapped to
c(0,1,NA)
using the codes
component of the
rollcall
object
. See the discussion in the
documentation of ideal
for details on the generation of
start values.
a list with elements:
xp |
prior means for ideal points. A matrix of dimensions number
of legislators in |
xpv |
prior meansprecisions for ideal points. A matrix of dimensions number
of legislators in |
bp |
prior means for item parameters. A matrix of dimensions number
of items or votes in |
bpv |
prior meansprecisions for item parameters. A matrix of dimensions number
of items or votes in |
xstart |
start values for ideal points. A matrix of dimensions number
of legislators in |
bstart |
start values for ideal points. A matrix of dimensions number
of items or votes in |
rollcall
, ideal
,
constrain.legis
## Not run: data(s109) f <- system.file("extdata","id1.rda",package="pscl") load(f) id1sum <- summary(id1,include.beta=TRUE) suspect1 <- id1sum$bSig[[1]]=="95 close60 <- id1sum$bResults[[1]][,"Yea"] < 60 close40 <- id1sum$bResults[[1]][,"Yea"] > 40 suspect <- suspect1 & close60 & close40 id1sum$bResults[[1]][suspect,] suspectVotes <- dimnames(id1sum$bResults[[1]][suspect,])[[1]] ## constraints on 2d model, ## close rollcall poorly fit by 1d model ## serves as reference item for 2nd dimension cl <- constrain.items(s109, x=list("2-150"=c(0,7), "2-169"=c(7,0)), d=2) id1Constrained <- ideal(s109, d=2, meanzero=TRUE, priors=cl, startvals=cl, maxiter=1e5, burnin=1e3, thin=1e2) summary(id1Constrained,include.beta=TRUE) ## End(Not run)
## Not run: data(s109) f <- system.file("extdata","id1.rda",package="pscl") load(f) id1sum <- summary(id1,include.beta=TRUE) suspect1 <- id1sum$bSig[[1]]=="95 close60 <- id1sum$bResults[[1]][,"Yea"] < 60 close40 <- id1sum$bResults[[1]][,"Yea"] > 40 suspect <- suspect1 & close60 & close40 id1sum$bResults[[1]][suspect,] suspectVotes <- dimnames(id1sum$bResults[[1]][suspect,])[[1]] ## constraints on 2d model, ## close rollcall poorly fit by 1d model ## serves as reference item for 2nd dimension cl <- constrain.items(s109, x=list("2-150"=c(0,7), "2-169"=c(7,0)), d=2) id1Constrained <- ideal(s109, d=2, meanzero=TRUE, priors=cl, startvals=cl, maxiter=1e5, burnin=1e3, thin=1e2) summary(id1Constrained,include.beta=TRUE) ## End(Not run)
Sets constraints on specified legislators for ideal point estimation by generating appropriate priors and start values.
constrain.legis(obj, dropList = list(codes = "notInLegis", lop = 0), x, d = 1)
constrain.legis(obj, dropList = list(codes = "notInLegis", lop = 0), x, d = 1)
obj |
an object of class |
dropList |
a |
x |
a |
d |
the number of dimensions for which to set up the priors and start values. |
constrain.items
and its cousin,
constrain.legis
are usefully thought of as
“pre-processor” functions, implementing identification
constraints for the ideal point model by generating priors and start
values for both the item parameters and the ideal points.
For the legislators specified in x
, the prior mean for each
dimension is set to the specified value and the prior precision for
each dimension is set to 1e12
(i.e., a near-degenerate
“spike” prior, and, for all practical purposes, constraining
that parameter to a fixed value). For the other legislators, the
priors on their ideal points are set to a mean of 0 and a small
precision of .01, corresponding to a prior variance of 100, or a prior
95 percent confidence interval of -20 to 20. All of the item
parameter priors are set to mean 0, precision 0.01.
Start values are also generated for both ideal points and item
parameters. The start values for the legislators named in
x
are set to the values specified in x
. The list
resulting from constrain.legis
can then be given as the value
for the parameters priors
and startvals
when
ideal
is run. constrain.legis
requires that
d+1
constraints be specified; if the constrained ideal points
points are linearly independent, then the parameters of the
item-response model are (at least locally) identified. For instance,
when fitting a 1 dimensional model, constraining the ideal points of
two legislators is sufficient to globally identify the model
parameters.
dropRollCall
is first called to generate the desired
roll call matrix. The entries of the roll call matrix are mapped to
c(0,1,NA)
using the codes
component of the
rollcall
object
. See the discussion in the
documentation of ideal
for details on the generation of
start values.
a list with elements:
xp |
prior means for ideal points. A matrix of dimensions number
of legislators in |
xpv |
prior meansprecisions for ideal points. A matrix of
dimensions number of legislators in |
bp |
prior means for item parameters. A matrix of
dimensions number of items or votes in |
bpv |
prior meansprecisions for item parameters. A matrix of
dimensions number of items or votes in |
x |
start values for ideal points. A matrix of dimensions
number of legislators in |
b |
start values for ideal points. A matrix of dimensions
number of items or votes in |
rollcall
, ideal
,
constrain.items
.
See pmatch
on how supplied names are matched against the
names in the rollcall
object.
data(s109) cl <- constrain.legis(s109, x=list("KENNEDY"=-1, "ENZI"=1), d=1) ## Not run: ## too long for examples id1Constrained <- ideal(s109, d=1, priors=cl, ## use cl startvals=cl, ## use cl maxiter=5000, burnin=500, thin=25) summary(id1Constrained) cl2 <- constrain.legis(s109, x=list("KENNEDY"=c(-1,0), "ENZI"=c(1,0), "CHAFEE"=c(0,-.5)), d=2) id2Constrained <- ideal(s109, d=2, priors=cl2, ## priors (w constraints) startvals=cl2, ## start value (w constraints) store.item=TRUE, maxiter=5000, burnin=500, thin=25) summary(id2Constrained,include.items=TRUE) ## End(Not run)
data(s109) cl <- constrain.legis(s109, x=list("KENNEDY"=-1, "ENZI"=1), d=1) ## Not run: ## too long for examples id1Constrained <- ideal(s109, d=1, priors=cl, ## use cl startvals=cl, ## use cl maxiter=5000, burnin=500, thin=25) summary(id1Constrained) cl2 <- constrain.legis(s109, x=list("KENNEDY"=c(-1,0), "ENZI"=c(1,0), "CHAFEE"=c(0,-.5)), d=2) id2Constrained <- ideal(s109, d=2, priors=cl2, ## priors (w constraints) startvals=cl2, ## start value (w constraints) store.item=TRUE, maxiter=5000, burnin=500, thin=25) summary(id2Constrained,include.items=TRUE) ## End(Not run)
Convert roll call matrix to binary form using encoding information.
convertCodes(object, codes = object$codes)
convertCodes(object, codes = object$codes)
object |
|
codes |
list, mapping entries in the |
See rollcall
for details on the form of the
codes
list.
a matrix
with dimensions equal to the dimensions of
the votes
component of the rollcall
object.
Any entries in the votes
matrix that can
not be mapped into c(0,1,NA)
using the information in
codes
are mapped to NA
, with an informative message
sent to the console.
Simon Jackman [email protected]
data(s109) mat <- convertCodes(s109) table(mat,exclude=NULL)
data(s109) mat <- convertCodes(s109) table(mat,exclude=NULL)
Drop user-specified elements of rollcall object, returning a roll call object.
dropRollCall(object, dropList,debug=FALSE)
dropRollCall(object, dropList,debug=FALSE)
object |
an object of class |
dropList |
a
|
debug |
|
It is often desirable to restrict the analysis of roll call
data in various ways. For one thing, unanimous votes provide no
information discriminating among legislators: hence,
summary and analysis should almost always use dropList=list(lop=0)
.
See the examples for other possibilities, limited only by the
information supplied in legis.data
and votes.data
.
An object of class rollcall
with components modified/added by
the subsetting indicated in the dropList
.
With the exception of codes
, each component of
dropList
generates a vector of mode logical
,
either with respect to legislators or votes. These logical
vectors are then combined element-wise, such that if any one of
the subsetting restrictions is TRUE
for a particular
legislator or vote, then that legislator or vote is dropped. Some
summaries are reported to the console along the way if debug=TRUE
.
dropRollCall
adds a component
named dropInfo
to the rollcall
object it returns. This
component is itself a list containing named components
a vector of mode logical
, with each
element TRUE
if the legislator is retained in the returned
rollcall
object.
a vector of mode logical
, with each element
TRUE
if the corresponding is retained in the returned
rollcall
object.
the dropList
supplied as input to
dropRollCall
.
If the input rollcall
object is itself the product of a call to
dropRollCall
, the dropInfo
component on output is a list
with named components
the dropInfo
component of the input
rollcall
object.
the dropInfo
list created by the current call to
dropRollCall
.
Functions like summary.rollcall
try to handle this information
sensibly.
When dropList
uses the dropLegis
or dropVotes
components then dropList
should be constructed via the
alist
command; this ensures that the
dropLegis
and dropVotes
components of
dropList
are objects of mode expression
, and
eval
uated to mode logical
in the
legis.data
and vote.data
environments
by
the function, if possible (rather than being evaluated immediately in
the environment calling dropRollCall
or constructing
dropList
). See the examples. This is not entirely
satisfactory, and behavior more like the subset
argument in
function lm
would be preferable.
Simon Jackman [email protected]
dropUnanimous
, summary.rollcall
,
ideal
, alist
.
data(s109) s109.working <- dropRollCall(s109, dropList=list(lop=0)) summary(s109.working) s109.working <- dropRollCall(s109, dropList=list(lop=0, code="notInLegis")) summary(s109.working) s109.working <- dropRollCall(s109, dropList=list(lop=3, code="notInLegis")) summary(s109.working) ## note use of alist, since dropLegis is an expression dropList <- alist(lop=3, dropLegis=party!="D", code="notInLegis") s109.working <- dropRollCall(s109,dropList=dropList,debug=TRUE) summary(s109.working) s109.working <- dropRollCall(s109.working,dropList=list(legisMin=25)) summary(s109.working) ## Not run: ## read 102nd House from Poole web site h102 <- readKH("ftp://voteview.ucsd.edu/dtaord/hou102kh.ord") ## drop President from roll call matrix h102 <- dropRollCall(h102, dropList=alist(dropLegis=state=="USA")) summary(h102) ## End(Not run)
data(s109) s109.working <- dropRollCall(s109, dropList=list(lop=0)) summary(s109.working) s109.working <- dropRollCall(s109, dropList=list(lop=0, code="notInLegis")) summary(s109.working) s109.working <- dropRollCall(s109, dropList=list(lop=3, code="notInLegis")) summary(s109.working) ## note use of alist, since dropLegis is an expression dropList <- alist(lop=3, dropLegis=party!="D", code="notInLegis") s109.working <- dropRollCall(s109,dropList=dropList,debug=TRUE) summary(s109.working) s109.working <- dropRollCall(s109.working,dropList=list(legisMin=25)) summary(s109.working) ## Not run: ## read 102nd House from Poole web site h102 <- readKH("ftp://voteview.ucsd.edu/dtaord/hou102kh.ord") ## drop President from roll call matrix h102 <- dropRollCall(h102, dropList=alist(dropLegis=state=="USA")) summary(h102) ## End(Not run)
Drop unanimous votes from rollcall objects and rollcall matrices.
dropUnanimous(obj, lop = 0)
dropUnanimous(obj, lop = 0)
obj |
|
lop |
numeric, non-negative integer, less than number of
legislators represented in |
Unanimous votes are the equivalent of test items that all subjects score “correct” (or all subjects scores “incorrect”); since there is no variation among the legislators/subjects, these votes/items provide no information as to latent traits (ideology, preferences, ability). A reasonably large number of rollcalls in any contemporary U.S. Congress are unanimous.
Specific methods are provided for objects of class
rollcall
or matrix
.
A rollcall
object or a matrix
depending on
the class of object
.
Simon Jackman [email protected]
dropRollCall
,
rollcall
, summary.rollcall
, ideal
data(s109) s109.working <- dropUnanimous(s109) summary(s109.working)
data(s109) s109.working <- dropUnanimous(s109) summary(s109.working)
Batting averages for 18 major league baseball players, first 45 at bats of the 1970 season.
data(EfronMorris)
data(EfronMorris)
name
character, name of player
team
character, team of player, abbreviated
league
character, National League or American League
r
numeric, hits in 1st 45 at bats
y
numeric, r
/45, batting average over 1st
45 at bats
n
numeric, number of at bats, remainder of 1970 season
p
numeric, batting average over remainder of 1970 season
Efron, Bradley and Carl Morris. 1975. Data Analysis Using Stein's Estimator and Its Generalizations. Journal of the American Statistical Association. 70:311-319.
data(EfronMorris) attach(EfronMorris) plot(p~y, xlim=range(p,y), ylim=range(p,y), xlab="Batting Average, 1st 45 at bats", ylab="Batting Average, Remainder of Season") abline(0,1)
data(EfronMorris) attach(EfronMorris) plot(p~y, xlim=range(p,y), ylim=range(p,y), xlab="Batting Average, 1st 45 at bats", ylab="Batting Average, Remainder of Season") abline(0,1)
Given a fitted model of class ideal
, return the
rollcall
object that was used in the model fitting
(i.e., apply all subsetting and recoding implied by the dropList
passed to ideal
).
extractRollCallObject(object)
extractRollCallObject(object)
object |
an object of class |
This function is used by many post-estimation commands that operate on
objects of class ideal
. The function inspects the
call
attribute of the
ideal
object, extracting the name of the
rollcall
object and the dropList
, then hands them
over to dropRollCall
.
An object of class rollcall
Simon Jackman [email protected]
rollcall
; see dropRollCall
for
details on the form of a dropList
.
data(s109) f = system.file("extdata","id1.rda",package="pscl") load(f) tmp <- extractRollCallObject(id1) summary(tmp) v <- convertCodes(tmp) ## roll call matrix per se
data(s109) f = system.file("extdata","id1.rda",package="pscl") load(f) tmp <- extractRollCallObject(id1) summary(tmp) v <- convertCodes(tmp) ## roll call matrix per se
Cross-tabulations of actual outcomes against predicted
outcomes for discrete data models, with summary statistics such as
percent correctly predicted (PCP) under fitted and null models. For
models with binary responses (generalized linear models with
family=binomial
), the user can specific a classification
threshold for the predicted probabilities.
hitmiss(obj, digits = max(3, getOption("digits") - 3), ...) ## S3 method for class 'glm' hitmiss(obj,digits=max(3,getOption("digits")-3), ..., k=.5)
hitmiss(obj, digits = max(3, getOption("digits") - 3), ...) ## S3 method for class 'glm' hitmiss(obj,digits=max(3,getOption("digits")-3), ..., k=.5)
obj |
a fitted model object, such as a |
digits |
number of digits to display in on-screen output |
... |
additional arguments passed to or from other functions |
k |
classification threshold for binary models |
For models with binary responses, the user can specify a
parameter 0 < k
< 1; if the predicted probabilities exceed this
threshold then the model is deemed to have predicted y=1, and
otherwise to have predicted y=0. Measures like percent correctly
predicted are crude summaries of model fit; the cross-tabulation of
actual against predicted is somewhat more informative, providing a
little more insight as to where the model fits less well.
For hitmiss.glm
, a vector of length 3:
pcp |
Percent Correctly Predicted |
pcp0 |
Percent Correctly Predicted among y=0 |
pcp1 |
Percent Correctly Predicted among y=1 |
To-do: The glm
method should also handle binomial data presented
as two-vector success/failures counts; and count data with
family=poisson
, the glm.nb
models and zeroinfl
and hurdle
etc. We should also make the output a class with
prettier print methods, i.e., save the cross-tabulation in the
returned object etc.
Simon Jackman [email protected]
pR2
for pseudo r-squared; predict
;
extractAIC
. See also the ROCR package and the lroc
function in the epicalc package for ROC computations for assessing binary classifications.
data(admit) ## ordered probit model op1 <- MASS::polr(score ~ gre.quant + gre.verbal + ap + pt + female, Hess=TRUE, data=admit, method="probit") hitmiss(op1)
data(admit) ## ordered probit model op1 <- MASS::polr(score ~ gre.quant + gre.verbal + ap + pt + female, Hess=TRUE, data=admit, method="probit") hitmiss(op1)
Fit hurdle regression models for count data via maximum likelihood.
hurdle(formula, data, subset, na.action, weights, offset, dist = c("poisson", "negbin", "geometric"), zero.dist = c("binomial", "poisson", "negbin", "geometric"), link = c("logit", "probit", "cloglog", "cauchit", "log"), control = hurdle.control(...), model = TRUE, y = TRUE, x = FALSE, ...)
hurdle(formula, data, subset, na.action, weights, offset, dist = c("poisson", "negbin", "geometric"), zero.dist = c("binomial", "poisson", "negbin", "geometric"), link = c("logit", "probit", "cloglog", "cauchit", "log"), control = hurdle.control(...), model = TRUE, y = TRUE, x = FALSE, ...)
formula |
symbolic description of the model, see details. |
data , subset , na.action
|
arguments controlling formula processing
via |
weights |
optional numeric vector of weights. |
offset |
optional numeric vector with an a priori known component to be included in the linear predictor of the count model. See below for more information on offsets. |
dist |
character specification of count model family. |
zero.dist |
character specification of the zero hurdle model family. |
link |
character specification of link function in the binomial
zero hurdle (only used if |
control |
a list of control arguments specified via
|
model , y , x
|
logicals. If |
... |
arguments passed to |
Hurdle count models are two-component models with a truncated count component for positive counts and a hurdle component that models the zero counts. Thus, unlike zero-inflation models, there are not two sources of zeros: the count model is only employed if the hurdle for modeling the occurrence of zeros is exceeded. The count model is typically a truncated Poisson or negative binomial regression (with log link). The geometric distribution is a special case of the negative binomial with size parameter equal to 1. For modeling the hurdle, either a binomial model can be employed or a censored count distribution. The outcome of the hurdle component of the model is the occurrence of a non-zero (positive) count. Thus, for most models, positive coefficients in the hurdle component indicate that an increase in the regressor increases the probability of a non-zero count. Binomial logit and censored geometric models as the hurdle part both lead to the same likelihood function and thus to the same coefficient estimates. A censored negative binomial model for the zero hurdle is only identified if there is at least one non-constant regressor with (true) coefficient different from zero (and if all coefficients are close to zero the model can be poorly conditioned).
The formula
can be used to specify both components of the model:
If a formula
of type y ~ x1 + x2
is supplied, then the same
regressors are employed in both components. This is equivalent to
y ~ x1 + x2 | x1 + x2
. Of course, a different set of regressors
could be specified for the zero hurdle component, e.g.,
y ~ x1 + x2 | z1 + z2 + z3
giving the count data model y ~ x1 + x2
conditional on (|
) the zero hurdle model y ~ z1 + z2 + z3
.
Offsets can be specified in both parts of the model pertaining to count and
zero hurdle model: y ~ x1 + offset(x2) | z1 + z2 + offset(z3)
, where
x2
is used as an offset (i.e., with coefficient fixed to 1) in the
count part and z3
analogously in the zero hurdle part. By the rule
stated above y ~ x1 + offset(x2)
is expanded to
y ~ x1 + offset(x2) | x1 + offset(x2)
. Instead of using the
offset()
wrapper within the formula
, the offset
argument
can also be employed which sets an offset only for the count model. Thus,
formula = y ~ x1
and offset = x2
is equivalent to
formula = y ~ x1 + offset(x2) | x1
.
All parameters are estimated by maximum likelihood using optim
,
with control options set in hurdle.control
.
Starting values can be supplied, otherwise they are estimated by glm.fit
(the default). By default, the two components of the model are estimated separately
using two optim
calls. Standard errors are derived numerically using
the Hessian matrix returned by optim
. See
hurdle.control
for details.
The returned fitted model object is of class "hurdle"
and is similar
to fitted "glm"
objects. For elements such as "coefficients"
or
"terms"
a list is returned with elements for the zero and count components,
respectively. For details see below.
A set of standard extractor functions for fitted model objects is available for
objects of class "hurdle"
, including methods to the generic functions
print
, summary
, coef
,
vcov
, logLik
, residuals
,
predict
, fitted
, terms
,
model.matrix
. See predict.hurdle
for more details
on all methods.
An object of class "hurdle"
, i.e., a list with components including
coefficients |
a list with elements |
residuals |
a vector of raw residuals (observed - fitted), |
fitted.values |
a vector of fitted means, |
optim |
a list (of lists) with the output(s) from the |
control |
the control arguments passed to the |
start |
the starting values for the parameters passed to the |
weights |
the case weights used, |
offset |
a list with elements |
n |
number of observations (with weights > 0), |
df.null |
residual degrees of freedom for the null model (= |
df.residual |
residual degrees of freedom for fitted model, |
terms |
a list with elements |
theta |
estimate of the additional |
SE.logtheta |
standard error(s) for |
loglik |
log-likelihood of the fitted model, |
vcov |
covariance matrix of all coefficients in the model (derived from the
Hessian of the |
dist |
a list with elements |
link |
character string describing the link if a binomial zero hurdle model is used, |
linkinv |
the inverse link function corresponding to |
converged |
logical indicating successful convergence of |
call |
the original function call, |
formula |
the original formula, |
levels |
levels of the categorical regressors, |
contrasts |
a list with elements |
model |
the full model frame (if |
y |
the response count vector (if |
x |
a list with elements |
Achim Zeileis <[email protected]>
Cameron, A. Colin and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.
Cameron, A. Colin and Pravin K. Trivedi 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.
Mullahy, J. 1986. Specification and Testing of Some Modified Count Data Models. Journal of Econometrics. 33:341–365.
Zeileis, Achim, Christian Kleiber and Simon Jackman 2008. “Regression Models for Count Data in R.” Journal of Statistical Software, 27(8). URL https://www.jstatsoft.org/v27/i08/.
hurdle.control
, glm
,
glm.fit
, glm.nb
,
zeroinfl
## data data("bioChemists", package = "pscl") ## logit-poisson ## "art ~ ." is the same as "art ~ . | .", i.e. ## "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment" fm_hp1 <- hurdle(art ~ ., data = bioChemists) summary(fm_hp1) ## geometric-poisson fm_hp2 <- hurdle(art ~ ., data = bioChemists, zero = "geometric") summary(fm_hp2) ## logit and geometric model are equivalent coef(fm_hp1, model = "zero") - coef(fm_hp2, model = "zero") ## logit-negbin fm_hnb1 <- hurdle(art ~ ., data = bioChemists, dist = "negbin") summary(fm_hnb1) ## negbin-negbin ## (poorly conditioned zero hurdle, note the standard errors) fm_hnb2 <- hurdle(art ~ ., data = bioChemists, dist = "negbin", zero = "negbin") summary(fm_hnb2)
## data data("bioChemists", package = "pscl") ## logit-poisson ## "art ~ ." is the same as "art ~ . | .", i.e. ## "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment" fm_hp1 <- hurdle(art ~ ., data = bioChemists) summary(fm_hp1) ## geometric-poisson fm_hp2 <- hurdle(art ~ ., data = bioChemists, zero = "geometric") summary(fm_hp2) ## logit and geometric model are equivalent coef(fm_hp1, model = "zero") - coef(fm_hp2, model = "zero") ## logit-negbin fm_hnb1 <- hurdle(art ~ ., data = bioChemists, dist = "negbin") summary(fm_hnb1) ## negbin-negbin ## (poorly conditioned zero hurdle, note the standard errors) fm_hnb2 <- hurdle(art ~ ., data = bioChemists, dist = "negbin", zero = "negbin") summary(fm_hnb2)
Various parameters that control fitting of hurdle regression models
using hurdle
.
hurdle.control(method = "BFGS", maxit = 10000, trace = FALSE, separate = TRUE, start = NULL, ...)
hurdle.control(method = "BFGS", maxit = 10000, trace = FALSE, separate = TRUE, start = NULL, ...)
method |
characters string specifying the |
maxit |
integer specifying the |
trace |
logical or integer controlling whether tracing information on
the progress of the optimization should be produced (passed to |
separate |
logical. Should the estimation of the parameters in the truncated count component and hurdle zero component be carried out separately? See details. |
start |
an optional list with elements |
... |
arguments passed to |
All parameters in hurdle
are estimated by maximum likelihood
using optim
with control options set in hurdle.control
.
Most arguments are passed on directly to optim
, only trace
is also
used within hurdle
and separate
/start
control how optim
is called.
Starting values can be supplied via start
or estimated by
glm.fit
(default).
If separate = TRUE
(default) the likelihoods of the truncated count component and the
hurdle zero component will be maximized separately, otherwise the joint likelihood is
set up and maximized. Standard errors are derived numerically using
the Hessian matrix returned by optim
. To supply starting
values, start
should be a list with elements "count"
and "zero"
and potentially "theta"
(a named vector, for models with negative binomial components
only) containing the starting values for the coefficients of the corresponding component
of the model.
A list with the arguments specified.
Achim Zeileis <[email protected]>
data("bioChemists", package = "pscl") ## default start values fm1 <- hurdle(art ~ fem + ment, data = bioChemists, dist = "negbin", zero = "negbin") ## user-supplied start values and other options fm2 <- hurdle(art ~ fem + ment, data = bioChemists, dist = "negbin", zero = "negbin", trace=TRUE, separate=FALSE, start = list(count = c(0.3, -0.2, 0), zero = c(4, -2, 0.8), theta = c(count = 2, zero = 0.1)))
data("bioChemists", package = "pscl") ## default start values fm1 <- hurdle(art ~ fem + ment, data = bioChemists, dist = "negbin", zero = "negbin") ## user-supplied start values and other options fm2 <- hurdle(art ~ fem + ment, data = bioChemists, dist = "negbin", zero = "negbin", trace=TRUE, separate=FALSE, start = list(count = c(0.3, -0.2, 0), zero = c(4, -2, 0.8), theta = c(count = 2, zero = 0.1)))
Wald test of the null hypothesis that no zero hurdle is required in hurdle regression models for count data.
hurdletest(object, ...)
hurdletest(object, ...)
object |
A fitted model object of class |
... |
arguments passed to |
If the same count distribution and the same set of regressors is used in the hurdle model for both, the count component and the zero hurdle component, then a test of pairwise equality between all coefficients from the two components assesses the null hypothesis that no hurdle is needed in the model.
The function hurdletest
is a simple convenience interface to
the function linearHypothesis
from the car
packages that can be employed to carry out a Wald test for this
hypothesis.
An object of class "anova"
as returned by linearHypothesis
.
Achim Zeileis <[email protected]>
Cameron, A. Colin and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.
Cameron, A. Colin and Pravin K. Trivedi 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.
data("bioChemists", package = "pscl") fm <- hurdle(art ~ ., data = bioChemists, dist = "negbin", zero = "negbin") hurdletest(fm)
data("bioChemists", package = "pscl") fm <- hurdle(art ~ ., data = bioChemists, dist = "negbin", zero = "negbin") hurdletest(fm)
Analysis of rollcall
data via the spatial voting model;
equivalent to a 2 parameter item-response model to educational testing data. Model fitting via Markov chain Monte Carlo (MCMC).
ideal(object, codes = object$codes, dropList = list(codes = "notInLegis", lop = 0), d = 1, maxiter = 10000, thin = 100, burnin = 5000, impute = FALSE, normalize = FALSE, meanzero = normalize, priors = NULL, startvals = "eigen", store.item = FALSE, file = NULL, verbose=FALSE, use.voter=NULL)
ideal(object, codes = object$codes, dropList = list(codes = "notInLegis", lop = 0), d = 1, maxiter = 10000, thin = 100, burnin = 5000, impute = FALSE, normalize = FALSE, meanzero = normalize, priors = NULL, startvals = "eigen", store.item = FALSE, file = NULL, verbose=FALSE, use.voter=NULL)
object |
an object of class |
codes |
a |
dropList |
a |
d |
numeric, (small) positive integer (default = 1), dimensionality of the ability space (or "policy space" in the rollcall context). |
maxiter |
numeric, positive integer, multiple of |
thin |
numeric, positive integer, thinning interval used for recording MCMC iterations. |
burnin |
number of MCMC iterations to run before recording. The
iteration numbered |
impute |
|
normalize |
|
meanzero |
to be deprecated/ignored; use |
priors |
a
None of the components should contain |
startvals |
either a string naming a method for generating start
values, valid options are |
store.item |
|
file |
string, file to write MCMC output. Default is
|
verbose |
logical, default
is |
use.voter |
A vector of logicals of length |
The function fits a d
+1 parameter item-response model to
the roll call data object, so in one dimension the model reduces
to the two-parameter item-response model popular in educational testing.
See References.
Identification: The model parameters are not identified without the user supplying some restrictions on the model parameters; i.e., translations, rotations and re-scalings of the ideal points are observationally equivalent, via offsetting transformations of the item parameters. It is the user's responsibility to impose these identifying restrictions if desired. The following brief discussion provides some guidance.
For one-dimensional models (i.e., d=1
), a simple route to
identification is the normalize
option, by imposing the restriction that the means of the posterior densities of the ideal points (ability parameters) have mean zero and standard deviation one, across legislators (test-takers). This normalization supplies
local identification (that is, identification up to a 180 degree rotation of
the recovered dimension).
Near-degenerate “spike” priors
(priors with arbitrarily large precisions) or the
constrain.legis
option on any two legislators' ideal points
ensures global identification in one dimension.
Identification in higher dimensions can be obtained by supplying
fixed values for d+1
legislators' ideal points, provided the
supplied fixed points span a d
-dimensional space (e.g., three
supplied ideal points form a triangle in d=2
dimensions), via
the constrain.legis
option. In this case the function
defaults to vague normal priors on the unconstrained ideal points, but at each iteration the sampled
ideal points are transformed back into the space of identified
parameters, applying the linear transformation that maps the
d+1
fixed ideal points from their sampled values to their
fixed values. Alternatively, one can impose
restrictions on the item parameters via
constrain.items
. See the examples in the documentation
for the constrain.legis
and
constrain.items
.
Another route to identification is via post-processing. That
is, the user can run ideal
without any identification
constraints. This does not pose any formal/technical problem in a
Bayesian analysis. The fact that the posterior density may have
multiple modes doesn't imply that the posterior is improper or that
it can't be explored via MCMC methods. – but then use the function
postProcess
to map the MCMC output from the space of
unidentified parameters into the subspace of identified parameters.
See the example in the documentation for the
postProcess
function.
When the normalize
option is set to TRUE
, an
unidentified model is run, and the ideal
object is
post-processed with the normalize
option, and then returned
to the user (but again, note that the normalize
option is
only implemented for unidimensional models).
Start values. Start values can be supplied by the user, or generated by the function itself.
The default method, corresponding to startvals="eigen"
, first
forms a n
-by-n
correlation matrix from the
double-centered roll call matrix (subtracting row means, and column
means, adding in the grand mean), and then extracts the first
d
principal components (eigenvectors), scaling the
eigenvectors by the square root of their corresponding eigenvector.
If the user is imposing constraints on ideal points (via
constrain.legis
), these constraints are applied to the
corresponding elements of the start values generated from the
eigen-decomposition. Then, to generate start
values for the rollcall/item parameters, a series of
binomial
glms
are
estimated (with a probit link
), one for
each rollcall/item, . The votes on the
-th rollcall/item are binary responses (presumed to be
conditionally independent given each legislator's latent
preference), and the (constrained or unconstrained) start values for
legislators are used as predictors. The estimated coefficients from
these probit models are used as start values for the item
discrimination and difficulty parameters (with the intercepts from
the probit GLMs multiplied by -1 so as to make those coefficients
difficulty parameters).
The default eigen
method generates extremely good start
values for low-dimensional models fit to recent U.S. congresses,
where high rates of party line voting result in excellent fits from
low dimensional models. The eigen
method may be
computationally expensive or lead to memory errors for
rollcall
objects with large numbers of legislators.
The random
method generates start values via iid sampling
from a N(0,1) density, via rnorm
, imposing any
constraints that may have been supplied via
constrain.legis
, and then uses the probit method
described above to get start values for the rollcall/item
parameters.
If startvals
is a list
, it must contain the named
components x
and/or b
, or named components that
(uniquely) begin with the letters x
and/or b
. The
component x
must be a vector or a matrix of dimensions equal to
the number of individuals (legislators) by d
. If supplied,
startvals$b
must be a matrix with dimension number of items
(votes) by d
+1. The x
and b
components cannot
contain NA
. If x
is not supplied when startvals
is a list, then start values are generated using the default
eiegn
method described above, and start values for the
rollcall/item parameters are regenerated using the probit method,
ignoring any user-supplied values in startvals$b
. That is,
user-supplied values in startvals$b
are only used when
accompanied by a valid set of start values for the ideal points in
startvals$x
.
Implementation via Data Augmentation. The MCMC algorithm
for this problem consists of a Gibbs sampler for the ideal points
(latent traits)
and item parameters, conditional on latent data , generated
via a data augmentation (DA) step. That is, following Albert (1992) and
Albert and Chib (1993), if
we sample from the
truncated normal density
and for we sample
where is an indicator function evaluating to one
if its argument is true and zero otherwise. Given the
latent
, the conditional distributions for
and
are extremely simple to sample from; see the
references for details.
This data-augmented Gibbs sampling strategy is easily implemented, but can sometimes require many thousands of samples in order to generate tolerable explorations of the posterior densities of the latent traits, particularly for legislators with short and/or extreme voting histories (the equivalent in the educational testing setting is a test-taker who gets almost every item right or wrong).
a list
of class ideal
with named components
n |
|
m |
|
d |
|
x |
a three-dimensional |
beta |
a three-dimensional |
xbar |
a |
betabar |
a |
args |
calling arguments, evaluated in the frame calling |
call |
an object of class |
Simon Jackman [email protected], with help from Christina Maimone and Alex Tahk.
Albert, James. 1992. Bayesian Estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics. 17:251-269.
Albert, James H. and Siddhartha Chib. 1993. Bayesian Analysis of Binary and Polychotomous Response Data. Journal of the American Statistical Association. 88:669-679.
Clinton, Joshua, Simon Jackman and Douglas Rivers. 2004. The Statistical Analysis of Roll Call Data. American Political Science Review. 98:335-370.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey.
Jessee, Stephen. 2016. (How) Can We Estimate the Ideology of Citizens and Political Elites on the Same Scale? American Journal of Political Science.
Patz, Richard J. and Brian W. Junker. 1999. A Straightforward Approach to Markov Chain Monte Carlo Methods for Item Response Models. Journal of Education and Behavioral Statistics. 24:146-178.
Rivers, Douglas. 2003. “Identification of Multidimensional Item-Response Models.” Typescript. Department of Political Science, Stanford University.
van Dyk, David A and Xiao-Li Meng. 2001. The art of data augmentation (with discussion). Journal of Computational and Graphical Statistics. 10(1):1-111.
rollcall
, summary.ideal
,
plot.ideal
, predict.ideal
.
tracex
for graphical display of MCMC iterative
history.
idealToMCMC
converts the MCMC iterates in an
ideal
object to a form that can be used by the coda
library.
constrain.items
and
constrain.legis
for implementing identifying
restrictions.
postProcess
for imposing identifying restrictions
ex post.
MCMCirt1d
and
MCMCirtKd
in the MCMCpack
package provide similar functionality to ideal
.
## Not run: ## long run, many iterations data(s109) n <- dim(s109$legis.data)[1] x0 <- rep(0,n) x0[s109$legis.data$party=="D"] <- -1 x0[s109$legis.data$party=="R"] <- 1 id1 <- ideal(s109, d=1, startvals=list(x=x0), normalize=TRUE, store.item=TRUE, maxiter=260E3, burnin=10E3, thin=100) ## End(Not run)
## Not run: ## long run, many iterations data(s109) n <- dim(s109$legis.data)[1] x0 <- rep(0,n) x0[s109$legis.data$party=="D"] <- -1 x0[s109$legis.data$party=="R"] <- 1 id1 <- ideal(s109, d=1, startvals=list(x=x0), normalize=TRUE, store.item=TRUE, maxiter=260E3, burnin=10E3, thin=100) ## End(Not run)
Converts the x element of an ideal
object to an MCMC object,
as used in the coda package.
idealToMCMC(object, burnin=NULL)
idealToMCMC(object, burnin=NULL)
object |
an object of class |
burnin |
of the recorded MCMC samples, how many to discard as
burnin? Default is |
A mcmc
object as used by the
coda package, starting at
iteration start
, drawn from the x
component of the
ideal
object.
When specifying a value of burnin
different from that used
in fitting the ideal
object, note a distinction
between the iteration numbers of the stored iterations, and the
number of stored iterations. That is, the n
-th iteration
stored in an ideal
object will not be iteration
n
if the user specified thin>1
in the call to
ideal
. Here, iterations are tagged with their
iteration number. Thus, if the user called ideal
with
thin=10
and burnin=100
then the stored iterations are
numbered 100, 110, 120, ...
. Any future subsetting via a
burnin
refers to this iteration number.
data(s109) f = system.file("extdata",package="pscl","id1.rda") load(f) id1coda <- idealToMCMC(id1) summary(id1coda)
data(s109) f = system.file("extdata",package="pscl","id1.rda") load(f) id1coda <- idealToMCMC(id1) summary(id1coda)
Density, distribution function, quantile function, and highest density region calculation for the inverse-Gamma distribution with parameters alpha
and beta
.
densigamma(x,alpha,beta) pigamma(q,alpha,beta) qigamma(p,alpha,beta) rigamma(n,alpha,beta) igammaHDR(alpha,beta,content=.95,debug=FALSE)
densigamma(x,alpha,beta) pigamma(q,alpha,beta) qigamma(p,alpha,beta) rigamma(n,alpha,beta) igammaHDR(alpha,beta,content=.95,debug=FALSE)
x , q
|
vector of quantiles |
p |
vector of probabilities |
n |
number of random samples in |
alpha , beta
|
rate and shape parameters of the inverse-Gamma density, both positive |
content |
scalar, 0 < |
debug |
logical; if TRUE, debugging information from the search for the HDR is printed |
The inverse-Gamma density arises frequently in Bayesian
analysis of normal data, as the (marginal) conjugate prior for
the unknown variance parameter. The inverse-Gamma density
for with parameters
and
is
where is the
gamma
function
and so ensures integrates to one. The
inverse-Gamma density has a mean at
for
and has variance
for
. The inverse-Gamma density has a unique mode at
.
The evaluation of the density,
cumulative distribution function and quantiles is done by
calls to the dgamma
, pgamma
and igamma
functions, with the arguments appropriately transformed. That
is, note
that if then
.
Highest Density Regions. In general, suppose
has a density
, where
. Then a
highest density region (HDR) for
with content
is a region (or set of regions)
such that:
and
For a continuous, unimodal
density defined with respect to a single parameter (like the
inverse-Gamma case considered here with parameters ), a HDR region
of content
(with
) is a unique, closed
interval on the real half-line.
This function uses numerical methods to solve for the
boundaries of a HDR with content
for the
inverse-Gamma density, via repeated calls the functions
densigamma
, pigamma
and
qigamma
. In particular, the function uniroot
is used to
find points and
such that
subject to the constraint
densigamma
gives the density, pigamma
the
distribution function, qigamma
the quantile function,
rigamma
generates random samples, and igammaHDR
gives
the lower and upper limits of the HDR, as defined above (NA
s if
the optimization is not successful).
The densigamma
is named so as not to conflict with the
digamma
function in the R base
package
(the derivative of the gamma
function).
Simon Jackman [email protected]
gamma
, dgamma
,
pgamma
, qgamma
, uniroot
alpha <- 4 beta <- 30 summary(rigamma(n=1000,alpha,beta)) xseq <- seq(.1,30,by=.1) fx <- densigamma(xseq,alpha,beta) plot(xseq,fx,type="n", xlab="x", ylab="f(x)", ylim=c(0,1.01*max(fx)), yaxs="i", axes=FALSE) axis(1) title(substitute(list(alpha==a,beta==b),list(a=alpha,b=beta))) q <- igammaHDR(alpha,beta,debug=TRUE) xlo <- which.min(abs(q[1]-xseq)) xup <- which.min(abs(q[2]-xseq)) plotZero <- par()$usr[3] polygon(x=xseq[c(xlo,xlo:xup,xup:xlo)], y=c(plotZero, fx[xlo:xup], rep(plotZero,length(xlo:xup))), border=FALSE, col=gray(.45)) lines(xseq,fx,lwd=1.25) ## Not run: alpha <- beta <- .1 xseq <- exp(seq(-7,30,length=1001)) fx <- densigamma(xseq,alpha,beta) plot(xseq,fx, log="xy", type="l", ylim=c(min(fx),1.01*max(fx)), yaxs="i", xlab="x, log scale", ylab="f(x), log scale", axes=FALSE) axis(1) title(substitute(list(alpha==a,beta==b),list(a=alpha,b=beta))) q <- igammaHDR(alpha,beta,debug=TRUE) xlo <- which.min(abs(q[1]-xseq)) xup <- which.min(abs(q[2]-xseq)) plotZero <- min(fx) polygon(x=xseq[c(xlo,xlo:xup,xup:xlo)], y=c(plotZero, fx[xlo:xup], rep(plotZero,length(xlo:xup))), border=FALSE, col=gray(.45)) lines(xseq,fx,lwd=1.25) ## End(Not run)
alpha <- 4 beta <- 30 summary(rigamma(n=1000,alpha,beta)) xseq <- seq(.1,30,by=.1) fx <- densigamma(xseq,alpha,beta) plot(xseq,fx,type="n", xlab="x", ylab="f(x)", ylim=c(0,1.01*max(fx)), yaxs="i", axes=FALSE) axis(1) title(substitute(list(alpha==a,beta==b),list(a=alpha,b=beta))) q <- igammaHDR(alpha,beta,debug=TRUE) xlo <- which.min(abs(q[1]-xseq)) xup <- which.min(abs(q[2]-xseq)) plotZero <- par()$usr[3] polygon(x=xseq[c(xlo,xlo:xup,xup:xlo)], y=c(plotZero, fx[xlo:xup], rep(plotZero,length(xlo:xup))), border=FALSE, col=gray(.45)) lines(xseq,fx,lwd=1.25) ## Not run: alpha <- beta <- .1 xseq <- exp(seq(-7,30,length=1001)) fx <- densigamma(xseq,alpha,beta) plot(xseq,fx, log="xy", type="l", ylim=c(min(fx),1.01*max(fx)), yaxs="i", xlab="x, log scale", ylab="f(x), log scale", axes=FALSE) axis(1) title(substitute(list(alpha==a,beta==b),list(a=alpha,b=beta))) q <- igammaHDR(alpha,beta,debug=TRUE) xlo <- which.min(abs(q[1]-xseq)) xup <- which.min(abs(q[2]-xseq)) plotZero <- min(fx) polygon(x=xseq[c(xlo,xlo:xup,xup:xlo)], y=c(plotZero, fx[xlo:xup], rep(plotZero,length(xlo:xup))), border=FALSE, col=gray(.45)) lines(xseq,fx,lwd=1.25) ## End(Not run)
On October 11, 2002, the United States Senate voted 77-23 to authorize the use of military force against Iraq. This data set lists the “Ayes” and “Nays” for each Senator and some covariates.
data(iraqVote)
data(iraqVote)
A data frame with 100 observations on the following 6 variables.
y
a numeric vector, the recorded vote (1 if Aye, 0 if Nay)
state.abb
two letter abbreviation for each state
name
senator name, party and state, e.g., AKAKA (D HI)
rep
logical, TRUE
for Republican senators
state.name
name of state
gorevote
numeric, the vote share recorded by Al Gore in the corresponding state in the 2000 Presidential election
The only Republican to vote against the resolution was Lincoln Chafee (Rhode Island); Democrats split 29-22 in favor of the resolution.
Keith Poole, 107th Senate Roll Call Data. https://voteview.com/static/data/out/votes/S107_votes.ord The Iraq vote is vote number 617.
David Leip's Atlas of U.S. Presidential Elections. https://uselectionatlas.org
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Chichester. Example 8.3.
data(iraqVote) ## probit model glm1 <- glm(y ~ gorevote + rep, data=iraqVote, family=binomial(link=probit))
data(iraqVote) ## probit model glm1 <- glm(y ~ gorevote + rep, data=iraqVote, family=binomial(link=probit))
A rollcall object containing 99 rollcalls from the 2nd session of the 110th U.S. Senate, designated by National Journal as the "key votes" of 2007. These data were used to by National Journal to rate (then Senator) Barack Obama was the "most liberal senator" in 2007.
data(nj07)
data(nj07)
A rollcall
object containing the recorded votes,
plus information identifying the legislators and the
rollcalls.
Note the coding scheme used by Poole and Rosenthal; Yea (1,2,3), Nay (4,5,6) etc.
Keith Poole's web site: https://legacy.voteview.com/senate110.htm
Originally scraped from the Senate's web site by Jeff Lewis.
Josh Clinton compiled the list of National Journal key votes.
Clinton, Joshua and Simon Jackman. 2009. To Simulate or NOMINATE? Legislative Studies Quarterly. V34(4):593-621.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey. Example 9.2.
require(pscl) data(nj07) is(nj07,"rollcall") ## TRUE nj07 ## print method for class rollcall names(nj07) names(nj07$vote.data) table(nj07$vote.data$policyArea) summary(nj07) ## summary method summary(nj07,verbose=TRUE)
require(pscl) data(nj07) is(nj07,"rollcall") ## TRUE nj07 ## print method for class rollcall names(nj07) names(nj07$vote.data) table(nj07$vote.data$policyArea) summary(nj07) ## summary method summary(nj07,verbose=TRUE)
Nicely formatted tables, with row or column marginals etc.
ntable(x,y=NULL, percent=1,digits=2, row=FALSE,col=FALSE)
ntable(x,y=NULL, percent=1,digits=2, row=FALSE,col=FALSE)
x |
vector or |
y |
vector of |
percent |
integer, 1 for row percentages (default), 2 for column percentages |
digits |
integer, digits to print after decimal place (default is 2) |
row |
logical, if |
col |
logical, if |
A wrapper function to prop.table
that produces prettier
looking results.
nothing returned; the function prints the table and exits silently.
Jim Fearon [email protected]
data(bioChemists) attach(bioChemists) ntable(fem) ntable(fem,mar,row=TRUE) ntable(fem,mar,per=2,col=TRUE) ntable(fem,mar,per=2,row=TRUE,col=TRUE)
data(bioChemists) attach(bioChemists) ntable(fem) ntable(fem,mar,row=TRUE) ntable(fem,mar,per=2,col=TRUE) ntable(fem,mar,per=2,row=TRUE,col=TRUE)
Compares the log-likelihoods of a negative binomial regression model and a Poisson regression model.
odTest(glmobj, alpha=.05, digits = max(3, getOption("digits") - 3))
odTest(glmobj, alpha=.05, digits = max(3, getOption("digits") - 3))
glmobj |
an object of class |
alpha |
significance level of over-dispersion test |
digits |
number of digits in printed output |
The negative binomial model relaxes the assumption in the
Poisson model that the (conditional) variance equals the (conditional)
mean, by estimating one extra parameter. A likelihood ratio (LR) test
can be used to test the null hypothesis that the restriction implicit
in the Poisson model is true. The LR test-statistic has a non-standard
distribution, even asymptotically, since the negative binomial
over-dispersion parameter (called theta
in glm.nb
) is
restricted to be positive. The asymptotic distribution of the LR
(likelihood ratio) test-statistic has probability mass of one half at
zero, and a half distribution above
zero. This means that if testing at the
= .05
level, one should not reject the null unless the LR test statistic
exceeds the critical value associated with the
= .10 level; this LR test involves just one parameter restriction, so
the critical value of the test statistic at the
= .05 level
is 2.7, instead of the usual 3.8 (i.e., the .90 quantile of the
distribution, versus the .95 quantile).
A Poisson model is run using glm
with family set to
link{poisson}
, using the formula
in the negbin
model object passed as input. The logLik
functions are
used to extract the log-likelihood for each model.
None; prints results and returns silently
Simon Jackman [email protected]. John Fox noted an error in an earlier version.
A. Colin Cameron and Pravin K. Trivedi (1998) Regression analysis of count data. New York: Cambridge University Press.
Lawless, J. F. (1987) Negative Binomial and Mixed Poisson Regressions. The Canadian Journal of Statistics. 15:209-225.
data(bioChemists) modelnb <- MASS::glm.nb(art ~ ., data=bioChemists, trace=TRUE) odTest(modelnb)
data(bioChemists) modelnb <- MASS::glm.nb(art ~ ., data=bioChemists, trace=TRUE) odTest(modelnb)
Numeric codes and names of 85 political parties appearing in Poole and Rosenthal's collection of U.S. Congressional roll calls.
data(partycodes)
data(partycodes)
code
integer, numeric code for legislator appearing in Poole and Rosenthal rollcall data files
party
character, name of party
The function readKH
converts the integer codes
into strings, via a table lookup in this data frame.
Keith Poole's website: https://legacy.voteview.com/PARTY3.HTM
Plot of the results of an ideal point estimation contained
in an object of class ideal
.
## S3 method for class 'ideal' plot(x, conf.int=0.95, burnin=NULL, ...) plot1d(x, d=1, conf.int=0.95, burnin=NULL, showAllNames = FALSE, ...) plot2d(x, d1=1, d2=2, burnin=NULL, overlayCuttingPlanes=FALSE, ...)
## S3 method for class 'ideal' plot(x, conf.int=0.95, burnin=NULL, ...) plot1d(x, d=1, conf.int=0.95, burnin=NULL, showAllNames = FALSE, ...) plot2d(x, d1=1, d2=2, burnin=NULL, overlayCuttingPlanes=FALSE, ...)
x |
an object of class |
conf.int |
for "ideal" objects with 1 dimension estimated, the
level of the confidence interval to plot around the
posterior mean for each legislator. If 2 or more dimensions were
estimated, |
d |
integer, which dimension to display in a 1d plot, if the object is a multidimensional ideal object |
.
burnin |
of the recorded MCMC samples, how many to discard as
burnin? Default is |
showAllNames |
|
d1 |
integer, the number of the first dimension to plot when
plotting multi-dimensional |
d2 |
integer, the number of the second dimension to plot when
plotting multi-dimensional |
overlayCuttingPlanes |
logical, if |
... |
other parameters to be passed through to plotting functions. |
If the ideal
object comes from fitting a d=1
dimensional model, then plot.ideal
plots the mean of the
posterior density over each legislator's ideal point, accompanied by a
conf.int
confidence interval. In this case, plot.ideal
is simply a wrapper function to plot1d
.
If the ideal
object has d=2
dimensions, then
plot2d
is called, which plots the (estimated) mean of
the posterior density of each legislator's ideal point (i.e., the
ideal point/latent trait is a point in 2-dimensional Euclidean space,
and the posterior density for each ideal point is a bivariate
density). Single dimension summaries of the estimated ideal points
(latent traits) can be obtained for multidimensional
ideal
objects by passing the ideal
object
directly to plot1d
with d
set appropriately.
If the ideal
object has d>2
dimensions, a
scatterplot matrix is produced via pairs
, with the
posterior means of the ideal points (latent traits) plotted against
one another, dimension by dimension.
For unidimensional and two-dimensional models, if party information is
available in the rollcall
object contained in the ideal
object, legislators from different parties are plotted in different
colors. If the ideal
object has more than 2 dimensions,
plot.ideal()
produces a matrix of plots of the mean ideal
points of each dimension against the posterior mean ideal points of
the other dimensions.
When specifying a value of burnin
different from that used
in fitting the ideal
object, note a distinction
between the iteration numbers of the stored iterations, and the
number of stored iterations. That is, the n
-th iteration
stored in an ideal
object will not be iteration
n
if the user specified thin>1
in the call to
ideal
. Here, iterations are tagged with their
iteration number. Thus, if the user called ideal
with
thin=10
and burnin=100
then the stored iterations are
numbered 100, 110, 120, ...
. Any future subsetting via a
burnin
refers to this iteration number.
ideal
;
tracex
for trace plots, a graphical aid useful in
diagnosing convergence of the MCMC algorithms.
## Not run: data(s109) id1 <- ideal(s109, d=1, normalize=TRUE, store.item=TRUE, maxiter=500, ## short run for examples burnin=100, thin=10) plot(id1) id2 <- ideal(s109, d=2, store.item=TRUE, maxiter=11e2, burnin=1e2, verbose=TRUE, thin=25) plot(id2,overlayCuttingPlanes=TRUE) id2pp <- postProcess(id2, constraints=list(BOXER=c(-1,0), INHOFE=c(1,0), CHAFEE=c(0,.25))) plot(id2pp,overlayCuttingPlanes=TRUE) ## End(Not run)
## Not run: data(s109) id1 <- ideal(s109, d=1, normalize=TRUE, store.item=TRUE, maxiter=500, ## short run for examples burnin=100, thin=10) plot(id1) id2 <- ideal(s109, d=2, store.item=TRUE, maxiter=11e2, burnin=1e2, verbose=TRUE, thin=25) plot(id2,overlayCuttingPlanes=TRUE) id2pp <- postProcess(id2, constraints=list(BOXER=c(-1,0), INHOFE=c(1,0), CHAFEE=c(0,.25))) plot(id2pp,overlayCuttingPlanes=TRUE) ## End(Not run)
Plot classification success rates by legislators, or by roll calls, using predictions from ideal.
## S3 method for class 'predict.ideal' plot(x, type = c("legis", "votes"),...)
## S3 method for class 'predict.ideal' plot(x, type = c("legis", "votes"),...)
x |
an object of class |
type |
string; one of |
... |
further arguments passed to or from other methods. |
type="legis"
produces a plot of the “percent correctly
predicted” for each legislator/subject
(using the classification threshold set in
predict.ideal
) against the estimated ideal point of each
legislator/subject (the estimated mean of the posterior density of the
ideal point), dimension at
a time. If the legislators' party affiliations are available in the
rollcall
object that was passed to ideal
,
then legislators from the same party are plotted with a unique color.
type="votes"
produces a plot of classification rates for each
roll call, by the percentage of legislators voting for the losing
side. The x
-ordinate is jittered for clarity.
After drawing plots on the current device, exits silently returning invisible(NULL)
.
Simon Jackman [email protected]
data(s109) f = system.file("extdata","id1.rda",package="pscl") load(f) phat <- predict(id1) plot(phat,type="legis") plot(phat,type="votes")
data(s109) f = system.file("extdata","id1.rda",package="pscl") load(f) phat <- predict(id1) plot(phat,type="legis") plot(phat,type="votes")
Plot seats-votes curves produced by seatsVotes
## S3 method for class 'seatsVotes' plot(x, type = c("seatsVotes", "density"), legend = "bottomright", transform=FALSE, ...)
## S3 method for class 'seatsVotes' plot(x, type = c("seatsVotes", "density"), legend = "bottomright", transform=FALSE, ...)
x |
an object of class |
type |
character, partially matching the options above; see details |
legend |
where to put the legend when plotting with
|
transform |
logical, whether to transform the vote shares for
|
... |
arguments passed to or from other functions (e.g.,
options for the |
A seats-votes curve (with various annotations) is produced
with option type="seatsVotes"
.
A density plot of the vote shares is
produced with type="density"
. A bimodal density corresponds to
an electoral system with a proliferation of safe seats for both
parties, and a seats-votes curve that is relatively flat (or
“unresponsive”) in the neighborhood of average district-level
vote shares of 50 percent. The density is fitted using the defaults
in the density
function, but with the density constrained
to fall to zero at the extremes of the data, via the from
and
to
options to density. A rug
is added to the
density plot.
If transform=TRUE
, the vote shares are transformed prior to
plotting, so as to reduce the extent to which extreme vote shares
close to zero or 1 determine the shape of the density (i.e., this
option is available only for plots of type="density"
). The
transformation is a sinusoidal function, a scaled
“log-odds/inverse-log-odds” function mapping from (0,1) to
(0,1): i.e., where
is the log-odds transformation
,
is a scaling parameter
set to
, and
is the inverse-log-odds
transformation
. Note that this transformation is cosmetic, with
the effect of assigning more of the graphing region to be devoted to
marginal seats.
The function performs the requested plots and exits silently with
invisible{NULL}
.
Simon Jackman [email protected]
data(ca2006) x <- ca2006$D/(ca2006$D+ca2006$R) sv <- seatsVotes(x, desc="Democratic Vote Shares, California 2006 congressional elections") plot(sv) plot(sv,type="density") plot(sv,type="density",transform=TRUE)
data(ca2006) x <- ca2006$D/(ca2006$D+ca2006$R) sv <- seatsVotes(x, desc="Democratic Vote Shares, California 2006 congressional elections") plot(sv) plot(sv,type="density") plot(sv,type="density",transform=TRUE)
Interviewers administering the 2000 American National Election Studies assigned an ordinal rating to each respondent's "general level of information" about politics and public affairs.
data(politicalInformation)
data(politicalInformation)
A data frame with 1807 observations on the following 8 variables.
y
interviewer rating, a factor with levels Very Low
Fairly Low
Average
Fairly High
Very High
collegeDegree
a factor with levels No
Yes
female
a factor with levels No
Yes
age
a numeric vector, respondent age in years
homeOwn
a factor with levels No
Yes
govt
a factor with levels No
Yes
length
a numeric vector, length of ANES pre-election interview in minutes
id
a factor, unique identifier for each interviewer
Seven respondents have missing data on the ordinal interviewer rating.
The covariates age
and length
also have some missing data.
The National Election Studies (www.electionstudies.org). THE 2000 NATIONAL ELECTION STUDY [dataset]. Ann Arbor, MI: University of Michigan, Center for Political Studies [producer and distributor].
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey.
data(politicalInformation) table(politicalInformation$y,exclude=NULL) op <- MASS::polr(y ~ collegeDegree + female + log(age) + homeOwn + govt + log(length), data=politicalInformation, Hess=TRUE, method="probit")
data(politicalInformation) table(politicalInformation$y,exclude=NULL) op <- MASS::polr(y ~ collegeDegree + female + log(age) + homeOwn + govt + log(length), data=politicalInformation, Hess=TRUE, method="probit")
Remap the MCMC iterates in an ideal
object via an affine
transformation, imposing identifying restrictions ex post (aka
post-processing).
postProcess(object, constraints="normalize", debug = FALSE)
postProcess(object, constraints="normalize", debug = FALSE)
object |
an object of class |
constraints |
list of length |
debug |
logical flag for verbose output, used for debugging |
Item-response models are unidentified without restrictions on the
underlying parameters. Consider the d=1
dimensional case. The
model is
Any linear transformation of the latent traits, say,
can be exactly offset by applying the appropriate linear transformations to the item/bill parameters, meaning that there is no unique set of values for the model parameters that will maximize the likelihood function. In higher dimensions, the latent traits can also be transformed via any arbitrary rotation, dilation and translation, with offsetting transformations applied to the item/bill parameters.
One strategy in MCMC is to ignore the lack of identification at run
time, but apply identifying restrictions ex post,
“post-processing” the MCMC output, iteration-by-iteration. In
a d
-dimensional IRT model, a sufficient condition for global
identification is to fix d+1
latent traits, provided the
constrained latent traits span the d
dimensional latent space.
This function implements this strategy. The user supplies a set of
constrained ideal points in the constraints
list. The function
then processes the MCMC output in the ideal
object
, finding the transformation that maps the current
iteration's sampled values for x
(latent traits/ideal points)
into the sub-space of identified parameters defined by the fixed
points in constraints
; i.e., what is the affine transformation
that maps the unconstrained ideal points into the constraints? Aside
from minuscule numerical inaccuracies resulting from matrix inversion
etc, this transformation is exact: after post-processing, the
d+1
constrained points do not vary over the MCMC iterations.
The remaining n-d-1
ideal points are subject to (posterior)
uncertainty; the “random tour” of the joint parameter space of
these parameters produced by the MCMC algorithm has been mapped into a
subspace in which the parameters are globally identified.
If the ideal
object was produced with store.item
set to TRUE
, then the item parameters are also post-processed,
applying the inverse transformation. Specifically, recall that the
IRT model is
where in this formulation is a vector of
length
d+1
, including a -1
to put a constant term into
the model (i.e., the intercept or difficulty parameter
is part of ). Let
denote the
non-singular,
d+1
-by-d+1
matrix that maps the
into the space of identified parameters. Recall that this
transformation is computed iteration by iteration. Then each
is transformed to
and
is transformed to
,
.
Local identification can be obtained for a one-dimensional model by
simply imposing a normalizing restriction on the ideal points: this
normalization (mean zero, standard deviation one) is the default
behavior, but (a) is only sufficient for local identification when
the rollcall
object was fit with d=1
; (b) is not
sufficient for even local identification when d>1
, with
further restrictions required so as to rule out other forms of
invariance (e.g., translation, or "dimension-switching", a
phenomenon akin to label-switching in mixture modeling).
The default is to impose dimension-by-dimension normalization with
respect to the means of the marginal posterior densities of the
ideal points, such that the these means (the usual Bayes estimates
of the ideal points) have mean zero and standard deviation one
across legislators. An offsetting transformation is applied to the
items parameters as well, if they are saved in the ideal
object.
Specifically, in one-dimension, the two-parameter IRT model is
If we normalize the
to
then the offsetting
transformations for the item/bill parameters are
and
.
An object of class ideal
, with components suitably
transformed and recomputed (i.e., x
is transformed and
xbar
recomputed, and if the
ideal
object was fit with store.item=TRUE
,
beta
is transformed and betabar
is recomputed).
Applying transformations to obtain identification can
sometimes lead to surprising results. Each data point makes the same
likelihood contributions with either the identified or unidentified
parameters. But, in general, predictions generated with the
parameters set to their posterior means will differ depending on
whether one uses the identified subset of parameters or the
unidentified parameters. For this reason, caution should be used when
using a function such as predict
after post-processing
output from ideal
. A better strategy is to compute the
estimand of interest at each iteration and then take averages over
iterations.
When specifying a value of burnin
different from that used in
fitting the ideal
object, note a distinction between
the iteration numbers of the stored iterations, and the number of
stored iterations. That is, the n
-th iteration stored in an
ideal
object will not be iteration n
if the
user specified thin>1
in the call to ideal
.
Here, iterations are tagged with their iteration number. Thus, if
the user called ideal
with thin=10
and
burnin=100
then the stored iterations are numbered 100,
110, 120, ...
. Any future subsetting via a burnin
refers to
this iteration number.
Simon Jackman [email protected]
Hoff, Peter, Adrian E. Raftery and Mark S. Handcock. 2002. Latent Space Approaches to Social Network Analysis. Journal of the American Statistical Association 97:1090–1098.
Edwards, Yancy D. and Greg M. Allenby. 2003. Multivariate Analysis of Multiple Response Data. Journal of Marketing Research 40:321–334.
Rivers, Douglas. 2003. “Identification of Multidimensional Item-Response Models.” Typescript. Department of Political Science, Stanford University.
data(s109) f = system.file("extdata",package="pscl","id1.rda") load(f) id1Local <- postProcess(id1) ## default is to normalize summary(id1Local) id1pp <- postProcess(id1, constraints=list(BOXER=-1,INHOFE=1)) summary(id1pp) ## two-dimensional fit f = system.file("extdata",package="pscl","id2.rda") load(f) id2pp <- postProcess(id2, constraints=list(BOXER=c(-1,0), INHOFE=c(1,0), CHAFEE=c(0,.25))) tracex(id2pp,d=1:2, legis=c("BOXER","INHOFE","COLLINS","FEINGOLD","COLEMAN", "CHAFEE","MCCAIN","KYL"))
data(s109) f = system.file("extdata",package="pscl","id1.rda") load(f) id1Local <- postProcess(id1) ## default is to normalize summary(id1Local) id1pp <- postProcess(id1, constraints=list(BOXER=-1,INHOFE=1)) summary(id1pp) ## two-dimensional fit f = system.file("extdata",package="pscl","id2.rda") load(f) id2pp <- postProcess(id2, constraints=list(BOXER=c(-1,0), INHOFE=c(1,0), CHAFEE=c(0,.25))) tracex(id2pp,d=1:2, legis=c("BOXER","INHOFE","COLLINS","FEINGOLD","COLEMAN", "CHAFEE","MCCAIN","KYL"))
compute various pseudo-R2 measures for various GLMs
pR2(object, ...)
pR2(object, ...)
object |
a fitted model object for which |
... |
additional arguments to be passed to or from functions |
Numerous pseudo r-squared measures have been proposed for generalized linear models, involving a comparison of the log-likelihood for the fitted model against the log-likelihood of a null/restricted model with no predictors, normalized to run from zero to one as the fitted model provides a better fit to the data (providing a rough analogue to the computation of r-squared in a linear regression).
A vector of length 6 containing
llh |
The log-likelihood from the fitted model |
llhNull |
The log-likelihood from the intercept-only restricted model |
G2 |
Minus two times the difference in the log-likelihoods |
McFadden |
McFadden's pseudo r-squared |
r2ML |
Maximum likelihood pseudo r-squared |
r2CU |
Cragg and Uhler's pseudo r-squared |
Simon Jackman [email protected]
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Sage. pp104-106.
data(admit) ## ordered probit model op1 <- MASS::polr(score ~ gre.quant + gre.verbal + ap + pt + female, Hess=TRUE, data=admit, method="probit") pR2(op1)
data(admit) ## ordered probit model op1 <- MASS::polr(score ~ gre.quant + gre.verbal + ap + pt + female, Hess=TRUE, data=admit, method="probit") pR2(op1)
Methods for extracting information from fitted hurdle
regression model objects of class "hurdle"
.
## S3 method for class 'hurdle' predict(object, newdata, type = c("response", "prob", "count", "zero"), na.action = na.pass, at = NULL, ...) ## S3 method for class 'hurdle' residuals(object, type = c("pearson", "response"), ...) ## S3 method for class 'hurdle' coef(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'hurdle' vcov(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'hurdle' terms(x, model = c("count", "zero"), ...) ## S3 method for class 'hurdle' model.matrix(object, model = c("count", "zero"), ...)
## S3 method for class 'hurdle' predict(object, newdata, type = c("response", "prob", "count", "zero"), na.action = na.pass, at = NULL, ...) ## S3 method for class 'hurdle' residuals(object, type = c("pearson", "response"), ...) ## S3 method for class 'hurdle' coef(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'hurdle' vcov(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'hurdle' terms(x, model = c("count", "zero"), ...) ## S3 method for class 'hurdle' model.matrix(object, model = c("count", "zero"), ...)
object , x
|
an object of class |
newdata |
optionally, a data frame in which to look for variables with which to predict. If omitted, the original observations are used. |
type |
character specifying the type of predictions or residuals, respectively. For details see below. |
na.action |
function determining what should be done with missing values
in |
at |
optionally, if |
model |
character specifying for which component of the model the terms or model matrix should be extracted. |
... |
currently not used. |
A set of standard extractor functions for fitted model objects is available for
objects of class "hurdle"
, including methods to the generic functions
print
and summary
which print the estimated
coefficients along with some further information. The summary
in particular
supplies partial Wald tests based on the coefficients and the covariance matrix
(estimated from the Hessian in the numerical optimization of the log-likelihood).
As usual, the summary
method returns an object of class "summary.hurdle"
containing the relevant summary statistics which can subsequently be printed
using the associated print
method.
The methods for coef
and vcov
by default
return a single vector of coefficients and their associated covariance matrix,
respectively, i.e., all coefficients are concatenated. By setting the model
argument, the estimates for the corresponding model component can be extracted.
Both the fitted
and predict
methods can
compute fitted responses. The latter additionally provides the predicted density
(i.e., probabilities for the observed counts), the predicted mean from the count
component (without zero hurdle) and the predicted ratio of probabilities for
observing a non-zero count. The latter is the ratio of probabilities for a non-zero
implied by the zero hurdle component and a non-zero count in the non-truncated
count distribution. See also Appendix C in Zeileis et al. (2008).
The residuals
method can compute raw residuals
(observed - fitted) and Pearson residuals (raw residuals scaled by
square root of variance function).
The terms
and model.matrix
extractors can
be used to extract the relevant information for either component of the model.
A logLik
method is provided, hence AIC
can be called to compute information criteria.
Achim Zeileis <[email protected]>
Zeileis, Achim, Christian Kleiber and Simon Jackman 2008. “Regression Models for Count Data in R.” Journal of Statistical Software, 27(8). URL https://www.jstatsoft.org/v27/i08/.
data("bioChemists", package = "pscl") fm <- hurdle(art ~ ., data = bioChemists) plot(residuals(fm) ~ fitted(fm)) coef(fm) coef(fm, model = "zero") summary(fm) logLik(fm)
data("bioChemists", package = "pscl") fm <- hurdle(art ~ ., data = bioChemists) plot(residuals(fm) ~ fitted(fm)) coef(fm) coef(fm, model = "zero") summary(fm) logLik(fm)
Compute predicted probabilities from an ideal
object. This predict method uses the posterior mean values of
and
to make predictions.
## S3 method for class 'ideal' predict(object, cutoff=.5, burnin=NULL, ...) ## S3 method for class 'predict.ideal' print(x,digits=2,...)
## S3 method for class 'ideal' predict(object, cutoff=.5, burnin=NULL, ...) ## S3 method for class 'predict.ideal' print(x,digits=2,...)
object |
an object of class |
cutoff |
numeric, a value between 0 and 1, the threshold to be used for classifying predicted probabilities of a Yea votes as predicted Yea and Nay votes. |
burnin |
of the recorded MCMC samples, how many to discard as
burnin? Default is |
x |
object of class |
digits |
number of digits in printed object |
... |
further arguments passed to or from other methods. |
Predicted probabilities are computed using the mean of the posterior
density of
of (ideal points, or latent ability) and
(bill or
item parameters). The percentage correctly predicted
are determined by counting the percentages of votes with predicted
probabilities of a Yea vote greater than or equal to the
cutoff
as the
threshold.
An object of class predict.ideal
, containing:
pred.probs |
the calculated predicted probability for each legislator for each vote. |
prediction |
the calculated prediction (0 or 1) for each legislator for each vote. |
correct |
for each legislator for each vote, whether the prediction was correct. |
legis.percent |
for each legislator, the percent of votes correctly predicted. |
vote.percent |
for each vote, the percent correctly predicted. |
yea.percent |
the percent of yea votes correctly predicted. |
nay.percent |
the percent of nay votes correctly predicted. |
party.percent |
the average value of the percent correctly
predicted by legislator, separated by party, if party information
exists in the |
overall.percent |
the total percent of votes correctly predicted. |
ideal |
|
desc |
string, the descriptive text from the
|
When specifying a value of burnin
different from that used
in fitting the ideal
object, note a distinction
between the iteration numbers of the stored iterations, and the
number of stored iterations. That is, the n
-th iteration
stored in an ideal
object will not be iteration
n
if the user specified thin>1
in the call to
ideal
. Here, iterations are tagged with their
iteration number. Thus, if the user called ideal
with
thin=10
and burnin=100
then the stored iterations are
numbered 100, 110, 120, ...
. Any future subsetting via a
burnin
refers to this iteration number.
ideal
, summary.ideal
, plot.predict.ideal
data(s109) f <- system.file("extdata","id1.rda",package="pscl") load(f) phat <- predict(id1) phat ## print method
data(s109) f <- system.file("extdata","id1.rda",package="pscl") load(f) phat <- predict(id1) phat ## print method
Methods for extracting information from fitted zero-inflated
regression model objects of class "zeroinfl"
.
## S3 method for class 'zeroinfl' predict(object, newdata, type = c("response", "prob", "count", "zero"), na.action = na.pass, at = NULL, ...) ## S3 method for class 'zeroinfl' residuals(object, type = c("pearson", "response"), ...) ## S3 method for class 'zeroinfl' coef(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'zeroinfl' vcov(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'zeroinfl' terms(x, model = c("count", "zero"), ...) ## S3 method for class 'zeroinfl' model.matrix(object, model = c("count", "zero"), ...)
## S3 method for class 'zeroinfl' predict(object, newdata, type = c("response", "prob", "count", "zero"), na.action = na.pass, at = NULL, ...) ## S3 method for class 'zeroinfl' residuals(object, type = c("pearson", "response"), ...) ## S3 method for class 'zeroinfl' coef(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'zeroinfl' vcov(object, model = c("full", "count", "zero"), ...) ## S3 method for class 'zeroinfl' terms(x, model = c("count", "zero"), ...) ## S3 method for class 'zeroinfl' model.matrix(object, model = c("count", "zero"), ...)
object , x
|
an object of class |
newdata |
optionally, a data frame in which to look for variables with which to predict. If omitted, the original observations are used. |
type |
character specifying the type of predictions or residuals, respectively. For details see below. |
na.action |
function determining what should be done with missing values
in |
at |
optionally, if |
model |
character specifying for which component of the model the terms or model matrix should be extracted. |
... |
currently not used. |
A set of standard extractor functions for fitted model objects is available for
objects of class "zeroinfl"
, including methods to the generic functions
print
and summary
which print the estimated
coefficients along with some further information. The summary
in particular
supplies partial Wald tests based on the coefficients and the covariance matrix
(estimated from the Hessian in the numerical optimization of the log-likelihood).
As usual, the summary
method returns an object of class "summary.zeroinfl"
containing the relevant summary statistics which can subsequently be printed
using the associated print
method.
The methods for coef
and vcov
by default
return a single vector of coefficients and their associated covariance matrix,
respectively, i.e., all coefficients are concatenated. By setting the model
argument, the estimates for the corresponding model components can be extracted.
Both the fitted
and predict
methods can
compute fitted responses. The latter additionally provides the predicted density
(i.e., probabilities for the observed counts), the predicted mean from the count
component (without zero inflation) and the predicted probability for the zero
component. The residuals
method can compute
raw residuals (observed - fitted) and Pearson residuals (raw residuals scaled by
square root of variance function).
The terms
and model.matrix
extractors can
be used to extract the relevant information for either component of the model.
A logLik
method is provided, hence AIC
can be called to compute information criteria.
Achim Zeileis <[email protected]>
data("bioChemists", package = "pscl") fm_zip <- zeroinfl(art ~ ., data = bioChemists) plot(residuals(fm_zip) ~ fitted(fm_zip)) coef(fm_zip) coef(fm_zip, model = "count") summary(fm_zip) logLik(fm_zip)
data("bioChemists", package = "pscl") fm_zip <- zeroinfl(art ~ ., data = bioChemists) plot(residuals(fm_zip) ~ fitted(fm_zip)) coef(fm_zip) coef(fm_zip, model = "count") summary(fm_zip) logLik(fm_zip)
Compute predicted probabilities from fitted models, optionally at new covariate values.
predprob(obj, ...)
predprob(obj, ...)
obj |
fitted model object |
... |
other arguments |
See documentation for specific methods.
A matrix of predicted probabilities, each row a vector of
predicted probabilities over the range of responses seen in the data
(i.e., min(y):max(y)
), conditional on the values of covariates.
Simon Jackman [email protected]
predprob.glm
, predprob.zeroinfl
## Not run: data("bioChemists") zip <- zeroinfl(art ~ . | ., data = bioChemists, EM = TRUE) phat <- predprob(zip) newdata <- expand.grid(list(fem="Men",mar="Married", kid5=1,phd=3.103, ment=0:77)) phat <- predprob(zip, newdata = newdata) ## End(Not run)
## Not run: data("bioChemists") zip <- zeroinfl(art ~ . | ., data = bioChemists, EM = TRUE) phat <- predprob(zip) newdata <- expand.grid(list(fem="Men",mar="Married", kid5=1,phd=3.103, ment=0:77)) phat <- predprob(zip, newdata = newdata) ## End(Not run)
Obtains predicted probabilities from a fitted generalized linear model object.
## S3 method for class 'glm' predprob(obj, newdata = NULL, at = NULL, ...)
## S3 method for class 'glm' predprob(obj, newdata = NULL, at = NULL, ...)
obj |
a fitted object of class inheriting from |
newdata |
optionally, a data frame in which to look for variables with which to predict. If omitted, the fitted linear predictors are used. |
at |
an optional numeric vector at which the probabilities are
evaluated. By default |
... |
arguments passed to or from other methods |
This method is only defined for glm objects with
family=binomial
or family=poisson
, or negative
binomial count models fit with the glm.nb
function in library(MASS)
.
A matrix of predicted probabilities. Each row in the matrix is a
vector of probabilities, assigning predicted probabilities over the range of
responses actually observed in the data. For instance, for models
with family=binomial
, the matrix has two columns for the "zero"
(or failure) and "one" (success) outcomes, respectively, and
trivially, each row in the matrix sums to 1.0.
For counts fit with family=poisson
or via glm.nb
, the
matrix has length(0:max(y))
columns. Each observation
used in fitting the model generates a row to the returned matrix; alternatively, if
newdata
is supplied, the returned matrix will have as many rows
as in newdata
.
Simon Jackman [email protected]
data(bioChemists) glm1 <- glm(art ~ ., data=bioChemists, family=poisson, trace=TRUE) ## poisson GLM phat <- predprob(glm1) apply(phat,1,sum) ## almost all 1.0
data(bioChemists) glm1 <- glm(art ~ ., data=bioChemists, family=poisson, trace=TRUE) ## poisson GLM phat <- predprob(glm1) apply(phat,1,sum) ## almost all 1.0
Computes predicted probabilities of a “Yea” vote conditional on the posterior means of the legislators' ideal points and vote-specific parameters.
## S3 method for class 'ideal' predprob(obj, ...)
## S3 method for class 'ideal' predprob(obj, ...)
obj |
An object of class |
... |
Arguments to be passed to other functions |
This is a wrapper function to predict.ideal
, extracting
just the predicted probabilities component of the object returned by
that function. Predicted probabilities can and are generated for each
voting decision, irrespective of whether the legislator actually voted
on any particular roll call.
A matrix
of dimension n
(number of legislators)
by m
(number of roll call votes).
Simon Jackman [email protected]
ideal
, predprob
, predict.ideal
f <- system.file("extdata","id1.rda",package="pscl") load(f) phat <- predprob(id1) dim(phat)
f <- system.file("extdata","id1.rda",package="pscl") load(f) phat <- predprob(id1) dim(phat)
Democratic share of the presidential vote, 1932-2016, in each state and the District of Columbia.
data(presidentialElections)
data(presidentialElections)
state
character, name of state
demVote
numeric, percent of the vote for president won by the Democratic candidate
year
numeric, integer
south
logical, TRUE
if state is one of the 11 states
of the former Confederacy
1,047 observations, unbalanced panel data in long format. Hawaii and Alaska contribute data from 1960 onwards the District of Columbia contributes data from 1964 onward; Alabama has missing data for 1948 and 1964.
David Leip's Atlas of U.S. Presidential Elections https://uselectionatlas.org
data(presidentialElections) if(require(lattice)) { lattice::xyplot(demVote ~ year | state, panel=lattice::panel.lines, ylab="Democratic Vote for President (percent)", xlab="Year", data=presidentialElections, scales=list(y=list(cex=.6),x=list(cex=.35)), strip=strip.custom(par.strip.text=list(cex=.6))) } ## Obama vs Kerry, except DC y08 <- presidentialElections$year==2008 y04 <- presidentialElections$year==2004 tmpData <- merge(y=presidentialElections[y08,], x=presidentialElections[y04,], by="state") tmpData <- tmpData[tmpData$state!="DC",] xlim <- range(tmpData$demVote.x,tmpData$demVote.y) col <- rep("black",dim(tmpData)[1]) col[tmpData$south.x] <- "red" plot(demVote.y ~ demVote.x, xlab="Kerry Vote Share, 2004 (percent)", ylab="Obama Vote Share, 2008 (percent)", xlim=xlim, ylim=xlim, type="n", las=1, data=tmpData) abline(0,1,lwd=2,col=gray(.65)) ols <- lm(demVote.y ~ demVote.x, data=tmpData) abline(ols,lwd=2) text(tmpData$demVote.x, tmpData$demVote.y, tmpData$state, col=col, cex=.65) legend(x="topleft", bty="n", lwd=c(2,2), col=c(gray(.65),"black"), legend=c("No Change from 2004","Regression")) legend(x="bottomright", bty="n", text.col=c("red","black"), legend=c("South","Non-South"))
data(presidentialElections) if(require(lattice)) { lattice::xyplot(demVote ~ year | state, panel=lattice::panel.lines, ylab="Democratic Vote for President (percent)", xlab="Year", data=presidentialElections, scales=list(y=list(cex=.6),x=list(cex=.35)), strip=strip.custom(par.strip.text=list(cex=.6))) } ## Obama vs Kerry, except DC y08 <- presidentialElections$year==2008 y04 <- presidentialElections$year==2004 tmpData <- merge(y=presidentialElections[y08,], x=presidentialElections[y04,], by="state") tmpData <- tmpData[tmpData$state!="DC",] xlim <- range(tmpData$demVote.x,tmpData$demVote.y) col <- rep("black",dim(tmpData)[1]) col[tmpData$south.x] <- "red" plot(demVote.y ~ demVote.x, xlab="Kerry Vote Share, 2004 (percent)", ylab="Obama Vote Share, 2008 (percent)", xlim=xlim, ylim=xlim, type="n", las=1, data=tmpData) abline(0,1,lwd=2,col=gray(.65)) ols <- lm(demVote.y ~ demVote.x, data=tmpData) abline(ols,lwd=2) text(tmpData$demVote.x, tmpData$demVote.y, tmpData$state, col=col, cex=.65) legend(x="topleft", bty="n", lwd=c(2,2), col=c(gray(.65),"black"), legend=c("No Change from 2004","Regression")) legend(x="bottomright", bty="n", text.col=c("red","black"), legend=c("South","Non-South"))
Deaths by year, by corp, from horse kicks.
data(prussian)
data(prussian)
A data frame with 280 observations on the following 3 variables.
y
a numeric vector, count of deaths
year
a numeric vector, 18XX, year of observation
corp
a factor
, corp of Prussian Army generating
observation
von Bortkiewicz, L. 1898. Das Gesetz der Kleinen Zahlen. Leipzig: Teubner.
data(prussian) corpP <- glm(y ~ corp, family=poisson,data=prussian) summary(corpP)
data(prussian) corpP <- glm(y ~ corp, family=poisson,data=prussian) summary(corpP)
Creates a rollcall
object from the flat file format for roll
call data used by
Keith Poole and Howard Rosenthal.
readKH(file, dtl=NULL, yea=c(1,2,3), nay=c(4,5,6), missing=c(7,8,9), notInLegis=0, desc=NULL, debug=FALSE)
readKH(file, dtl=NULL, yea=c(1,2,3), nay=c(4,5,6), missing=c(7,8,9), notInLegis=0, desc=NULL, debug=FALSE)
file |
string, name of a file or URL holding KH data |
dtl |
string, name of a file or URL holding KH |
yea |
numeric, possibly a vector, code(s) for a Yea vote in the
rollcall context (or a correct answer in the educational testing
context). Default is |
nay |
numeric, possibly a vector, code(s) for a Nay vote in the
rollcall context (or an incorrect answer in the educational testing
context). Default is |
missing |
numeric and/or |
notInLegis |
numeric or |
desc |
string, describing the data, e.g., |
debug |
logical, print debugging information for net connection |
Keith Poole and Howard Rosenthal have gathered an impressive collection of roll call data, spanning every roll call cast in the United States Congress. This effort continues now as a real-time exercise, via a collaboration with Jeff Lewis (109th Congress onwards). Nolan McCarty collaborated on the compilation of roll call data for the 102nd through 108th Congress.
This function relies on some hard-coded features of Poole-Rosenthal
flat files, and assumes that the file
being supplied has the following
structure (variable, start-end columns):
4-8
9-10
11-12
13-20
21-23
26-36
37 to end-of-record
This function reads data files in that format, and creates a
rollcall
, for which there are useful methods such as
summary.rollcall
. The legis.data
component of the
rollcall
object is a data.frame
which
contains:
state
a 2-character string abbreviation of each legislator' state
icpsrState
a 2-digit numeric code for each legislator's state, as used by the Inter-university Consortium for Political and Social Research (ICPSR)
cd
numeric, the number of each legislator's congressional district within each state; this is always 0 for members of the Senate
icpsrLegis
a unique numeric identifier for each legislator assigned by the ICPSR, as corrected by Poole and Rosenthal.
partyName
character string, the name of each legislator's political party
party
numeric, code for each legislator's political party; see https://legacy.voteview.com/PARTY3.HTM
The rownames
attribute of this data frame is a
concatenation of the legislators' names, party abbreviations (for
Democrats and Republicans) and state, and (where appropriate), a
district number; e.g., Bonner (R AL-1)
. This tag is also
provided in the legis.name
component of the returned rollcall
object.
Poole and Rosenthal also make dtl
files available for
Congresses 1 through 106. These files contain information about the
votes themselves, in a multiple-line per vote ascii
format, and
reside in the dtl
director of Poole's web site, e.g.,
https://legacy.voteview.com/k7ftp/dtl/102s.dtl is the dtl
file for the 102nd Senate. The default is to presume that no such
file exists. When a dtl
file is available, and is read, the
votes.data
attribute of the resulting rollcall
object is a data.frame
with one record per vote, with
the following variables:
date
vector of class Date
, date of the
rollcall, if available; otherwise NULL
description
vector of mode character
, descriptive
text
The dtl
files are presumed to have the date of the rollcall in
the first line of text for each roll call, and lines 3 onwards contain
descriptive text.
Finally, note also that the Poole/Rosenthal data sets often include the U.S. President as a pseudo-legislator, adding the announced positions of a president or the administration to the roll call matrix. This adds an extra “legislator” to the data set and can sometimes produce surprising results (e.g., a U.S. Senate of 101 senators), and a “legislator” with a surprisingly low party loyalty score (since the President/administration only announces positions on a relatively small fraction of all Congressional roll calls).
an object of class rollcall
, with components created
using the identifying information in the Poole/Rosenthal files. If
the function can not read the file (e.g., the user specified a URL and
the machine is not connected to the Internet), the function fails with
an error message (set debug=TRUE
to help resolve these issues).
Simon Jackman [email protected]
Poole, Keith and Howard Rosenthal. 1997. Congress: A Political-Economic History of Roll Call Voting. New York: Oxford University Press.
Poole, Keith. https://legacy.voteview.com
Rosenthal, Howard L. and Keith T. Poole. United States Congressional Roll Call Voting Records, 1789-1990: Reformatted Data [computer file]. 2nd ICPSR release. Pittsburgh, PA: Howard L. Rosenthal and Keith T. Poole, Carnegie Mellon University, Graduate School of Industrial Administration [producers], 1991. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2000. https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/09822
## Not run: h107 <- readKH("https://voteview.com/static/data/out/votes/H107_votes.ord", desc="107th U.S. House of Representatives") s107 <- readKH("https://voteview.com/static/data/out/votes/S107_votes.ord", desc="107th U.S. Senate") ## End(Not run)
## Not run: h107 <- readKH("https://voteview.com/static/data/out/votes/H107_votes.ord", desc="107th U.S. House of Representatives") s107 <- readKH("https://voteview.com/static/data/out/votes/S107_votes.ord", desc="107th U.S. Senate") ## End(Not run)
Voter turnout data spanning 85 cable TV systems, randomly allocated to a voter mobilization experiment targeting 18-19 year olds with "Rock the Vote" television advertisements
data(RockTheVote)
data(RockTheVote)
A data frame with 85 observations on the following 6 variables.
strata
numeric, experimental strata
treated
numeric, 1 if a treated cable system, 0 otherwise
r
numeric, number of 18 and 19 year olds turning out
n
numeric, number of 19 and 19 year olds registered
p
numeric, proportion of 18 and 19 year olds turning out
treatedIndex
numeric, a counter indexing the 42 treated units
Green and Vavreck (2008) implemented a cluster-randomized experimental design in assessing the effects of a voter mobilization treatment in the 2004 U.S. Presidential election. The clusters in this design are geographic areas served by a single cable television system. So as to facilitate analysis, the researchers restricted their attention to small cable systems whose reach is limited to a single zip code. Further, since the experiment was fielded during the last week of the presidential election, the researchers restricted their search to cable systems that were not in the 16 hotly-contested “battleground” states (as designated by the Los Angeles Times).
Eighty-five cable systems were available for randomization and were assigned to treatment after stratification on previous turnout levels in presidential elections (as determined from analysis of the corresponding states' voter registration files). Each cable system was matched with one or sometimes two other cable systems in the same state, yielding 40 strata. Then within each strata, cable systems were randomly assigned to treatment and control conditions. Strata 3, 8 and 25 have two control cable systems and 1 treated system each, while strata 6 and 20 have two treated cable systems and one control system. The remaining 35 strata have 1 treated cable system and 1 control system. In this way there are 38 + 4 = 42 treated systems, spanning 40 experiment strata.
The treatment involved researchers purchasing prime-time advertising spots on four channels in the respective cable system in which the researchers aired voter mobilization ads. The ads were produced by Rock the Vote, targeted at younger voters, and aired four times per night, per channel, over the last eight days of the election campaign. After the election, public records were consulted to assemble data on turnout levels in the treated and control cable systems. In the analysis reported in Green and Vavreck (2008), the researchers focused on turnout among registered voters aged 18 and 19 years old.
Green, Donald P. and Lynn Vavreck. 2008. Analysis of Cluster-Randomized Experiments: A Comparison of Alternative Estimation Approaches. Political Analysis 16:138-152.
Jackman, Simon, 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey. Example 7.9.
data(RockTheVote) ## estimate MLEs of treatment effects deltaFunction <- function(data){ model <- glm(cbind(r,n-r)~treated, data=data, family=binomial) c(coef(model)[2], confint(model)[2,]) } tmp <- by(RockTheVote, as.factor(RockTheVote$strata), deltaFunction) tmp <- matrix(unlist(tmp),ncol=3,byrow=TRUE) indx <- order(tmp[,1]) plot(y=1:40, x=tmp[indx,1], pch=16,cex=1.25, xlim=range(tmp), ylab="", axes=FALSE, xlab="Estimated Treatment Effect (MLEs, Logit Scale)") text(y=1:40, x=par()$usr[1], pos=4, as.character((1:40)[indx]), cex=.5) segments(x0=tmp[indx,2], x1=tmp[indx,3], y0=1:40, y1=1:40) axis(1) axis(3) abline(v=0)
data(RockTheVote) ## estimate MLEs of treatment effects deltaFunction <- function(data){ model <- glm(cbind(r,n-r)~treated, data=data, family=binomial) c(coef(model)[2], confint(model)[2,]) } tmp <- by(RockTheVote, as.factor(RockTheVote$strata), deltaFunction) tmp <- matrix(unlist(tmp),ncol=3,byrow=TRUE) indx <- order(tmp[,1]) plot(y=1:40, x=tmp[indx,1], pch=16,cex=1.25, xlim=range(tmp), ylab="", axes=FALSE, xlab="Estimated Treatment Effect (MLEs, Logit Scale)") text(y=1:40, x=par()$usr[1], pos=4, as.character((1:40)[indx]), cex=.5) segments(x0=tmp[indx,2], x1=tmp[indx,3], y0=1:40, y1=1:40) axis(1) axis(3) abline(v=0)
Create a rollcall
object, used for the analysis of legislative
voting or, equivalently, item-response modeling of binary data
produced by standardized tests, etc.
rollcall(data, yea=1, nay=0, missing=NA, notInLegis=9, legis.names=NULL, vote.names=NULL, legis.data=NULL, vote.data=NULL, desc=NULL, source=NULL)
rollcall(data, yea=1, nay=0, missing=NA, notInLegis=9, legis.names=NULL, vote.names=NULL, legis.data=NULL, vote.data=NULL, desc=NULL, source=NULL)
data |
voting decisions (for roll calls), or test results (for
IRT). Can be in one of two forms. First, |
yea |
numeric, possibly a vector, code(s) for a Yea vote in the rollcall context, or a correct answer in the educational testing context. Default is 1. |
nay |
numeric, possibly a vector, code(s) for a Nay vote in the rollcall context, or an incorrect answer in the educational testing context. Default is 0. |
missing |
numeric or |
notInLegis |
numeric or |
legis.names |
a vector of names of the legislators or
individuals. If |
vote.names |
a vector of names or labels for the votes or items.
If |
legis.data |
a |
vote.data |
a |
desc |
character, a string providing an (optional) description of
the data being used. If |
source |
character, a string providing an (optional) description of where the roll call data originated (e.g., a URL or a short-form reference). Used in print and summary methods. |
See below for methods that operate on objects of class
rollcall
.
An object of class rollcall
, a list with the following components:
votes |
a |
codes |
a |
n |
numeric, number of legislators, equal to |
m |
numeric, number of votes, equal to |
legis.data |
user-supplied data on legislators/test-subjects. |
vote.data |
user-supplied data on rollcall votes/test-items. |
desc |
any user-supplied description. If no description was provided,
defaults |
source |
any user-supplied source information (e.g., a url or a
short-form reference). If no description is provided, |
readKH
for creating objects from files (possibly over
the web), in the format used for data from the United States Congress
used by Keith Poole and Howard Rosenthal (and others).
summary.rollcall
, ideal
for model fitting.
## generate some fake roll call data set.seed(314159265) fakeData <- matrix(sample(x=c(0,1),size=5000,replace=TRUE), 50,100) rc <- rollcall(fakeData) is(rc,"rollcall") ## TRUE rc ## print the rollcall object on screen data(sc9497) ## Supreme Court example data rc <- rollcall(data=sc9497$votes, legis.names=sc9497$legis.names, desc=sc9497$desc) summary(rc,verbose=TRUE) ## Not run: ## s107 ## could use readKH for this dat <- readLines("sen107kh.ord") dat <- substring(dat,37) mat <- matrix(NA,ncol=nchar(dat[1]),nrow=length(dat)) for(i in 1:103){ mat[i,] <- as.numeric(unlist(strsplit(dat[i], split=character(0)))) } s107 <- rollcall(mat, yea=c(1,2,3), nay=c(4,5,6), missing=c(7,8,9), notInLegis=0, desc="107th U.S. Senate", source="http://voteview.ucsd.edu") summary(s107) ## End(Not run)
## generate some fake roll call data set.seed(314159265) fakeData <- matrix(sample(x=c(0,1),size=5000,replace=TRUE), 50,100) rc <- rollcall(fakeData) is(rc,"rollcall") ## TRUE rc ## print the rollcall object on screen data(sc9497) ## Supreme Court example data rc <- rollcall(data=sc9497$votes, legis.names=sc9497$legis.names, desc=sc9497$desc) summary(rc,verbose=TRUE) ## Not run: ## s107 ## could use readKH for this dat <- readLines("sen107kh.ord") dat <- substring(dat,37) mat <- matrix(NA,ncol=nchar(dat[1]),nrow=length(dat)) for(i in 1:103){ mat[i,] <- as.numeric(unlist(strsplit(dat[i], split=character(0)))) } s107 <- rollcall(mat, yea=c(1,2,3), nay=c(4,5,6), missing=c(7,8,9), notInLegis=0, desc="107th U.S. Senate", source="http://voteview.ucsd.edu") summary(s107) ## End(Not run)
A sample rollcall object, generated using a collection of the rollcalls of the 109th U.S. Senate (2005-2006).
data(s109)
data(s109)
A rollcall
object containing the recorded votes of the 109th
U.S. Senate, plus information identifying the legislators and the
rollcalls.
Note the coding scheme used by Poole and Rosenthal; Yea (1), Nay (6) etc.
Keith Poole's web site: https://legacy.voteview.com/senate109.htm
Originally scraped from the Senate's web site by Jeff Lewis (UCLA).
Information identifying the votes is available at https://voteview.com/static/data/out/rollcalls/S109_rollcalls.csv
require(pscl) data(s109) is(s109,"rollcall") ## TRUE s109 ## print method for class rollcall summary(s109) ## summary method summary(s109,verbose=TRUE) ## Not run: ## how s109 was created require(pscl) s109 <- readKH("https://voteview.com/static/data/out/votes/S109_votes.ord", desc="109th U.S. Senate", debug=TRUE) url <- "https://voteview.com/static/data/out/rollcalls/S109_rollcalls.csv" s109$vote.data <- data.frame(read.csv(file=url,header=TRUE)) s109$vote.data$date <- as.Date(s109$vote.data$date, format=" dimnames(s109$votes)[[2]] <- paste(s109$vote.data$session, s109$vote.data$number,sep="-") ## End(Not run)
require(pscl) data(s109) is(s109,"rollcall") ## TRUE s109 ## print method for class rollcall summary(s109) ## summary method summary(s109,verbose=TRUE) ## Not run: ## how s109 was created require(pscl) s109 <- readKH("https://voteview.com/static/data/out/votes/S109_votes.ord", desc="109th U.S. Senate", debug=TRUE) url <- "https://voteview.com/static/data/out/rollcalls/S109_rollcalls.csv" s109$vote.data <- data.frame(read.csv(file=url,header=TRUE)) s109$vote.data$date <- as.Date(s109$vote.data$date, format=" dimnames(s109$votes)[[2]] <- paste(s109$vote.data$session, s109$vote.data$number,sep="-") ## End(Not run)
This data set provides information on the United States Supreme Court from 1994-1997. Votes included are non-unanimous.
data(sc9497)
data(sc9497)
A list containing the elements:
a matrix of the votes, 0=Nay, 1=Yea,
NA
=Abstained or missing data. The matrix columns are labeled with
vote.names
and the rows are labeled with legis.names
.
a vector of the names of the nine Justices sitting on the court at this time.
NULL; exists for consistency with House and Senate data sets.
NULL; exists for consistency with House and Senate data sets.
NULL; exists for consistency with House data sets.
NULL; exists for consistency with House and Senate data sets.
a vector of strings numbering the cases simply to distinguish them from one another.
a description of the data set.
Harold J. Spaeth (1999). United States Supreme Court Judicial Database, 1953-1997 Terms. Ninth edition. Inter-university Consortium for Political and Social Research. Ann Arbor, Michigan. doi:10.3886/ICPSR09422.v9
Convert a vector of vote shares into a seats-vote curve object, providing estimates of partisan bias.
seatsVotes(x, desc = NULL, method = "uniformSwing")
seatsVotes(x, desc = NULL, method = "uniformSwing")
x |
a vector of vote shares for a specific party (either proportions or percentages) |
desc |
descriptive text |
method |
how to simulate a seats-vote curve; the only supported
method at this stage is |
Simulation methods are required to induce a seats-votes curve given a vector of vote shares from one election. The uniform swing method simply slides the empirical distribution function of the vote shares “up” and “down”, computing the proportion of the vote shares that lie above .5 (by construction, the winning percentage in a two-party election) for each new location of the vector of vote shares. That is, as the empirical CDF of the observed vote shares slides up or down, more or less seats cross the .5 threshold. A seats-votes curve is formed by plotting the seat share above .5 as a function of the average district-level vote share (a weakly monotone function, since the empirical CDF constitutes a set of sufficient statistics for this problem). The simulation is run so as to ensure that average district-level vote shares range between 0 and 1.
The extent to which the seats-votes curve departs from symmetry is known as bias. More specifically, the vertical displacement of the seats-votes curve from .5 when average district-level vote share is .5 is conventionally reported as an estimate of the bias of the electoral system.
Different methods produce different estimates of seats-votes curves and summary estimands such as bias. The uniform swing method is completely deterministic and does not produce any uncertainty assessment (e.g., confidence intervals etc).
An object of class seatsVotes
, with components
s |
Estimated seat shares over the range of simulated average, district-level vote shares |
v |
Simulated average district-level vote shares |
x |
observed seat shares, with missing data removed |
desc |
user-supplied descriptive character string |
call |
a list of class |
Additional methods to come later.
Simon Jackman [email protected]
Tufte, Edward R. 1973. The Relationship Between Seats and Votes in Two-Party Systems. American Political Science Review. 67(2):540-554.
Gelman, Andrew and Gary King. 1990. Estimating the Consequences of Electoral Redistricting. Journal of the American Statistical Association. 85:274-282.
Jackman, Simon. 1994. Measuring Electoral Bias: Australia, 1949-93. British Journal of Political Science. 24(3):319-357.
plot.seatsVotes
for plotting methods.
data(ca2006) x <- ca2006$D/(ca2006$D+ca2006$R) sv <- seatsVotes(x, desc="Democratic Vote Shares, California 2006 congressional elections")
data(ca2006) x <- ca2006$D/(ca2006$D+ca2006$R) sv <- seatsVotes(x, desc="Democratic Vote Shares, California 2006 congressional elections")
Monte Carlo estimation of pi
simpi(n)
simpi(n)
n |
integer, number of Monte Carlo samples, defaults to 1000 |
A crude Monte Carlo estimate of can be formed as
follows. Sample from the unit square many times (i.e., each sample is
formed with two independent draws from a uniform density on the unit
interval). Compute the proportion
of sampled points that
lie inside a unit circle centered on the origin; such points
have distance from the origin
less than 1. Four times
is a
Monte Carlo estimate of
. This function is a wrapper to
a simple C function, bringing noticeable speed gains and memory
efficiencies over implementations in native R.
Contrast this Monte Carlo method with Buffon's needle and refinements thereof (see the discussion in Ripley (1987, 193ff).
the Monte Carlo estimate of
Simon Jackman [email protected]
Ripley, Brain D. 1987 [2006]. Stochastic Simulation. Wiley: Hoboken, New Jersey.
seed <- round(pi*10000) ## hah hah hah m <- 6 z <- rep(NA,m) lim <- rep(NA,m) for(i in 1:m){ cat(paste("simulation for ",i,"\n")) set.seed(seed) timings <- system.time(z[i] <- simpi(10^i)) print(timings) cat("\n") lim[i] <- qbinom(prob=pi/4,size=10^i,.975)/10^i * 4 } ## convert to squared error z <-(z - pi)^2 lim <- (lim - pi)^2 plot(x=1:m, y=z, type="b", pch=16, log="y", axes=FALSE, ylim=range(z,lim), xlab="Monte Carlo Samples", ylab="Log Squared Error") lines(1:m,lim,col="blue",type="b",pch=1) legend(x="topright", legend=c("95% bound", "Realized"), pch=c(1,16), lty=c(1,1), col=c("blue","black"), bty="n") axis(1,at=1:m, labels=c(expression(10^{1}), expression(10^{2}), expression(10^{3}), expression(10^{4}), expression(10^{5}), expression(10^{6}))) axis(2)
seed <- round(pi*10000) ## hah hah hah m <- 6 z <- rep(NA,m) lim <- rep(NA,m) for(i in 1:m){ cat(paste("simulation for ",i,"\n")) set.seed(seed) timings <- system.time(z[i] <- simpi(10^i)) print(timings) cat("\n") lim[i] <- qbinom(prob=pi/4,size=10^i,.975)/10^i * 4 } ## convert to squared error z <-(z - pi)^2 lim <- (lim - pi)^2 plot(x=1:m, y=z, type="b", pch=16, log="y", axes=FALSE, ylim=range(z,lim), xlab="Monte Carlo Samples", ylab="Log Squared Error") lines(1:m,lim,col="blue",type="b",pch=1) legend(x="topright", legend=c("95% bound", "Realized"), pch=c(1,16), lty=c(1,1), col=c("blue","black"), bty="n") axis(1,at=1:m, labels=c(expression(10^{1}), expression(10^{2}), expression(10^{3}), expression(10^{4}), expression(10^{5}), expression(10^{6}))) axis(2)
Numeric codes and names of 50 states and the District of Columbia, required to parse Keith Poole and Howard Rosenthal's collections of U.S. Congressional roll calls.
data(state.info)
data(state.info)
icpsr
integer, numeric code for state used by the Inter-university Consortium for Political and Social Research
state
character, name of state or Washington
D.C.
year
numeric or NA
, year of statehood
The function readKH
converts the integer ICPSR codes
into strings, via a table lookup in this data frame. Another table
lookup in state.abb
provides the 2-letter abbreviation
commonly used in identifying American legislators, e.g.,
KENNEDY, E (D-MA)
.
Various ICPSR codebooks. e.g., doi:10.3886/ICPSR00002.v5
Provides a summary of the output from ideal point estimation contained
in an object of class ideal
.
## S3 method for class 'ideal' summary(object, prob=.95, burnin=NULL, sort=TRUE, include.beta=FALSE,...)
## S3 method for class 'ideal' summary(object, prob=.95, burnin=NULL, sort=TRUE, include.beta=FALSE,...)
object |
an object of class |
prob |
scalar, a proportion between 0 and 1, the content of the highest posterior density (HPD) interval to compute for the parameters |
burnin |
of the recorded MCMC samples, how many to discard as
burnin? Default is |
sort |
logical, default is |
include.beta |
whether or not to calculate summary statistics of
beta, if beta is available. If the item parameters were not stored
in the |
... |
further arguments passed to or from other functions |
The test of whether a given discrimination parameter is
distinguishable from zero first checks to see if the two most extreme
quantiles
are symmetric around .5 (e.g., as are the default
value of .025 and .975). If so, the corresponding quantiles of the
MCMC samples for each discrimination parameter are inspected to see if
they have the same sign. If they do, then the corresponding
discrimination parameter is flagged as distinguishable from zero;
otherwise not.
An item of class summary.ideal
with elements:
object |
the name of the ideal object as an
|
xm |
|
xsd |
|
xHDR |
|
bm |
|
bsd |
|
bHDR |
|
bSig |
a |
party.quant |
if party information is available through the
|
When specifying a value of burnin
different from that used
in fitting the ideal
object, note a distinction
between the iteration numbers of the stored iterations, and the
number of stored iterations. That is, the n
-th iteration
stored in an ideal
object will not be iteration
n
if the user specified thin>1
in the call to
ideal
. Here, iterations are tagged with their
iteration number. Thus, if the user called ideal
with
thin=10
and burnin=100
then the stored iterations are
numbered 100, 110, 120, ...
. Any future subsetting via a
burnin
refers to this iteration number.
Simon Jackman [email protected]
f <- system.file("extdata","id1.rda",package="pscl") load(f) summary(id1) ## Not run: data(s109) cl2 <- constrain.legis(s109, x=list("KENNEDY (D MA)"=c(-1,0), "ENZI (R WY)"=c(1,0), "CHAFEE (R RI)"=c(0,-.5)), d=2) id2Constrained <- ideal(s109, d=2, priors=cl2, ## priors (w constraints) startvals=cl2, ## start value (w constraints) store.item=TRUE, maxiter=5000, burnin=500, thin=25) summary(id2Constrained, include.items=TRUE) ## End(Not run)
f <- system.file("extdata","id1.rda",package="pscl") load(f) summary(id1) ## Not run: data(s109) cl2 <- constrain.legis(s109, x=list("KENNEDY (D MA)"=c(-1,0), "ENZI (R WY)"=c(1,0), "CHAFEE (R RI)"=c(0,-.5)), d=2) id2Constrained <- ideal(s109, d=2, priors=cl2, ## priors (w constraints) startvals=cl2, ## start value (w constraints) store.item=TRUE, maxiter=5000, burnin=500, thin=25) summary(id2Constrained, include.items=TRUE) ## End(Not run)
Provides a summary of the information about votes, legislators, etc in
a rollcall
object.
## S3 method for class 'rollcall' summary(object, dropList=NULL, verbose=FALSE,debug=FALSE,...) ## S3 method for class 'summary.rollcall' print(x, digits=1, ...)
## S3 method for class 'rollcall' summary(object, dropList=NULL, verbose=FALSE,debug=FALSE,...) ## S3 method for class 'summary.rollcall' print(x, digits=1, ...)
object |
an |
dropList |
a |
verbose |
logical, if |
debug |
logical, if |
x |
an object of class |
digits |
number of decimal places in printed display |
... |
further arguments passed to or from other methods. |
An object of class summary.rollcall
with the following elements
(depending on the logical flag verbose
):
n |
number of legislators in the |
m |
number of roll call votes in the |
codes |
a |
allVotes |
a matrix containing a tabular breakdown of all votes
in the |
partyTab |
a tabular breakdown of the legislators' party
affiliations, after processing the |
lopSided |
a tabular summary of the frequency of lop-sided roll
call votes in the |
legisTab |
a tabular summary of each legislators' voting history |
partyLoyalty |
the proportion of times that each legislator votes the way that a majority of his or her fellow partisans did, provided party affiliations are available |
voteTab |
a tabular summary of each rollcall's votes |
call |
the |
set.seed(314159265) fakeData <- matrix(sample(x=c(0,1),size=1000,replace=TRUE), 10,100) rc <- rollcall(fakeData) rc data(sc9497) rc <- rollcall(sc9497) summary(rc) data(s109) summary(s109) summary(s109,verbose=TRUE)
set.seed(314159265) fakeData <- matrix(sample(x=c(0,1),size=1000,replace=TRUE), 10,100) rc <- rollcall(fakeData) rc data(sc9497) rc <- rollcall(sc9497) summary(rc) data(s109) summary(s109) summary(s109,verbose=TRUE)
Produces a trace plot of the MCMC samples from the posterior density of
legislators' ideal
points.
tracex(object, legis=NULL, d=1, conf.int=0.95, multi = FALSE, burnin=NULL,span=.25, legendLoc="topright")
tracex(object, legis=NULL, d=1, conf.int=0.95, multi = FALSE, burnin=NULL,span=.25, legendLoc="topright")
object |
an object of class |
legis |
a vector of either the names of legislators (or
|
d |
numeric, either a scalar or a vector of length two, the dimension(s) to be traced. |
conf.int |
numeric, the level of the confidence interval on the posterior mean to be plotted. |
multi |
logical, multiple plotting panels, one per legislators? If |
burnin |
of the recorded MCMC samples, how many to discard as
burnin? Default is |
span |
numeric, a proportion, the |
legendLoc |
numeric or character, and possibly a vector,
specifying where to place the legend when |
Produces a trace plot showing the history of the MCMC
iterations for the ideal point of each of the legislators (partially)
named in legis
. For d=1
, each trace plot includes a
trace over iterations, the cumulative mean, a moving average, the
MCMC-based estimate of the mean of the posterior, and a confidence
interval (specified by conf.int
) around the mean of the
posterior (using the estimated quantiles
) of
the respective MCMC iterates). All of these values are calculated
discarding the initial burnin
iterations.
When d
is a vector of length two, a 2-dimensional trace plot is
displayed, with the d[1]
dimension on the horizontal axis, and
the d[2]
dimension on the vertical axis.
When d=1
, a legend will be placed on the plot; the option
legendLoc
controls the placing of the legend. legendLoc
may be a vector, specifying a unique legend location for each
requested trace plots. If
legendLoc
is of length 1, it will be rep
licated
to have length equal to the number of requested trace plots.
ideal
; pmatch
for matching
legislators' names. See legend
for valid options to
legendLoc
.
data(s109) f <- system.file("extdata","id1.rda",package="pscl") load(f) tracex(id1,legis="KENN") ## n.b., no such legislator named Thomas Bayes tracex(id1,legis=c("KENN","BOX","KYL","Thomas Bayes")) f <- system.file("extdata","id2.rda",package="pscl") load(f) tracex(id2,d=1,legis=c("KENNEDY","BOXER","KYL","Thomas Bayes")) tracex(id2,d=2,legis=c("KENNEDY","BOXER","KYL","Thomas Bayes")) tracex(id2,d=1:2, legis=c("KENNEDY","BOXER","KYL","Thomas Bayes")) ## partial matching tracex(id2,d=1:2, legis=c("KENN","BOX","BID","SNO","SPEC","MCCA","KYL", "Thomas Bayes"), multi=TRUE)
data(s109) f <- system.file("extdata","id1.rda",package="pscl") load(f) tracex(id1,legis="KENN") ## n.b., no such legislator named Thomas Bayes tracex(id1,legis=c("KENN","BOX","KYL","Thomas Bayes")) f <- system.file("extdata","id2.rda",package="pscl") load(f) tracex(id2,d=1,legis=c("KENNEDY","BOXER","KYL","Thomas Bayes")) tracex(id2,d=2,legis=c("KENNEDY","BOXER","KYL","Thomas Bayes")) tracex(id2,d=1:2, legis=c("KENNEDY","BOXER","KYL","Thomas Bayes")) ## partial matching tracex(id2,d=1:2, legis=c("KENN","BOX","BID","SNO","SPEC","MCCA","KYL", "Thomas Bayes"), multi=TRUE)
Electoral returns, selected constituencies, 1992 general election for the British House of Commons
data(UKHouseOfCommons)
data(UKHouseOfCommons)
A data frame with 521 observations on the following 12 variables.
constituency
a character vector, name of the House of Commons constituency
county
a character vector, county of the House of Commons constituency
y1
a numeric vector, log-odds of Conservative to LibDem vote share
y2
a numeric vector, log-odds of Labor to LibDem vote share
y1lag
a numeric vector, y1
from previous election
y2lag
a numeric vector, y2
from previous election
coninc
a numeric vector, 1 if the incumbent is a Conservative, 0 otherwise
labinc
a numeric vector, 1 if the incumbent is from the Labor Party, 0 otherwise
libinc
a numeric vector, 1 if the incumbent is from the LibDems, 0 otherwise
v1
a numeric vector, Conservative vote share (proportion of 3 party vote)
v2
a numeric vector, Labor vote share (proportion of 3 party vote)
v3
a numeric vector, LibDem vote share (proportion of 3 party vote)
These data span only 521 of the 621 seats in the House of Commons at the time of 1992 election. Seats missing either a Conservative, Labor, or a LibDem candidate appear to have been dropped.
The original Katz and King data set does not have case labels. I used matches to an additional data source to recover a set of constituency labels for these data; labels could not recovered for two of the constituencies.
Jonathan Katz; Gary King. 1999. "Replication data for: A Statistical Model of Multiparty Electoral Data", http://hdl.handle.net/1902.1/QIGTWZYTLZ
Katz, Jonathan and Gary King. 1999. “A Statistical Model for Multiparty Electoral Data”. American Political Science Review. 93(1): 15-32.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Chichester. Example 6.9.
data(UKHouseOfCommons) tmp <- UKHouseOfCommons[,c("v1","v2","v3")] summary(apply(tmp,1,sum)) col <- rep("black",dim(tmp)[1]) col[UKHouseOfCommons$coninc==1] <- "blue" col[UKHouseOfCommons$labinc==1] <- "red" col[UKHouseOfCommons$libinc==1] <- "orange" library(vcd) vcd::ternaryplot(tmp, dimnames=c("Cons","Lab","Lib-Dem"), labels="outside", col=col, pch=1, main="1992 UK House of Commons Election", cex=.75)
data(UKHouseOfCommons) tmp <- UKHouseOfCommons[,c("v1","v2","v3")] summary(apply(tmp,1,sum)) col <- rep("black",dim(tmp)[1]) col[UKHouseOfCommons$coninc==1] <- "blue" col[UKHouseOfCommons$labinc==1] <- "red" col[UKHouseOfCommons$libinc==1] <- "orange" library(vcd) vcd::ternaryplot(tmp, dimnames=c("Cons","Lab","Lib-Dem"), labels="outside", col=col, pch=1, main="1992 UK House of Commons Election", cex=.75)
Cross-national data on relative size of the trade unions and predictors, in 20 countries. Two of the predictors are highly collinear, and are the source of a debate between Stephens and Wallerstein (1991), later reviewed by Western and Jackman (1994).
data(unionDensity)
data(unionDensity)
union
numeric, percentage of the total number of wage and salary earners plus the unemployed who are union members, measured between 1975 and 1980, with most of the data drawn from 1979
left
numeric, an index tapping the extent to which parties of the left have controlled governments since 1919, due to Wilensky (1981).
size
numeric, log of labor force size, defined as the number of wage and salary earners, plus the unemployed
concen
numeric, percentage of employment, shipments, or production accounted for by the four largest enterprises in a particular industry, averaged over industries (with weights proportional to the size of the industry) and the resulting measure is normalized such that the United States scores a 1.0, and is due to Pryor (1973). Some of the scores on this variable are imputed using procedures described in Stephens and Wallerstein (1991, 945).
Pryor, Frederic. 1973. Property and Industrial Organization in Communist and Capitalist Countries. Bloomington: Indiana University Press.
Stephens, John and Michael Wallerstein. 1991. Industrial Concentration, Country Size and Trade Union Membership. American Political Science Review 85:941-953.
Western, Bruce and Simon Jackman. 1994. Bayesian Inference for Comparative Research. American Political Science Review 88:412-423.
Wilensky, Harold L. 1981. Leftism, Catholicism, Democratic Corporatism: The Role of Political Parties in Recent Welfare State Development. In The Development of Welfare States in Europe and America, ed. Peter Flora and Arnold J. Heidenheimer. New Brunswick: Transaction Books.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey.
data(unionDensity) summary(unionDensity) pairs(unionDensity, labels=c("Union\nDensity", "Left\nGovernment", "log Size of\nLabor Force", "Economic\nConcentration"), lower.panel=function(x,y,digits=2){ r <- cor(x,y) par(usr=c(0,1,0,1)) text(.5,.5, format(c(r,0.123456789),digits=digits)[1], cex=1.5) } ) ols <- lm(union ~ left + size + concen, data=unionDensity) summary(ols)
data(unionDensity) summary(unionDensity) pairs(unionDensity, labels=c("Union\nDensity", "Left\nGovernment", "log Size of\nLabor Force", "Economic\nConcentration"), lower.panel=function(x,y,digits=2){ r <- cor(x,y) par(usr=c(0,1,0,1)) text(.5,.5, format(c(r,0.123456789),digits=digits)[1], cex=1.5) } ) ols <- lm(union ~ left + size + concen, data=unionDensity) summary(ols)
Extract the information in a roll call matrix as a series of vectors with voting decision, a unique identifier for the legislator and a unique identifier for the roll call.
vectorRepresentation(object, dropList = list(codes = c("missing", "notInLegis")))
vectorRepresentation(object, dropList = list(codes = c("missing", "notInLegis")))
object |
an object of class |
dropList |
a |
It is often the case that roll call matrices are sparse, say, when the roll call matrix has an “overlapping generations” structure; e.g., consider forming data by pooling across a long temporal sequence of legislatures such that relatively few of the legislators in the data set actually vote on any given roll call. In such a case, representing the data as a roll call matrix is not particularly helpful nor efficient, either for data summaries or modeling.
A matrix
with z
rows, where z
is the
number of non-missing entries in object$votes
, with
‘missingness’ defined by the codes
component of the
dropList
. The matrix has 3 columns:
vote |
the voting decision, either a |
i |
the row of the roll call matrix |
j |
the column of the roll call matrix |
Simon Jackman [email protected]
data(s109) y <- vectorRepresentation(s109) apply(y,2,table,exclude=NULL)
data(s109) y <- vectorRepresentation(s109) apply(y,2,table,exclude=NULL)
Survey data containing self-reports of vote choice in the 1992 U.S. Presidential election, with numerous covariates, from the 1992 American National Election Studies.
data(vote92)
data(vote92)
A data frame with 909 observations on the following 10 variables.
vote
a factor with levels Perot
Clinton
Bush
dem
a numeric vector, 1 if the respondent reports identifying with the Democratic party, 0 otherwise.
rep
a numeric vector, 1 if the respondent reports identifying with the Republican party, 0 otherwise
female
a numeric vector, 1 if the respondent is female, 0 otherwise
persfinance
a numeric vector, -1 if the respondent reports that their personal financial situation has gotten worse over the last 12 months, 0 for no change, 1 if better
natlecon
a numeric vector, -1 if the respondent reports that national economic conditions have gotten worse over the last 12 months, 0 for no change, 1 if better
clintondis
a numeric vector, squared difference between respondent's self-placement on a scale measure of political ideology and the respondent's placement of the Democratic candidate, Bill Clinton
bushdis
a numeric vector, squared ideological distance of the respondent from the Republican candidate, President George H.W. Bush
perotdis
a numeric vector, squared ideological distance of the respondent from the Reform Party candidate, Ross Perot
These data are unweighted. Refer to the original data source for weights that purport to correct for non-representativeness and non-response.
Alvarez, R. Michael and Jonathan Nagler. 1995. Economics, issues and the Perot candidacy: Voter choice in the 1992 Presidential election. American Journal of Political Science. 39:714-44.
Miller, Warren E., Donald R. Kinder, Steven J. Rosenstone and the National Election Studies. 1999. National Election Studies, 1992: Pre-/Post-Election Study. Center for Political Studies, University of Michigan: Ann Arbor, Michigan.
Inter-University Consortium for Political and Social Research. Study Number 1112. doi:10.3886/ICPSR01112.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey. Examples 8.7 and 8.8.
data(vote92) summary(vote92)
data(vote92) summary(vote92)
Compares two models fit to the same data that do not nest via Vuong's non-nested test.
vuong(m1, m2, digits = getOption("digits"))
vuong(m1, m2, digits = getOption("digits"))
m1 |
model 1, an object inheriting from class |
m2 |
model 2, as for model 1 |
digits |
significant digits in printed result |
The Vuong non-nested test is based on a comparison of the predicted probabilities of two models that do not nest. Examples include comparisons of zero-inflated count models with their non-zero-inflated analogs (e.g., zero-inflated Poisson versus ordinary Poisson, or zero-inflated negative-binomial versus ordinary negative-binomial). A large, positive test statistic provides evidence of the superiority of model 1 over model 2, while a large, negative test statistic is evidence of the superiority of model 2 over model 1. Under the null that the models are indistinguishable, the test statistic is asymptotically distributed standard normal.
Let be the predicted probabilities from model 1, evaluated conditional on the estimated MLEs. Let
be the corresponding probabilities from model 2. Then the Vuong statistic is
where
and
is the sample standard deviation of
.
Two finite sample corrections are often considered, based on the Akaike (AIC) and Schwarz (BIC) penalty terms, based on the complexity of the two models. These corrections sometimes generate conflicting conclusions.
The function will fail if the models do not contain identical values
in their respective components named y
(the value of the
response being modeled).
nothing returned, prints 3 test statistics and values and
exits silently.
Simon Jackman [email protected]
Vuong, Q.H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica. 57:307-333.
## Not run: data("bioChemists") ## compare Poisson GLM and ZIP glm1 <- glm(art ~ ., data = bioChemists, family = poisson) zip <- zeroinfl(art ~ . | ., data = bioChemists, EM = TRUE) vuong(glm1, zip) ## compare negbin with zero-inflated negbin nb1 <- MASS::glm.nb(art ~ ., data=bioChemists) zinb <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin", EM = TRUE) vuong(nb1, zinb) ## End(Not run)
## Not run: data("bioChemists") ## compare Poisson GLM and ZIP glm1 <- glm(art ~ ., data = bioChemists, family = poisson) zip <- zeroinfl(art ~ . | ., data = bioChemists, EM = TRUE) vuong(glm1, zip) ## compare negbin with zero-inflated negbin nb1 <- MASS::glm.nb(art ~ ., data=bioChemists) zinb <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin", EM = TRUE) vuong(nb1, zinb) ## End(Not run)
Fit zero-inflated regression models for count data via maximum likelihood.
zeroinfl(formula, data, subset, na.action, weights, offset, dist = c("poisson", "negbin", "geometric"), link = c("logit", "probit", "cloglog", "cauchit", "log"), control = zeroinfl.control(...), model = TRUE, y = TRUE, x = FALSE, ...)
zeroinfl(formula, data, subset, na.action, weights, offset, dist = c("poisson", "negbin", "geometric"), link = c("logit", "probit", "cloglog", "cauchit", "log"), control = zeroinfl.control(...), model = TRUE, y = TRUE, x = FALSE, ...)
formula |
symbolic description of the model, see details. |
data , subset , na.action
|
arguments controlling formula processing
via |
weights |
optional numeric vector of weights. |
offset |
optional numeric vector with an a priori known component to be included in the linear predictor of the count model. See below for more information on offsets. |
dist |
character specification of count model family (a log link is always used). |
link |
character specification of link function in the binary zero-inflation model (a binomial family is always used). |
control |
a list of control arguments specified via
|
model , y , x
|
logicals. If |
... |
arguments passed to |
Zero-inflated count models are two-component mixture models combining a point mass at zero with a proper count distribution. Thus, there are two sources of zeros: zeros may come from both the point mass and from the count component. Usually the count model is a Poisson or negative binomial regression (with log link). The geometric distribution is a special case of the negative binomial with size parameter equal to 1. For modeling the unobserved state (zero vs. count), a binary model is used that captures the probability of zero inflation. in the simplest case only with an intercept but potentially containing regressors. For this zero-inflation model, a binomial model with different links can be used, typically logit or probit.
The formula
can be used to specify both components of the model:
If a formula
of type y ~ x1 + x2
is supplied, then the same
regressors are employed in both components. This is equivalent to
y ~ x1 + x2 | x1 + x2
. Of course, a different set of regressors
could be specified for the count and zero-inflation component, e.g.,
y ~ x1 + x2 | z1 + z2 + z3
giving the count data model y ~ x1 + x2
conditional on (|
) the zero-inflation model y ~ z1 + z2 + z3
.
A simple inflation model where all zero counts have the same
probability of belonging to the zero component can by specified by the formula
y ~ x1 + x2 | 1
.
Offsets can be specified in both components of the model pertaining to count and
zero-inflation model: y ~ x1 + offset(x2) | z1 + z2 + offset(z3)
, where
x2
is used as an offset (i.e., with coefficient fixed to 1) in the
count component and z3
analogously in the zero-inflation component. By the rule
stated above y ~ x1 + offset(x2)
is expanded to
y ~ x1 + offset(x2) | x1 + offset(x2)
. Instead of using the
offset()
wrapper within the formula
, the offset
argument
can also be employed which sets an offset only for the count model. Thus,
formula = y ~ x1
and offset = x2
is equivalent to
formula = y ~ x1 + offset(x2) | x1
.
All parameters are estimated by maximum likelihood using optim
,
with control options set in zeroinfl.control
.
Starting values can be supplied, estimated by the EM (expectation maximization)
algorithm, or by glm.fit
(the default). Standard errors
are derived numerically using the Hessian matrix returned by optim
.
See zeroinfl.control
for details.
The returned fitted model object is of class "zeroinfl"
and is similar
to fitted "glm"
objects. For elements such as "coefficients"
or
"terms"
a list is returned with elements for the zero and count component,
respectively. For details see below.
A set of standard extractor functions for fitted model objects is available for
objects of class "zeroinfl"
, including methods to the generic functions
print
, summary
, coef
,
vcov
, logLik
, residuals
,
predict
, fitted
, terms
,
model.matrix
. See predict.zeroinfl
for more details
on all methods.
An object of class "zeroinfl"
, i.e., a list with components including
coefficients |
a list with elements |
residuals |
a vector of raw residuals (observed - fitted), |
fitted.values |
a vector of fitted means, |
optim |
a list with the output from the |
control |
the control arguments passed to the |
start |
the starting values for the parameters passed to the |
weights |
the case weights used, |
offset |
a list with elements |
n |
number of observations (with weights > 0), |
df.null |
residual degrees of freedom for the null model (= |
df.residual |
residual degrees of freedom for fitted model, |
terms |
a list with elements |
theta |
estimate of the additional |
SE.logtheta |
standard error for |
loglik |
log-likelihood of the fitted model, |
vcov |
covariance matrix of all coefficients in the model (derived from the
Hessian of the |
dist |
character string describing the count distribution used, |
link |
character string describing the link of the zero-inflation model, |
linkinv |
the inverse link function corresponding to |
converged |
logical indicating successful convergence of |
call |
the original function call, |
formula |
the original formula, |
levels |
levels of the categorical regressors, |
contrasts |
a list with elements |
model |
the full model frame (if |
y |
the response count vector (if |
x |
a list with elements |
Achim Zeileis <[email protected]>
Cameron, A. Colin and Pravin K. Trevedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.
Cameron, A. Colin and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.
Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics. 34(1):1-14. doi:10.2307/1269547
Zeileis, Achim, Christian Kleiber and Simon Jackman 2008. “Regression Models for Count Data in R.” Journal of Statistical Software, 27(8). URL https://www.jstatsoft.org/v27/i08/.
zeroinfl.control
, glm
,
glm.fit
, glm.nb
,
hurdle
## data data("bioChemists", package = "pscl") ## without inflation ## ("art ~ ." is "art ~ fem + mar + kid5 + phd + ment") fm_pois <- glm(art ~ ., data = bioChemists, family = poisson) fm_qpois <- glm(art ~ ., data = bioChemists, family = quasipoisson) fm_nb <- MASS::glm.nb(art ~ ., data = bioChemists) ## with simple inflation (no regressors for zero component) fm_zip <- zeroinfl(art ~ . | 1, data = bioChemists) fm_zinb <- zeroinfl(art ~ . | 1, data = bioChemists, dist = "negbin") ## inflation with regressors ## ("art ~ . | ." is "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment") fm_zip2 <- zeroinfl(art ~ . | ., data = bioChemists) fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
## data data("bioChemists", package = "pscl") ## without inflation ## ("art ~ ." is "art ~ fem + mar + kid5 + phd + ment") fm_pois <- glm(art ~ ., data = bioChemists, family = poisson) fm_qpois <- glm(art ~ ., data = bioChemists, family = quasipoisson) fm_nb <- MASS::glm.nb(art ~ ., data = bioChemists) ## with simple inflation (no regressors for zero component) fm_zip <- zeroinfl(art ~ . | 1, data = bioChemists) fm_zinb <- zeroinfl(art ~ . | 1, data = bioChemists, dist = "negbin") ## inflation with regressors ## ("art ~ . | ." is "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment") fm_zip2 <- zeroinfl(art ~ . | ., data = bioChemists) fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
Various parameters that control fitting of zero-inflated regression models
using zeroinfl
.
zeroinfl.control(method = "BFGS", maxit = 10000, trace = FALSE, EM = FALSE, start = NULL, ...)
zeroinfl.control(method = "BFGS", maxit = 10000, trace = FALSE, EM = FALSE, start = NULL, ...)
method |
characters string specifying the |
maxit |
integer specifying the |
trace |
logical or integer controlling whether tracing information on
the progress of the optimization should be produced (passed to |
EM |
logical. Should starting values be estimated by the EM (expectation maximization) algorithm? See details. |
start |
an optional list with elements |
... |
arguments passed to |
All parameters in zeroinfl
are estimated by maximum likelihood
using optim
with control options set in zeroinfl.control
.
Most arguments are passed on directly to optim
, only trace
is also
used within zeroinfl
and EM
/start
control the choice
of starting values for calling optim
.
Starting values can be supplied, estimated by the EM (expectation maximization)
algorithm, or by glm.fit
(the default). Standard errors are
derived numerically using
the Hessian matrix returned by optim
. To supply starting
values, start
should be a list with elements "count"
and "zero"
and potentially "theta"
(for negative binomial components only) containing
the starting values for the coefficients of the corresponding component of the
model.
A list with the arguments specified.
Achim Zeileis <[email protected]>
## Not run: data("bioChemists", package = "pscl") ## default start values fm1 <- zeroinfl(art ~ ., data = bioChemists) ## use EM algorithm for start values fm2 <- zeroinfl(art ~ ., data = bioChemists, EM = TRUE) ## user-supplied start values fm3 <- zeroinfl(art ~ ., data = bioChemists, start = list(count = c(0.7, -0.2, 0.1, -0.2, 0, 0), zero = -1.7)) ## End(Not run)
## Not run: data("bioChemists", package = "pscl") ## default start values fm1 <- zeroinfl(art ~ ., data = bioChemists) ## use EM algorithm for start values fm2 <- zeroinfl(art ~ ., data = bioChemists, EM = TRUE) ## user-supplied start values fm3 <- zeroinfl(art ~ ., data = bioChemists, start = list(count = c(0.7, -0.2, 0.1, -0.2, 0, 0), zero = -1.7)) ## End(Not run)