DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI
LOG-LINEAR MODELS
AND TORIC IDEALS
Fabio Rapallo
Relatore di tesi:
Coordinatore del Dottorato:
Prof. GIOVANNI PISTONE
Prof. CLAUDIO PEDRINI
(Politecnico di Torino)
(Università di Genova)
DOTTORATO DI RICERCA IN MATEMATICA E APPLICAZIONI
XV CICLO
Sede Amministrativa: Università di Genova
Sedi consorziate: Università di Torino, Politecnico di Torino
Tesi presentata per il conseguimento del titolo di
Dottore di Ricerca in Matematica e Applicazioni nel mese di Aprile 2003
Contents
Introduction
3
1 Statistical and algebraic background
7
1.1
1.2
Log-linear models . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1.1
Definitions and basic properties . . . . . . . . . . . . . .
8
1.1.2
Maximum likelihood estimates . . . . . . . . . . . . . . . 12
1.1.3
Goodness of fit tests . . . . . . . . . . . . . . . . . . . . 14
1.1.4
An important remark . . . . . . . . . . . . . . . . . . . . 16
1.1.5
Exact methods . . . . . . . . . . . . . . . . . . . . . . . 17
Algebraic theory of toric ideals . . . . . . . . . . . . . . . . . . . 21
1.2.1
Definitions and first properties . . . . . . . . . . . . . . . 22
1.2.2
Computation of toric ideals. First method . . . . . . . . 24
1.2.3
Navigating inside the set of solutions of an integer linear
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.4
Computation of toric ideals. Second method . . . . . . . 27
2 The Diaconis-Sturmfels algorithm
29
2.1
Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 29
2.2
The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3
Rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4
Two examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Two-way contingency tables
39
3.1
Two-way contingency tables: Independence . . . . . . . . . . . . 40
3.2
Some models for incomplete and square tables . . . . . . . . . . 44
3.2.1
Incomplete tables . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2
The quasi-independence model . . . . . . . . . . . . . . . 45
3.2.3
The symmetry model . . . . . . . . . . . . . . . . . . . . 47
3.2.4
The quasi-symmetry model . . . . . . . . . . . . . . . . 47
1
2
CONTENTS
3.2.5
Computational notes . . . . . . . . . . . . . . . . . . . . 48
3.3
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4
An application. The h-sample problem . . . . . . . . . . . . . . 52
4 Rater agreement models
55
4.1
4.2
A medical problem . . . . . . . . . . . . . . . . . . . . . . . . . 55
Multidimensional rater agreement models . . . . . . . . . . . . . 56
4.3
Exact inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4
Conditioning
4.5
4.6
Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Sufficiency and estimation
73
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2
Sufficiency and sampling . . . . . . . . . . . . . . . . . . . . . . 75
5.3
Geometry of toric models . . . . . . . . . . . . . . . . . . . . . . 78
5.4
5.5
Algebraic representation of the sufficient statistics . . . . . . . . 84
Minimum-variance estimator . . . . . . . . . . . . . . . . . . . . 85
5.6
Algebraic method . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A Programs
89
Bibliography
98
Introduction
It is useful to start this work with two sentences from a recent survey paper
by Agresti (2001) recently appeared in “Statistics in Medicine”.
“For the null hypothesis of independence and conditional independence,
the relevant conditional distribution is simple, which is why ordinary Monte
Carlo applies so easily. This is not true of more complex hypotheses, such
as quasi-independence, quasi-symmetry, and others for square and triangular
contingency tables that were until recently given little attention in the exact
literature.”
“If statisticians cannot agree how to analyze even a 2 × 2 table, with no
one approach being obviously best, what hope is there for a consensus on more
complex analyses? The probability is probably decreasing to 0 with time, . . .”
These two claims give concisely an excellent idea on how statisticians need
simple and comprehensive procedures for analyzing complex problems involving contingency tables.
In this thesis we use techniques from Computational Commutative Algebra
to solve complex problems in estimation and hypothesis testing for discrete
data.
The application of Computational Commutative Algebra in statistics and
probability is a recent topic and a first book in the field is Pistone et al. (2001a).
In this thesis, starting from the work by Diaconis & Sturmfels (1998), we
study the log-linear models through polynomial algebra, with important applications to statistical problems, such as goodness-of-fit tests for contingency
tables. In particular, we apply the algebraic theory of toric ideals to study
a Markov Chain Monte Carlo method for sampling from conditional distributions, through the notion of Markov basis. Throughout the thesis we will consider a wide range of statistical problems: Models for two-way and multi-way
contingency tables; rater agreement models; contingency tables with structural
zeros. In some cases we will characterize the relevant Markov basis, while in
3
4
CONTENTS
other cases we will give results which lead to a fundamental simplification of
the symbolic computations. A number of examples will show the relevance,
the efficiency and the versatility of these algebraic procedures.
The relevance of the topics considered here is proved by the increasing
number of specific workshops, conferences and lectures on this subject. Moreover, the number of people interested in grows rapidly, also including Professor
Bernd Sturmfels, and Professor Stephen Fienberg.
In Chapter 1, we introduce some background material about the log-linear
models and the toric ideals, which have a key role in the analysis of contingency tables. In Chapter 2, we present a slight generalization of the DiaconisSturmfels algorithm, while in Chapter 3 we analyze the case of two-way contingency tables, with some results for the characterization of the relevant toric
ideals and some other results useful to simplify their computation. In Chapter
4 we study an application of the previous theory to rater agreement models,
which are widely used in biomedical and pharmaceutical research. We also
give results for the characterization of toric ideals for multi-way contingency
tables. Finally, in Chapter 5 we study the algebraic properties of the notions of
sufficiency and maximum likelihood estimation in statistical models on finite
sample space, with results about the computation of the sufficient statistic
of the model and its geometric representation. The programs, both symbolic
and numerics, presented in Appendix A, show the practical applicability of the
algorithms, together with the examples on real data sets which are presented
in all Chapters.
The results presented in Chapters 3, 4 and 5 represent original work, as
well as the generalization of the algorithm presented in Chapter 2 and the
programs in Appendix A.
A part of the thesis has been published in preliminary versions. In particular, a part of Chapter 2 is in press on the “Scandinavian Journal of Statistics”,
while a part of Chapters 4 and 5 represents the subject of a preprint of the
Department of Mathematics of Genova. Two papers are currently submitted
to international journals.
The topics presented here have been the subject of some presentations
to conferences and international workshops. Among other we want to cite
here the series “Grostat” and the Meetings of the Italian Statistical Society.
Moreover, the tutorial for the course “Statistical Toric Models”, at the “First
International School on Algebraic Statistics”, has been based on the work
CONTENTS
5
presented in this thesis.
Finally, the theory and the applications developed in Chapter 4 have been
the starting point for a current cooperation with the pharmaceutical company
“Schering” in Berlin.
Acknowledgements
During the writing of this work, through the last two years, I have benefited
from the help of many people, both from the scientific and moral point of view.
In alphabetical order, a minimal list is formed by Daniele De Martini (University of “Piemonte Orientale”, Novara), Franco Fagnola (University of Genova), Mauro Gasparini (Politecnico di Torino), my advisor Giovanni Pistone
(Politecnico di Torino), Eva Riccomagno (University of Warwick), Lorenzo
Robbiano (University of Genova), Maria Piera Rogantin (University of Genova), Ivano Repetto (University of Genova), Richard Vonk (Schering, Berlin).
Outside the academia, I want to acknowledge the constant support of Matteo Cella, my cousin, and Andrea Giacobbe. Their support has been essential
in the most critical moments.
6
CONTENTS
Chapter 1
Statistical and algebraic
background
In this chapter we introduce the main background theory we need, both from
the statistical and the algebraic point of view. In Section 1.1 we recall the
properties of log-linear models for the analysis of qualitative data, with a
particular attention on the maximum likelihood estimation and the goodness
of fit tests, using the classical asymptotic approach. Moreover, we emphasize
how simple algebraic objects, such as power products and binomials, naturally
appear in the classical analysis of log-linear models. Finally, in Section 1.1.5,
we review the main exact methods for hypothesis testing for contingency tables.
Exact methods come back to the works by Fisher in the forties, where an exact
test for independence in 2×2 contingency tables was introduced. A wide range
of exact procedures for inference for contingency tables is now available and
in most cases the increased speed and computing power of computers has
played a fundamental role for the practical applicability of such procedures.
We describe here the main methods for exact analysis of contingency tables.
In Section 1.2, we introduce the algebraic theory of toric ideals, with special attention to the statistical meaning of the algebraic procedures. A brief
and neat review of some relevant concepts from Computational Commutative
Algebra is presented in order to give a presentation as self-contained as possible. In particular, we deal with the computation of toric ideals, which will
represent a key step in the algorithms presented in the next chapters.
In this chapter, the proofs of the results are omitted, with references to
literature, in particular Haberman (1974), Bishop et al. (1975) and Agresti
(2002) for the theory of log-linear models, and Kreuzer & Robbiano (2000)
and Bigatti & Robbiano (2001) for the theory of toric ideals.
7
8
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.1
Log-linear models
1.1.1
Definitions and basic properties
The material presented in this section is mainly taken from classical works in
the field of the log-linear models. The aim of this section is to show that in the
classical representation of the log-linear models, i.e. using vector space theory,
polynomial equations play a fundamental role. Some simple examples of this
section will be used many times throughout the thesis. The main references
are Haberman (1974), Bishop et al. (1975), Agresti (1996), Fienberg (1980)
and Dobson (1990). In particular we follow the notation in Haberman (1974),
as it allows us to give an unique presentation of the theory of log-linear models
in a general setting. Some examples and useful remarks are taken from the
other references.
Log-linear models are statistical models for the analysis of qualitative data,
then the sample space is finite and the data are usually presented in a contingency table.
Definition 1.1 A contingency table is a collection f of frequencies (f (x) ∈
N : x ∈ X ), where X is the sample space.
A contingency table contains the joint observations of d random variables
on N subjects. A first classification of contingency tables can be made looking
at the sample space X . In particular, the contingency table is said to be a d-way
contingency table if f is the cross-classification of the subjects with respect to
d random variables. Consequently, the sample space X is a finite subset of Nd ,
possibly with a suitable coding of the levels of the random variables. Moreover,
the contingency table f is complete if the sample space is a Cartesian product
of one-dimensional sample spaces and is incomplete otherwise.
Log-linear models are used to describe the stochastic structure of the joint
probability distribution underlying the sample, i.e. to determine stochastic relations between the observed variables, such as independence, non-interaction,
symmetry and others more complex relations.
If we denote the cardinality of the sample space X by q, the table f is
an element of the q-dimensional space Rq , with the standard inner product
defined by
< f1 , f2 >=
X
x∈X
f1 (x)f2 (x)
1.1. LOG-LINEAR MODELS
9
and the canonical basis (ex )x∈X .
In a log-linear model, f is assumed to be the realization of a random vector
F with mean m = (m(x) : x ∈ X ), where m(x) > 0 for all x ∈ X . Here
and throughout we denote the observed tables with lowercase letters and the
random vectors with the corresponding uppercase letters. Thus for each x ∈ X ,
log m(x) is well defined. Here we assume that the probabilities of the sample
points are positive, and that the means m(x) > 0 are a parameterization of
the model. Therefore, if µ = (log m(x) : x ∈ X ), then µ ∈ Rq . In a log-linear
model it is assumed that µ ∈ M , where M is a p-dimensional linear manifold
contained in Rq . When q = p, we say that the corresponding log-linear model
is saturated.
Example 1.2 Consider a 2-way contingency table with two levels for any
random variable, i.e. a 2 × 2 contingency table. The space R4 is spanned by
these four orthogonal vectors:
¶
µ
¶
µ
1
1
1 1
v2 =
v1 =
−1 −1
1 1
µ
v3 =
¶
1 −1
1 −1
µ
v4 =
¶
1 −1
.
−1 1
We can consider a model M such that M = span(v1 , v2 , v3 ). In such case,
Bishop et al. (1975) showed that the model can be written in the form
Y
log m(i, j) = λ + λX
i + λj
(1.1)
for i, j = 1, 2, with the constraints
X
Y
Y
λX
1 + λ2 = λ1 + λ2 = 0 .
(1.2)
If we denote by mi+ and m+j the marginal totals, one can write
X
Y
Y
Y
X
X
mi+ = eλ eλi (eλ1 + eλ2 ) ,
m+j = eλ eλj (eλ1 + eλ2 ) ,
X
X
Y
Y
N = eλ (eλ1 + eλ2 )(eλ1 + eλ2 ) .
Hence we obtain
m(i, j) = mi+ m+j /N
(1.3)
where mi+ and m+j are the marginal totals. Moreover, since v4 is orthogonal
to M1 , we obtain the relation
log
m(1, 1)m(2, 2)
=0
m(1, 2)m(2, 1)
(1.4)
10
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
from which we obtain
m(1, 1)m(2, 2) − m(1, 2)m(2, 1) = 0 .
(1.5)
Equation (1.5) represents an algebraic relation between the means of the cells
and in particular it is a binomial. We will see later the relevance of such
binomials in our framework.
A detailed representation of the vector space bases of log-linear models is
presented in Collombier (1980), with special attention to the meaning of such
representation in terms of hypothesis testing.
For the statistical inference, it is now essential to specify the underlying
probability distribution of the table f . The main models in literature are the
Poisson model and the multinomial model.
The Poisson model
In the Poisson model, the elements of F are independent Poisson random
variables with E[F (x)] = m(x) for all x ∈ X and m(x) > 0 for all x ∈ X . The
log-likelihood is
log L(f, µ) =
X
f (x) log m(x) − m(x) =
x∈X
= < f, µ > −
X
exp(µ(x)) .
x∈X
If PM is the orthogonal projection from Rq to M , then since µ ∈ M and PM
is a symmetric operator, we have
log L(f, µ) =< PM f, µ > −
X
exp(µ(x)) .
(1.6)
x∈X
Therefore, the family of Poisson models such that µ ∈ M is an exponential family. Following classical results about exponential families, the projection PM f
is a complete minimal sufficient statistic, see for example Lehmann (1983).
Moreover, any nonsingular linear transformation of PM f is again a complete
minimal sufficient statistic. Traditionally, Equation (1.6) is presented as the
following theorem.
Theorem 1.3 Under the Poisson model, the sufficient statistic is expressible
as linear combination of the cell counts (f (x) : x ∈ X ).
1.1. LOG-LINEAR MODELS
11
For the proof, see Agresti (2002), page 136.
The random vector F has mean m and variance-covariance matrix D(µ),
where D(µ) is a diagonal matrix with the variances m(x) on the main diagonal
and 0 otherwise.
We denote the convergences in probability and in distribution by the symp
d
bols −→ and −→, respectively. Moreover the symbol N denotes the Gaussian
distribution. The first asymptotic results are summarized in the following
proposition.
Proposition 1.4 If (F (N ) )N >0 is a sequence of random vectors from a Poisson
model with respective means (m(N ) )N >0 and if N −1 m(N ) −→ m, then
p
(1.7)
d
(1.8)
N −1 F (N ) −→ m
and
N −1/2 (F (N ) − m(N ) ) −→ N (0, D(µ)) .
For the proof, see Haberman (1974), page 7.
The multinomial model
In the multinomial model, the vector F consists of one or more independent
multinomial random vectors (Fi : i ∈ Xk ) with mean (mi : i ∈ Xk ). The sets
Xk , k = 1, . . . , r are disjoint and have union X , and m(x) > 0 for all x ∈ X .
We suppose that the model M is such that for each k = 1, . . . , r
ν (k) = {IXk (x) : x ∈ X } ∈ M .
For the determination of the complete minimal sufficient statistic, we consider
the decomposition of M in M1 and M2 = M −M1 , where M1 is the linear manifold generated by (ν (k) : k = 1, . . . , r) and M2 is its orthogonal complement.
We denote the sample size for (f (x) : x ∈ Xk ) by Nk . The log-likelihood is
log L(f, µ) =< PM2 f, PM2 µ > −
r
X
Nk log(< m(PM2 µ), ν
k=1
(k)
>) +
r
X
k=1
log Nk ! .
(1.9)
Thus, PM2 f is a complete minimal sufficient statistic. As any nonsingular
linear transformation of PM2 f is again a complete minimal sufficient statistic,
we can use the statistic PM f as in the Poisson model. The random vector F
has mean m and variance-covariance matrix
∗
Σ(µ) = D(µ)(I − PM
(µ))
1
12
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
∗
where PM
is the orthogonal projection from Rq onto M1 and I is the identity
1
matrix. We summarize in the following proposition the basic convergence
results.
Proposition 1.5 If (F (N ) )N >0 is a sequence of random vectors from a multinomial model with respective means (m(N ) )N >0 and if N −1 m(N ) −→ m, then
p
N −1 F (N ) −→ m
(1.10)
and
d
∗
N −1/2 (F (N ) − m(N ) ) −→ N (0, D(µ)(I − PM
(µ))) .
1
(1.11)
For the proof, see Haberman (1974), page 13.
The close connection between the Poisson model and the multinomial model
can be found in Haberman (1974), where the author proves that many results
about maximum likelihood estimation and hypothesis testing are the same
under both the Poisson model and the multinomial model.
Other models can be considered, such as the product-multinomial model
and the conditional Poisson model. See Haberman (1974) and Agresti (2002)
for details. Moreover, the different meanings of the models from an applicative
point of view are widely discussed in Chapters 2 and 3 of Fienberg (1980).
1.1.2
Maximum likelihood estimates
We start by considering the problem of maximum likelihood estimation for the
Poisson model. The maximum likelihood estimate µ̂ of µ is the value such that
log L(f, µ̂) = sup log L(f, µ) .
(1.12)
µ∈M
We briefly recall the basic theorems about the existence and the uniqueness of
the maximum likelihood estimate
Theorem 1.6 If a maximum likelihood estimate µ̂ exists, then it is unique
and satisfies the equation
PM m̂ = PM f .
(1.13)
Conversely, if for some µ̂ ∈ M and m̂ = exp(µ̂) and Equation (1.13) is satisfied, then µ̂ is the maximum likelihood estimate of µ.
1.1. LOG-LINEAR MODELS
13
Theorem 1.7 A necessary and sufficient condition that maximum likelihood
estimate µ̂ of µ exists is that there exists δ ∈ M ⊥ such that f (x) + δ(x) > 0
for all x ∈ X .
Corollary 1.8 If f (x) > 0 for all x ∈ X , then the maximum likelihood estimate µ̂ exists.
Theorem 1.9 A necessary and sufficient condition that maximum likelihood
estimate µ̂ exists is that there not exist µ ∈ M such that µ 6= 0, µ(x) ≤ 0 for
all x ∈ X and < f, µ >= 0.
For the multinomial model, the relevant theorem is the following, which
states that the maximum likelihood estimate is the same under the Poisson
model and the multinomial model.
Theorem 1.10 If µ̂(m) is the maximum likelihood estimate for a multinomial
model with µ ∈ M and if µ̂ is the corresponding maximum likelihood estimate
for a Poisson model with µ ∈ M , then µ̂(m) = µ̂, in the sense that when one
side of the equation exists, then the other side exists and the two sides are
equal.
For the proof of all the above theorems, see Haberman (1974), pages 35-41.
In view of this theorem, the necessary and sufficient conditions stated for
the Poisson model also apply to the multinomial model.
In many cases, the maximum likelihood estimate can not be computed
directly and we resort to numerical approximations. Among the numerical
methods, the most important ones in this field are the Newton-Raphson algorithm and the Iterative Proportional Fitting algorithm. Details on such
methods can be found in Haberman (1974) and Agresti (2002). We will come
back later on the numerical computation of the maximum likelihood estimate.
Now, we analyze the asymptotic behavior of the maximum likelihood estimate.
Suppose that (F (N ) )N >0 is a succession of random vectors with means
(m(N ) )N >0 . We denote by (µ(N ) )N >0 the sequence of the logarithms of the
means. Using the same notation as above, the main asymptotic results are
summarized in the following theorem.
14
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Theorem 1.11 Let m∗ = limN →+∞ N −1 m(N ) , assume m∗ > 0, and let µ∗ =
log m∗ . The following relations hold when N goes to +∞.
p
µ̂(N ) − µN −→ 0
p
N −1 m̂(N ) −→ m∗
d
∗
∗
N 1/2 (µ̂(N ) − µ(N ) ) −→ N (0, (PM
(µ∗ ) − PM
(µ∗ ))D−1 (µ∗ ))
1
d
∗
∗
N 1/2 (m̂(N ) − m(N ) ) −→ N (0, D(µ∗ )(PM
(µ∗ ) − PM
(µ∗ )))
1
(1.14)
(1.15)
(1.16)
(1.17)
For the proof, see Agresti (2002), page 584.
The previous result, and in particular Equations (1.16) and (1.17) allows
us to compute asymptotic confidence intervals for any linear combination of
the components of µ and m, and simple hypotheses tests for such quantities,
applying the theory described in Lehmann (1983) and Lehmann (1987).
Remark 1.12 For historical reasons, the log-linear models are usually expressed in terms of the expected cell counts, instead of in terms of the cell
probabilities. However, the representation of our models in terms of the cell
probabilities is given by noting that in our notation f is the vector of the observed frequencies, while N −1 f is the vector of empirical probabilities. Thus,
m(N ) represents the vector of the expected frequencies, and N −1 m(N ) represents the vector of cell probabilities, as well as m∗ . Sometimes in the thesis,
we will refer to the vector of the cell probabilities as p = (p(x) : x ∈ X ).
1.1.3
Goodness of fit tests
Hypothesis tests for log-linear models are generally based on test statistics
which have an asymptotic chi-squared distribution under the null hypothesis.
We consider here tests of the form H0 : µ ∈ M0 ⊂ M1 against the alternative
hypothesis µ ∈ M1 . Such tests are commonly called goodness of fit tests, in
the sense that they test if a smaller model M0 can replace a bigger model
M1 without a significant loss of information. Moreover, in many situations
these tests give some information about the stochastic relations between the
variables. The goodness of fit tests are based on the Pearson test statistic
C(µ̂(1) , µ̂(0) ) =
X (m̂(1) (x) − m̂(0) (x))2
x∈X
m̂(0) (x)
(1.18)
1.1. LOG-LINEAR MODELS
15
or on the log-likelihood ratio test statistic
∆(µ̂(1) , µ̂(0) ) = log L(f, µ̂(0) ) − log L(f, µ̂(1) ) .
(1.19)
Suppose that M0 has dimension p0 and M1 has dimension p1 . The relevant
results about hypothesis testing is the following.
Theorem 1.13 Consider a sequence of null hypotheses H0 : µ(N ) ∈ M0 and
alternatives H1 : µ(N ) ∈ M1 . Suppose that µ̂(N,1) and µ̂(N,0) are the maximum
likelihood estimates of µ(N ) under H0 and H1 , respectively. Then under H0 ,
the statistics −2∆ and C are asymptotically equivalent and
lim P[−2∆(µ̂(N,1) , µ̂(N,0) ) > c∗ (α)] = P[C(µ̂(N,1) , µ̂(N,0) ) > c∗ (α)] = α ,
N →+∞
(1.20)
where c∗ (α) is the α-th quantile of the chi-square distribution with p1 − p0
degrees of freedom.
For the proof, see Bishop et al. (1975), page 514.
If M1 is the saturated model, we can apply the fact that µ̂(N,1) = f and
the test statistics in Equations (1.18) and (1.19) assume the most familiar
expressions
C(f, µ̂(0) ) =
X (f (x) − m̂(0) (x))2
x∈X
and
∆(f, µ̂(0) ) =
X
m̂(0) (x)
m̂(0) (x) log
x∈X
f (x)
.
m̂(0) (x)
(1.21)
(1.22)
A detailed discussion on these test statistics can be found in Dobson (1990)
or Haberman (1978). The small sample properties of the asymptotic tests are
presented in Fienberg (1980), Appendix IV. We will consider in details the
small sample properties later in the thesis.
Remark 1.14 In the applications, the goodness of fit tests are performed
also in those cases where the maximum likelihood estimate does not exist. In
this case we can use the notion of extended maximum likelihood estimate, see
Appendix B of Haberman (1974) for details.
16
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.1.4
An important remark
As we have seen in Example (1.2), the vector space theory leads to a canonical
representation of the log-linear models in the form
log m(i) = AΛ ,
(1.23)
where A is a the model matrix and Λ is the vector of the model parameters.
In many cases, see for example Haberman (1978), the matrix A has integer
non-negative entries and the model can be written in the form
log m(i) =
s
X
A(h, i)λ(h) (i) .
(1.24)
h=1
This is the classical representation of the log-linear models and it allows a
simple interpretation of the statistical relations among the variables, see Fienberg (1980), Fingleton (1984) and Agresti (1996). We present here a detailed
analysis of the independence model for two categorical variables. This model
is enough simple to be analyzed in few pages with any detail and it contains all
the relevant algebraic objects we have pointed out in the previous discussion.
More complex models which involve a greater number of categorical variables
and different stochastic relations among the variables will be presented in the
next chapters.
Example 1.15 Consider the cross-classification of two categorical random
variables X and Y with supports {1, . . . , I} and {1, . . . , J}, respectively.
The saturated model is written in the form
Y
XY
log m(i, j) = λ + λX
i + λj + λij
with the constraints
I
X
i=1
λX
i
=
J
X
λYj
=
j=1
I
X
λXY
ij
i=1
=
J
X
λXY
ij = 0 ,
j=1
while the independence model, where we impose no interaction between the
variables, assumes the form
Y
log m(i, j) = λ + λX
i + λj
with the constraints
I
X
i=1
λX
i
=
J
X
j=1
λYj = 0 .
1.1. LOG-LINEAR MODELS
17
This representation follows immediately from the vector space representation
presented in Example 1.2. Note that a vector space basis is easy to be found
in the two-way case, while in the multi-way case some difficulties arise.
From our point of view is of great interest the multiplicative form of the loglinear model, obtained by exponentiating the log-linear form. For the saturated
model we obtain the expression
m(i, j) = ζ0 ζiX ζjY ζijXY
(1.25)
and for the independence model we can write
m(i, j) = ζ0 ζiX ζjY
(1.26)
where the ζ parameters satisfy appropriate constraints. Note that our notation
is coherent with the notation in Chapter 6 of Pistone et al. (2001a).
Manipulating Equation (1.26), it is easy to check that under the independence model the means m(i, j) must satisfy the relations
m(i1 , j1 )m(i2 , j2 ) − m(i1 , j2 )m(i2 , j1 ) = 0
(1.27)
for all i1 , i2 ∈ {1, . . . , I} and j1 , j2 ∈ {1, . . . , J}. Of course, the same relations
hold for the cell probabilities p(i, j). Thus, the independence model is represented by the zero set of the polynomial system formed by the binomials in
Equation (1.27). We will see in the next section the correspondence between
algebraic restrictions of this form and the geometric restrictions on the space
of the means parameters, or equivalently on the space of the cell probabilities.
The multiplicative form of the log-linear models has been widely studied
in Goodman (1979a), Goodman (1979b) and Goodman (1985). We will see in
the next sections the relevance of this representation.
1.1.5
Exact methods
Exact statistics can be useful in situations where the asymptotic assumptions
are not met, and so the asymptotic p−values are not close approximations for
the true p−values. Standard asymptotic methods involve the assumption that
the test statistic follows a particular distribution when the sample size is sufficiently large. When the sample size is not large, asymptotic results may not
be valid, with the asymptotic p−values differing substantially from the exact
p−values. Asymptotic results may also be unreliable when the distribution of
18
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
the data is sparse. Examples can be found in Agresti (1996) and Bishop et al.
(1975). We will also see some examples in our framework later in the thesis.
Exact computations are based on the statistical theory of exact conditional
inference for contingency tables, reviewed by Agresti (2001).
Following such exact conditional methods, one computes the exact distribution of the test statistic given the sufficient statistic. These procedures often
arise for independence and goodness of fit tests, as well as in the construction of
uniformly most powerful tests and accurate confidence interval computations
through the Rao-Blackwell theorem, see Lehmann (1987). Thus, the analysis
is restricted to the set of tables with the same value t of the sufficient statistic
TN as the observed table
Ft = {f : X −→ N : TN (f ) = t} .
(1.28)
This set is often called the reference set. Throughout all the thesis, we will
assume that Ft is a finite set.
In literature, there exist many exact algorithms for a wide number of exact
tests, such as Pearson chi-square, likelihood-ratio chi-square, Mantel-Haenszel
chi-square, Fisher’s exact test and McNemar’s test, but in general such algorithms are limited to two-way contingency tables. A number of exact tests for
contingency tables are presented for example in Agresti (2001).
These algorithms are also implemented in statistical softwares, see for example SAS/STAT User’s Guide (2000). These exact algorithms compute exact
p−values for general I × J tables using the network algorithm developed by
Mehta & Patel (1983). This algorithm provides a substantial advantage over
direct enumeration of the reference set Ft , which can be very time-consuming
and feasible only for small problems. Refer to Agresti (1992) for a review of
algorithms for computation of exact p−values, and refer to Metha et al. (1991)
for information on the performance of the network algorithm.
Limiting ourselves to the basic steps of the algorithm, the network algorithm proceeds as follows. Corresponding to the reference set Ft , the algorithm
forms a directed acyclic network consisting of nodes in a number of stages. A
path through the network corresponds to a distinct table in the reference set.
The distances between nodes are defined so that the total distance of a path
through the network is the corresponding value of the test statistic. At each
node, the algorithm computes the shortest and longest path distances for all
the paths that pass through that node. The exact computation of these quan-
1.1. LOG-LINEAR MODELS
19
tities for statistics which are linear combinations of the cells counts is presented
in Agresti et al. (1990). For statistics of other forms, the computation of an
upper bound for the longest path and a lower bound for the shortest path
are needed. The longest and shortest path distances or bounds for a node
are compared to the value of the test statistic to determine whether all paths
through the node contribute to the p−value, none of the paths through the
node contribute to the p−value, or neither of these situations occur. If all
paths through the node contribute, the p−value is incremented accordingly,
and these paths are eliminated from further analysis. If no paths contribute,
these paths are eliminated from the analysis. Otherwise, the algorithm continues, still processing this node and the associated paths. The algorithm finishes
when all nodes have been accounted for, incrementing the p−value accordingly,
or eliminated.
For each possible table, the algorithm compares its test statistic value with
the corresponding value for the observed table. If the value for the table is
greater than or equal to the observed test statistic value, it increments the
exact p−value by the probability of that table, which is calculated under the
null hypothesis using the multinomial frequency distribution.
For several tests we meet in the analysis of categorical data, the test
statistic is nonnegative, and large values of the test statistic indicate a departure from the null hypothesis. Such tests include Pearson’s chi-square, the
likelihood-ratio chi-square, and other ones, such as the Mantel-Haenszel chisquare, Fisher’s exact test for tables larger than 2 × 2 tables, McNemar’s test.
The exact p−value for these tests is the sum of probabilities for those tables
having a test statistic greater than or equal to the value of the observed test
statistic.
These recently developed algorithms for two-way tables, together with improvements in computer power, make it feasible now to perform exact computations for data sets where previously only asymptotic methods could be
applied. Nevertheless, there are still large problems that may require a prohibitive amount of time and memory for exact computations
As noted by Diaconis & Sturmfels (1998), the reference set Ft is very
large and then difficult to enumerate even for very small data sets, where the
asymptotic approximation is heavily inadequate.
Recently, a software has been implemented in order to compute efficiently
the cardinality of the reference set Ft also in non trivial cases. This soft-
20
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
ware, called LattE, acronym of “Lattice Point Enumeration”, is presented in
De Loera et al. (2003) and uses non-standard algebraic results about convex
polytopes. For example, using this software, the authors are able to compute the cardinality of the reference set Ft for Table 1.1 with two-dimensional
marginal totals as components of the sufficient statistic, concluding that this
cardinality is equal to 441, while for a table whose two-dimensional marginal
total are displayed in Table 1.2, the cardinality of the reference set is about
2.2498 ∗ 1040 .
X1
1
2
3
X2
1 2
3
96 72 161
10 7
6
1 1
2
X3 = 1
X1
1
2
3
X2
1
2 3
186 127 51
11
7 3
0
1 0
X3 = 2
Table 1.1: A 3 × 3 × 2 example.
X1
1
2
3
X2
1
2
3
164424 324745 127239
262784 601074 9369116
149654 7618489 1736281
X1
1
2
3
X3
1
2
3
163445
49395 403568
1151824 767866 8313284
16099500 6331023 1563901
X2
1
2
3
X3
1
2
3
184032 123585 269245
886393 6722333 935582
1854344 302366 9075926
Table 1.2: Two-dimensional marginal totals for a 3 × 3 × 3 example.
As the software LattE is at this time under development, we only dispose
of some demo, and this is the reason why we present here examples taken from
De Loera et al. (2003), instead of new examples.
1.2. ALGEBRAIC THEORY OF TORIC IDEALS
21
However, it is worth noting that for the statistical applications, it is often
sufficient to have an approximation of the cardinality of Ft .
When asymptotic methods may not be sufficient for such large problems,
we can use Monte Carlo estimation of the exact p−values. A formula does
not exist that can predict in advance how much time and memory are needed
to compute an exact p-value for a certain problem. The time and memory
required depend on several factors, including which test is being performed,
the total sample size, the number of rows and columns, and the specific arrangement of the observations into table cells. Moreover, for a fixed sample
size, time and memory requirements tend to increase as the number of rows
and columns increases, since this corresponds to an increase in the number
of tables in the reference set. Also for a fixed sample size, time and memory
requirements increase as the marginal row and column totals become more
homogeneous, see Agresti et al. (1990) for details.
1.2
Algebraic theory of toric ideals
Following the description of the log-linear models in Section 1.1, we remark
that binomials and power products play a fundamental role in the formal
representation of the log-linear models.
Experts in Commutative Algebra have studied extensively the algebraic
relations of power products, leading to the theory of toric ideals. In this section
we briefly review the theory of toric ideals with a particular emphasis on the
results more relevant for the statistical applications. Moreover, we present
some recent computational issues which lead to a feasible computation of toric
ideals.
In the following, some basic notions in Computational Commutative Algebra is needed. In particular, we will use the notions of polynomial ideal,
Gröbner basis, elimination ideal, saturation. The reader can refer to Kreuzer
& Robbiano (2000) or Cox et al. (1992). Many of the statistical applications
of the theory of toric ideals in Probability and Statistics comes from the fundamental work by Sturmfels (1996)
For the theory of toric ideals we mainly refer to the papers Bigatti &
Robbiano (2001) and Bigatti et al. (1999). The computations in the examples
are obtained with the free software CoCoA, see Capani et al. (2000).
22
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
1.2.1
Definitions and first properties
Let K be a numeric field and let K[y1 , . . . , ys ] and K[ξ1 , . . . , ξq ] be the polynomial rings in the indeterminates y1 , . . . , ys and ξ1 , . . . , ξq , respectively. Recall
that a term in a polynomial ring is a power product and that a binomial is a
difference of terms. We denote by Ts the set of all terms in s indeterminates.
Definition 1.16 Let t1 , . . . , tq be terms in K[y1 , . . . , ys ]. The toric ideal I =
I(t1 , . . . , tq ) ⊆ K[ξ1 , . . . , ξq ] associated to the set {t1 , . . . , tq } is the ideal of all
polynomials p such that p(t1 , . . . , tq ) = 0.
Note that different sets of power products produce the same toric ideals.
This remark will be analyzed from the point of view of probability models late
in the thesis.
In the following, we use the multi-index notation, i.e. y d means y1d1 · · · ysds .
Given D = {d1 , . . . , dq } ⊂ Ns such that y di = ti for i = 1, . . . , q, consider the
semigroup homomorphism
T : Nq −→ Ns
(1.29)
ei 7−→ di .
We can associate to this map a ring homomorphism
π : K[ξ1 , . . . , ξq ] −→ K[y1 , . . . , ys ]
(1.30)
ξi 7−→ ti .
With these definitions, we refer to the toric ideal I as the toric ideal associated to T or the toric ideal associated to π. It is worth noting that the toric
ideal I is the kernel of the ring homomorphism π.
Note that we have π(ξ u ) = y v = y T (u) , with u ∈ Nq and v ∈ Ns such that
P
v = qi=1 ui di .
We report in the following results the main properties about toric ideals of
statistical interest.
Proposition 1.17 Let T : Nq −→ Ns be a semigroup homomorphism and
let us denote by π : K[ξ1 , . . . , ξq ] −→ K[y1 , . . . , ys ] the corresponding ring
homomorphism. The toric ideal I associated to π is generated, as vector space,
by the binomials of the set
{ξ u − ξ v : T (u) = T (v)} .
(1.31)
1.2. ALGEBRAIC THEORY OF TORIC IDEALS
Proof.
23
Choose a term-ordering τ on Ts and suppose there exists a polyno-
mial g in I which cannot be written as linear combination of polynomials in
{ξ u −ξ v : T (u) = T (v)}. Choose g 6= 0 such that the leading term LT (g) = ξ a
is minimal with respect to τ .
As g ∈ ker(π), we have g(t1 , . . . , tq ) = 0. In particular, the term y T (a)
must be eliminated and in g there is another monomial ξ b ≺τ ξ a such that
T (b) = T (a).
Now, the polynomial g 0 = g − ξ a + ξ b is non-zero, LT (g 0 ) ≺τ LT (g) and it
cannot be written as linear combination of polynomials in {ξ u − ξ v : T (u) =
T (v)}. This is a contradiction, because we supposed LT (g) minimal.
Now we define
T̃ : Zq −→ Zs
(1.32)
the extension of T to the group Z. As any u ∈ Zq can be written in the
form u = u+ − u− , where u+ and u− are the positive and negative part of u
respectively, the above result together with the standard theory of Gröbner
bases leads to the following proposition.
Proposition 1.18 For any term-ordering τ on Tq , there exists a finite set of
vectors Gτ ⊂ ker(T̃ ) such that the reduced Gröbner basis of I with respect to τ
is
+
−
{ξ u − ξ u
Proof.
: u ∈ Gτ } .
(1.33)
The result follows from Proposition 1.17 and from standard results
about Gröbner bases.
Theorem 1.19 Let t1 , . . . , tq be terms in the indeterminates y1 , . . . , ys and let
J be the ideal in K[ξ1 , . . . , ξq , y1 , . . . , ys ] generated by {ξ1 − t1 , . . . , ξq − tq }. We
have that
1. I = J ∩ K[ξ1 , . . . , ξq ];
2. if Gτ is a Gröbner basis of J with respect to an elimination ordering τ for
{y1 , . . . , ys }, then I is generated by the polynomials in Gτ ∩ K[ξ1 , . . . , ξq ].
For the proof, see Bigatti & Robbiano (2001), Proposition 1.4.
Moreover, nice properties of toric ideals are summarized in the following
theorem. We will see later in the thesis the relevance of these properties.
24
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Theorem 1.20 In the previous settings, consider any grading on the polynomial ring K[ξ1 , . . . , ξq , y1 , . . . , ys ] such that the degrees of the yj are arbitrary
integers and deg(ξi ) = deg(ti ), i = 1, . . . , r. Then
1. the ideal I is prime;
2. the ideal I is generated by pure binomials, i.e.
I = Ideal(t1 − t2 : π(t1 ) = π(t2 ), gcd(t1 , t2 ) = 1) ;
(1.34)
3. the ideal I is homogeneous.
For the proof, see Bigatti & Robbiano (2001), Theorem 1.6.
The variety Variety(I) is the geometric counterpart of the ideal I. A
point (ξ1 , . . . , ξq ) belongs to Variety(I) if and only if p(ξ1 , . . . , ξq ) = 0 for all
polynomials p ∈ I. This condition can be verified making use of a system of
generators of the ideal I. For details about the relationships between ideals
and varieties, see for example Cox et al. (1992).
1.2.2
Computation of toric ideals. First method
A simple algorithm to compute the Gröbner basis of the toric ideal I =
I(t1 , . . . , tq ) is based on the elimination algorithm as described in Theorem
(1.19).
We first consider the homomorphism π : K[ξ1 , . . . , ξq ] −→ K[y1 , . . . , ys ]
defined by ξi 7−→ ti ; then, we consider the ideal J in K[ξ1 , . . . , ξq , y1 , . . . , ys ]
generated by the set of polynomials
{ξ1 − t1 , . . . , ξq − tq } .
(1.35)
Using the results from Theorem (1.19), we have that a (reduced) Gröbner basis
of I is obtained computing a (reduced) Gröbner basis of J with respect to
a term-ordering of elimination for y1 , . . . , ys and taking the only polynomials
not involving the y’s.
Example 1.21 Consider the polynomial ring K[y1 , . . . , y5 ] and the following
set of six terms:
{y1 y4 , y1 y5 , y2 y4 , y2 y5 , y3 y4 , y3 y5 }
(1.36)
We will consider this example later in the thesis, viewed from a statistical
point of view.
1.2. ALGEBRAIC THEORY OF TORIC IDEALS
25
In order to compute the toric ideal I = I(y1 y4 , y1 y5 , y2 y4 , y2 y5 , y3 y4 , y3 y5 )
we introduce the polynomial ring K[ξ1 , . . . , ξ6 ] and we define the ring homomorphism π as described in Equation (1.30). The toric ideal of interest I is
the kernel of π. To compute this kernel, we define the ideal
J = Ideal(ξ1 − y1 y4 , ξ2 − y1 y5 , ξ3 − y2 y4 , ξ4 − y2 y5 , ξ5 − y3 y4 , ξ6 − y3 y6 )
in the polynomial ring K[ξ1 , . . . , ξ6 , y1 , . . . , y5 ]. Using CoCoA, the Gröbner
basis of this ideal with respect to an elimination term ordering for the indeterminates y1 , . . . , y5 is
G(J ) ={−y3 y5 + ξ6 , −y3 y4 + ξ5 , −y2 y5 + ξ4 , −y2 y4 + ξ3 ,
− y1 y5 + ξ2 , −y1 y4 + ξ1 , ξ5 y5 − ξ6 y4 , ξ3 y5 − ξ4 y4 , ξ1 y5 − ξ2 y4 ,
ξ4 y3 − ξ6 y2 , ξ2 y3 − ξ6 y1 , ξ2 y2 − ξ4 y1 , ξ3 y3 − ξ5 y2 , ξ1 y3 − ξ5 y1 ,
ξ1 y2 − ξ3 y1 , ξ4 ξ5 − ξ3 ξ6 , ξ2 ξ5 − ξ1 ξ6 , ξ2 ξ3 − ξ1 ξ4 }
and the only polynomials which do not involve the y’s are the last three ones.
Thus
I = Ideal(ξ4 ξ5 − ξ3 ξ6 , ξ2 ξ5 − ξ1 ξ6 , ξ2 ξ3 − ξ1 ξ4 )
(1.37)
and the Gröbner basis of the relevant toric ideal consists of three binomials.
This first method, although intuitive, is not the best from the computational point of view. Better methods for computing toric ideals can be found in
Bigatti et al. (1999). These methods are based on the theory of the saturation.
We will illustrate such methods in Section 1.2.4.
1.2.3
Navigating inside the set of solutions of an integer
linear system
The saturation of an ideal with respect to a polynomial is given by the following construction. Consider a polynomial p in K[ξ1 , . . . , ξr ] and an ideal
B ⊂ K[ξ1 , . . . , ξr ]. The ideal I is the saturation of B with respect to p if
I = Elim(v, B + Ideal(pv − 1))
(1.38)
where v is a new indeterminate and the computations are made in the polynomial ring K[ξ1 , . . . , ξr , v].
Let t1 , . . . , tq power products in the polynomial ring K[y1 , . . . , ys ], i.e.
yi = ta11i · · · ysasi .
(1.39)
26
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
We can define the matrix A ∈ Matq×s (N) associated to the power products.
The matrix A has non-negative entries. Conversely, any matrix A defines a
set of power products using Equation (1.39).
Definition 1.22 Let A ∈ Matq×s (N). We define the toric ideal associated to
A as the toric ideal IA = I(t1 , . . . , tq ), where tq is defined in Equation (1.39).
Now, we investigate the relation between the toric ideal IA and the kernel
of A viewed as Z-module. First, we need some technical definitions.
Consider the homogeneous system of Diophantine equations associated to
A:

a11 z1 + · · · + α1q zq = 0





···





as1 z1 + · · · + αsq zq = 0
(1.40)
where aij is the (i, j)-th element of the matrix A. The set of integer solutions
ker(A) of such system is a Z-module. In the sequel, we will use the decomposition of a vector into its positive and negative parts, that is v = v + − v − .
Now, let Bin(K[ξ1 , . . . , ξq ]) be the set of all binomials in K[ξ1 , . . . , ξq ] and
define the maps ρ0 : Zq −→ K[ξ1 , . . . , ξq ] and η 0 : Bin(K[ξ1 , . . . , ξq ]) −→ Zq as
+
−
ρ0 (u) = ξ u − ξ u
(1.41)
η 0 (ξ u − ξ v ) = u − v .
(1.42)
and
Definition 1.23 Let B = {v1 , . . . , vr } be a set of vectors. If B generates
ker(A) as Z-module, the ideal I(B) = I(ρ0 (v1 ), . . . , ρ0 (vr )) is called a lattice
ideal associated to ker(A).
Q
Let π = qi=1 ξi be the product of all the ξ indeterminates. The main
theorem about the application of the saturation for computing toric ideals is
the following.
Theorem 1.24 Let B = {v1 , . . . , vr } ⊆ ker(A). The following conditions are
equivalent.
a) I(B) : π ∞ = I(A);
b) I(B)Pπ = I(A)Pπ ;
1.2. ALGEBRAIC THEORY OF TORIC IDEALS
27
c) B is a set of generators of ker(A), i.e. I(B) is a lattice ideal.
For the proof, see Bigatti & Robbiano (2001), Theorem 2.10. For further
details about Theorem 1.24, see also Sturmfels (1996).
Now, let A ∈ Matq×s (N) and let b = (b1 , . . . , bs ) be a vector in Ns . We
want to find the non-negative solutions of the system

a11 z1 + · · · + α1q zq = b1





···





as1 z1 + · · · + αsq zq = bs
(1.43)
The following result easily lead to a method for navigating inside the set
of solutions of a Diophantine system.
Theorem 1.25 Let S a Diophantine system as defined in Equation (1.43).
Let ti , i = 1, . . . , q be the power products associated to the matrix A and
t = y1b1 · · · ysbs . Moreover, let (α1 , . . . , αq ) ∈ Nq . Let J be the ideal in the
polynomial ring K[ξ1 , . . . , ξq , y1 , . . . , ys ] generated by the set of binomials {ξ1 −
t1 , . . . , ξq − tq }. The following conditions are equivalent.
a) The vector (α1 , . . . , αq ) is a solution of S.
α
b) There is an equality of power products tα1 1 · · · tq q = y1b1 · · · ysbs .
α
c) The binomial y1b1 · · · ysbs − ξ1α1 · · · ξq q is in J .
For the proof, see Bigatti & Robbiano (2001), Proposition 3.2.
We will see an important example of such kind of navigation in the following
Chapter, in the framework of the contingency tables.
1.2.4
Computation of toric ideals. Second method
In view of the results in Section 1.2.3, and in particular Theorem 1.24, we can
compute the toric ideal with the following steps:
• consider the matrix representation A of the power products;
• compute a lattice basis B of ker(A) and then the lattice ideal I(B);
• compute the saturation of I(B) with respect to the polynomial π =
Qq
i=1 ξq .
28
CHAPTER 1. STATISTICAL AND ALGEBRAIC BACKGROUND
Example 1.26 Consider again the problem of Example 1.21. Now we apply
the saturation-based algorithm.
The matrix representation of the

1
1

0
A=
0

0
0
power products in Equation (1.36) is

0 0 1 0
0 0 0 1

1 0 1 0

(1.44)
1 0 0 1

0 1 1 0
0 1 0 1
A lattice basis of ker(A) is formed by the two vectors
(1, −1, −1, 1, 0, 0) and (1, −1, 0, 0, −1, 1) .
Thus, the lattice ideal I(B) is Ideal(ξ1 ξ4 − ξ2 ξ3 , ξ1 ξ6 − ξ2 ξ5 ). The ideal J =
I(B) + (πv − 1) is generated by the polynomials
G(J ) ={−ξ2 ξ5 + ξ1 ξ6 , −ξ2 ξ3 + ξ1 ξ4 , −ξ12 ξ3 ξ4 ξ62 v + 1,
− ξ12 ξ32 ξ63 v + ξ5 , −ξ13 ξ42 ξ62 v + ξ2 , ξ4 ξ5 − ξ3 ξ6 }
(1.45)
and, eliminating the v indeterminate, we obtain the same result as in Example
1.21, but with a relevant computational improvement.
We will consider again Examples 1.21 and 1.26, after the discussion of the
Diaconis-Sturmfels algorithm.
Note that the first step of the algorithm, i.e. the computation of a basis
of ker(A), can be made using the theory of vector spaces, with a notable
saving of time. A comparison of the different timings between the algorithms
introduced here can be found in Bigatti et al. (1999). The saturation-based
algorithm is implemented in the CoCoA function Toric, which computes a
system of generators of the toric ideal, starting from the matrix representation
of the power products. We will use CoCoA for all symbolic computations
presented in this thesis.
Chapter 2
The Diaconis-Sturmfels
algorithm
In the present chapter we present an algorithm for sampling from conditional
distributions on finite sample spaces, first introduced by Diaconis & Sturmfels
(1998). While in the original paper by Diaconis & Sturmfels (1998) the attention is restricted to exponential models, in this chapter we show that the
algorithm can be applied to a more general class of probability models, namely
all probability models which verify a linearity property of the sufficient statistic.
In Section 2.1, we introduce the multivariate hypergeometric distribution,
showing how that distribution is essential in the problems of estimation and
hypothesis testing. In Section 2.2, we describe the algorithm, and we discuss
the relevance of the use of toric ideals. The algorithm is a Monte Carlo Markov
chain one, based on the key notion of Markov basis. The links between the
theorems proving the validity of the algorithm and the theory of toric ideals
presented in Chapter 1 will be pointed out. In Section 2.3 we briefly present
some results about the rates of convergence of the Markov chain and in Section
2.4 we present two simple examples.
2.1
Hypergeometric distribution
This Section is a slight generalization of Diaconis & Sturmfels (1998), Section
1, where this theory is presented under the exponential model.
Let X be a finite set and let
P[x] = ϕ(T (x))
29
x∈X ,
(2.1)
30
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
be a probability model with sufficient statistic T : X −→ Ns and such that the
distribution of a sample of independent and identically P-distributed random
variables X = (X1 , . . . , XN ) is of the form
∗N
P
= ψ(TN )
where
TN =
N
X
T (Xk ) .
(2.2)
k=1
This means that the sufficient statistic of X is the sum of the sufficient statistics
of the one-dimensional random variables Xk , k = 1, . . . , N .
This model contains the usual exponential models of the form
Pθ [x] = c(θ)e<θ,T (x)> ,
but other ones are included, such as the models of the form
Pθ [(x, y)] = c(θ)eθx+θ
We denote
(
Yt =
2y
.
¯ N
)
¯X
¯
(x1 , . . . , xN ) ¯
T (xk ) = t ,
¯
(2.3)
k=1
i.e., the set of all samples with fixed value t of the sufficient statistic TN . It is
known that the distribution of X given {TN = t} is uniform on Yt . In fact
P∗N [x]
P∗N [X = x | TN = t] = P
{y : TN (y)=t}
P∗N [y]
=
1
1
=
#{x : TN (x) = t}
#Yt
for all x = (x1 , . . . , xN ) ∈ Yt .
As presented in the previous chapter, standard statistical techniques, such
as bootstrap methods, need uniformly distributed samples on Yt . Even in
simplest cases, the set Yt is very large and difficult to enumerate and it is not
easy to sample directly from Yt with classical Monte Carlo methods. Refer in
particular to the discussion in Chapter 1, Section 1.1.5.
As we are in the finite case, the problem can be reduced to a more suitable
form, in terms of counting. Then, we consider the reference set Ft introduced
in Chapter 1
(
Ft =
f : X −→ N :
X
)
f (x)T (x) = t
,
(2.4)
x∈X
i.e., the set of all frequency tables obtained from samples with value t of the
sufficient statistic TN .
2.2. THE ALGORITHM
31
The relation between Yt and Ft is given by the map F : Yt −→ Ft defined
by
X
F (x1 , . . . , xN ) =
ex
x∈X
N
X
I{xk =x}
(2.5)
k=1
which associates to every sample the corresponding frequency table. Here
(ex )x∈X denotes the canonical basis of RX . Note that the additivity condition
(2.2), together with the conditions imposed in the definitions of Yt and Ft , is
essential for the following composition:
F
Yt
@
-F
@
t
P
TN@
@
f (x)T (x)
@
R ?
@
Rs
The image probability of P∗N [ · | TN = t] induced by F is
#{x : F (x) = f }
,
(2.6)
#Yt
which is by definition the hypergeometric distribution on Ft . Simple compuHt (f ) = P∗N [F −1 (f ) | TN = t] =
tations show that
Ht (f ) =
N! Y
(f (x)!)−1 .
#Yt x∈X
(2.7)
For further details on hypergeometric distribution, see Bishop et al. (1975) and
Agresti (2002). However, MCMC methods presented in this thesis need not
the computation of #Yt , as it will be clear in the next Section.
2.2
The algorithm
As pointed out in Chapter 1, in order to perform hypothesis testing for contingency tables, the usual approach is the asymptotic one, which involves chisquared distributions, but in many cases, especially when the table is sparse,
the chi-squared approximation may not be adequate (for further details on
this topic see, for example, Appendix IV of Fienberg (1980)). We can obtain approximations of test statistics via Monte Carlo methods, drawing an iid
hypergeometric ample of contingency tables in the reference set Ft .
The problem is then reduced to sample from the hypergeometric distribution on Ft , or equivalently from the uniform distribution on Yt . Literature suggests to avoid the enumeration problem via the use of Markov Chains
32
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
Monte Carlo (MCMC) methods. In particular we are interested in Metropolis–
Hastings algorithm which is based on a set of moves for constructing the relevant Markov chain. A review on the Metropolis–Hastings algorithm can be
found in Chib & Greenberg (1995).
The following is a key definition in that theory.
Definition 2.1 A Markov basis of Ft is a set of functions m1 , . . . , mL : X −→
Z, called moves, such that for any 1 ≤ i ≤ L
X
mi (x)T (x) = 0 ,
(2.8)
x∈X
where T is the sufficient statistic, and for any f, f 0 ∈ Ft there exist a sequence
of moves (mi1 , . . . , miA ) and a sequence (²j )A
j=1 with ²j = ±1 such that
f0 = f +
A
X
²j mij
(2.9)
²j mij ≥ 0
(2.10)
j=1
and
f+
a
X
j=1
for all 1 ≤ a ≤ A.
The condition (2.8) implies that a move is a table with integer entries (even
negative) and such that the value of the sufficient statistic is constant for every
table obtained with moves in {m1 , . . . , mL }. Note once again the importance
of the linearity condition for the sufficient statistic T . In particular, if the
components of the sufficient statistic are the margins, then every move is a
table with null margins.
For example, if we consider the 3 × 3 tables with the marginal totals as
components of the sufficient statistic a move is For example the following move
for the 3 × 3 tables
0 +2
+1 −1
0 −1
0
0
+1
From this definition it is clear that Markov basis is the main tool to define
a random-walk-like Markov chain on Ft . It is well known that a connected,
reversible and aperiodic Markov chain converges to the stationary distribution.
In particular, we use here the following theorem.
2.2. THE ALGORITHM
33
Theorem 2.2 Given a Markov basis {m1 , . . . , mL }, generate a Markov chain
on Ft by choosing I uniformly in {1, . . . , L}. If the chain is currently at g ∈ Ft ,
determine the set of j ∈ Z such that g + jmI ∈ Ft . Choose j in this set with
probability proportional to
Y
{(g(x) + jmI (x))!}−1
(2.11)
x∈CI
with CI = {x : mI (x) > 0}. This is a connected, reversible, aperiodic Markov
chain with stationary distribution Ht .
Proof.
It is easy to see that the product in Equation (2.11) is proportional
to the stationary distribution Ht , and then the Markov chain is reversible with
respect to Ht . As M is a Markov basis, from the property (2.9) it follows that
the Markov chain is connected. To show that the chain is aperiodic, choose a
move and apply this move until some cell counts is zero, obtaining a holding
probability strictly positive.
In practice, to obtain a sample from the distribution of interest σ(f ) on
Ft , the Markov chain is performed as follows:
(a) at the time 0 the chain is in f ;
(b) choose a move m uniformly in the Markov basis and ² = ±1 with probability 1/2 each independently from m;
(c) if f + ²m ≥ 0 then move the chain from f to f + ²m with probability
min{σ(f + ²m)/σ(f ), 1}; in all other cases, stay at f .
The transition matrix of this Markov chain is given by
P (f, f + m) =
1
min{σ(f + m)/σ(f ), 1}
2L
if f + m ≥ 0
1
min{σ(f − m)/σ(f ), 1}
if f − m ≥ 0
2L
for all m ∈ {m1 , . . . , mL } and P (f, f ) chosen in order to have a stochastic
P (f, f − m) =
matrix. For example, if σ is the uniform distribution, we have
P (f, f + m) =
1
2L
if f + m ≥ 0
P (f, f − m) =
1
2L
if f − m ≥ 0
34
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
and P (f, f ) is simply the ratio of the number of inadmissible moves among
{±m1 , . . . , , ±mL } over the total number of moves.
Moreover, note that in the hypergeometric case the computation of the
ratio σ(f + m)/σ(f ) leads to some simplifications. In fact
Q
x∈X f (x)!
Ht (f + m)/Ht (f ) = Q
=
x∈X (f (x) + m(x))!
Q
Y
f (x)!
x∈X f (x)!
=
=Q
.
(2.12)
(f (x) + m(x))!
x∈X (f (x) + m(x))!
x : m(x)6=0
Let us recall here the basic convergence theorem for the Metropolis–Hastings algorithm working in our framework.
Theorem 2.3 Let σ(f ) be a positive function on Ft . Given a Markov basis
{m1 , . . . mL }, the Markov chain generated following the described algorithm is
connected, reversible and aperiodic on Ft with stationary distribution proportional to σ(f ).
Proof.
The proof is the same as for Theorem 2.2, considering a general
stationary distribution σ(f ) instead of the hypergeometric distribution Ht .
Now, consider an indeterminate for each sample point, that is q indeterminates ξ1 , . . . , ξq , an let K[ξ1 , . . . , ξq ] be the polynomial ring in the indeterminates ξ1 , . . . , ξq with coefficients in the numeric field K. Moreover, let
M = {m1 , . . . , mL } be a set of moves. We define the ideals
IT = Ideal(ξ a − ξ b : T (a) = T (b))
IM = Ideal(ξ
m+
i
−ξ
m−
i
: i = 1, . . . , L)
(2.13)
(2.14)
Note that the ideal IT is a toric ideal as T is a semigroup homomorphism.
The application of the theory of toric ideals comes from the following result.
Theorem 2.4 The Markov chain with moves in M is connected if and only
if IM = IT .
Proof.
“only if”: Notice that that IM ⊆ IT is obvious. Moreover, if the
Markov chain is connected basis, then for any a, b with T (a) = T (b) the term
ξ a can be written as
a
b
ξ =ξ +
A
X
j=1
²j (ξ
m+
i
j
−ξ
m−
i
j
)
(2.15)
2.2. THE ALGORITHM
35
for a suitable sequence of moves (mi1 , . . . , miA ) ∈ M and a suitable sequence
of signs (²1 , . . . , ²A ). This implies that ξ a = ξ b ∈ IM , and then IT ⊆ IM .
P
“if”: For all g, g 0 such that x∈X (g(x) − g 0 (x))T (x) = 0, there is a representation of the form
0
ξg − ξg =
L
X
²j ξ hj (ξ
fi+
j
−ξ
fi−
j
)
(2.16)
j=1
with ²j = ±1 for all j. If L = 1, the above Equation reduces to the connectedness condition. If L > 1 the result follows by induction. In fact from
+
−
Equation (2.16), it follows that either ξ g = ξrh ξ fir or ξ g = ξrh ξ fir for some
+
r = 1, . . . , L. Suppose for example ξ g = ξrh ξ fir . Then g − fi−r is non-negative
+
−
and so g + fi+r is non-negative. Subtracting ξ hr (ξ fir − ξ fir ) from both sides and
0
using hr + fi+r = g + fir , we obtain an expression for ξ g+fir − ξ g having length
L − 1. By induction, g + fir can be connected to g 0 in L − 1 steps. The proof
is now complete.
Theorem 2.4 gives also the relation between moves and binomials. If m is
+
−
a move, m = m+ − m− , the corresponding binomial is ξ m − ξ m and vice
versa. For example, the move presented above for the 3 × 3 tables
0
+1
0
+2
−1
−1
0
0
+1
is represented by the binomial
2
gm = ξ12
ξ21 ξ33 − ξ22 ξ32 .
in the polynomial ring Q[ξ11 , ξ12 , ξ13 , ξ21 , ξ22 , ξ23 , ξ31 , ξ32 , ξ33 ].
Thus, in view of Theorem 2.4, in order to compute the Markov basis, one
computes the toric ideals associated to the sufficient statistic and one defines
the moves from the binomials following the above rule.
Note that the theory of Markov bases does not need the notion of Gröbner
basis; it is sufficient to consider the ideals and set of generators. Gröbner
bases will be used below as computational tool. Moreover, the above theory
can be applied to the contingency tables observing that: 1) the (reduced)
Gröbner basis of a toric ideal is formed only by binomials; 2) applying the
division algorithm to a monomial with respect to a binomial Gröbner basis,
the dividends and the Normal Form are all monomials: this means that at
every step of the algorithm we have a monomial, or equivalently the algebraic
rules transform a contingency table into another contingency table.
36
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
2.3
Rates of convergence
In general, the Markov chain described in Section 2.2 needs some running time
to reach the stationary distribution.
There are no general results for the rates of convergence of the chain,
but there is a number of specific results working in some special framework.
Roughly speaking, the rate of convergence depends on the diameter γ of the
random walk. γ is the diameter of the graph with the points in Ft as vertices
and two points f and f 0 are connected if and only if f 0 can be reached from f
in one step.
Here is an example of such theorems, taken from Diaconis & Sturmfels
(1998). Consider a sample space of the form {1, . . . , I} × {1, . . . , J}, the row
sums and the column sums as components of the sufficient statistic, and let U
be the uniform distribution on the reference set Ft .
Theorem 2.5 In the previous settings, let K(f, g) be the connected Markov
chain on Ft based on the moves of the form
+1 −1
−1 +1
for any 2 × 2 minor of the table and zero otherwise. Then,
kK h (x, f ) − UkT V ≤ A1 exp(−A2 c)
(2.17)
for h = cγ 2 , where k · kT V denotes the total variation distance.
Refer to Diaconis & Sturmfels (1998), Section 2.3, for a general survey on
other results and pointers to literature. However, in the case of sample spaces
with a more complex geometrical structure, no results are available.
2.4
Two examples
We conclude this chapter with the illustration of two examples.
Example 2.6 Consider a sample space formed by 6 points, say
X = {a(1), . . . , a(6)}
2.4. TWO EXAMPLES
37
and consider the 5−dimensional sufficient statistic with components
T1 = F (a(1)) + F (a(2))
T2 = F (a(3)) + F (a(4))
T3 = F (a(5)) + F (a(6))
T4 = F (a(1)) + F (a(3)) + F (a(5))
T5 = F (a(1)) + F (a(2)) + F (a(6))
The matrix representation of this sufficient statistic produces the matrix analyzed in Example 1.26 in Chapter 1. The Gröbner basis of the toric ideal
is
{ξ4 ξ5 − ξ3 ξ6 , ξ2 ξ5 − ξ1 ξ6 , ξ2 ξ3 − ξ1 ξ4 }
corresponding to the three moves
(0, 0, 1, −1, −1, 1)
(1, −1, 0, 0, −1, 1)
(1, −1, −1, 1, 0, 0)
We will consider again this example in the next chapter in the framework of
the independence model.
The above example belongs to a special class where the components of
the sufficient statistic are the counts over subsets of the sample space. The
following example is not in that class, and statistical literature calls such model
logistic regression model.
Example 2.7 Suppose that the sample space is
X = {(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)} ⊂ N2
and consider the 5−dimensional sufficient statistic with components
T1 = F (1, 1) + F (1, 2)
T2 = F (2, 1) + F (2, 2)
T3 = F (3, 1) + F (3, 2)
T4 = F (1, 2) + F (2, 2) + F (2, 3)
T5 = F (1, 2) + 2F (2, 2) + 3F (2, 3)
38
CHAPTER 2. THE DIACONIS-STURMFELS ALGORITHM
Using the matrix representation of this sufficient statistic and computing the
toric ideal associated to that matrix, we obtain the following Gröbner basis
2
2
{ξ11 ξ22
ξ31 − ξ12 ξ21
ξ32 }
corresponding to a Markov basis with only one move
+1 −1
−2 +2
+1 −1
Chapter 3
Two-way contingency tables
In the present chapter, we restrict our attention to two-way contingency tables. In particular, we consider the application of the algorithm presented in
the previous chapter to a wide class of log-linear models. Throughout this
chapter we will present some new results which make explicit the computation
of Markov bases for some classical log-linear models for two-way contingency
tables, such as quasi-independence and quasi-symmetry. For a discussion on
the meaning of the different log-linear models in the two-way case, the reader
can refer to Fienberg (1980), Fingleton (1984) and Agresti (2002), where a
number of applications of log-linear models to different problems coming from
biology, psychology, econometrics are presented.
The notation is the same as in the previous chapters, except for the cell
frequencies, where we will use the standard notation in the analysis of contingency tables, e.g. we will write fij instead of f (i, j). This choice allows also
to save space in the proofs.
In Section 3.1 we adapt the notation of Chapters 1 and 2 to the framework
of the contingency tables, starting with the independence model for complete
tables, as it represents an easy prototype, both from the algebraic and statistical point of view. In Section 3.2 we present new characterizations or computations of the Markov bases for the most widely used models for two-way contingency tables, including models for tables with structural zeros. In Section 3.3
we analyze three examples of goodness of fit test for the quasi-independence
and the quasi-symmetry models. The first one is a standard case where the
asymptotic is correct; the second one is a case where the Monte Carlo p− value
leads to a different decision with respect to the asymptotic p−-value; the third
one, from Agresti (1996), shows a case where the asymptotics dramatically fail.
Finally, Section 3.4 shows the application of the methodology presented here
39
40
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
to a class of problems which frequently happens in the biological and medical
research.
3.1
Two-way contingency tables: Independence
In this Section we consider I × J contingency tables formed by considering the
counts generated from two categorical variables.
Let us denote the observed count for the (i, j) cell by fij and the totals
for the ith row and jth column by fi+ and f+j respectively. Finally, let N =
PI
PJ
i=1 fi+ =
j=1 f+j the sample size. We emphasize that this notation is
easily extendable to the multi-way contingency tables and also the topics of
this Section can be extended to the multi-way case. We will see examples of
log-linear models for multi-way contingency tables in Chapter 4.
Using the same notation as in Chapters 1 and 2, the sample space is X =
{(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J} and the vectors r = (f1+ , . . . , fI+ ) and
c = (f+1 , . . . , f+J ) are the components of the sufficient statistic T . In the
independence case, the probability model can be written as
I
Y
fi+
pr (i)
i=1
J
Y
pc (j)f+j ,
(3.1)
j=1
where pr (i) and pc (j), with i = 1, . . . , I and j = 1, . . . , J, are the parameters
of the marginal distributions.
Every margin as component of the sufficient statistic is additive and then
this probability model is coherent with the above theory. If we denote the set
of all contingency tables with fixed margins by F(r,c) , the following result holds
true.
Proposition 3.1 The probability distribution on F(r,c) is the hypergeometric
distribution, and its explicit formula is
QJ
H(r,c) (f ) =
Proof.
¡
f+j
j=1 f1j ···fIj
¡ N ¢
f1+ ···fI+
¢
.
The probability of (f11 , . . . , f1J ) in the first row of the table is
QJ ¡f+j ¢
¡f+1 ¢
¡f+J ¢
·
·
·
j=1 f1j
f11
¡ N ¢ f1J =
¡N¢
f1+
f1+
(3.2)
3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE
41
and by straightforward computation the result is proved.
As it is difficult to find the number of tables in F(r,c) and to have a complete
list of all tables in F(r,c) , we approximate the distribution of a generic test
statistic C conditioned to the sufficient statistic using the algorithm described
in Chapter 2.
Recalling the definition of Markov basis, see Chapter 2, Definition 2.1, it is
clear that the Markov basis for the independence model has to be formed by
moves with null margins. Moreover, the set of moves must give a connected
Markov chain.
Then, using this Markov basis, one can apply the algorithm described in
the previous chapter in order to obtain a Monte Carlo approximation of the
distribution of the test statistic.
The likelihood of any I ×J contingency table f can be written in monomial
form
f11
fIJ
,
ξ11
· · · ξIJ
(3.3)
where the ξij ’s are functions of the model parameters. The sufficient statistic
in the independence case is a map
T : NI×J −→ NI+J
(3.4)
f 7−→ (f1+ , . . . , fI+ , f+1 , . . . , f+J )
(3.5)
Then, for I × J contingency tables the ring homomorphism of interest is
π : K[ξ11 , . . . , ξIJ ] −→ K[y1+ , . . . , yI+ , y+1 , . . . , y+J ]
(3.6)
ξij 7−→ yi+ y+j
(3.7)
The following result characterizes a well known object, e.g. the set of 2 × 2
minors, as the Gröbner basis of the toric ideal I associated to π.
Theorem 3.2 Given any term-ordering τ , the reduced Gröbner basis of I is
Gτ = {ξil ξjk − ξik ξjl : 1 ≤ i < j ≤ I, 1 ≤ k < l ≤ J}
Proof.
(3.8)
First, we prove that Gτ is a set of generators for ker(π). Every
polynomial of the form ξil ξjk − ξik ξjl is in ker(π), so Ideal(Gτ ) ⊆ ker(π). For
the converse, consider a monic polynomial g ∈ ker(π). Letting ξ a = LT (g) the
42
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
leading term of g and π(ξ a ) = y u , there exists in g at least another term ξ b
such that π(ξ b ) = y u . The polynomial ξ a − ξ b is such that the exponents verify
I
X
aij =
i=1
I
X
bij
for all j = 1, . . . , J
bij
for all i = 1, . . . , I .
i=1
and
J
X
aij =
j=1
J
X
j=1
With these constraints, after long but trivial computations one proves that
ξ a − ξ b ∈ Ideal(Gτ ). Now, also the polynomial g 0 = g − ξ a + ξ b is in ker(π) and
LT (g 0 ) ≺τ LT (g). Then, the result is proved by applying the same procedure
to g 0 and so on for a finite number of times.
The proof that Gτ is a Gröbner basis is a simple application of the notion
of syzygy, which is a standard tool in Commutative Algebra. The reader can
find the theory of syzygies in Kreuzer & Robbiano (2000), Section 2.3.
The following example is the continuation of Example 2.6 of Chapter 2.
Example 3.3 The reduced Gröbner basis for the 3 × 2 contingency tables
gives the three moves
+1 −1
−1 +1
0
0
+1 −1
0
0
−1 +1
0
+1
−1
0
−1
+1
This example can be used to show that the standard analysis of contingency
tables with the use of the theory of vector spaces is not sufficient in order to
find a Markov basis, see also Pistone et al. (2001a). The matrix representation
of the homomorphism T in Equation (3.5) is

1
0

S1 = 
0
1
0
1
0
0
0
1
0
1
0
1
0
0
1
0
0
1
0
0
1
1
0

0
0

1

0
1
(3.9)
Note that we use the notation S1 in order to preserve here the notation of
the statistical literature, but the matrix S1 is just the matrix A introduced in
Chapter 1, Section 1.2.3.
3.1. TWO-WAY CONTINGENCY TABLES: INDEPENDENCE
43
We extend its action from R6 to R5 and compute its kernel with the system
of linear equations






a11 + a12 = 0
a21 + a22 = 0
a31 + a32 = 0


a
+ a21 + a31 = 0


 11
a12 + a22 + a32 = 0
(3.10)
As the system has rank 4, the kernel has dimension 2. Two solutions for this
system are
(a11 , a12 , a21 , a22 , a31 , a32 ) = (1, −1, 0, 0, −1, 1) ,
(3.11)
(a11 , a12 , a21 , a22 , a31 , a32 ) = (0, 0, 1, −1, −1, 1) .
(3.12)
Then, a basis of the orthogonal of S1T = Z1 is given by


1
0
−1 0 


0
1


Z2 = 

0
−1


−1 −1
1
1
(3.13)
but the polynomials ξ11 ξ32 − ξ12 ξ31 and ξ21 ξ32 − ξ22 ξ31 are not a Markov basis.
For example, starting from the table


3 5
6 4 ,
(3.14)
0 0
the Markov chain based on the two moves
+1
0
−1
−1
0
+1
0
+1
−1
0
−1
+1
does not produce any other tables. In fact, the polynomial ξ11 ξ22 − ξ12 ξ21
represents a non-trivial move, but it does not lie in the ideal Ideal(ξ11 ξ32 −
ξ12 ξ31 , ξ21 ξ32 − ξ22 ξ31 ). Although this table is statistically trivial, this example
allows us to write explicitly the homomorphism in a simple case. In other
models there are more interesting examples, as we will see in Section 3.3.
Using the saturation-based algorithm presented in Chapter 1, Section 1.2.4,
the Markov basis can be found through the computation of the Gröbner basis
of the ideal
Ideal(ξ11 ξ32 − ξ12 ξ31 , ξ21 ξ32 − ξ22 ξ31 , hu − 1) ∩ K[ξ11 , ξ12 , ξ21 , ξ22 , ξ31 , ξ32 ] (3.15)
44
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
where h is the polynomial formed by the product of all the ξ’s indeterminates.
This operation leads to the three moves described above. Refer to Chapter 1
for details on the computation of toric ideals as well as for a discussion on the
symbolic software we use.
Finally, note that in the independence case the Markov basis consists of all
tables of the type
+1 −1
−1 +1
for any 2 × 2 minor of the table and zero otherwise. Then, the computation of
the ratio H(r,c) (f + m)/H(r,c) (f ) in the rejection probability of the Metropolis–
Hastings algorithm is again simplified. In fact, referring to Equation (2.12),
we obtain
Y
fij !
H(r,c) (f + m)/H(r,c) (f ) =
=
(3.16)
(f
ij + mij )!
i,j : m 6=0
ij
=
Y
Y
fij !
fij !
=
(f
+
1)!
(f
−
1)!
ij
ij
i,j : mij =+1
i,j : mij =−1
Y
Y
=
(fij + 1)−1
fij
i,j : mij =+1
(3.17)
(3.18)
i,j : mij =−1
and only four numbers are involved.
3.2
Some models for incomplete and square
tables
We have considered in the previous sections the independence model as it
represents an easy prototype. We analyze now the Markov bases for other
models, with special attention to the most widely used models for two-way
contingency tables. A survey on these models in the framework of the loglinear models can be found in Haberman (1978).
3.2.1
Incomplete tables
Incomplete tables are contingency tables with structural zeroes. In this case
the sample space X is a subset of {(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ J}. A complete
description of a Markov basis for incomplete tables is not easy. For example,
we consider a 3×3 table with the missing entry (1, 3), again with the row sums
and the column sums as components of the sufficient statistic. Using CoCoA
3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES
45
for computing a Gröbner basis of the corresponding toric ideal, we obtain these
five moves in the Markov basis:
+1 −1 •
−1 +1 0
0
0 0
+1 −1 •
0
0 0
−1 +1 0
0 0
+1 0
−1 0
•
−1
+1
0
0
+1 −1
−1 +1
•
0
0
0
0
•
0 +1 −1
0 −1 +1
In this case, the Markov basis seems to be obtained from the corresponding
Markov basis of the complete table by deleting the moves which involve the
missing entry. But this is false in general as we show in the following example.
If we consider a 3 × 3 table with all diagonal cells as missing entries, the
algebraic computation show that the Markov basis has only one move:
•
−1
+1
+1
•
−1
−1
+1
•
In this example, the ring homomorphism of interest is
π : K[ξ12 , ξ13 , ξ21 , ξ23 , ξ31 , ξ32 ] −→ K[y1+ , y2+ , y3+ , y+1 , y+2 , y+3 ]
ξij 7−→ yi+ y+j
(3.19)
(3.20)
with i 6= j.
3.2.2
The quasi-independence model
In square tables is often useful the quasi-independence model. The log-linear
form of this model is
Y
log mij = λ + λX
i + λj + δi I{i=j}
(3.21)
where the λX
i ’s are the effects of the first variable X with values 1, . . . , I, the
λYj ’s are the effects of the second variable Y with values 1, . . . , I and the δi ’s
are the effect of the diagonal cells. Here the indicator I{i=j} equals 1 when
i = j and 0 otherwise. The usual constraints of this model are
I
X
i=1
λX
i
=0
and
I
X
j=1
λYj = 0 .
(3.22)
46
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
This model consider the independence except for the diagonal cells, which
are fitted exactly. In this model, the components of the sufficient statistic
are the row sums and the column sums together with the diagonal entries
d = (f11 , . . . , fII ). As the diagonal entries are themselves sufficient statistic,
the model fits exactly these cells, and then no move of the Markov basis can
modify the diagonal. This observation leads us to the following result.
Proposition 3.4 The reduced Gröbner basis of the toric ideal I for the quasiindependence model is the Gröbner basis for the corresponding incomplete table
with missing diagonal entries.
Proof.
An equivalent minimal sufficient statistic has components d, r̃ =
(f˜1+ , . . . , f˜I+ ) and c̃ = (f˜+1 , . . . , f˜+I ), where f˜i+ = fi1 + . . . + fi i−1 + fi i+1 +
. . . + fiI is the i-th row sum except for the i-th diagonal cell which is excluded
and the same definition holds for the column sums f˜+j . With this sufficient
statistic, the equations defining the kernel of π are
ξij = ỹi+ ỹ+j
for
i, j = 1, . . . , I, i 6= j
(3.23)
ξii = fii
for
i = 1, . . . , I
(3.24)
A Gröbner basis G1 of the ideal J1 = Ideal(ξij − ỹi+ ỹ+j , i, j = 1, . . . , I, i 6= j)
is the Gröbner basis coming from the incomplete table with missing diagonal
entries. A Gröbner basis G2 of the ideal J2 = Ideal(ξii − fii , i = 1, . . . , I) is
{ξii − fii , i = 1, . . . , I}.
Now, observe that the first group of equations involves only the indeterminates ξij , i 6= j, ỹi+ and ỹ+j , i, j = 1, . . . I, whereas the second group of
equations involves the indeterminates ξii and fii , i = 1, . . . I. As a consequence, it is a standard fact in polynomial algebra that a Gröbner basis of the
ideal J = J1 + J2 is G1 ∪ G2 .
As we have previously seen, the Gröbner basis of the toric ideal I = J ∩
K[ξij : i 6= j] is obtained from G1 ∪ G2 by deleting the polynomials involving
the y’s and the f ’s indeterminates. So, the polynomials in G2 are all deleted
and the result is proved.
As the computation of a Gröbner basis of a toric ideal is very computer
intensive and contingency tables need a great number of indeterminates and
equations, the above result is specially useful from a computational point of
view, as it reduces the number of equations used in the computation. Moreover,
3.2. SOME MODELS FOR INCOMPLETE AND SQUARE TABLES
47
note that the above proposition can be extended to other models where some
entries are sufficient statistics.
3.2.3
The symmetry model
Following the notations of the log-linear model, the symmetry model can be
written in the form
log mij = λ + λi + λj + λij
with λij = λji for all i, j = 1, . . . , I and with the constraint
(3.25)
PI
i=1
λi = 0. This
model implies the marginal homogeneity. The sufficient statistic is s = (sij =
fij + fji : 1 ≤ i ≤ j ≤ I).
The Markov basis for the symmetry model is characterized by the following
result.
Theorem 3.5 Given any term-ordering τ , the reduced Gröbner basis of I for
the symmetry model is
Gτ = {ξij − ξji : 1 ≤ i < j ≤ I}
Proof.
(3.26)
Writing explicitly the kernel of the ring homomorphism we obtain
a polynomial system which contains the couples of equations
½
ξij = sij
ξji = sij
(3.27)
for all 1 ≤ i < j ≤ I and the equations
ξii = sii
(3.28)
for all i = 1, . . . , I. As the corresponding polynomials are a Gröbner basis,
straightforward elimination of sij ’s indeterminates gives the result.
3.2.4
The quasi-symmetry model
The quasi-symmetry model has the log-linear form
Y
log mij = λ + λX
i + λj + λij
(3.29)
with λij = λji for all i, j = 1, . . . , I and with the constraints
I
X
i=1
λX
i
=0
and
I
X
j=1
λYj = 0 .
(3.30)
48
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
This model is more general than the previous, as it does not imply the marginal
homogeneity. The components of the sufficient statistic are the row sums, the
column sums and s = (sij = fij + fji : 1 ≤ i ≤ j ≤ I). In this model the
Markov basis can be computed with CoCoA. For example, for a 4 × 4 table,
the seven moves are
0
0
0
0
0
0
+1
−1
0
0
+1
−1
0
0
−1
+1
0
−1
0
+1
−1
+1
0
0
0
+1
−1
0
+1
−1
0
0
0
0
+1
−1
0
+1
0
−1
0
+1
−1
0
0
0
0
0
−1
0
+1
0
−1
0
+1
0
−1
0
0
+1
0
−1
0
+1
+1
−1
0
0
+1
0
−1
0
+1
0
−1
0
0
+1
0
−1
0
−1
+1
0
−1
0
0
+1
+1
0
0
−1
0
0
0
0
−1
0
0
+1
+1
−1
0
0
0
+1
−1
0
0
0
0
0
It should be noted that the Markov basis is easily computable also in those
models where the maximum likelihood equations are not in explicit form and
a numerical method, such as the Newton-Raphson method, must be used in
order to calculate the maximum likelihood estimate, see Chapter 1.
3.2.5
Computational notes
The application of the above theory leads to two computational problems.
First, the computation of the Gröbner bases of toric ideals involves a great
number of indeterminates and the number increases fast when the dimensions
of the table increase. For example, the computation of the Gröbner basis of
the toric ideal in the independence model for complete I × J tables involves
IJ + I + J indeterminates.
For other models the number of indeterminates is even greater. For example, the quasi-symmetry model for an I × I table involves I 2 + 2I + I(I + 1)/2
indeterminates. The most common symbolic software packages, such as CoCoA or Maple, give computational problems if the number of indeterminates
is too high. So, for solving problems for large tables are essential the theorems
of the above Sections in order to have the Gröbner basis avoiding the computation. However, we have found in our computations that CoCoA, having the
function Toric for the computation of toric ideals, is the package which allows
to work with the grater number of indeterminates.
Second, the number of moves increases fast when I and J increase. Considering once again the independence model, the number of polynomials of the
3.3. EXAMPLES
49
Gröbner basis, or equivalently the number of moves of the Markov basis, is
¡I ¢¡J ¢
. In Table 3.1, we report the number of moves of the Markov basis for
2 2
the independence model.
I
2
3
4
5
6
7
8
9
10
2
3
1
3
3
9
6 18
10 30
15 45
21 63
28 84
36 108
45 135
4
6
18
36
60
90
126
168
216
270
5
10
30
60
100
150
210
280
360
450
J
6
15
45
90
150
225
315
420
540
675
7
8
9
10
21
28
36
45
63
84 108 135
126 168 216 270
210 280 360 450
315 420 540 675
441 588 756 945
588 784 1008 1260
756 1008 1296 1620
945 1260 1620 2025
Table 3.1: Number of moves for the independence model.
In this case, the main problem is how to store the results of the symbolic
computations in matrix form in order to use these results with statistical softwares.
3.3
Examples
In this section we present three different examples of application of the Diaconis-Sturmfels algorithm to goodness of fit tests for log-linear models.
Example 3.6 Consider the categorical data in Fingleton (1984), page 142
and reported in Table 3.2. These data describe the place of residence of a
subgroup of British migrants in 1966 and 1971. The data are simplified in
order to consider only four places, labelled A, B, C and D. For the quasisymmetry model, the Pearson statistic is C = 2.5826. If we compare this
value with the chi-squared distribution with (I − 1)(I − 2)/2 = 3 degrees of
freedom, it gives an asymptotic approximate p−value 0.4607.
Running the Markov chain based on the seven moves found in Section 3.2
for the quasi-symmetry model with 50,000 burn-in steps and sampling every
50 steps for a total of 10,000 values, the Monte Carlo approximate p−value
is 0.4714. As the sample size is high, the classical asymptotic approximation
works well and the two approximations are similar.
50
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
1971
A
B
C
D
118
12
7
23
14 2127
86 130
8
69 2548 107
12 110
88 7712
1966
A
B
C
D
Table 3.2: British migrants data.
The sample size is chosen in order to have a 95%-confidence interval for
the exact p−value shorter than 0.02. In fact, assuming near independence,
p
the width of the 95%-confidence interval is 2 · 1.96 p(1 − p)/B, where p is
the estimated p−value and B is the size of the Monte Carlo sample. In the
p
inequality 2 · 1.96 p(1 − p)/B < 0.02, the left hand term has a maximum in
p
p = 1/2, so the inequality is reduced to 2 · 1.96 1/(4B) < 0.02. From the last
inequality we easily obtain B > 9, 604.
Example 3.7 As a second example, we present a quasi-independence model
for complete tables. We analyze the medical data in Agresti (1996), page 242
and reported in Table 3.3.
Y
X
1
2
3
4
1
22
5
0
0
2 3 4
2 2 0
7 14 0
2 36 0
1 17 10
Table 3.3: Carcinoma data.
This table shows ratings by two pathologists, X and Y , who separately
classified 118 slides regarding the presence and extent of carcinoma of the
uterine cervix. The rating scale has four ordered categories, 1, 2, 3 and 4.
Under the quasi-independence model, the Pearson statistic is C = 11.5, leading
to an asymptotic approximate p−value 0.0423. Applying the algebraic MCMC
algorithm as in the previous example, the Monte Carlo approximate p−value
is 0.0080. If we perform a 1%-level goodness of fit test, the conclusions are
different using different approaches.
3.3. EXAMPLES
51
Example 3.8 Finally, we consider again the data of the previous example,
but under the quasi-symmetry model. The Pearson statistic for this table
under the quasi-symmetry model is C = 0.6348. Agresti (1996) compares this
value with the chi-squared distribution with 2 degrees of freedom, obtaining
an asymptotic approximate p−value 0.7280. Running the Markov chain as
above, the Monte Carlo approximate p−value is exactly 1, and in this case the
two approximations are not similar. We used the definition of p−value as the
smallest significance level at which the model is rejected with probability 1.
But there is more. Using the seven moves of the Markov basis, we can
compute for this table the exact distribution of C under the model. In fact,
starting from the original table (denoted here by f1 ), only the last move
0 −1 +1 0
+1
0 −1 0
−1 +1
0 0
0
0
0 0
with negative sign, can be applied and this move gives the table f2 below.
Again, only the last move can be used: if we choose the positive sign we come
back to the table f1 , whereas if we choose the negative sign we obtain the table
f3 .

22
4
f2 = 
1
0

3 1 0
7 15 0 

1 36 0 
1 17 10

22
3
f3 = 
2
0

4 0 0
7 16 0 

0 36 0 
1 17 10
(3.31)
Now, only the last move can be applied, with positive sign, coming back to
table f2 .
There are only three tables with non-negative entries and with the appropriate values of the sufficient statistic and it is not difficult to compute the
corresponding hypergeometric probabilities. Applying the formula (3.16), we
obtain the equations H(f2 )/H(f1 ) = 4/9 and H(f3 )/H(f2 ) = 1/32, together
with the constraint H(f1 ) + H(f2 ) + H(f3 ) = 1. Solving this simple system,
we find H(f1 ) = 72/105, H(f2 ) = 32/105, H(f3 ) = 1/105. The Pearson
statistic values for these three tables are C(f1 ) = 0.6348, C(f2 ) = 1.8405,
C(f3 ) = 12.3206.
Moreover, following the Example 3.3, the size of the matrix S1 is 16 × 18 and
the dimension of its kernel is 3. Then, a basis of the kernel as vector space is
given by three linearly independent moves, as instance the first three ones. If
52
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
the last move is excluded, the Markov chain starting from f1 cannot reach the
other two tables.
3.4
An application. The h-sample problem
In this section we present a first application of the theory above, and in particular of the exact test for the independence model.
We want to compute the exact p−values for tests on equality of h multinomial distributions, through the test of independence on h × k tables. It is
worth noting that when the p−value is not exactly computed, classical approximations based on chi-squared asymptotic distribution of the test statistic are
used. As for log-linear models, asymptotic approximations in many situations
are not accurate, indeed the problem of computing exact p−values has been
widely considered in recent literature.
Mehta & Patel (1983) provide a network algorithm to inspect the orbit of
h × k contingency tables with fixed margins. But, the entire set of tables is
often too wide and consequently approximation methods have been proposed.
Baglivo et al. (1992) used a FFT algorithm to compute the exact p−value on
h × k tables with fixed margins. Hirji & Johnson (1996) compared the two
algorithms above and concluded that the first one is superior to the second
one with respect to computing speed and accuracy.
The algorithm proposed by Metha et al. (1988) provides an approximation
of the exact p−value and it is based on a complicated Monte Carlo method
which generates random tables providing statistics belonging to the critical
region of the test.
Recently, Strawderman & Wells (1998) used saddlepoint approximations to
the exact distribution of the conditional maximum likelihood estimate (MLE)
for several 2 × 2 tables, providing approximated p−values, power calculations
and confidence intervals. Booth & Butler (1999) proposed a Monte Carlo importance sampling for conducting exact conditional tests in log-linear models.
Moreover, Metha et al. (2000) proposed a Monte Carlo method for approximating the exact p−values in conditional logistic regression and foresee the
applicability of algebraic-based methods in this framework. The main problem
in Metha et al. (1988), and in most cited works, was how to generate random
tables with fixed margins. In this paper we give a new solution to the h-sample
problem applying the technique described in Diaconis & Sturmfels (1998), and
3.4. AN APPLICATION. THE H-SAMPLE PROBLEM
53
the computer program source can be included in few lines of statements. Moreover, we provide a solution not only for 2 × k tables but, more generally, for
h × k tables.
The h-sample problem can be stated as follows. Let X1 , . . . , Xh be random
variables with values in {1, . . . , k} defined on h different groups and consider
h samples drawn from these groups: (X1,1 , . . . , X1,n1 ) from the multinomial
distribution D1 , . . ., (Xh,1 , . . . , Xh,nh ) from the multinomial distribution Dh .
P
Let N = hi=1 ni be the total sample size. The distribution Di has parameters
P
pi1 , . . . , pik , for i = 1, . . . , h with the constraints pij ≥ 0 and kj=1 pij = 1 for
all i = 1, . . . , h. The usual test in this situation is the test for equality of the
proportions, where the null hypothesis is
H0 : p1j = . . . = phj
for j = 1, . . . k
(3.32)
against the composite alternative hypothesis of different proportions.
The components of the sufficient statistic are the sums of the observations
in each group and the sums of observations for each possible value of the
variables, that is
Ã
!
X
X
X
X
S=
I{X1,u =j} , . . . ,
I{Xh,u =j} ,
I{Xi,u =1} , . . . ,
I{Xi,u =k}
(3.33)
u,j
j,u
i,u
i,u
for i = 1, . . . , h and j = 1, . . . , k, where IA is the indicator function of A. Of
course, the sample can be summarized in a contingency table with h rows and
P i
I{Xi,u =j} .
k columns and we denote this table by F . In other words Fij = nu=1
The table F has raw parameters p∗ij = pij ni /N . Moreover, we denote by fi+ and
by f+j the row sums and the column sums, respectively. The sufficient statistic
S is represented by the margins of this table and the maximum likelihood
estimate of the parameters is
p̂∗ij = fi+ f+j /N 2
for i = 1, . . . h , j = 1, . . . , k .
(3.34)
For this hypothesis the most commonly used test statistics are the PearP
∗
∗ 2
son statistic C =
i,j (Fij − N p̂ij ) /N p̂ij and the log-likelihood ratio L =
P
2 i,j Fij log(Fij /N p̂∗ij ), see for example Agresti (1996).
There are many nonparametric algorithms, based on Monte Carlo approximations of the p−value which do not involve limit theorems. The algorithm
presented here is still valid when the test statistic is of the form
X
T (F ) =
aij (Fij ) ,
i,j
(3.35)
54
CHAPTER 3. TWO-WAY CONTINGENCY TABLES
where the aij ’s are real valued functions on the set N of the nonnegative integers.
As usual in this framework, we consider a problem of conditional inference, where only the tables with the same margins of the original table are
relevant. So, we restrict the analysis to the reference set Ft . Using the algorithm described above with the Markov basis for the independence model,
the approximate Monte Carlo p−value is simply the ratio of the number of
tables whose value of the test statistic is greater than or equal to the one of
the original table, over the number B of sampled tables.
Note that or the two-sample problem, there is an algorithm proposed by
Metha et al. (1988), which is an importance sampling algorithm and is based
on the network representation of the reference set Ft . It were also used in order
to approximate the power of the test, but is difficult to rewrite it for the hsample problem. Moreover, our method allows us to approximate the p−value
as stated before, instead of considering also the hypergeometric probability
weights of the sampled tables.
It is interesting to note that also Metha et al. (2000) considered the possibility of applying algebraic-based methods: “An investigation is needed to establish conditions under which the probability of connectedness of the Markov
chain is high”. But, the theorems presented in Diaconis & Sturmfels (1998),
and described in Chapter 2 of this thesis, state that the algebraic algorithm
described above defines a connected Markov chain.
As in this kind of applications only the independence model is involved, it
is worth noting the algebraic step of the algorithm needs not symbolic computations. In fact, the relevant Markov bases are completely characterized in
Theorem 3.2.
Finally, we emphasize that classical methods, for example Metha et al.
(1988), proposed the solution by sampling few well chosen tables with great
computational effort. On the other hand, this method easily generates tables
from the reference set.
Chapter 4
Rater agreement models
In this chapter we apply the algebraic technique for sampling from conditional distributions presented in Chapter 2 in order to define an exact Markov
Chain Monte Carlo (MCMC) algorithm for conditional hypothesis testing in
the problems of rater agreement. Moreover, we extend the results and the
computations made up in Chapter 3 to the multi-dimensional case, i.e. to
the case of d-way tables, with d > 2. These new results allow us to compute
Markov bases also in non-trivial cases and some of the most important cases
are presented explicitly.
In Section 4.1, we present the medical framework where models for rater
agreement apply. In Section 4.2 we review the main methods for measuring
rater agreement, with particular attention to the multi-observer case. The
methods considered here are the Cohen’s κ and the goodness of fit tests for
appropriate log-linear models. In Section 4.3, we describe the conditional inference for these problems, using the Diaconis-Sturmfels algorithm and the
notion of Markov basis, while in Section 4.4 a careful presentation of the conditioning structures arising from complex problems is given. Section 4.5 deals
with the computation of some relevant Markov bases coming from agreement
problems, through some new theoretical results useful for simplifying the symbolic computations. Finally, in Section 4.6, an example on a real data set
shows the practical applicability of the algebraic procedures.
4.1
A medical problem
The material presented in this chapter was inspired by a medical research on
medical imaging drugs and biological products, in particular contrast agents,
see CDER (2000). Contrast agents are drugs that improve the visualization of
55
56
CHAPTER 4. RATER AGREEMENT MODELS
tissues, organs or physiologic processes by increasing the relative difference of
imaging signal intensities in adjacent parts of the body.
From the statistical point of view, the effectiveness of a contrast agent is
assessed by evaluating with statistical quantitative studies the agent’s ability to
provide a better reading of the medical images. The assessment of the validity
of a new product over the existing ones is performed in general by comparing
radiographies, computed tomographies and magnetic resonance imaging taken
with the new product and the old product, when available.
In this medical setting, the most commonly used statistical models are the
models for agreement among raters. For the case of two raters the proposed
methods are the Cohen’s κ and similar indices, see Landis & Koch (1975) for
a survey. More recently, the application of log-linear models, in particular the
quasi-independence model and the quasi-symmetry model has been proposed,
see for example Agresti (1992).
In some particular clinical settings, Government Agencies can require models for the agreement among many raters or the analysis of stratified samples
with respect to other covariates, for example when the study includes subjects
that represent the full spectrum of the population. For example, in a study
intended to demonstrate the validity of a contrast agent to assess bronchiestasis, the sample should be stratified with respect to age classes of the patients
and to the presence or absence of the main related pulmonary disorders, e.g.,
chronic bronchitis, pneumonia, asthma, cystic fibrosis, see CDER (2000).
In these situations, the use of conditional inference is common practice. In
fact, the use of conditional inference seems to be preferable with respect to the
use of an unconditional approach, see Klar et al. (2000) for a discussion and
further references. In addition, different structures of conditioning arise in the
inferential procedures, as it will be clear in the next sections. For example, it
is usual to define different indices of agreement for each pair of observers, e.g.
different κ’s for any pairs of observers and for any strata. As it will be clear
from the following, such a study can be much more informative than the study
of an overall index of agreement for the sample.
4.2
Multidimensional rater agreement models
Suppose that d observers rate the same set of N subjects.
In this section, we briefly review the main methods for measuring rater
4.2. MULTIDIMENSIONAL RATER AGREEMENT MODELS
57
agreement, starting with the case of d = 2 observers, say A and B. We present
here a wide range of models, both for binary, nominal and ordinal rating scales,
and we denote the set of possible responses as {1, . . . , I}. We do not consider
continuous responses, where parametric or nonparametric ANOVA methods
can be applied,see for example Landis & Koch (1975).
The data are summarized in a two-way contingency table. We note πij =
P{A = i, B = j} the probability that observer A classifies the subject in
the category i and observer B in category j, with i, j = 1 . . . , I. Moreover, we
denote by {πi+ , i = 1, . . . , I} and {π+j , j = 1, . . . , I} the marginal distributions
for the observers A and B, respectively. Note that in the previous chapters
the cell probabilities were denoted by p, as usual in the analysis of contingency
tables. But in the rater agreement literature the notation π to denote the
probabilities is the standard one.
Probably, the most widely used index of agreement for nominal and ordinal
categories is the Cohen’s κ defined as
PI
P
πii − Ii=1 πi+ π+i
P
1 − Ii=1 πi+ π+i
i=1
κ=
leading to the model

 πij = (1 − κ)πi+ π+j

(4.1)
for i 6= j
πii = (1 − κ)πi+ π+i + κ/I for i = j
This index measures the observer agreement beyond chance in the sense
that the index equals zero if the agreement is equal to that expected by chance
and computed under the independence assumption, see Landis & Koch (1975)
for further details or the original work Cohen (1960).
Another version of the κ index implies marginal homogeneity and its definition assumes the form
PI
h
κ =
P
πii − Ii=1 (πi+ + π+i )2 /4
.
P
1 − Ii=1 (πi+ + π+i )2 /4
i=1
(4.2)
There are also several versions of weighted κ for ordinal scales, see Landis
& Koch (1975) or Fleiss (1981) for details and discussion. As the Markov bases
do not change when considering weights, we restrict our presentation to the
case of standard weights, such as in (4.1) and (4.2). Moreover, some authors
define separated k’s for each pair of responses, see for example Agresti (1992).
58
CHAPTER 4. RATER AGREEMENT MODELS
An agreement index is a normalized index such that it is equal to 1 in the
case of perfect agreement and it is equal to 0 in the case of independence. For
d > 2 raters, one can find different versions of the Cohen’s κ. Here we use the
following formula
o
¡d¢−1 nPd
PI
Pd
PI
π
−
π
(i)π
(i)
i
,...,i
,...,i
,...,i
k
h
1
h
k
d
h<k=1
ih =ik =1
h<k=1
i=1
2
κ(d) =
¡d¢−1 Pd
PI
1− 2
h<k=1
i=1 πk (i)πh (i)
(4.3)
where πk (i) is the proportion of the i-th level in the k-th marginal distribution,
see Landis & Koch (1975).
With the marginal homogeneity hypothesis we obtain the formula:
´ P
¡d¢−1 ³Pd
PI
π
− Ii=1 π(i)2 /d2
i
,...,i
,...,i
,...,i
1
h
k
d
h<k=1
ih =ik =1
2
κh(d) =
(4.4)
P
1 − Ii=1 π(i)2 /d2
P
where π(i) = dh=1 πh (i) is the proportion of ratings in the i-th category by
the d observers.
It is known that the maximum likelihood estimates of the κ parameters are
obtained simply by substituting in the above formulae the observed proportions to the theoretical proportions. We denote by κ̂ the maximum likelihood
estimators.
As the interpretation of the Cohen’s κ is not always straightforward, see
Fleiss (1981), more recently the use of log-linear models for describing observer agreement has been proposed. For the specific application of the loglinear models for measuring the agreement, Agresti (1992) illustrates the use
of the quasi-independence and the quasi-symmetry models. In particular, the
p−value of the goodness of fit test is used to measure the agreement.
Let {nij | i, j = 1, . . . , I} and {mij | i, j = 1, . . . , I} denote the observed
cell counts and the expected cell counts, respectively. The quasi-independence
model assumes the form
B
log mij = µ + λA
i + λj + δi I{i=j}
(4.5)
B
where µ is a global mean parameter, λA
i , λj , i, j = 1, . . . , I are the marginal
PJ
P
B
effects with the constraints Ii=1 λA
i = 0,
j=1 λj = 0, and δi , i = 1, . . . , I
are the main diagonal parameters.
The expression of the independence model, which gives the expected agreement by chance, has the form
B
log mij = µ + λA
i + λj .
4.2. MULTIDIMENSIONAL RATER AGREEMENT MODELS
59
As pointed out in Agresti (1992), the quasi-independence model decomposes the agreement between the two raters in two parts: a first addendum
which represents the agreement by chance and a second addendum which represents the agreement beyond chance. For further details on the use of the
quasi-independence model for measuring rater agreement, see Bishop et al.
(1975) or Tanner & Young (1985).
We can also define a simplified quasi-independence model with an unique
parameter δ for the main diagonal cells instead of the sequence {δi , i = 1, . . . I}.
This makes the fit of the main diagonal unsaturated and the model has the
form
B
log mij = µ + λA
i + λj + δI{i=j} .
(4.6)
Another log-linear model to assess rater agreement is the quasi-symmetry
model, defined by the equations
B
log mij = µ + λA
i + λj + λij
with the constraints
PI
i=1
λA
i = 0,
PJ
j=1
(4.7)
λB
j = 0, and λij = λji for all i, j =
1, . . . , I, see Agresti (1992). Other related models can also be found in Becker
(1990).
The extension of the theory of log-linear models to higher dimensional
tables is presented for example in Fienberg (1980). However, as pointed out in
the introduction, concept such as independence and conditional independence
have been largely studied in the statistical literature, while the notions of
quasi-independence, symmetry and quasi-symmetry are still underdeveloped.
Some models of quasi-independence for many raters can be found in Tanner
& Young (1985). A generalization of the quasi-symmetry model to higher
order tables is presented in Rudas & Bergsma (2002) for marginal log-linear
models. Roughly speaking, marginal models are generalizations of classical
log-linear models and they are defined by restriction on appropriate marginal
distributions of the table, see Bergsma & Rudas (2002) for details.
We consider here 4 models. First, the quasi-independence model given by
Ad
1
log mi1 ...id = λ + λA
i1 + . . . + λid + δi I{i1 =...=id }
(4.8)
see Tanner & Young (1985). From this model we can also generalize the
simplified quasi-independence model by considering δi = δ for all i = 1, . . . , I.
Tanner & Young (1985) proposed also a non-homogeneous quasi-indep¡¢
endence model defined by measuring the agreement among the kd subsets of
60
CHAPTER 4. RATER AGREEMENT MODELS
k raters. The model is specified by the equation:
Ad
1
log mi1 ...id = λ + λA
i1 + . . . + λid + δi1 ...id
(4.9)
with
δi1 ...id = I1 δ1 + . . . + I(d) δ(d)
k
k
where Ig = 1 for those cells corresponding to agreement between the k observers of the g-th k-tuple of observers and Ig = 0 otherwise. In particular,
when k = 2 this model measures the pairwise agreement.
The generalization of the quasi-symmetry model is less straightforward.
The model in Equation (4.10) below is a prototype and generalizes a model
presented in Rudas & Bergsma (2002):
Ad
1
log mi1 ...id = λ + λA
i1 + . . . + λid +
X
λiAkkihAh
(4.10)
h<k=1,...,d
h Ak
with λiAkkihAh = λA
for all h, k = 1, . . . , d. In the same paper other models
ih ik
are introduced. Our choice is motivated by the fact that often in medical
application the hypothesis of pairwise agreement is more useful rather than an
overall quasi-symmetry condition of the d observers. The definition of more
restrictive quasi-symmetry models is possible by adding to the previous model
other parameters for higher-order symmetries.
In some clinical studies the repeated image evaluation by the same reader
is of great interest. In fact, the medical studies usually require the assessment of both the inter-reader variability and the intra-reader variability. In
particular, intra-reader agreement models are useful for the study of category
distinguishability for the neat definition of the rating scales, see Darroch &
McCloud (1986), where quantitative measures are welcome, see CDER (2000).
But, the resulting statistical structure and modeling in the intra-reader agreement is the same than in the previous setting of inter-reader agreement.
Finally, in the case where a truth standard is available, the analysis usually
concerns non-square tables, in most cases 2 × I tables. In such situation, the
statistical analysis is usually performed using the Receiver Operating Characteristic (ROC) curves, see for example Hanley (1998). We do not consider
here this application, limiting ourselves to square tables.
4.3. EXACT INFERENCE
4.3
61
Exact inference
In this section we introduce the application of the algebraic algorithm for sampling from conditional distribution described in Chapter 2 in order to define
exact inferential procedures for the Cohen’s κ and for the log-linear models presented in the previous section. As mentioned above, we define our procedures
in the framework of the conditional inference.
In the recent literature, some methods for estimating κ indices with nonasymptotic techniques are referred to as the methods of generalized estimating
equations, see Klar et al. (2000) and Williamson et al. (1994) for details. In this
last paper, a simple simulation study is given. This approach is a generalization
of the Cohen’s κ and it is useful in particular when each subject is rated by
a different number of observers. Other studies approach the problem from
the perspective of latent class analyses, see Guggenmoos-Holzmann & Vonk
(1998).
Some procedures for hypothesis testing on κ have been recently developed,
see for example Donner et al. (2000), but limited to two raters or dichotomous variables. Other works on the same subject can be found in Donner &
Klar (1996). Note that under the conditional approach the distribution of the
estimator κ̂ is difficult to compute. In fact, also its support depends on the
observed marginal distributions.
Here, we use the algebraic algorithms presented previously in order to define
appropriate MCMC approximations of the test statistics. Note that specific
work on the quasi-independence model can also be found in Smith et al. (1996).
The method we present here applies to general multi-dimensional tables,
with both more than 2 levels of the rating scale and more than 2 raters. Let
X = {1, . . . , I}d be the sample space. Denote the cell probabilities by π and
the sample proportions by p. Moreover, let f be a generic contingency table,
that is a function f : X −→ N.
As discussed in the previous chapters, for the log-linear models, the goodness of fit tests are performed conditionally to the sufficient statistics, whereas
in the κ approach, the distribution of the estimator κ̂ is computed conditionally to the one-dimensional margins of the tables. These statements are well
known in the two-observer case. In the multi-observer case the situation is less
well understood and it will be widely discussed later.
We call conditioning statistic the function T : X −→ Ns , under which we
62
CHAPTER 4. RATER AGREEMENT MODELS
compute the conditional distributions. In the log-linear case, this is equivalent
to the notion of sufficient statistic, but in a general approach this terminology
could lead to some misunderstandings, especially when we compute conditional
distributions of the estimators.
Now, the p−values and the conditional distributions of the estimators can
be computed applying the algorithm described in Chapter 2.
4.4
Conditioning
The problem of finding the conditioning statistic T for the models introduced
above is relatively easy for d = 2. In fact, the distribution of the estimator κ̂ is
computed with fixed margins if we allow different marginal distributions, i.e.,
using the coefficient defined in Equation (4.1). In this case the conditioning
statistic is
T =
à I
X
fij , i = 1, . . . I,
j=1
I
X
!
fij , j = 1, . . . , I
.
i=1
Using the definition of κh of Equation (4.2) where we impose marginal
homogeneity, the appropriate conditioning statistic is
T =
à I
X
fij +
I
X
!
fri , i = 1, . . . I,
.
r=1
j=1
For the goodness of fit tests for log-linear models, we compute the distribution of the test statistics conditioned to the sufficient statistic, see for
example Fienberg (1980) for details and formulae. For example, the conditioning statistic for the quasi-independence model as defined in Equation (4.5)
is given by
T =
à I
X
fij , i = 1, . . . I,
j=1
I
X
!
fij , j = 1, . . . , I, fii , i = 1, . . . , I
.
i=1
The situation presented above for the two-observer case is much less clear
in the multi-observer case.
Analyzing Equations (4.3) and (4.4), it follows that the conditioning statistics for the Cohen’s κ problem are those coming from the complete independence model for the estimation of κ(d) and from the complete independence
4.4. CONDITIONING
63
model together with the marginal homogeneity for the estimation of κh(d) . In
the first case, the conditioning statistic T is
T = (fk (i), i = 1, . . . , I, k = 1, . . . , d)
where fk (i) =
P
ih 6=ik
fi1 ,...,ih ,...,id I{ik =i} is the i-th count of the k-th marginal
distribution, while in the second case the conditioning statistic T is
à d
!
X
T =
fk (i), i = 1, . . . , I .
k=1
In the log-linear models perspective, the conditioning statistic is
T = (fk (i), i = 1, . . . , I, k = 1, . . . , d, fi...i , i = 1, . . . , I)
(4.11)
for the quasi-independence model described by Equation (4.8), while for the
non-homogeneous model we need to know the same conditioning statistic for all
marginal two-way tables. Note that here our attention is restricted to the case
k = 2 in Equation (4.9), i.e. to the pairwise agreement, but the conditioning
structure is the same also for the agreement among h > 2 observers.
For the quasi-symmetry model in Equation (4.10) the appropriate conditioning statistic is given by
T = (fk (i), i = 1, . . . , I, k = 1, . . . , d,
X

(fi1 ...ij ...ih ...id + fi1 ...ih ...ij ...id ), h ≥ k = 1, . . . , I  . (4.12)
ik 6=ij ,ih
In other words the quasi-symmetry model presented in this paper considers
the quasi-symmetry of all two-way marginal tables. This is sufficient because
many agreement models refer to pairwise agreement.
The marginal two-way agreement in a d-dimensional table is sometimes
preferable with respect to an overall test of quasi-independence or quasisymmetry. In Agresti (1992) a discussion on this topic is presented for the
quasi-symmetry models. In particular the author’s attention is addressed to
the study of the degrees of freedom of the test statistics and he concludes that,
in some cases when the constraints imposed in the model reduce the degrees of
freedom, the test can be too conservative. Although we use a non-asymptotic
approach, also in the exact framework the most complex models can lead to
64
CHAPTER 4. RATER AGREEMENT MODELS
results with a difficult interpretation, especially when the sample size is small
and there are sampling zeros. To show this, we present a simple example of
two-way table. Consider the data shown in Table 4.1,
Y
X
1
2
3
4
1
22
5
0
0
2 3 4
2 2 0
7 14 0
2 36 0
1 17 10
Table 4.1: Carcinoma data.
These data has already been used in Example 3.8, Chapter 3. If we analyze
these data under the quasi-symmetry model, there are only three tables in the
appropriate reference set Ft and the exact p−value of the goodness of fit test
is 1. But, this result can be criticized because of the small cardinality of Ft .
Moreover, if we slightly modify this table, setting to 0 the (1, 3) entry of the
table and then considering the data in Table 4.2, the reference set is formed
only by the original table and no other ones. In such cases, the inference is
probably infeasible.
Y
X
1
2
3
4
1
22
5
0
0
2 3 4
2 2 0
7 14 0
2 36 0
1 17 10
Table 4.2: Carcinoma data modified.
4.5
Computations
In this section, we compute the relevant Markov bases for the statistical models introduced above using the same technique as in Chapter 3. Here, the
computations are much more complex and new results are presented.
As pointed out in the previous chapter, in order to find Markov bases it
is not necessary to consider Gröbner bases. We can consider any finite set of
generators of the relevant ideals. Given an ideal, it is possible to determine
4.5. COMPUTATIONS
65
a minimal set of generators , that is a set of polynomials which generates the
ideal and has minimal cardinality, see Kreuzer & Robbiano (2000). Roughly
speaking, a minimal set of generators is another basis of the ideal. For the
application of the Diaconis-Sturmfels algorithm, the relevant object is the toric
ideal IT coming from the matrix representation AT of the sufficient statistic,
and the algorithm is independent from the particular representation of such
ideal. While in simple cases there is no considerable difference between the
Gröbner basis and the corresponding minimal set of generators of an ideal,
in complex cases a minimal set of generators is preferable, because of the
significant reduction in the number of binomials, leading to a simplification of
the Monte Carlo step of the algorithm.
Nevertheless, Gröbner bases are essential in the symbolic computation of
the toric ideals, as they has nice algebraic properties, namely they guarantee
the finiteness of the division algorithm needed in a great number of algebraic
algorithms, including that for the computation of toric ideals.
Here, we use the same notation as in Chapters 2 and 3, i.e., we use the
ξ’s indeterminates for the cell probabilities and the y’s indeterminates for the
components of the sufficient statistic, using these indeterminates as parameters.
The minimal set of generators is computed in CoCoA through the function
MinGens, starting from the Gröbner basis. Note that in the following we leave
the resulting Markov bases in terms of binomials in order to save space.
For the κ(d) coefficient in the case of 3 observers and a response variable with
3 levels we simply write in CoCoA the matrix representation of the conditioning
statistic and few lines of statements as follows.
Mat1:= [
[1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0],
[0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0],
[0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1],
[1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0],
[0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0],
[0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1],
[1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1]
];
T1:=Toric(Mat1);
MB1:=MinGens(T1);
66
CHAPTER 4. RATER AGREEMENT MODELS
We obtain 162 moves of the form ξ121 ξ211 −ξ111 ξ221 . As we can see, the alge-
braic complexity of the problems increases fast with the number of observers.
For example the above Markov basis consists of 162 moves, while the Markov
basis for the corresponding problem with two observers consists of 9 moves.
If we consider the distribution of κ̂h(d) , the problem is implemented with
Mat2:= [
[3,2,2,2,1,1,2,1,1,2,1,1,1,0,0,1,0,0,2,1,1,1,0,0,1,0,0],
[0,1,0,1,2,1,0,1,0,1,2,1,2,3,2,1,2,1,0,1,0,1,2,1,0,1,0],
[0,0,1,0,0,1,1,1,2,0,0,1,0,0,1,1,1,2,1,1,2,1,1,2,2,2,3]
];
T2:=Toric(Mat2);
MB2:=MinGens(T2);
and we obtain 44 moves: 17 moves of degree one, that is coming from binomials
of degree one, and 27 moves of degree two. From a statistical point of view, the
degree of a move represents the number of counts moved from a cell to another
cell, see the correspondence between binomials and moves given in Chapter 2.
For higher-dimensional problems, i.e. with a greater number of observers
and/or with a greater number of levels of the response variable, it is enough to
add appropriate rows and columns to the previous matrices of the conditioning
statistics.
Note that the matrix representation of the conditioning statistic allows us
to write easily the expression of the underlying statistical model. In fact, as
each column represents a raw probability and each row represents a component
of the conditioning statistic, it is easy to write the multiplicative form of the
model. For example in the previous model we have
πikj = yi yj yk
(4.13)
with i, j, k = 1, . . . , I, giving π111 = y13 in the first column, π112 = y12 y2 in the
second column, and so on.
Now, consider the log-linear models presented in Section 2. In the models
below we do not write explicitly the matrix representation of the conditioning
statistics, but we write directly the results. For the quasi-independence model
and the quasi-symmetry model for two-way contingency tables, the computation of the Markov bases has been presented in the previous chapter, so we
restrict our attention to higher-dimensional tables. In particular we compute,
as an example, the Markov bases for a 3 × 3 × 3 table.
4.5. COMPUTATIONS
67
For the quasi-independence model defined in Equation (4.8) with conditioning statistic given in Equation (4.11), we find a Markov basis with 105
moves, all of degree two.
It is worth noting that in the quasi-independence model we have single cells
as components of the sufficient statistic. So, we can apply in the computation of
the Markov bases the following result, which is a generalization of Proposition
3.4, Chapter 3, holding for two-way tables.
Theorem 4.1 The Gröbner basis of the toric ideal IT for the quasi-independence model is the Gröbner basis for the corresponding incomplete table with
missing diagonal entries under the complete independence model.
Proof.
The conditioning statistic for the quasi-independence model is
T = (fk (i), i = 1, . . . , I, k = 1, . . . , d, fi...i , i = 1, . . . , I)
Consider the equivalent conditioning statistic
T ∗ = (fk∗ (i), i = 1, . . . , I, k = 1, . . . , d, fi...i , i = 1, . . . , I)
where fk∗ (i) denotes the sum excluding the main diagonal cells. Using the
conditioning statistic T ∗ , the set of binomials coming from the ring homomorphism ξu → tu = y T (u) is formed by two separated subsets of equations. The
first subset does not involve the indeterminates ξi...i and yi...i , and is formed by
the equations of the complete independence model for the table with missing
diagonal entries, i.e. the binomials are
ξi1 ...id −
d
Y
ykik
k=1
for all non-diagonal cells. The second set of equations is formed the binomials
ξi...i − yi...i for i = 1, . . . I.
Now, eliminating the y’s indeterminates, the Gröbner basis coming from
the first subset of equations is the Gröbner basis for the complete independence
model for the table with missing diagonal entries, while the binomials of the
second subset are all deleted. This proves the theorem.
Usually, because of a reduction of the indeterminates, Theorem 4.1 speadies
the symbolic computation, with a great saving of time, especially for highdimensional tables.
Moreover, we can also apply the following result, if we have the Markov
basis for the complete independence model.
68
CHAPTER 4. RATER AGREEMENT MODELS
Theorem 4.2 The Gröbner basis of the toric ideal IT for the quasi-independence model is the Gröbner basis for elimination ideal obtained by the toric
ideal of the complete independence model with the elimination of those indeterminates corresponding to the diagonal cells.
Proof.
Let T be the conditioning statistic of the quasi-independence defined
in Equation(4.11) and let T 0 be the conditioning statistic of the complete
independence model for the same table. As the diagonal cells are components
of the conditioning statistic T , the binomials ξi...i − yi...i , i = 1, . . . , I, are
generators of the ideal JT . The ideal JT is then formed by the set of binomials
corresponding to the complete independence model
ξi1 ...id −
d
Y
ykik
(4.14)
k=1
for all cells, and by the set of binomials ξi...i − yi...i , i = 1, . . . , I. The binomials in Equation (4.14) are the generators of the ideal JT 0 of the complete independence model and the corresponding toric ideal is computed as
IT 0 = Elim(y, JT 0 ).
Following the theory in Chapter 2 of Kreuzer & Robbiano (2000), by the
linearity of the last binomials, we can replace ξi...i with yi...i in the binomials
(4.14).
Eliminating the y’s indeterminates from JT , the binomials of the form
ξi...i − yi...i are all deleted and for the first set we obtain Elim(y, JT 0 ). As we
can replace ξi...i with yi...i , the following chain of equalities holds:
IT = Elim(y, JT ) = Elim((y, ξi...i ), JT ) =
= Elim((y, ξi...i ), JT 0 ) = Elim(ξi...i , IT 0 ) .
The proof is now complete.
We also consider the same problem under the simplified quasi-independence
model, as defined in Equation (4.6) and we obtain a Markov basis formed by
164 moves: the 105 moves of the quasi-independence model above, together
with 38 moves of degree 3 and 21 moves of degree 4.
For the non-homogeneous model defined in Equation (4.9) a slight more
complex situation happens. In fact, the Markov basis is formed by 3436 moves:
52 moves of degree 3; 774 moves of degree 4; 1064 moves of degree 5; 1503 moves
of degree 6; 18 moves of degree 7; 27 moves of degree 8.
4.5. COMPUTATIONS
69
Under the quasi-symmetry model defined in Equation (4.10) with the conditioning statistic specified in Equation (4.12), we have a Markov basis formed
by 219 moves: 13 moves of degree 3; 90 moves of degree 4; 54 moves of degree
5; 54 moves of degree 6; 18 moves of degree 8.
To illustrate the computational growth when moving from two- to threedimensional tables, we give in Table 4.3 the number of moves for the 3-observer
problems and the corresponding moves for the 2-observer problems of the previous models.
model
κ
κ with marg. hom.
q.i.
simplified q.i.
non-hom. q.i.
q.s.
2 observers 3 observers
9
162
9
44
1
105
9
164
1
3436
1
219
Table 4.3: Number of moves for a 3-level response and 2 or 3 observers.
In addition, it is interesting to show the importance of considering minimal
sets of generators rather than Gröbner bases. This is clearly shown in Table 4.4,
where we compare the number of moves coming from the Gröbner bases and
the number of moves coming from the minimal sets of generators. Moreover,
as the complexity of a move is represented by its degree, we also compare the
maximum degree of the moves.
model
κ
κ with marg. hom.
q.i.
simplified q.i.
non-hom. q.i.
q.s.
GB Max Degree
168
3
44
2
121
4
210
7
5995
17
421
13
MSG
162
44
105
164
3436
219
Max Degree
2
2
2
4
8
8
Table 4.4: Number of moves and maximum degree from Gröbner bases (GB)
and from minimal sets of generators (MSG).
Finally, notice that non-trivial moves are involved. For example, the move
of degree 13 coming from the Gröbner basis for the quasi-symmetry model is
70
CHAPTER 4. RATER AGREEMENT MODELS
represented by the polynomial
2
2
3
2
3
2
4
2
−ξ133 ξ212
ξ221
ξ322
ξ323
ξ331
+ ξ122 ξ213 ξ222
ξ231 ξ313 ξ321
ξ332
ξ333
and it corresponds to the move displayed in Table 4.5. This move is not needed
if we look at the minimal set of generators instead of the Gröbner basis.
i
1
2
3
1
0
0
0
k
j
2
3
0
0
-2 +1
+4 -3
=1
i
1
2
3
j
1
2
3
0 +1
0
-2 +2
0
0 -3 +2
k=2
i
1
2
3
j
1 2
3
0 0 -1
+1 0
0
+1 -2 +1
k=3
Table 4.5: Move of degree 13 from the Gröbner basis for the quasi-symmetry
model.
4.6
Example
In this section we apply the methodology described in the paper to a data set
taken from Dillon & Mulani (1984) and shown in Table 6. This data set gives
the ratings of 164 subjects by three observers. The rating scale has 3 levels:
Positive(+), Neutral (N) and Negative (−).
A1
+
N
−
A2
+ N
56 12
1 2
0 1
A3 = +
−
1
1
0
A1
+
N
−
A2
+ N −
5 14 2
3 20 1
0 4 7
A3 =N
A1
+
N
−
j
+ N
0 0
0 4
1 2
A3 = −
−
1
1
24
Table 4.6: Data set from Dillon & Mulani (1984).
For all models, we compute the non-asymptotic p−values of the tests running the Markov chain based on the appropriate Markov basis. Any Markov
chain samples 50,000 tables with 50,000 burn-in steps and sampling every 50
tables, i.e., with the same Monte Carlo parameters as in the example of Chapter 3.
We consider two tests, a first test without the constraint of marginal homogeneity with null hypothesis H0 : κ(3) = 0 against the one-sided alternative
4.6. EXAMPLE
71
model
κ
κ with marg. hom.
q.i.
simplified q.i.
non-hom. q.i.
q.s.
df
17
19
17
11
asymptotic
<0.0001
<0.0001
0.0031
0.0001
0.0016
0.1296
MCMC
0
0
0.0105
0.0001
0.0081
0.1624
Table 4.7: Comparison of the results.
hypothesis H1 : κ(3) > 0, and a second test with the constraint of marginal homogeneity with null hypothesis H0 : κh(3) = 0 against the one-sided alternative
hypothesis H1 : κh(3) > 0. In the case of log-linear models we use the Pearson’s
chi-squared C statistic for testing the goodness of fit of the 4 models presented
in Section 2.
Few lines of statements are sufficient to produce the output in ten seconds
approximately of CPU time usage on a standard PC, except for the nonhomogeneous quasi-independence model, which requires some additional time
for the symbolic computation.
In Table 4.7 we compare the non-asymptotic MCMC p−values with the
classical ones based on asymptotically normal or chi-square distributions.
From Table 4.7, we remark that the differences between the two approaches
are small in those cases where there is strong evidence against the null hypothesis, while there are notable differences if the significance of the test is less evident. We conclude that there are significant differences between the asymptotic
p−values and the MCMC ones, especially when the exact p−values are near to
the usual α-levels of the tests and the sample size is small. These differences
can also lead to different conclusions on the rejection of the null hypothesis.
72
CHAPTER 4. RATER AGREEMENT MODELS
Chapter 5
Sufficiency and estimation
The problems of sufficiency and estimation of log-linear models are an important topics in the analysis of discrete data and, specially, in the analysis of
contingency tables. In Chapter 1 we have presented an introduction to loglinear models, with special attention to their structure. In particular, literature
suggests the maximum likelihood (ML) approach, as introduced in Chapter 1.
In the past few years, the application of new algebraic non-linear techniques to Probability and Statistics have been presented. Here we follow the
polynomial representation of random variables on discrete sample spaces as
introduced in Pistone et al. (2001a) and we study some algebraic and geometrical property of a class of models introduced as toric models in Pistone
et al. (2001b), showing the links between toric models and log-linear models.
The polynomial Algebra is used to describe the geometrical structure of the
statistical toric models on finite sample spaces. The first works in this direction, but limited to the analysis of graphical models, are Geiger et al. (2002)
and Garcia et al. (2003). Moreover, the work of Diaconis & Sturmfels (1998)
presented in Chapter 2 leads to the possibility of sampling from conditional
distributions and we will use such technique in order to actually compute the
minimum-variance unbiased estimator of the cell counts.
In this chapter we consider a general finite sample space and we use algebraic techniques in order to obtain a description of the notion of sufficiency and
to define an unbiased minimum-variance estimator. This estimator is defined
for all observed tables, even in the non-negative case, i.e. it avoids problems
with structural zeros. Moreover, we show that it is well-defined also for sparse
tables, i.e. it avoids problems with sampling zeros. Starting from this theory, we also define an algorithm to actually compute the estimate, using the
technique described in Chapter 2.
73
74
CHAPTER 5. SUFFICIENCY AND ESTIMATION
Working in the non-negative case, in Section 2 we introduce the class of
toric models, and we give a polynomial representation of the notion of sufficient
statistic, first for a single observation and then for a independent sample,
generalizing some known results. In Section 3 we study the geometry of toric
models and we show that toric models are disjoint union of exponential models.
Moreover, we show that the parameterization play a fundamental role, and we
define a special parameterization such that the model contains all the possible
exponential models. In Section 4 we define an unbiased minimum-variance
estimator of the cell counts, we show its uniqueness and its relationship with
the ML estimate. Finally, in Section 5 we define a MCMC algorithm for the
computation of the minimum-variance estimate.
5.1
Introduction
Consider a statistical model on a finite sample space X . Although in contingency tables the sample space is usually a Cartesian product (for example in
the two-way case is a space of the form {(i, j) | i = 1, . . . I, j = 1, . . . , J}), we
assume here, without loss of generality, a generic list
X = {a(1), . . . , a(k)} .
A probability distribution on X is characterized by the parameters pi = P[a(i)]
with i = 1, . . . , k. Roughly speaking, a model M is a subset of the space of
parameters. As in the previous chapters, we use the vector notation, that is p
is the vector (p1 , . . . , pk ). The parameter space is given by the simplex pi ≥ 0,
P
i = 1, . . . , k and ki=1 pi = 1.
If (x1 , . . . , xN ) is a sample of size N drawn from a vector (X1 , . . . , XN ) of
independent and identically P-distributed random variables with values in X ,
the ML approach considers the likelihood function L(p; x1 , . . . , xN ) and the
maximum likelihood estimate p̂ of p is given by
L(p̂; x1 , . . . , xN ) = sup L(p; x1 , . . . , xN ) .
p∈M
The ML estimate may not exist or it may be not unique for some observed
tables. The problems of non-existence or non-uniqueness are related to the
presence of sampling zeros or structural zeros.
In the case where we restrict to the case of positive parameters, i.e. pi > 0
for all i = 1, . . . , k, and the conditions for the existence and uniqueness of
5.2. SUFFICIENCY AND SAMPLING
75
the ML estimate are satisfied, the ML estimate is found by differentiating the
likelihood function.
In the log-linear models for contingency tables, the likelihood equations
can be classified in two types:
(a) the equation system has an explicit solution p̂ and no approximation is
needed. This is the case, for example, of the independence model;
(b) the equation system has not an explicit solution and we must use a
numerical method in order to approximate the solution. This is the case,
for example, of the quasi-independence model.
For a discussion on this topic, see Section 3.7 of Kreuzer & Robbiano (2000),
where some symbolic methods for solving system of polynomial equations are
presented.
As outlined in Chapter 1, in the case (b) the most widely used methods for approximating the ML estimate are the Newton–Raphson one and
the Iterative Proportional Fitting one. For details about these methods, see
Haberman (1974), Chapter 3. Many optimized versions of such algorithms are
implemented in various software packages. For example, a modified Newton–
Raphson method is used in PROC CATMOD of SAS software, see SAS/STAT
User’s Guide (2000) for a concise computation’s discussion and for further references on numerics. Details on the applications of the Newton–Raphson algorithm and the Fisher scoring are given, for example, in McCullagh & Nelder
(1989), Chapter 2. Again, the main problem in the application of numerical
methods is the presence in the contingency tables of sampling or structural
zeros. Then, the algorithms need slight modifications of the observed counts,
in order to eliminate the null entries, see for example the recent algorithms in
Duffy (2002).
When ML estimate exists, it is in general a biased estimate of the cell
counts, and in some applications one may ask for an unbiased estimate. The
minimum-variance unbiased estimator can be used in such situation and also
when the ML estimate does not exist or it is not unique.
5.2
Sufficiency and sampling
Let us consider a statistical model on the finite sample space X of the form
px = Pφ [x] = φ(T (x)),
x∈X
(5.1)
76
CHAPTER 5. SUFFICIENCY AND ESTIMATION
where T : X −→ Ns is the vector of integer valued components of the sufficient
statistic. Here φ ∈ Φ, where Φ is a subset of functions from Ns to R. In point
of fact, the range of T is a finite subset T ⊂ Ns . Thus, the set of functions
from T to R has a vector space structure with a finite basis. This remark will
be useful in the next sections.
The set Φ defines a subset M of the space of the probabilities with sufficient
statistic T , and we refer to this subset as to a statistical model. In other words,
M is a subset defined through
M = {p : p = φT, φ ∈ Φ} .
We specialize Equation (5.1), by assuming that there exists a parameterization with parameters ζ1 , . . . , ζs such that the probabilities assume the form
L(ζ, x)
,
x∈X L(ζ, x)
px = P
with L(ζ, x) = ζ T (x) . Here T (x) = (T1 (x), . . . , Ts (x)) and
(5.2)
P
x∈X
L(ζ, x) is the
normalizing constant. As in the previous chapters, we use the vector notation,
T (x)
T (x)
e.g. ζ T (x) means ζ1 1 · · · ζs s . The algebraic structure of the functions φ and
T in Equation (5.1) will be discussed later in Section 5.4. Following Pistone
et al. (2001a) and Pistone et al. (2001b), we can state this definition.
Definition 5.1 If a statistical model M consists of all likelihoods of the form
in Equation (5.2), then we say that the model is a toric model.
The term “toric” comes from Commutative Algebra, because of the algebraic structure of the probabilities. As we have seen in Chapter 1, toric ideals
describe the algebraic relations among power product and in toric models the
probabilities are expressed in terms of power products. See also Sturmfels
(1996) and Bigatti & Robbiano (2001), where toric ideals are studied in details.
In Definition 5.1, the ζ parameters are unrestricted, except the non-negativity constraint. Note that in general a toric model is bigger than the exponential
model, as we do not assume positive probabilities, and with the constraint
px > 0 for all x ∈ X it is a log-linear model. Let M>0 be the subset of the
toric model M with the restriction px > 0 for all x ∈ X . Then
X
log px =
(log ζj )Tj (x) + log ζ0 ,
(5.3)
j
5.2. SUFFICIENCY AND SAMPLING
where
Ã
ζ0 =
X
77
!−1
L(ζ, x)
(5.4)
x∈X
is the normalizing constant. M>0 appears to be a log-linear, then it is an
exponential model with sufficient statistic T and canonical parameters log ζj ,
j = 1, . . . , s.
Note that the representation of the toric model in Equation (5.2) refers
to the notion of multiplicative form of a log-linear model, see for example
Goodman (1979a).
A first example of toric model is a model with sufficient statistic consisting
of the counts over the sets A1 , . . . , As ⊆ X , possibly overlapping:
T : x 7−→ (IA1 (x), . . . , IAs (x))
In most examples in the literature
X ⊆ {1, . . . , I1 } × · · · × {1, . . . , Id }
is a d-way array, possibly incomplete.
A second example is a log-linear model of the type
log Pψ (x) =
s
X
ψi Ti (x) − k(ψ)
(5.5)
i=1
with integer valued Ti ’s and parameters ψ = (ψ1 , . . . , ψs ). In such a case each
X
Ti (x) =
Ti (a)Ia (x)
(5.6)
a∈X
is an integer combination of all counts, see Fienberg (1980).
Note that in Equation (5.1) the strict positivity implied by the log-linear
model in Equation (5.5) is not assumed.
Consider now the problem of sampling. As the probabilities can be written
as
T (x)
px = ζ0 ζ1 1
· · · ζsTs (x)
(5.7)
the probability of a sample (x1 , . . . , xN ) is
P
px1 · · · pxN = ζ0N ζ1
i
T1 (xi )
P
· · · ζs
i
Ts (xi )
(5.8)
i.e., the sufficient statistic for the sample of size N is the sum of the sufficient
statistics of the N components of the sample. Note that this result is formally
the same as in the positive case where the theory of exponential models applies.
78
CHAPTER 5. SUFFICIENCY AND ESTIMATION
The j-th component of the sufficient statistic can be represented as
X
i
Tj (xi ) =
X
Tj (a)Fa (x1 , . . . , xN )
(5.9)
a∈X
where
Fa (x1 , . . . , xN ) =
N
X
Ia (xj )
(5.10)
j=1
is the count of the cell a. This result generalizes the corresponding theorem
holding in the framework of log-linear models, which states that the sufficient
statistics for all log-linear models are linear combinations of the cell counts.
5.3
Geometry of toric models
Now, define the matrix AT in the following way. AT is a matrix with k rows
and s columns and its generic element AT (i, j) is Tj (xi ) for all i = 1, . . . k and
j = 1, . . . , s.
If ηj = E[Tj ] are the mean-parameters and p is the row vector of the
probabilities, then
η = pAT
(5.11)
The matrix AT can also be used in order to describe the geometric structure
of the statistical model. First, we can state the following result.
Proposition 5.2 Choose a parameter ζj and take the set X 0 of points x ∈ X
such that Tj (x) > 0. Suppose that X 0 6= X . If we set ζj = 0, we obtain a model
on the remaining sample points and this model is again a toric model.
Proof.
Without loss of generality, suppose j = 1 and T1 (x) = 0 for
0
i = 1, . . . , k and T1 (x) > 0 for i = k 0 + 1, . . . , k. The matrix AT can be
partitioned as

0
 ..
 .

 0
AT = 
 ∗
 .
 ..
∗

A0T








(5.12)
where ∗ denotes non-zero entries. Now, the matrix A0T is non-zero and it is
the representation of a toric model on the first k 0 sample points.
5.3. GEOMETRY OF TORIC MODELS
79
Thus, the geometric structure of the toric model is an exponential model
and at most s toric models with (s − 1) parameters on appropriate subsets of
the sample space X . Moreover, by applying recursively the above theorem we
obtain the following result.
Theorem 5.3 Let ζj1 , . . . , ζjr be a set of parameters and take the set X 0 of
points x ∈ X such that Tib (x) > 0 for at least one b ∈ {1, . . . , r} and Tib (x) = 0
for all b = 1, . . . , r and x ∈ X − X 0 . Suppose that X 0 6= X . If we set ζjb = 0
for all b = 1, . . . , r, we obtain a toric model on the remaining sample points.
Proof.
Apply b times Proposition 5.2.
These results lead to a geometrical characterization of a toric model.
Theorem 5.4 A toric model is the disjoint union of exponential models.
Proof.
The result follows from the application of Theorem 5.3 for all
possible sets of parameters ζj1 , . . . , ζjb for which X 0 6= X .
Example 5.5 Consider the independence model for 3 × 2 tables analyzed in
Example 3.3 of Chapter 3. The matrix representation of the sufficient statistic
is


1 0 0 1 0
1 0 0 0 1 


0 1 0 1 0 


AT = 

0
1
0
0
1


0 0 1 1 0 
0 0 1 0 1
and the toric model is, apart from the normalizing constant,

p11 = ζ1 ζ4




p12 = ζ1 ζ5



p21 = ζ2 ζ4
p22 = ζ2 ζ5




p = ζ3 ζ4


 31
p32 = ζ3 ζ5
If ζ > 0 we have the log-linear model. Moreover, we can choose the following
sets X 0 in Theorem 5.3:
•
a) X 0 = {(1, 1), (1, 2)} corresponding to ζ1 = 0;
80
CHAPTER 5. SUFFICIENCY AND ESTIMATION
b) X 0 = {(2, 1), (2, 2)} corresponding to ζ2 = 0;
c) X 0 = {(3, 1), (3, 2)} corresponding to ζ3 = 0.
In this cases we obtain the independence model for the 2×2 tables X \X 0 ;
•
d) X 0 = {(1, 1), (2, 1), (3, 1)} corresponding to ζ4 = 0;
e) X 0 = {(1, 2), (2, 2), (3, 2)} corresponding to ζ5 = 0.
In this case we obtain the multinomial model for the second and the first
column, respectively;
•
f) X 0 = {(1, 1), (1, 2), (2, 1), (2, 2))} corresponding to ζ1 = ζ2 = 0;
g) X 0 = {(1, 1), (1, 2), (3, 1), (3, 2))} corresponding to ζ1 = ζ3 = 0;
h) X 0 = {(2, 1), (2, 2), (3, 1), (3, 2))} corresponding to ζ2 = ζ3 = 0.
In this case we obtain the Bernoulli model for the third, the second and
the first row, respectively;
•
i) X 0 = {(1, 1), (1, 2), (2, 1), (3, 1)} corresponding to ζ1 = ζ4 = 0;
j) X 0 = {(1, 1), (1, 2), (2, 2), (3, 2)} corresponding to ζ1 = ζ5 = 0;
k) X 0 = {(2, 1), (2, 2), (1, 1), (3, 1)} corresponding to ζ2 = ζ4 = 0;
l) X 0 = {(2, 1), (2, 2), (1, 2), (3, 2)} corresponding to ζ2 = ζ5 = 0;
m) X 0 = {(3, 1), (3, 2), (1, 1), (2, 1)} corresponding to ζ3 = ζ4 = 0;
n) X 0 = {(3, 1), (3, 2), (1, 2), (2, 2)} corresponding to ζ3 = ζ5 = 0.
In this case we obtain 6 Bernoulli models. These models are the 6 models
on the columns of the 3 independence models found above;
• corresponding to the six conditions ζ1 = ζ2 = ζ4 = 0, ζ1 = ζ2 = ζ5 = 0,
ζ1 = ζ3 = ζ4 = 0, ζ1 = ζ3 = ζ5 = 0, ζ2 = ζ3 = ζ4 = 0 and ζ2 = ζ3 = ζ5 = 0
we obtain 6 trivial distributions on one sample point.
The toric model is then formed by 21 models: the log-linear model on 6 points,
3 models on 4 points, 2 models on 3 points, 9 models on 2 points and 6 trivial
models on 1 point. Apart from the trivial models, the remaining 15 models
are represented in Table 5.5.
In Table 5.5 the 0 means structural zero, while ∗ denotes the cells with
non-zero probability.
The procedure in Theorem 5.3, applied as in Example 5.5 can also be used
to define the admissible sets of structural zeros.
5.3. GEOMETRY OF TORIC MODELS
81
∗ ∗
∗ ∗
∗ ∗
∗ ∗
∗ ∗
0 0
∗ ∗
0 0
∗ ∗
0
∗
∗
0
∗
∗
0
0
0
∗
∗
∗
∗
∗
∗
0
0
0
0 0
0 0
∗ ∗
0
∗
0
0
∗
0
∗
0
0
∗
0
0
0
0
0
0
∗
∗
0 0
∗ 0
∗ 0
0 ∗
0 0
0 ∗
∗ 0
0 0
∗ 0
0
0
0
∗
∗
0
∗ 0
∗ 0
0 0
Table 5.1: Nontrivial models.
Definition 5.6 A subset X 0 ⊂ X is an admissible set of structural zeros if
there exist parameters ζi1 , . . . , ζir such that the condition of Theorem 5.3 holds.
Now, we consider the behavior of the η-parameterization in the different
exponential models. Starting from the representation of the η parameters as
function of the ζ parameters in Equation (5.11), we can easily prove that the
η parameterization is coherent, as stated in the following proposition.
Proposition 5.7 The η parameters for the reduced models are the same as in
the exponential case, provided that a component is fixed to zero.
Proof.
For the proof, it is enough to combine the linear relation between η
and ζ given in Equation (5.11) and Theorem 5.3.
Remark 5.8 The definition of toric model as disjoint union of exponential
models has a counterpart in differential geometry, namely in the definition of
the differential manifolds with corners. For the construction of a differential
manifold with corners, consider the positive quadrant in Rk . It is an open
variety and its boundary is formed by k quadrants of dimension k − 1. Add
one or more of these quadrants to the variety and iterate k times the procedure.
In general, such derivatives differ from the limit of the derivatives computed
in Rk . Every real variety diffeomorphic to a variety defined as above is a
differential manifold with corner. For details about differential manifolds with
corners, see Margalef Roig & Outerelo Dominguez (1992)
82
CHAPTER 5. SUFFICIENCY AND ESTIMATION
Now, we state a differential relationship among the ζ parameters and the
η parameters.
Theorem 5.9 In the positive case the following equation holds
ζj
∂ζ0
= −ηj ζ02
∂ζj
(5.13)
for all j = 1, . . . , s
Proof.
By straightforward computation, we have
P
Tj (x)ζ T (x) /ζj
∂ζ0
ζ02 ηj
= − x∈X
,
=
−
¡P
¢
T (x) 2
∂ζj
ζj
ζ
x∈X
and the proof is complete.
Note that the previous Equation holds in all exponential models contained
in the toric model.
The geometric representation of the structural zeros as presented in the
previous definition needs some further discussions. Let we consider a simple
example.
Example 5.10 Consider the simple example of the independence model for
2 × 2 contingency tables. Using the notations of the previous chapters, a first
parameterization is

p11 = ζ0



p12 = ζ0 ζ1
p21 = ζ0 ζ2



p22 = ζ0 ζ1 ζ2
(5.14)
This parameterization is used in Pistone et al. (2001a). A second parameterization is the following one, based on the matrix representation of the sufficient
statistic as explained in Chapter 3.

p11



p12
p21



p22
= ζ1 ζ3
= ζ1 ζ4
= ζ2 ζ3
= ζ3 ζ4
(5.15)
Eliminating the ζ indeterminates in the above systems of equations, in both
cases we obtain only one polynomial, namely the binomial
p11 p22 − p12 p21 = 0
(5.16)
5.3. GEOMETRY OF TORIC MODELS
83
But, the toric models in Equation (5.14) and in Equation (5.15) are different.
In fact, the second parameterization contains for example the Bernoulli model
on the two points (1, 1) and (1, 2), while the first parameterization does not.
Now, consider the binomial equations obtained from a toric model by elimination of the ζ indeterminates as described in Chapter 1. In this way we
obtain a set of binomials which defines a statistical model, but in general the
binomial form differs from the parametric form. The binomial form is independent on the parameterizations and it can allow some boundaries excluded
from the parametric form, as in the previous Example. In the following, we
will use the terms “parametric toric model” and “binomial toric model”.
Moreover, we can also restate the definition of set of structural zeros for a
toric model in binomial form.
Definition 5.11 Let B be the set of binomials defined by elimination as described in Chapter 2. A subset X 0 ⊂ X is an admissible set of structural zeros
independent from the parameterization if the polynomial system B = 0 together
with px = 0 for all x ∈ X 0 has a non-negative normalized solution.
In general, it is difficult to find a parameterization such that the parametric
form of the toric model contains all the exponential sub-models.
Definition 5.12 If a parameterization defines a parametric toric model which
contains all the exponential sub-models, we say that the parameterization is a
full parameterization.
In view of Example 5.10 and Definition 5.12, it follows that not all matrix
representations of the sufficient statistic are equivalent. In general there is an
infinite number of non-negative integer valued bases of the sub-space spanned
by T , but not all of these contains all the exponential sub-models. This happens because the columns of the matrix AT are defined in the vector space
framework, but in the power product representation we need non-negative
exponents and then we need linear combinations with non-negative integer coefficients. In general, it is difficult to find a full parameterization, but it is easy
to characterize all the parameterizations which lead to a given binomial toric
model.
84
CHAPTER 5. SUFFICIENCY AND ESTIMATION
Theorem 5.13 Let AT be the matrix model. If v1 , . . . , vs is a non-negative
integer system of generator of the rank of AT , then
v (x)
px = ζ1 1
· · · ζ vs (x)
(5.17)
for x ∈ X is a parameterization and all parameterizations of this kind lead to
the same binomial toric model.
Proof.
Let Av be the matrix
Av = (v1 , . . . , vs )
As the kernels of the systems As = 0 and AT = 0 are the same. As a consequence of the construction of the toric ideals presented in Section 2 of Chapter
1, the toric models defined through AT and As are represented by the same
binomials.
Among others, one can consider the rank of AT as a lattice and then consider a Hilbert basis of such lattice.
Definition 5.14 Let S ⊆ Nk be the set of integer solutions of a diophantine
system AT p = 0. A set of integer vectors H = {v1 , . . . , vh } is a Hilbert basis
of S if for all β ∈ S
X
β=
cv v
(5.18)
v∈H
where cv ∈ N.
It is known that such a set H exists and is unique. The number of elements
in H in general differs from the dimension of the rank of AT as vector subspace. For details about the properties of the Hilbert basis and the algorithms
for its computation, see Sturmfels (1993) and Kreuzer & Robbiano (2003).
The application of theory of Hilbert bases to the geometrical description of
the toric models is currently in progress.
5.4
Algebraic representation of the sufficient
statistics
Let us now introduce the algebraic representation of the sufficient statistic for
toric models. Let T = T (X ) be the image of the sufficient statistic T , i.e. the
set of all values of the sufficient statistic. As noted in Section 5.2, the set T is
a finite subset of Ns .
5.5. MINIMUM-VARIANCE ESTIMATOR
85
Proposition 5.15 Let (T1 (X ), . . . , Ts (X )) = T ⊆ Ns be the range of the sufficient statistic. As T is a finite subset of Qs , then there exists a list of monomials
{tα : α ∈ L}
(5.19)
such that all function φ : T −→ T are expressed as
X
φα tα
φ(t) =
(5.20)
α∈L
For the proof, see e.g. Pistone et al. (2001a). Implemented algorithms can
be used to derive such list.
The model M has the polynomial representation
X
px = φ(T1 (x), . . . , Ts (x)) =
φα T α (x) .
(5.21)
α∈L
The functions T α , with α ∈ L are a sufficient statistic equivalent to T1 , . . . , Ts .
If the φα are unrestricted except for the normalizing constraint, this means that
the model contains all the probabilities with sufficient statistic T . The advantage of the former is the linearity of Equation (5.21). This remark prompts
the following definition.
Definition 5.16 The set T1 , . . . , Ts0 is a basis for the statistical model M if
Ã
!
X
P∗N (x1 , . . . , xN ) = φN
Tj (xi ), j = 1, . . . , s
(5.22)
i
for all N ≥ 2.
The T α in Equation (5.21) are a basis for all models with sufficient statistic
T.
5.5
Minimum-variance estimator
In this section, we define an unbiased minimum-variance estimator of the cell
counts. This estimator is defined as the mean of the hypergeometric distribution, whose parameters depend on the observed value of the sufficient statistic
TN . Moreover we show that, under some extra conditions, this estimate is
equal to the ML estimate of the vector of parameters N p.
In the classical log-linear theory, where the attention is restricted to the
positive case, the most commonly used theorem of existence and uniqueness
86
CHAPTER 5. SUFFICIENCY AND ESTIMATION
of the ML estimate is based on the vector space representation of log-linear
models, see Chapter 1 for an introduction and Haberman (1974), Theorem 2.2,
for a detailed proof.
From this point of view, an unbiased minimum-variance estimator has two
advantages. This estimator is well defined also in such cases where the ML
estimator does not exists or its computation is difficult. Moreover, in some
applications, it could be desirable to use an unbiased estimator instead of a
biased one.
The difficulty of the computation of the minimum-variance estimate will
be avoided with an algebraic based algorithm presented in the next section.
From Definition 5.16 it follows that the basis is the algebraic counterpart
of the complete sufficient statistic.
Definition 5.17 Define
F̂x = Eφ [Fx (X1 , . . . , XN ) | TNα , α ∈ L]
(5.23)
F̂x is the minimum-variance unbiased estimator of the expected cell counts
N px .
In fact, Eφ [F̂x ] = Eφ [Fx ] = N px and F̂x is minimum-variance as it is function of the sufficient statistic. Moreover, if TNα is a basis, see Definition 5.16,
then the estimator F̂x is a linear combination of the TNα .
Theorem 5.18 If T1 , . . . , Ts are a vector space basis of the space of the sufficient statistic, the estimator F̂x is the unique unbiased estimator of the count
N px which is function of T1 , . . . , Ts .
Proof.
Following the technique used in Lehmann (1983), let F̃x be another
estimator of N px function of T1 , . . . , Ts . We have
F̂x =
X
cα T α
α
and
F̃x =
X
dα T α
α
Thus, we can write
0 = Eφ [F̂x − F̃x ] =
X
(cα − dα )Eφ [T α ]
α
5.6. ALGEBRAIC METHOD
87
for all Eφ [T α ]. As the φα are unrestricted, the previous Equation leads to
0 = Eφ [F̂x − F̃x ] =
X
φα E0 [(F̂x − F̃x )T α ] ,
α
where E0 is the mean with respect to a dominant measure. As E0 [(F̂x − F̃x )T α ]
must be all zero, we conclude that F̃x = F̂x .
Remark 5.19 Note that the result above can be easily proved if we restrict to
the model M>0 , as in the exponential model the sufficient statistic is complete
and the result in Theorem 5.18 is a consequence of the completeness.
The estimator F̂x exists even when ML estimate is not well defined. The
links between this estimator and the ML estimator is given by the following
result.
Proposition 5.20 If ML estimate exists and the vector F̂ /N belongs to the
model, then the ML estimate N p̂ is equal to F̂ .
Proof.
A classical result from Birch (1963), generalized by Bishop et al.
(1975), states that there is a unique configuration of the cell probabilities which
lies in the model and verifies the constraints of the sufficient statistic, and this
configuration is the ML estimate of the cell probabilities.
5.6
Algebraic method
From the above theory, together with the use of the algorithm presented in
Chapter 2, we can derive a Monte Carlo method for the computation of the
estimates fˆx for all x ∈ X . As in the previous Chapters, let Yt be the set of all
samples with fixed value t of the sufficient statistic TN and Ft be the set of all
frequency tables obtained from samples with value t of the sufficient statistic
TN . As P∗N
φ = φN (TN ), then the distribution of the sample given {TN = t} is
uniform on Yt . The distribution of F = (Fx )x∈X is the image of the uniform
distribution, i.e. it is the hypergeometric distribution on Ft , see Chapter 2 for
details.
Given t, we define a Markov chain with sample space Ft and with stationary
distribution Ht . The Markov chain is a random walk on Ft with moves in an
88
CHAPTER 5. SUFFICIENCY AND ESTIMATION
appropriate Markov basis M = {m1 , . . . , mL }, computed starting from the
matrix representation of the sufficient statistic T . Thus, as we are interested
in the mean value of the cell counts, the Markov chain is constructed following
a Metropolis-Hastings-type algorithm and it is performed as follows:
(a) let S = 0;
(b) at the time 0 the chain is in f ;
(c) choose a move m uniformly in the Markov basis and ² = ±1 with probability 1/2 each independently of m;
(d) if f + ²m ≥ 0 then move the chain from f to f + ²m with probability
min{Ht (f + ²m)/Ht (f ), 1} .
(5.24)
In all other cases, stay at f ;
(e) let S = S + f ;
(f) repeat B times the steps (c)–(e);
(g) the (approximated) mean is S/B.
The convergence of this Markov chain has been proved in Chapter 2. As
in many cases we are interested in the computation of the estimates in the
case of two-way of multi-way contingency tables, we can use the results for the
computation of the Markov bases presented in Chapters 3 and 4.
Appendix A
Programs
In this Appendix we show the programs, both symbolic and numerics, needed
for the implementation of the MCMC algorithm for the goodness of fit tests
for log-linear models presented in Chapters 3 and 4. We will use the data and
the quasi-symmetry model for 4 × 4 tables in Example 3.8, Chapter 3, as main
example.
The first step of the algorithm needs the computation of the relevant
Markov basis for the model. In order to do that, we use CoCoA. We set
an appropriate polynomial ring, we write the matrix representation of the
sufficient statistic and we use the CoCoA function Toric.
We use a polynomial ring with an indeterminate for each sample point, i.e.,
in our example, with 16 indeterminates. Thus, we write:
Use A::=Q[x[1..16]];
T:=Mat[[1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0],
[0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1],
[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0],
[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0],
[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0],
[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1],
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1],
[0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0],
[0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0],
[0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0],
[0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0],
[0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0]];
89
90
APPENDIX A. PROGRAMS
I:=Toric(T);
G:=GBasis(I);
G;
The first 4 rows of the matrix represent the row sums, the second 4 rows
represent the column sums and the remaining rows represent the sum of the
diagonal-opposite cells.
Two remarks are useful for the implementation of this part of the algorithm.
First, we use line vectors for the sample points, i.e. we use the indeterminates
ξ1 , . . . , ξ16 , instead of the two-way representation, i.e. ξ1,1 , . . . , ξ4,4 . This choice
does not present any problems in the multi-way case, as shown by examples
in Chapter 4. Second, if we want to work with a minimal set of generators
instead of a Gröbner basis of the ideal, it is sufficient to add the command
replace the command G:=GBasis(I) with the command G:=MinGens(I).
Now, we have a Gröbner basis of the relevant toric ideal. In our example,
the CoCoA output is
[x[2]x[8]x[13] - x[4]x[5]x[14], x[2]x[7]x[9] - x[3]x[5]x[10],
-x[7]x[12]x[14] + x[8]x[10]x[15], x[3]x[12]x[13] - x[4]x[9]x[15],
x[2]x[7]x[12]x[13] - x[4]x[5]x[10]x[15], x[3]x[8]x[10]x[13] x[4]x[7]x[9]x[14], -x[3]x[5]x[12]x[14] + x[2]x[8]x[9]x[15]]
Now, we use these binomials for the definition of the Markov basis. We
transform each binomial in a vector, as described in Chapter 2. We can use
the following simple CoCoA function:
Define GensToMat(G);
-- input: a list of binomials;
-- output: a matrix containing the vector representation
-- of the binomials;
L:=Len(G);
M:=NewMat(L,NumIndets(),0);
For I:=1 To L Do
ListMon:=Monomials(G[I]);
L1:=Log(ListMon[1]);
L2:=Log(ListMon[2]);
M[I]:=L1-L2;
End;
Return M;
EndDefine;
Applying this function to the Gröbner basis computed above, we obtain
the following matrix
91
Mat[[0, 1, 0, -1, -1, 0, 0, 1, 0, 0, 0, 0, 1, -1, 0, 0],
[0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, -1, 0, -1, 0, 1, 0, 1, -1, 0],
[0, 0, 1, -1, 0, 0, 0, 0, -1, 0, 0, 1, 1, 0, -1, 0],
[0, 1, 0, -1, -1, 0, 1, 0, 0, -1, 0, 1, 1, 0, -1, 0],
[0, 0, 1, -1, 0, 0, -1, 1, -1, 1, 0, 0, 1, -1, 0, 0],
[0, -1, 1, 0, 1, 0, 0, -1, -1, 0, 0, 1, 0, 1, -1, 0]]
Here each row represents a move. This matrix is a concise representation
of the 7 moves found in Chapter 3, Section 3.2.4.
Now, we can use this matrix for the second step of the algorithm, i.e. the
simulation step. For this step we use the software Matlab. After using Matlab,
we need an easy ASCII manipulation in order to have a suitable input format
for Matlab.
function simul=simul(tab,mle,m,N,bis,step);
%tab = the observed table;
%mle = maximum likelihood estimate;
%m = the matrix representation of the moves;
%N = number of MCMC replicates;
%bis = number of burn-in steps;
%step = reduction of correlation step;
nc=length(tab);
p=0;
c=zeros(N,1);
chisqref=chisq(tab,MLE);
mm=-m;
m=[m; mm];
[nmoves,ncm]=size(m);
numit=N*step+bis;
for i=1:numit
tabp=zeros(1,nc);
r=ceil(nmoves*rand(1));
tabp=tab+m(r,:);
if tabp>=zeros(1,nc)
mhr=1;
for j=1:nc
if m(r,j)~=0
mhr=mhr*prod(1:tab(j))/prod(1:tabp(j));
end;
end;
alpha=rand(1);
if mhr>=alpha
tab=tabp;
92
APPENDIX A. PROGRAMS
end;
end;
if (rem(i,step)==0) & (i>bis)
c((i-bis)/step)=chisq(tab,MLE);
if c((i-bis)/step)>=chisqref
p=p+1;
end;
end;
end;
p=p/N;
simul=p;
The value of the Pearson statistic is computed through the following function:
function chisq=chisq(tab,MLE);
c=0;
l=length(tab);
for i=1:l
if MLE(i)~=0
c=c+(tab(i)-MLE(i))*(tab(i)-MLE(i))/MLE(i);
end;
end;
chisq=c;
Some remarks about the parameters of the simulation Matlab function.
• the maximum likelihood estimate of the cell counts mle is computed
with statistical software, for example SAS, see SAS/STAT User’s Guide
(2000), unless we disposed of a closed form;
• the number of Monte Carlo replicates B is chosen accordingly to the
precision we want reach. In our examples we have used B=50.000, and
the motivation of our choice is discussed in Example 3.6, Chapter 3;
• the number of burn-in steps bis and the step for the reduction of the
correlation step should be chosen for any particular problem based on
inspection of the empirical distribution of the random variables and the
traces of auto-correlation function, respectively.
Finally, with a slight modification of the above program, one can obtain
the approximate distribution of the test statistic.
Bibliography
Agresti, A. (1992). Modelling patterns of agreement and disagreement.
Statistical Methods in Medical Research 1, 201–218.
Agresti, A. (1996). An Introduction to Categorical Data Analysis. New
York: Wiley.
Agresti, A. (2001). Exact inference for categorical data: Recent advances
and continuing controversies. Statist. Med. 20, 2709–2722.
Agresti, A. (2002). Categorical Data Analysis. New York: Wiley, 2nd ed.
Agresti, A., Mehta, C. R. & Patel, N. R. (1990). Exact inference for
contingency tables with ordered categories. J. Amer. Statist. Assoc. 85,
453–458.
Baglivo, J., Olivier, D. & Pagano, M. (1992).
Methods for exact
goodness-of-fit tests. J. Amer. Statist. Assoc. 87, 464–469.
Becker, M. P. (1990). Quasisymmetric models for the analysis of square
contingency tables. J. R. Statist. Soc. 52, 369–378.
Bergsma, W. P. & Rudas, T. (2002). Marginal models for categorical data.
Ann. Statist. 30, 140–159.
Bigatti, A., La Scala, R. & Robbiano, L. (1999). Computing toric
ideals. J. Symb. Comput. 27, 351–365.
Bigatti, A. & Robbiano, L. (2001). Toric ideals. Mat. Contemp. 21, 1–25.
Birch, M. W. (1963). Maximum likelihood in three-way contingency tables.
J. R. Statist. Soc. 25, 220–233.
Bishop, Y. M., Fienberg, S. & Holland, P. W. (1975). Discrete multivariate analysis: theory and practice. Cambridge: MIT Press.
93
94
BIBLIOGRAPHY
Booth, J. G. & Butler, R. W. (1999). An importance sampling algorithm
for exact conditional tests in log-linear models. Biometrika 86, 321–331.
Capani, A., Niesi, G. & Robbiano, L. (2000). CoCoA, a system for doing
Computations in Commutative Algebra. Available via anonymous ftp from
cocoa.dima.unige.it, 4th ed.
CDER (2000). Developing Medical Imaging Drugs and Biological products.
Guidance for industry. U.S. Department of Health and Human Services,
Food and Drug Administration.
Chib, S. & Greenberg, E. (1995). Understanding the Metropolis–Hastings
algorithm. Amer. Statist. 49, 327–335.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational
and Psychological Measurement 20, 37–46.
Collombier, D. (1980). Récherches sur l’Analyse des Tables de Contingence.
Phd thesis, Université Paul Sabatier de Toulouse.
Cox, D., Little, J. & O’Shea, D. (1992). Ideals, Varieties, and Algorithms.
New York: Springer Verlag.
Darroch, J. N. & McCloud, P. I. (1986). Category distinguishability and
observer agreement. Australian Journal of Statistics 28, 371–388.
De Loera, J. A., Hemmecke, R., Tauzer, J. & Yoshida, R. (2003).
Effective lattice point counting in rational convex polytopes. Preprint.
Diaconis, P. & Sturmfels, B. (1998). Algebraic algorithms for sampling
from conditional distributions. Ann. Statist. 26, 363–397.
Dillon, W. R. & Mulani, N. (1984). A probabilistic latent class model
for assessing inter-judge reliability. Multivariate Behavioral Research 19,
438–458.
Dobson, A. J. (1990). An Introduction to Generalized Linear Models. London: Chapman & Hall.
Donner, A. & Klar, N. (1996). The statistical analysis of kappa statistics
in multiple samples. Journal of Clinical Epidemiology 49, 1053–1058.
BIBLIOGRAPHY
95
Donner, A., Shoukri, M. M., Klar, N. & Bartfay, E. (2000). Testing
the equality of two dependent kappa statistics. Statist. Med. 19, 373–387.
Duffy,
D.
(2002).
The
gllm
package.
Available
from
http://cran.r-project.org, 0th ed.
Fienberg, S. (1980). The Analysis of Cross-Classified Categorical Data.
Cambridge: MIT Press.
Fingleton, B. (1984). Models of Category Counts. Cambridge: Cambridge
University Press.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. New
York: Wiley, 2nd ed.
Garcia, L. D., Stillman, M. & Sturmfels, B. (2003). Algebraic geometry of Bayesyan networks. Preprint.
Geiger, D., Meek, C. & Sturmfels, B. (2002). On the toric algebra of
graphical models. Preprint.
Goodman, L. A. (1979a). Multiplicative models for square contingency tables
with ordered categories. Biometrika 66, 413–418.
Goodman, L. A. (1979b). Simple models for the analysis of association in
cross-classifications having ordered categories. J. Amer. Statist. Assoc. 74,
537–552.
Goodman, L. A. (1985). The analysis of cross-classified data having ordered
and/or unordered categories: Association models, correlation models, and
asymmetry models for contingency tables with or without missing entries.
Ann. Statist. 13, 10–59.
Guggenmoos-Holzmann, I. & Vonk, R. (1998). Kappa-like indices of
agreement viewed from a latent class perspective. Statist. Med. 17, 797–
812.
Haberman, S. J. (1974). The Analysis of Frequency Data. Chicago and
London: The University of Chicago Press.
Haberman, S. J. (1978). Analysis of Qualitative Data. New York: Academic
Press.
96
BIBLIOGRAPHY
Hanley, J. A. (1998). Receiver Operating Characteristic (ROC) curves.
In Encyclopedia of Biostatistics, P. Armitage & T. Colton, eds. Wiley, pp.
3738–3745.
Hirji, K. F. & Johnson, T. D. (1996). A comparison of algorithms for
exact analysis of unordered 2 × k contingency tables. Comput. Statist. Data
Anal. 21, 419–429.
Klar, N., Lipsitz, S. R. & Ibrahim, J. G. (2000). An estimating equation
approach for modelling Kappa. Biom. J. 42, 45–58.
Kreuzer, M. & Robbiano, L. (2000). Computational Commutative Algebra
1. New York: Springer.
Kreuzer, M. & Robbiano, L. (2003). Computational Commutative Algebra
2. New York: Springer. In preparation.
Landis, R. J. & Koch, G. G. (1975). A review of statistical methods in
the analysis of data arising from observer reliability studies, Parts I and II.
Statist. Neerlandica 29, 101–123, 151–161.
Lehmann, E. L. (1983). Theory of Point Estimation. New York: John Wiley
& Sons.
Lehmann, E. L. (1987). Testing Statistical Hypotheses. New York: John
Wiley & Sons.
Margalef Roig, J. & Outerelo Dominguez, E. (1992). Differential
Topology. No. 173 in North-Holland Mathematixs Studies. Amsterdam:
North-Holland Publishing Co.
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. London: Chapman & Hall, 2nd ed.
Mehta, C. R. & Patel, N. R. (1983). A network algorithm for perfoming
Fisher’s exact test in r × c contingency tables. J. Amer. Statist. Assoc. 78,
427–434.
Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1988). Importance
sampling for estimating exact probabilities in permutational inference. J.
Am. Statist. Assoc. 83, 999–1005.
BIBLIOGRAPHY
97
Metha, C. R., Patel, N. R. & Senchaudhuri, P. (1991). Exact stratified linear rank tests for binary data. In Computing Science and Statistics:
Proceedings of the 23rd Symposium on the Interface, E. M. Keramidas, ed.
American Mathematical Society, pp. 200–207.
Metha, C. R., Patel, N. R. & Senchaudhuri, P. (2000). Efficient monte
carlo methods for conditional logistic regression. J. Am. Statist. Assoc. 95,
99–108.
Pistone, G., Riccomagno, E. & Wynn, H. P. (2001a). Algebraic Statistics: Computational Commutative Algebra in Statistics. Boca Raton: Chapman&Hall/CRC.
Pistone, G., Riccomagno, E. & Wynn, H. P. (2001b). Computational
commutative algebra in discrete statistics. In Algebraic Methods in Statistics
and Probability, M. A. G. Viana & D. S. P. Richards, eds., vol. 287 of
Contemporary Mathematics. American Mathematical Society, pp. 267–282.
Rudas, T. & Bergsma, W. P. (2002). On generalised symmetry. Tech. rep.,
Paul Sabbatier University, Toulouse, France. Preprint.
SAS/STAT User’s Guide (2000). Version 8. Cary, NC: SAS Institute, 1st ed.
Smith, P. W. F., Forster, J. J. & McDonald, J. W. (1996). Monte
Carlo exact tests for square contingency tables. J. R. Statist. Soc. 159,
309–321.
Strawderman, R. L. & Wells, M. T. (1998). Approximately exact inference for the common odds ratio in several 2 × 2 tables. J. Am. Statist.
Assoc. 93, 1294–1307.
Sturmfels, B. (1993). Algorithms in Invariant Theory. Texts and Monographs in Symbolic Computation. New York: Springer.
Sturmfels, B. (1996). Gröbner bases and convex polytopes, vol. 8 of University lecture series (Providence, R.I.). American Mathematical Society.
Tanner, M. A. & Young, M. A. (1985). Modelling agreement among
raters. J. Am. Statist. Assoc. 80, 175–180.
98
BIBLIOGRAPHY
Williamson, J. M., Manatunga, A. K. & Lipsitz, S. R. (1994). Modeling kappa for measuring dependent categorical agreement data. Biostatistics
1, 191–202.
Scarica

LOG-LINEAR MODELS AND TORIC IDEALS