Bayesian test of homogeneity for Markov chains

Bayesian test of homogeneity for Markov chains

S'rA :,kS l . : 3 :~illl ~t: Cll & li ELSEVIER Statistics & Probability Letters 31 ( 1997 ) 333 - 338 Bayesian test of homogeneity for Markov ch...

345KB Sizes 0 Downloads 34 Views

S'rA

:,kS l . : 3 :~illl ~t:

Cll &

li

ELSEVIER

Statistics & Probability Letters 31 ( 1997 ) 333 - 338

Bayesian test of homogeneity for Markov chains J & 6 m e A. D u p u i s L S T A , Universitd Paris VI and Crest, Insee, France Received December 1994

Abstract

The test we develop expresses the null hypothesis in terms of proximity of the distribution of a Markov chain (yt) to the subspace ~ of homogeneous Markov chains. The distance we use is the Kullback distance which turns out to be conceptually appropriate. Departure from the point null hypothesis allows us to formulate the question of interest in meaningful terms, but implementing this approach comes up against a scaling problem. In this paper, we propose a new approach in order to solve this scaling problem by formulating the proximity to homogeneity as a percentage of the maximum distance to ocg.

Keywords: Entropy; Homogeneity; Kullback-Leibler distance; Longitudinal data; Proximity; Test

I. Introduction

In many fields, it is common to follow up over time individuals previously sampled in a population of interest and to observe their states at different particular discrete time moments. In this paper we focus on phenomena assumed to be directed by a first-order Markovian process. This modeling has been used for instance in Biology when estimating the movement probabilities of animal populations moving in a geographically stratified space (Seber, 1982; Dupuis, 1995), in Botany when studying growth of the diameter of trees (Roberts and Hruska, 1986), as well as in Social Sciences for the study of interindustry job mobility (Bartholomew, 1982). Obviously other models could be considered depending on the problem at hand; Gilula and Haberman (1994) give a review of the many approaches for modelling polytomous data correlated over time. Taking into account the Markovian assumption brings us to express the question of interest in terms of homogeneity of the chain. When no time-dependent exogenous variable is available, which is the only situation we consider here, Bayesian test of homogeneity has not been developed. The aim of this paper is to develop such a test by settling on a particular representation of the null hypothesis. Let us denote by qt(r,s) the probability of being in s at time t + 1 given that the chain is in r at time t. We consider that the exact equalities:

qt(r,s) = qt,(r,s)

for all states (r,s) and all times (t,t')

0167-7152/97/$17.00 (~) 1997 Elsevier Science B.V. All rights reserved P l l S 0 1 6 7 - 7 1 5 2 ( 96 ) 0 0 0 4 7 - 8

(1.1)

334

J.A. Dupuis I Statistics & Probability Letters 31 (1997) 333-338

represent the question of interest very crudely since, in practice, experimenters will mostly consider the phenomenon as time homogeneous as soon as (1.1) is approximatively satisfied. Departure of the point null hypothesis formulation allows to formulate in meaningful terms the question of interest. Moreover this approach avoids the theoretical drawbacks and criticisms of the point null hypothesis (Vaderman, 1987). It has been first implemented by Ferrandiz (1982), who also argues about the extreme sensitivity of the posterior probability associated with a sharp null hypothesis. This formulation has also been used by McCulloch and Ross± (1992) when dealing with nonlinear hypotheses, and by Mengersen and Robert (1996) when testing the number of components of a mixture of two distributions. We express the null hypothesis (here proximity to homogeneity), in terms of distance between distributions; more precisely, if f(y[O) represents the distribution of the process (Yt) with 0 = (qt(r,s)), we consider the distance from f(y[O) to the distributions family Jg of homogeneous Markov chains. If we denote by g(yl2) the distribution of a generic element of ~ where 2 = (2(r,s)), the null hypothesis we consider is Ho: a(f(ylO), g(ylO±)) ~<~ where O± = Arg m!n a(f(y[O), g(yl2)).

(1.2)

The distance measure d employed here is the Kullback-Leibler distance for which the projection g(y[O±) of f(y[O) on ~ has a closed and meaningful form (see Section 3.2). In Section 3.1 we motivate the entropic choice. The response to the test obviously depends upon the choice of e which thus needs to be scaled. We propose, in Section 3.3, a new approach for solving this scaling problem is given by taking advantage of the fact that the quantity d(f(y[O),9(y[O±)) is bounded. Comments and extensions are given in Section 4.

2. Notations and assumptions Let y = {yt : t = 1,..., m} be a first-order nonhomogeneous Markov chain with finite state space denoted by K, and with an initial state Yl assumed to be fixed. We denote by k the cardinal of K, and by m = m - 1 the number of transitions of the chain y. The density (associated to the counting measure) of the chain (Yt), parametrized by 0 = (qt(r,s)), is denoted by f(y[O). The data, consist of n independent realizations of the process (Yt), is denoted by y; so y = (Yl ..... Yi . . . . . Yn) where Yi = ( Y ( i , 1 ) . . . . . Y(i,t) . . . . . Y(i,m)). We assume that:

qt(r,s) > 0 for all t and for all (r,s) C K 2.

(2.1)

Let E, = {(xl . . . . . xj . . . . . x,) E ~k I ~ = l xj = 1,xj > 0}. Then we have O = Ea where d = k ( m - 2 ) + 1; note that O is an open convex subset of ~a. We denote by de' the set of Markov chains of this type and by aft the subset of J¢ composed of homogeneous Markov chains. The density of a chain (Yt) of our, parametrized by 2 = (2(r,s)), is denoted by 0(y[2). Note that 2 E A = Ek*. For k and m fixed we denote by ~ the set of the possible paths with m transitions of elements of K. Note that Card q / = k m-l. We assume that the parameters qt(r) = (qt(r, 1)..... qt(r, k)) are a priori independent. Settling on Dirichlet prior distributions, we assume that qt(r) ~ ~k(at(r, 1) ..... at(r,k)), where the hyperparameters (at(r,s))s=l,, are assumed to be known. In this paper we consider a loss which allows the experimenter to penalize differently the bad decisions (Berger, 1985). If the cost due to rejecting H0 wrongly is co, and the cost due to accepting it wrongly is cl, the optimal Bayesian procedure is to accept the null hypothesis when its posterior probability is larger than 1/(c + 1) where c = co~el.

J.A. Dupuis / Statistics & Probability Letters 31 (1997) 333-338

335

3. The Kullback-Leibler distance approach

3.1. The entropic motivation The entropy of a random variable y with distribution f(ylO) is defined as follows: Ho(y) = -~-o log f(y[O). The Kullback-Leibler distance is related to the entropy since for all (0, ~) E 02, it is defined as:

d(f(ylO),f(yl~)) = I:0 log f(YlO)'~ f(yl~)J = -EHo(y) + ~-ologf(yl~)]. Note that, since we assume that the qt(r,s)'s are strictly positive, entropy Ho(y) as well as d(f(ylO ), f ( y [ ~ ) ) are always defined and finite. Moreover Ho(y) is strictly positive. The fact that the Kullback-Leibler distance is not symmetric is not here a disadvantage, since the formulation of the test is by nature asymmetric. Indeed, we are concerned with measuring the Kullback distance from the (true) density f(y[O) of the Markov chain (Yt) to the space 9f, which is clearly the quantity which is measured by d(f(ylO),9(y[O±)), since the logarithm of the deviation between f(y[O) and O(y[O±)is averaged with respect to the (true) distribution of (yt), namely f(y[O). This quantity, denoted by d(f(yIO),~,ug) is called the Kullback proximity of f(y[O) to (see e.g. Gourirroux and Monfort, 1989). Note that 0± defined by (1.2) should not be confused with the I-divergence of f(y[O) on J f (see e.g. Csiszar, 1975) defined as 0± = Arg minz~Ad(9(yl2),f(ylO)). Entropy is best interpreted as a measure of heterogeneity for multinomial distribution (with unordered cells), as in Kullback (1954). Indeed it is well known that the entropy of a random variable x which follows a multinomial distribution Jl[(1, (Pb..., Pj .... , pq)) is maximum if and only if pj = 1/q for all j, that is when the distribution is homogeneous; thus, its maximum value is then logq. On the contrary Hp(x) is minimum when x is not random and in this case Hp(x) = 0. The quantity logq - Hp(x) could be therefore interpreted as a measure of the homogeneity of the distribution of x, as well as a measure of the distance from the distribution of x to the distribution associated to the homogeneity, we denote by f(x[p*) where

since

q

[logq] - Hp(x) = Z

PJ log ~

[

f(xIp)

]

= ~:p log f(xlp*)] = d(f(xlP)'f(xlP*))"

(3.1)

j=l

The concept of entropy and, consequently the use of Kullback-Leibler distance, are also appropriate for the problem we consider. The quantity d(f(ylO),9(ylO±)) is thus clearly similar to d(f(x[p),f(xlp*)) when focusing on time homogeneity issues; it can be perceived from two perspectives. The first is to think of this quantity as a parameter of the distribution of (Yt) which measures the (time) homogeneity of the chain, in the same way the quantity l o g / q - Hp(x) measures the homogeneity of the distribution of x (see (3.1)). The second perspective is to think of it as a statistical measure of the error made by using g(yl0±) instead of f(ylO) (that is by accepting H0 wrongly); this would be appealing if the selection model is a first step before an estimation one.

3.2. Calculating the projection of f(ylO) on .~ Proposition 3.1. The minimizin9 problem Arg min). d ( f (y I0), O(yl£)), under the k constraints ~,~1~ 2(r, s)=1,

has a unique solution.

J.A. Dupuis I Statistics & Probability Letters 31 (1997) 333-338

336

Proof. Let k and m be fixed. First of all, note that A is an convex open set of ~d, with d = k k. For any given 0, the function ). ---, d(f(y[O),g(yl2)) is indefinitely derivable with respect to 2. Therefore a point 2 - - ( 2 ( r , s ) ) solution of the minimizing problem should necessarily satisfy E0 log

Ow

f(ylO ) + 0(Yl2)

Z VrSr] = 0 / rEK

(3.2)

where Sr = - 1 + ~--~s~,r2(r,s) and the partial derivatives are with respect to the variables w which represent either the 2(r,s)'s or the Lagrange multipliers v/s. Now we have

a f(ylO) ~2(r,s----~Eolog~#(y12 where N(r,~)(y) = ~,~1 from r --* s in the chain given by

a Y-o~

1 logg(yl2) -

2(r,s)~-°N(~'~)(Y)

D(yt = r,y(t+l) = s) is the random variable which counts (Yt). It is straightforward to verify that the system (3.2)

O±(r,s) - EoN(r,s)(y) EoN(r)(y)

where

the number of transitions has a unique solution O±

N(~)(y) = ~--~N(r,~)(y). sex

(3.3)

Note that, taking into the assumptions qt(r,s) > 0, the quantities O±(r,s) are always defined since EoN(~)(y) 0. Now it is straightforward to check that then the function d(f(y[O),O(Yl2)) is convex with respect to 2. Then 0± = (O±(r,s)) is the unique solution of the minimizing problem. Note that O±(r,s) has a meaningful form since it is the ratio of the expected number of transitions r ~ s of the chain y over the expected number of passages to the chain in r. []

3.3. Scalin9 the threshold Let us recall that the homogeneity test we consider is formulated as: H0:

a(f(y[O),~)<<.~

The response to the test depends obviously on the choice of the threshold e. Different approaches have been viewed for calibrating the Kullback-Leibler distance. Most often a calibration is first "decided" with a usual distribution (e.g. Bernouilli, or Poisson distribution) and the calibration of e is derived from a one-to-one correspondence between both calibrations (see for details McCulloch, 1989). Here scaling e is based upon the fact that suP0eo d(f(y]O ), ~ ) is finite, which is a direct consequence of the following Proposition 3.2. For m and k fixed, we denote by Hk*m the maximum entropy of y, namely: suPoeoHo(y ).

Proposition 3.2. For all 0 c=0, d(f(ylO ), :¢g) < Hk*,m Proof. As

Ho(y)

> 0, we have for all 0:

d(f(ylO),9(ylO±)) = Eo [log ~ ] We now show that

Eolog(1/O(ylO±)) is

- Ho(y) < Eo [logg(yiO±) ] . bounded by Hk,m. We have

~-ologg(ylO±)= Z f(YlO)log ( I-[ O±(r,s)N~rm(Y))= Z[logO±(r,s)]~-oN(r,s)(y). yEq/

\ (r,s)

/

(r,s)

(3.4)

337

ZA. Dupuis / Statistics & Probability Letters 31 (1997) 333-338

Taking into account that

EoN(r,s)(y) : Ok(r,s)~-oN(r)(y)

we have

~_~[logO±(r,s)][EoN(r,s)(Y)=Z(~o[N(r)(Y)]~_~O-L(r,s)logO±(r,s) ) • (r,s)

rcK

sEK

As Y'~s~x O±(r,s)log(1/Ok(r,s)) is bounded by logk (which is the maximum entropy of a multinomial distribution with k cells) it follows that

~0 [ 1 o g ~ ] since

~~exN(~)(y)

(3.5,

~(logk)Z~-oN(r)(y)=Hk*,m rEK

= m

and

Hk*m=

mlogk.

[]

For k and m fixed let us denote Sk,m = suP0eo d(fo(y), and is finite. The test is thus reformulated as follows: H0:

d(J(ylO),Jg)~vSk, m

~).

From Proposition 3.2, this upper bound exists

where 7 E]0, I[.

(3.6)

The choice of the degree of proximity to homogeneity is now formulated through 7, that is, as a percentage of the maximum distance to homogeneity. Note that, when Ho is accepted, it could be of interest to compute, for a fixed ratio of co~c1, the minimum 7, beyond which H0 is accepted. We point out that generally Sm,k has not a closed form expression and thus numerical methods should be implemented to assess this quantity.

4. Comments Obviously the quantity n(H0ly) cannot be analytically computed but it is easily approximated by classical n Monte-Carlo simulation methods. Let us denote wt(r,s)= ~i=1 ~(Y(i,t)= r,Y(i,t+l)= S). Step (l) of the algorithm is (1) generate qll)(r)[y ~ ~k(at(r,s) + wt(r,s)); 1 I (2) update 7 ~ p = l 0Ho(0(P)), where DHo(0(p)) = 1 if d(f(y[O(P)), ~f)~e. Applying the strong Law of Large Numbers, it follows that

1 l

7 ~

~H°(0(P))

/-~+e~

' E~[DH°(0lY)] ----~(H0lY)

(a.s).

p=l In this paper we have considered that the initial state of the chain was fixed (thus not random). Propositions 3.1 and 3.2 also hold if it is assumed that the Markov chain has an initial distribution p = (#1 ..... #s ..... Pk). It is actually straightforward to check that Pk----# and that Ok(r,s) is not modified. Proposition 3.2 is derived from

d(j'(ylp,

0), g(zl#k, 0k)) =

d(f (yl I#), g(zl ]#±)) + Z #rd(f (y[yl

= r, 0), g(Y[yl

= r, Ok ))

rEK

= Eprd(f(y[yl =- r,O),g(ylyl

= r, 0k)) < (m -- 1)logk.

(4.1)

rEK

On the fringe of the test we construct in this paper, note that we can take advantage of the Proposition 3.2 to define a quantity that belongs to [0, 1] and that measures, in an intrinsic way, the degree of homogeneity of the Markov chain (Yt) namely Sk,,~/d(f(y[O(P)),Jg).

338

J.A. Dupuis I Statistics & Probability Letters 31 (1997) 333-338

Acknowledgements The author is grateful to Christian Robert, and George Casella for helpful discussions. The author is also grateful to CREST-INSEEfor financial support.

References Bartholomew, D.J. (1982), Stochastical Models for social Process (Wiley, New York, 3rd ed.). Berger, J.O. (1985), Statistical Decision Theory on Bayesian Analysis (Springer, New York). Csiszar, I. (1976),/-Divergence geometry of probability distributions and minimization problems, Ann. Probab. 3, 146-158. Dupuis, J.A. (1995), Bayesian estimation of movement probabilities for open populations using hidden Markov chains, to appear in: Biometrika 82, 761-772. Ferrandiz, J.R. (1985), Bayesian influence on Mahalanobis distance: an alternative approach to Bayesian model testing, in: J.M. Bernardo, M.H. De Groot, D.V. Lindley, A.F.M. Smith, eds., Bayesian Statistics, Vol. 2 (Oxford University Press, Oxford) pp. 645-654. Gilula, Z. and J.S. Haberman (1994), Conditional log-linear models for analyzing categorical panel data, J. Amer. Statist. Assoc. 89, 645-656. Gourirroux, C. and A. Monfort (1989), Statistique et ModUles Econom~triques (Economica, Paris). Kullback, S. (1954), Information Theory and Statistics (Wiley, New York). McCulloch, R.E. (1989), Local model influence, J. Amer. Statist. Assoc. 84, 473-478. McCulloch, R.E. and P.E. Rossi (1992), Bayes factors for non linear hypotheses and likelihood distributions, Biometrika 79, 663-676. Mengersen, K.L. and C.P. Robert (1996), Testing for mixtures: a Bayesian entropic approach. In: J.M Bernardo, J.O. Berger, A.P. Dawid and A.FM. Smith, eds., Bayesian Statistics 5 (Oxford University Press, Oxford) pp. 255-276. Roberts, M.R. and A.J. Hruska (1986), Predicting diameter distributions: a test of the stationary Markov model, Canal. J. For. Res. 16, 130-135. Seber, G.A.F. (1982), The Estimation of Animal Abundance and Related Parameters (Macmillan, New York, 2nd ed.). Vardeman, S.B. (1987), Comments on "Testing a point null hypothesis", J. Amer. Statist. Assoc. 82, 130-131.