Estimation in a growth curve model with singular covariance

Estimation in a growth curve model with singular covariance

Journal of Statistical Planning and Inference 97 (2001) 323–342 www.elsevier.com/locate/jspi Estimation in a growth curve model with singular covari...

186KB Sizes 0 Downloads 7 Views

Journal of Statistical Planning and Inference 97 (2001) 323–342

www.elsevier.com/locate/jspi

Estimation in a growth curve model with singular covariance  a Department

Chi Song Wonga; b;∗ , Hua Chenga; b

of Economics, Mathematics and Statistics, University of Windsor, Windsor, Ont. N9B 3P4 Canada b Statistician and Distribution Analyst, Analog Devices, 804 Woburn Street, MA 01887, USA Received 26 February 1999; received in revised form 1 September 2000; accepted 13 September 2000

Abstract Let Y be a multivariate normal random matrix with covariance A ⊗  and mean  ∈ S1 S2 , where Si = {Xi bi : Ki bi = Mi ui for some ui } and S1 S2 is the linear span of the set of all x1 x2 with xi ∈ Si . Explicit formulae are obtained for the estimators of (; ). These estimators are investigated through a large class of loss functions and other principles. None of the matrices A; , Xi , Ki and Mi are assumed to have full column rank. For robust studies, elliptical Y is c 2001 Elsevier Science B.V. All rights reserved. considered when it is appropriate.  MSC: primary 62J05; secondary 62H15 Keywords: Best quadratic unbiased estimator; Covariance structure; Image set of a generalized central Wishart random matrix; Maximum likelihood estimator; Linear regression constraint; Loss function; Mean structure; Sensible estimator

1. Introduction In this paper, Mn×p will denote the set of all n × p matrices over the real >eld R , Rn will denote Mn×1 and Np will denote the set of all non-negative de>nite (n.n.d.) p×p matrices. For T ∈ Mn×p , T  , T + and Im T will denote, respectively, the transpose, Moore–Penrose inverse and column space of T . Let Y be an n × p normal random matrix with mean  = X1 X2

(1.1)

 Supported by NSERC, PIN No. 7914. The paper is based on the talk of the >rst author in honour of Professor C.R. Rao at the Conference ‘Statistical Research in the 21st Century’, Montreal; it was >nalized during his stay in April–May, 2000 at Swedish University of Agriculture Sciences, Uppsala, Sweden. ∗ Correspondence address. Department of Economics, Mathematics and Statistics, University of Windsor, Windsor, Ont., N9B 3P4, Canada. E-mail addresses: [email protected] (C.S. Wong), [email protected] (H. Cheng).

c 2001 Elsevier Science B.V. All rights reserved. 0378-3758/01/$ - see front matter  PII: S 0 3 7 8 - 3 7 5 8 ( 0 0 ) 0 0 2 2 0 - 2

324

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

and covariance Y = A ⊗ ;

(1.2)

where X1 ∈ Mn×q , X2 ∈ Mp×k ; A ∈ Nn are known,  ∈ Mq×k and  ∈ Np are unknown, and X1 ; X2 , A and  are non-zero. Note that (1.1) can be rewritten as  ∈ S ≡ S1 S2 (= {0});

(1.3)

where S1 = Im X1 , S2 = Im X2 , and S1 S2 is the linear span of uv with u ∈ S1 and v ∈ S2 . Both (1.3) and (1.2) amount to separation of the design and the population. S1 and S2 may, respectively, be referred to as the design space and the population space. Instead of having Si = Im Xi , one may also choose Si = {Xi bi : Ki bi = Mi vi for some vi };

(1.4)

where, without loss of generality, we may assume that Im K ⊂ Im X  and Im M  ⊂ Im K  ; see Chan and Keung (1997) and Wong Chi Song (1989). Motivation for using (1.3) with (1.4) can be found in Wong Chi Song (1989) in terms of multivariate regression models and linear models with covariates. When Ki and Mi are 0, (1.3) with restriction (1.4) is nothing but (1.1). In this paper, we shall consider the normal model Y with mean structure and covariance structure given by (1.3) and (1.2). Note that (a) if p = 1 and  = (2 ), then Y is the usual linear model; (b) if A = In ,  is non-singular, S1 = Rn and S2 = Rp , then Y is the usual multivariate model; (c) if A = In ,  is non-singular and S2 = Rp , then Y is the usual multivariate linear model with or without restrictions. (a) – (c) above indicate certain areas covered by the inference of growth curve models. (a) was treated in Rao (1973) and Wong et al. (1991); (b) and (c) were treated in Rao (1973), Arnold (1981), Eaton (1983), Muirhead (1982) and Anderson (1984). Growth curve models were treated in PotthoK and Roy (1964), Rao (1973), Fujikoshi (1974), Arnold (1981), Muirhead (1982), von Rosen (1991a, b) and the references therein, all for the case where  is non-singular. Our main contribution here is for the case where  is singular. In fact, none of the matrices in this paper is assumed to have full column rank. [Recall that for principal components, we assume that  is non-singular and identify small eigenvalues of ; here we consider singular  and try to identify zero eigenvalues of .] Our results are expressed in terms of certain orthogonal projections on Euclidean [or any >nite-dimensional inner product] spaces. To obtain matrix versions of our results, one need only use the projection formula in Lemma 3.3. In this way, complications caused by the matrix representations in terms of (1.1) and (1.4) are avoided. This paper is organized into seven sections. Basic de>nitions and notations are presented in Section 2. For statistical inference, we need certain results on projections and optimization (minima and maxima); these results are presented in Sections 3 and 4. In particular, projections involve factorization of matrices and generalized inverses, while optimization involves diKerential forms and convex analysis. In Theorem 5.1, we show that among all quadratic unbiased estimators B Y  WYB of  with n.n.d. W ,

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

325

ˆ ) in Wong Chi Song et al. (1995) is the best with respect to a large class of loss (Y functions. Special cases of our result were obtained in Calvert and Seber (1978). All statistical principles are sensitive to the choice of parameter space that contains the given >xed unknown parameter. To some extent, the purpose of inference is to reduce the assumed parameter space to a smaller one through various principles. By theory ˆ ) = Im  with probability 1; of generalized Wishart distributions, we know that Im (Y this result generalizes an earlier result of Dykstra (1970). By this result, we reduce Y , in Section 6, to a growth curve model with the properties S1 ⊂ Im A and S2 ⊂ Im . Then we obtain in (6.15) a sensible estimator of the mean in the sense of Arnold (1981): obtain in (6.10) a best linear unbiased estimator of the mean by assuming that  is known and then replace  in (6.10) by its estimator obtained in Theorem 5.1. Through a large class of loss functions, this estimator is shown to be better than the standard estimator in (6.16) for the usual multivariate linear models where S2 = Rp ; such a result is game- or decision-theoretic in nature. In Section 7, Y is reduced further to a growth curve model for which the covariance is non-singular and an estimator, O ((y); O (y)), of (; ) is then obtained in (7.33) and (7.35) through the maximum ˆ )0 = (Y ˆ )+ (Y ˆ ), likelihood method. All estimators of (; ) involve A0 = A+ A and (Y this distinguishes inference of a singular growth curve model from a non-singular one ˆ ))0 = Ip with probability 1. While our formulae are comfor which A0 = In and ((Y plicated, they are pretty and the underlying estimates can easily be obtained through computer programs (using, e.g., the matrix procedure).

2. Basic denitions and notations Let V; E be p-, n-dimensional inner product spaces over R . Then L(V; E) will denote the set of all linear transformations of V into E, equipped with the trace inner product ; : A; B = tr(B A);

(2.1)

where B is the adjoint of B. T ∈ L(V; E) will be identi>ed with its matrix representation [T ] with respect to the given bases. Thus if V = Rp and E = Rn , then L(V; E) = Mn×p and T ∈ Mn×p is the linear transformation x → Tx on Rp . Most lemmas and theorems are proved in such a way that one can read matrices as operators and vice versa. For T ∈ L(V; E); Im T is the image set {T (b): b ∈ V } of T , r(T ) is the rank of T or the dimension of Im T ; T − will denote a generalized inverse of T . For generalized inverses, see Rao and Mitra (1971), Kruskal (1975) or Wong Chi Song (1986). When T ∈ L(E; E) is n.n.d. and  ¿ 0, T  will denote the th n.n.d. root of T , T − will denote the th n.n.d. root of T + , and T 0 will denote T + T ; thus T 0 = T  T − = T − T  and T 0 is the orthogonal projection of E into Im T . When T is p.d., T 0 above is nothing but the identity map on E. For T ∈ L(V; V ), tr T and det T (or |T |) will denote the trace and determinant of T . Pp will denote the set of all positive de>nite (p.d.) T ∈ L(V; V ) and Sp will denote the set of all T ∈ L(V; V ) with

326

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

T  = T: For a linear subspace S of E, the orthogonal projection of E onto S is de>ned as the PS ∈ L(E; E) with (PS )2 = PS , PS = PS and Im PS = S. For any u ∈ E and v ∈ V , the outer product u v is the element in L(V; E) such that (u v)(z) = v; z u for all z ∈ V: If u ∈ Rn and v ∈ Rp , then with respect to the usual bases, u v = uv ∈ Mn×p . For bilinearity, the notation u v is more convenient than uv . For K ⊆ V; K ⊥ is the set {x ∈ V : x; y = 0 for all y ∈ K}, where , is the underlying inner product. For any H ⊆ E and K ⊆ V , H K will denote the linear span of {u v : u ∈ H; v ∈ K}. For any A ∈ L(E1 ; E2 ) and B ∈ L(V1 ; V2 ), the Kronecker product A ⊗ B is the element in L(L(V1 ; E1 ), L(V2 ; E2 )) such that (A ⊗ B)(C) = ACB for all C ∈ L(V1 ; E1 ); or in matrix form, (A ⊗ B)vec C = vec(ACB );

(2.2)

where one need not order the indices of C =(cjk ) and A⊗B=(aij blk )(i; l); ( j; k) ; see Wong Chi Song (1985) and Wong Chi Song and Liu (1995) is essentially a special case of ⊗ and ⊗ can be de>ned by the universal property; here we follow Eaton (1983) and de>ne and ⊗ as above. Note that [A⊗B]=[A]⊗[B] and Im(A⊗B)=(Im A) (Im B). Now let Y be a random matrix in Mn×p : Then Y is said to be multivariate elliptically contoured distributed (written as Y ∼ MECn×p (; Y ; ()) if the characteristic function, Yˆ , of Y is given by Yˆ (T ) = eiT;  ((u);

u = T; Y (T ) ;

T ∈ Mn×p ;

(2.3)

where  is the mean of Y and ; is the trace inner product. Eq. (2.3) is still meaningful if we replace Mn×p by L(V; E): If ((u) = e−u=2 for u ∈ R ; then Y ∼ N(; Y ), i.e., Y is normal with mean  and covariance Y . By (2.3), ( (0) = − 12 . For elliptically contoured distributions, we refer the reader to Gupta and Varga (1993). We shall assume that Y is normal except for remarks on robustness. Two random matrices are regarded as the same if they are equal to each other with probability 1.

3. Lemmas on projections Results on least squares or projections obtained in this section can be used on many occassions. For any n × n p.d. matrix D, we may de>ne an inner product ; on Rn × Rn by x; y = y Dx (and de>ne the corresponding norm || || by ||x|| = x; x 1=2 ). Let x∗ = D1=2 x: Then x; y = y∗ x∗ . Every inner product for Rn can be obtained in this way. So we shall merely present Lemmas 3.1–3.3 in the Euclidean space Rn . Further, replacing the transpose operator  by the adjoint operator ∗, Lemmas 3.1–3.3 can also be extended to results for an inner product space over the complex >eld. Let S be a linear subspace of Rn : Then for any z ∈ Rn ; z + S is called a Pat. Let y; z ∈ Rn and Pz+S y be the vector in Rn such that Pz+S y ∈ z + S and ||Pz+S y − y||2 = min{||x − y||2 : x ∈ z + S}. When z = 0; PS is the same as that in Section 2. Let eS y = y − PS y. We shall use SSEz+S (y) to denote ||eS (y − z)||2 and use SSR z+S (y) to denote ||PS (y − z)||2 . Although PS and eS have ‘opposite’ meanings in statistics,

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

327

the theory for one is the same as that for the other. Let U; V be Pats in Rn such that V ⊂ U . We shall use SSEV |U (y) to denote SSEV (y) − SSEU (y) and use SSR V |U (y) to denote SSR U (y) − SSR V (y). Thus SSEV |U (y) = SSR V |U (y). The sum of squares SSR V |U (y) is used to measure the diKerences between the projections PU y and PV y. Now assume that S is spanned by a set of vectors X1 ; : : : ; Xp ∈ Rn ; i.e., S is the column ˆ ˆ space {Xb: b ∈ Rp } of X = [X1 ; X2 ; : : : ; Xp ] ∈ Mn×p . Then PS y = X b(y) for some b(y) ˆ ∈ Rp : It is known that b(y) = (X  X )− X  y and (PS =)PIm X = X (X  X )− X  :

(3.0)

By induction, this simple formula actually gives rise to Gram–Schmidt orthogonalization process. Lemma 3.1. Let {Xj }kj=1 be a family of matrices Xj over R such that Im X2i ⊂   Im X2i−1 and Im X2i+1 ⊂ Im X2i : Let A1 = X1 X1 (=X1 (I  I )− X1 ); A2i = X2i A− 2i−1 X2i and −  A2i+1 = X2i+1 A2i X2i+1 . Then r(Aj ) = r(Xj ); j = 1; 2; : : : ; k. Hence for any X ∈ Mn×p ; K ∈ Mp×s and W ∈ Mq×s ; (i) r(X ) = r(X (X  X )− X  )(=tr(X (X  X )− X  )), (ii) r(K) = r(K  (X  X )− K) if Im K ⊂ Im X  ; and (iii) r(W ) = r(W (K  (X  X )− K)− W  ) if Im K ⊂ Im X  and if Im W  ⊂ Im K  : Lemma 3.2. Let X ∈ Mn×p ; H ∈ Mp×s and K  = H  X  X . Then (a) Im(XH ) = (Im X ) ∩ ((Im X ) ∩ ker(XH ) )⊥ and ˆ (b) SSR Im XH (y) = QK  (b(y)); y ∈ Rn ; where QK  (b) = (K  b) (K  (X  X )− K)− K  b for p b∈R : Proof. (a) follows from (i) Im(XH ) ⊂(Im X )∩((Im X )∩ker(XH ) )⊥ and (ii) Im(XH ), (Im X )∩((Im X )∩ker(XH ) )⊥ have the same dimension. (b) By (3.0), SSR Im(XH ) (y)= y PIm(XH ) y = y XH (H  X  XH )− (XH ) y: So (b) follows from K  (X  X )− K = H  (X  X )H

and

ˆ K  b(y) = H  X  y:

(3.1)

Lemma 3.3. Let Y ∈ Rn ; X ∈ Mn×p ; K ∈ Mp×s and W ∈ Mq×s such that Im K ⊂ Im X  ; a ∈ Im K  − Im W  and Im W  ⊂ Im K  : Let bˆ = (X  X )− X  and E = {Xb: b ∈ Rp ; K  b = W  v + a for some v ∈ Rq }: Then (a)

ˆ ˆ PE y = X b(y) − a) − X (X  X )− K(K  (X  X )− K)− (K  b(y) +X (X  X )− K(K  (X  X )− K)− W  [W (K  (X  X )− K)− W  ]− ˆ W (K  (X  X )− K)− (K  b(y) − a);

and (b)

ˆ ˆ ˆ SSEE (y) = (y y − y X b(y)) + QK  ; a (b(y)) − QK  ; W  ; a (b(y));

328

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

where for b ∈ Rp ; QK  ; a (b) = (K  b − a) (K  (X  X )− K)− (K  b − a) and QK  ; W  ; a (b) = (K  b − a) (K  (X  X )− K)− W  [W (K  (X  X )− K)− W  ]− ×W (K  (X  X )− K)− (K  b − a): Proof. By a translation argument, we may assume that a = 0 and write QK  ; W  for QK  ; W  ; 0 . Thus E ={Xb: b ∈ Rp ; K  b=W  v for some v}: Since Im W  ⊂ Im K  ; W  =K  B for some B ∈ Mp×q . Since Im K ⊂ Im X  , K  = H  X  X for some H ∈ Mp×s . So E = {z ∈ Im X : H  X  z = (H  X  XB)(v) for some v} = (Im X ) ∩ [ker(H  X  ) + Im((H  X  )− H  X  XB)]: Taking orthogonal complements in Im X , we obtain (Im X ) ∩ E ⊥ = (Im X ) ∩ (ker(H  X  ))⊥ ∩ (Im[(H  X  )− H  X  XB]⊥ ): By Lemma 3.2(a), Im X ∩ E ⊥ = (Im XH ) ∩ (Im((XH ) )− (XH ) XB)⊥ . So (Im X ) ∩ E ⊥ = {XH1: 1 ∈ R s ; (XH1) ((XH )− ) (XH ) XB = 0} = {XH1: 1 ∈ R s ; 1 (XH ) ((XH )− ) (XH ) XB = 0} = {XH1: 1 ∈ R s ; 1 (XH ) XB = 0}: Thus, (Im X ) ∩ E ⊥ = {X∗ b∗ : b∗ ∈ Rp∗ ; K∗ b∗ = 0};

(3.2)

where X∗ = XH;

b∗ = 1;

p∗ = s;

K∗ = (XB) X∗ :

(3.3)

So K∗ = H∗ X∗ X∗ for some H∗ ∈ Ms×q . Thus by (3.2), (Im X ) ∩ E ⊥ = (Im X∗ ) ∩ ker (X∗ H∗ ) and therefore P(Im X )∩E ⊥ =PIm X∗ −P(Im X∗ )∩[(Im X∗ )∩ ker (X∗ H∗ ) ]⊥ . By Lemma 3.2(a), P(Im X )∩E ⊥ = PIm X∗ − PIm(X∗ H∗ ) . Since E ⊂ Im X; PE = PIm X − P(Im X )∩E ⊥ . Thus, PE = PIm X − (PIm X∗ − PIm(X∗ H∗ ) ): and (3.3), PIm X∗ =X∗ (X∗ X∗ )− X∗ =XH (H  X  XH )− (XH ) :  − − 

By (3.0) XH (K  (X X ) K) (XH ) and therefore

(3.4) So by (3.1) PIm X∗ =

PIm(X∗ H∗ ) = X∗ H∗ (K∗ (X∗ X∗ )− K∗ )− (X∗ H∗ ) :

(3.5)

XH = X (X  X )− K

(3.6)

Since and

(XH ) = K  (X  X )− X  ;

PIm X∗ = X (X  X )− K(K  (X  X )− K)− K  (X  X )− X  :

(3.7)

By (3.3) and (3.1), X∗ H∗ = X∗ (X∗ X∗ )− K∗ = XH (K  (X  X )− K)− W  : So by (3.6), X∗ H∗ = X (X  X )− K(K  (X  X )− K)− W  :

(3.8)

Now by (3.3) and (3.1), K∗ (X∗ X∗ )− K∗ = W (K  (X  X )− K)− W  :

(3.9)

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

329

By (3.8) and (3.9), we obtain from (3.5), PIm(X∗ H∗ ) = X (X  X )− K(K  (X  X )− K)− W  ×[W (K  (X  X )− K)− W  ]− W (K  (X  X )− K)− K  (X  X )− X  :

(3.10)

By (3.10), (3.7) and (3.0), we obtain from (3.4) the desired result: PE = X (X  X )− X  − X (X  X )− K(K  (X  X )− K)− K  (X  X )− X  +X (X  X )− K(K  (X  X )− K)− W  [W (K  (X  X )− K)− W  ]− ×W (K  (X  X )− K)− K  (X  X )− X  :

(3.11)

(b) Again, we may assume that a = 0: By (3.4), Im X = E ⊕ ((Im X ) ∩ E ⊥ ) and Im X∗ = [(Im X ) ∩ E ⊥ ] ⊕ [Im(X∗ H∗ )]; we obtain  PE y 2 =  PIm X y 2 −  PIm X∗ y 2 ˆ +  PIm(X∗ H∗ ) y 2 : Since SSEE (y) = y y − SSR E (y) and  PIm X y 2 =y X b(y); by Lemma 3.2(b), it suQces to prove that ˆ  PIm(X∗ H∗ ) y 2 =QK  ; W  (b(y)); which follows from  PIm(X∗ H∗ )y 2 =y PIm(X∗ H∗ ) y and (3.10). Lemma 3.4. Let E ∈ Mn×s and C ∈ Nn : (a) Suppose that EE  C = C. Then (E  CE)+ = E  C + E: (b) Suppose that P ∈ Nn with P 2 = P: Then P(PCP)+ = (PCP)+ = (PCP)+ P. Lemma 3.5. Let  ∈ Np and S be a linear subspace of Rp . Then (a)– (d) are equivalent: (a) S ⊂ Im ; (b) PS 0 = PS (=0 PS ); (c) 0 (S) ⊂ S; (d) 0 (P(1=2 (S ⊥ ))⊥ )0 = 0 P−1=2 (S) 0 (de Morgan law) Moreover; if (a) holds; then (PS + PS )+ =  − (PS ⊥ PS ⊥ )+ :

(3.12)

Proof. That (c) and (d) are equivalent can be found in Wong Chi Song (1993), where several more conditions equivalent to (c) are given. (3.12) can be proved as follows Since Im(f ◦ g) = f(Im g), by Lemma 3.5(d) and (3.0), 0 − 1=2 PS ⊥ (PS ⊥ PS ⊥ )+ PS ⊥ 1=2 = −1=2 PS (PS + PS )+ PS −1=2 :

(3.13)

Multiplying 1=2 on the left and right sides of (3.13), we obtain (3.12) from (b) of Lemmas 3.4 and 3.5. As an application of (3.12), suppose that −1 exists and take PS = diag [Ir ; 0]. Then by writing   11 12 = with 11 ∈ Mr×r ; 21 22

330

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

we obtain from (3.12) and from       −1 11 12 0 0 11 12 12 22 12 12 = ; −1 0 22 21 22 21 22 21 22 −1 diag [11 − 12 22 21 ; 0] = diag [(−1 )−1 11 ; 0]; i.e., −1 (−1 )11 = (11 − 12 22 21 )−1

(=A−1 11:2 );

(3.15)

a familiar formula in Anderson (1984). Thus, the de Morgan law in Lemma 3.5 is in some sense a generalization of (3.15). 4. Lemmas on optimization The >nding of optimal rules in statistics amounts to a mathematical programming that searches for minima or maxima of a certain real-valued function. For related results on optimization, see, e.g., Anderson and Olkin (1985). Lemma 4.1. Let A ∈ Np be non-zero; S be a linear subspace of Rn and W be the set of all W ∈ Nn such that WPS = 0 and tr(AW ) = 1. Let g(W ) = tr((AW )2 ); and W0 = (PS ⊥ APS ⊥ )+ =r(PS ⊥ APS ⊥ ): Then W0 ∈ W and g(W0 ) = ming(W): Lemma 4.2. Let S ∈ Pp ; T; P ∈ Np such that P 2 = P. For G ∈ Pp ; let f(G) = −ln|G| − tr(G −1 S) − tr(T (PGP)+ )

(4.1)

Gˆ = S + S((PSP)+ T (PSP)+ )S:

(4.2)

and ˆ = max f(Pp ): Then f(G) Proof. For getting to know diKerential forms, we >rst consider the important case where P = 0: Then f(G) = −ln|G| − tr(G −1 S); which, up to a scale of N; is (10) on p. 62 of Anderson (1984). For this case, let g(G) = f(G −1 ) = ln|G| − tr(GS):

(4.3)

Note that Pp is open in Sp : Let dG ∈ Sp . Recall the following diKerential formulae in Wong Chi Song (1985): (d|G|)(dG) = |G|tr(G −1 dG)

(4.4)

and for any linear function h and constant c, d[h(G) + c](dG) = h(dG):

(4.5)

So by (4.3), dg(G)(dG) = tr(G −1 dG) − tr(dGS):

(4.6)

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

331

Recall further that the quadratic form of d2 g(G) at dG is given by d(w(G))(dG);

(4.7) −1

where w(G) = dg(G)(dG), treating dG as a constant. So by the formula dC (dC) = −C −1 dCC −1 in Wong Chi Song (1985), we obtain from (4.6) that d2 g(G)(dG) = −tr(G −1 dGG −1 dG) which is 60 because G ∈ Pp and dGG −1 dG ∈ Np : Thus −g is convex on Pp (and this is our reason for considering g). So g(G∗ ) = max g(Pp ) if dg(G∗ ) = 0 as a linear operator on Sp . Thus by (4.6), G∗ = S −1 maximizes g in (4.3) over Pp and therefore Gˆ = S maximizes f over Pp . [The proof provided by Anderson (1984) uses two factorization results, namely A.22 and A.17 in his book].  Now for P = 0, P = Q diag[I; 0]Q = Q(1) Q(1) for some orthogonal Q = (Q(1); Q(2) ). Let g(G) = f((QGQ )−1 );

S∗ = Q SQ; +

+

T∗ = Q TQ:



 +

(4.8)

+

Then by the formulae (QH ) = H Q and (HQ ) = QH , we obtain g(G) = ln|G| − tr(GS∗ ) − tr(T∗ (diag[I; 0]G −1 diag[I; 0])+ ): Write

 S∗ = G

−1

S11 S12 S21 S22 

=



 ;

C11 C12 C21 C22

T∗ = 

 =

C1 C2

T11 T12 T21 T22





; 

;

H=

H11 H12 H21 H22

 ;

where H ∈ Sp . Then H → H11 is linear and g(G) = ln|G| − tr(GS∗ ) − tr(T11 ((G −1 )11 )−1 ):

(4.9)

Let dG ∈ Sp : Then dg(G)(dG) = tr(G −1 dG) − tr(dGS∗ ) − tr(T11 ((G −1 )11 )−1 ×(G −1 dGG −1 )11 (G −1 )11 )−1 ) = tr(G −1 dG) − tr(S∗ dG) − tr[((G −1 )11 )−1 T11 ((G −1 )11 )−1 (C1 dGC1 )]: Setting dg(G)(dG) = 0 −1

(4.10) −1

−1

−1

−1

C1

= 0, i.e., on Sp ; we obtain G − S∗ − C1 ((G )11 ) T11 ((G )11 )       C11 C12 S11 S12 C11 −1 −1 − − C11 T11 C11 [C11 ; C12 ] = 0 C21 C22 S21 S22 C21 whence, upon a careful simpli>cation, we obtain   −1 T11 S11 S12 T11 −1 ; G = S∗ + −1 −1 −1 S21 S11 T11 S21 S11 T11 S11 S12

332

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

i.e., G −1 = S∗ + S∗ (diag[I; 0]S∗ diag[I; 0])+ T∗ (diag[I; 0]S∗ diag[I; 0])+ S∗ :

(4.11)

Since g(G) = f(QG Q ), Gˆ = QG Q maximizes f over Pp if G maximizes g over Pp . Now we show that G in (4.11) maximizes g over Pp . Let {Gn } be a sequence in Pp such that lim g(Gn ) = sup g(Pp ) ≡  ∈ (−∞; ∞]. If the largest eigenvalue of Gn tends to ∞ as n → ∞, then −tr(Gn S) and therefore g(Gn ) tends to −∞ as n → ∞, a contradiction. If the smallest eigenvalue of Gn tends to 0 as n → ∞, then ln|Gn | and therefore g(Gn ) tends to −∞ as n → ∞, a contradiction. So some subsequence of {Gn } tends to some G0 ∈ Pp . G0 , being maxima, must satisfy (4.9). Since (4.9) has a unique solution, G0 = G: Thus by (4.8), Gˆ ≡ QG −1 Q maximizes f over Pp . Now by (4.11), Q Q = I and by Lemma 3.4(a), −1



−1



Gˆ = QG −1 Q = QS∗ Q + QS∗ Q Q(diag[I; 0]Q QS∗ Q Q diag[I; 0])+ ×Q QT∗ Q Q(diag[I; 0]Q QS∗ Q Q diag[I; 0])+ Q QS∗ Q = S + SQ(diag[I; 0]Q SQ diag[I; 0])+ Q TQ(diag[I; 0]Q SQ diag[I; 0])+ Q S = S + S(Q diag[I; 0]Q SQ diag[I; 0]Q )+ QT∗ Q (Q diag[I; 0]Q SQ diag[I; 0]Q )+ S; i.e., Gˆ = S + S(PSP)+ T (PSP)+ S; which is (4.2). 5. The best quadratic unbiased estimator of  Theorem 5.1. Let Y ∼ MEC(; Y ; () with  and Y given by (1:3) and (1:2); r1 = r(PS ⊥ APS ⊥ ) (¿ 0); 1

1

W0 = (PS ⊥ APS ⊥ )+ =r1 ; 1

1

(5.1)

and ˆ (y) = y W0 y;

y ∈ Mn×p :

(5.2)

Suppose that M =Im  is known (or observable). Then among all unbiased estimators ˆ ) is best with respect to the loss B Y  WYB of  with W ∈ Nn and B ∈ Mp×p ; (Y function L given by L(a; (; )) = tr[H (a − )K(a − )]; where H; K ∈ Np . ˆ )) = . Suppose that Proof. It is obvious that E((Y B Y  W∗ YB is unbiased for :

(5.3)

E(Y  W∗ Y ) =  W∗  + tr(AW∗ );

(5.4)

Since by (5.3), B  W∗ B + tr(AW∗ )B B =  for all  ∈ S ≡ S1 B∗ B∗ = 

(5.5)

S2 and  with Im  = M: Take  = 0: Then by (5.5) and (5.3), (5.6)

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

333

for all  with Im  = M; where  = (tr(AW∗ ))1=2 :

B∗ = B;

(5.7)

 

By (5.5) – (5.7), B  W∗ B = 0 for  ∈ S, which, by n.n.d. of W , is equivalent to W∗ B = 0

for  ∈ S:

(5.8)

Let {pi }ri=1 be an orthonormal basis of M and P = [p1 ; p2 ; : : : ; pr ]. Then by (5.6) with  = PDP  and n.n.d. D, B∗ PDP  B∗ = PDP 

(5.9)

whence by P  P = Ir ; we obtain, upon taking D = Ir ; B∗ PP  B∗ = PP  : Thus Q ≡ P  B∗ P is orthogonal. So by (5.9), Q DQ = D for all n.n.d. D ∈ Mr×r , i.e., Q D = DQ whence Q = 1Ir for some real number 1, i.e., pi B∗ pj = 1:ij ;

(5.10)

where :ij ’s are the Kronecker symbols. Since Im(X  X ) = Im X  , by (5.6), Im  = Im(B∗ 1=2 )

(5.11)

for all  with Im  = M: Take  = PM . Then by (5.11), B∗ (M ) = M:

(5.12)

r

Since pj ∈ M; by (5.12), B∗ pj = i=1 i pi for some unique i ’s; by (5.10), j = 1 and i = 0 for i = j. Thus each B∗ pj = pj ; i.e., B∗ x = x for all x ∈ M and B∗ PM = PM :

(5.13)

By (5.7) and (5.8), 2 B Y  W∗ YB = 2 B (Y − ) W∗ (Y − )B = B∗ (Y − ) W∗ (Y − )B∗ : So 2 B Y  W∗ YB = B∗ [(Y − )0 ] W∗ (Y − )0 B∗ = B∗ PM (Y



− ) W∗ (Y − )PM B∗

(with probability 1) (0 = PM )

= PM (Y − ) W∗ (Y − )PM

(use (5:13))

0 

(0 = PM )

= [(Y − ) ] W∗ (Y − ) = Y  W∗ Y

0

(with probability 1);

or B Y  W∗ YB = Y  WY with probability 1, where W = W∗ =tr(AW∗ ), i.e., tr(AW ) = 1:

(5.14)

Thus, we may assume that B = Ip and (5.8) is then reduced to W = 0

(5.15)

for all  ∈ S: Since S2 = ∅; (5.15) implies WPS1 = 0:

(5.16)

334

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

The risk for using Y  WY to estimate  is given by f(W ) ≡ E tr[H (Y  WY − )K(Y  WY − )]: By (5.15), f(W ) = tr((Y −) W (Y −) (H ⊗ K)):

(5.17)

By a moment formula for Y ∼ MEC(; A ⊗ ; (); (Y −) W (Y −) = {4( (0)tr(AWAW )(i(1; 2) ( ⊗ ) +  ⊗ ) +[4( (0) − 1](tr(AW ))2 vec (vec ) };

(5.18)

where i(1; 2) ((aj1 ; j2 ; j3 ; j4 )j1 ; j2 ; j3 ; j4 ) = (aj1 ; j2 ; j3 ; j4 )j2 ; j1 ; j3 ; j4 ; see Wong Chi Song and Liu (1995). By (5.17) and (5.18), we obtain, upon simpli>cation, f(W ) = 4( (0)tr(AWAW )(tr(HK) + tr(H )tr(K)) +[4( (0) − 1](tr(AW ))2 tr(HK): So by (5.14), f(W ) = 4( (0)tr(AWAW )(tr(HK) + tr(H )tr(K) +[4( (0) − 1]tr(HK): 

Since tr(HK) + tr(H )tr(K) and (4( (0) − 1)tr(HK) do not depend on W and tr(HK)+(trH )tr(K) ¿ 0, it suQces to show that W0 ∈ W and g(W0 )=min g(W); where g(W ) = tr((AW )2 ); W is the set of all W ∈ Nn that satisfy (5.14) and (5.16), and this is exactly Lemma 4.1. ˆ A computer program for evaluating (y) can be obtained with the help of projecˆ ) is not limited to estimating . By theory on tion formula (3.11). The use of (Y ˆ ) = Im  with probability 1. This means that generalized Wishart distributions, Im (Y although  is unknown, Im  is observable. 6. Sensible estimators For the normal model Y with mean  and Y given by (1.2) and (1.3) with regression constraints (1.4), we shall >nd a reasonable estimator ˆ of  (in the sense of Arnord) and compare its performance with the usual estimator ˜ of . Since A and  may be singular, we consider ˆ ))0 ; Y ∗ = A0 Y ((Y

y∗ = A0 y0

for y ∈ Mn×p :

(6.1)

Note that ˆ )0 = 0 (Y

with probability 1:

(6.2)

So Y ∗ ∼ N(∗ ; A ⊗ );

(6.3)

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

335

where ∗ = A0 0 ;

∗ ∈ S ∗ ≡ S1∗

S2∗

(6.4)

and S1∗ = A0 (S1 );

S2∗ = 0 (S2 ):

(6.5)

Let ◦

Y =Y − Y ∗ :

(6.6)



ˆ )0 )(Y ) and  ◦ = (I − A0 ⊗ 0 )Y (I − A0 ⊗ 0 ) = (I − A0 ⊗ Then Y =(I − A0 ⊗ (Y Y





0 )(A ⊗  − A ⊗ ) = 0. So with probability 1, Y =E(Y ); i.e., ◦

Y = − ∗

with probabilty 1

(6.7) ◦

Let e(Y ∗ ) be a reasonable (e.g. unbiased) estimator of ∗ . Then by (6.7), e(Y ∗ ) + Y is a reasonable estimator of . So we investigate Y through Y ∗ : Now (6.5) gives an extra property: S1∗ ⊂ Im A;

S2∗ ⊂ Im ;

(6.8)

i.e., S ∗ ⊂ Im Y ∗

(=Im A Im ):

(6.9)

Thus if  were known, by Corollary 2:1(a) of Wong Chi Song (1993), the best linear unbiased estimator of vec ∗ is given by vec ˆ∗ (Y ) = PS ∗ (Inp − Y ∗ (PS ∗⊥ Y ∗ PS ∗⊥ )+ ) vec Y ∗ . So vec ˆ∗ (Y ) = PS ∗ Y0 ∗ (Inp − Y ∗ (PS ∗⊥ Y ∗ PS ∗⊥ )+ )Y0 ∗ vec Y ∗ (use (6:9)) = PS ∗ [Y0 ∗ − Y ∗ (PS ∗⊥ Y ∗ PS ∗⊥ )+ ]Y ∗ Y+∗ vec Y ∗ (use (6:3) and (6:4)) = (PS ∗ Y+∗ PS ∗ )+ Y+∗ vec Y ∗ (use (3:12)) = [(PS1∗ ⊗ PS2∗ )(A ⊗ )+ (PS1∗ ⊗ PS2∗ )]+ (A ⊗ )+ vec Y ∗ whence by (H ⊗ K)+ = H + ⊗ K + and (C ⊗ D)(F ⊗ G) = (CF) ⊗ (DG), vec ˆ∗ (Y ) = [((PS1∗ A+ PS1∗ )+ A+ ) ⊗ ((PS2∗ + PS2∗ )+ + )]vec Y ∗ ; thus by (2.2), the best linear unbiased estimator, ˆ∗ (Y ), of ∗ is given by ˆ∗ (Y ) = (PS1∗ A+ PS1∗ )+ A+ Y ∗ + (PS2∗ + PS2∗ )+ :

(6.10)



Here by ‘a linear estimator L(Y )’, it means that L is linear and is not restricted to L1 ⊗ L2 , and by ‘a best linear estimator L∗ (Y ∗ )’, it means that L(Y ∗ ) − L∗ (Y ∗ ) is n.n.d. for all linear L: Now suppose again that  is unknown. We follow Arnold (1981) and >nd the ˆ ∗ ) given reasonable estimator for ∗ , i.e., replace  in (6.10) with its estimator (Y by Theorem 5.1: ˆ ∗ )]+ (PS ∗ [(Y ˆ ∗ )]+ PS ∗ )+ ; ˆ∗ (Y ) = (PS1∗ A+ PS1∗ )+ A+ Y ∗ [(Y 2 2

(6.11)

336

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

where ˆ ∗ ) = y∗ (PS ∗⊥ APS ∗⊥ )+ y∗ =r1∗ (y

(6.12)

r1∗ = r(PS ∗⊥ APS ∗⊥ ) ¿ 0:

(6.13)

1

1

and 1

1

For the multivariate linear case, X2 = Ip or equivalently, S2 = Rp whence S2∗ = Im  and both (6.10) and (6.11) are reduced to ˜∗ (Y ): ˜∗ (Y ) = (PS1∗ A+ PS1∗ )+ A+ Y ∗ :

(6.14)

The estimators of  that correspond to (6.11) and (6.14) are: ˆ ∗ )]+ (PS ∗ [(Y ˆ ∗ )]+ PS ∗ )+ (Y ˆ ) = Y − Y ∗ + (PS1∗ A+ PS1∗ )+ A+ Y ∗ [(Y 2 2

(6.15)

(Y ˜ ) = Y − Y ∗ + (PS1∗ A+ PS1∗ )+ A+ Y ∗ ;

(6.16)

and ˆ ∗ ) are given by (6.1) and Lemma 5:1. By (3.0) and (6.5), where Y ∗ and (Y PS1∗ = A0 (PS1 A0 PS1 )+ A0 ;

PS2∗ = 0 (PS2 0 PS2 )+ 0 :

(6.17)

A computer program for evaluating (y) ˆ and (y) ˜ can be written with the help of (6.17) and the projection formula (3.11). One wonders, when S2∗ = Im , (or S2 = Rp ) whether (Y ˆ ) is a better estimator than (Y ˜ ). For this, we shall >rst show that both (Y ˆ ) and (Y ˜ ) are unbiased for . Let U ∗ = (PS1∗ A+ PS1∗ )+ A+ Y ∗ ;

V ∗ = (PS ∗⊥ APS ∗⊥ )+ Y ∗ : 1

(6.18)

1

Then U ∗ and V ∗ are normal. By (6.8) and (6:18), U ∗ ; V ∗ = 0. So U ∗ and V ∗ are independent. Also (PS1∗ A+ PS1∗ )+ A+ ∗ = (PS1∗ A+ PS1∗ )+ PS1∗ A+ PS1∗ ∗ +

(use (6:4); Lemma 3:4(b))

= [A − A(PS ∗⊥ APS ∗⊥ ) A]A PS1∗ ∗ (use (6:8) and (3:12)) 1

+

1

= [A0 − A(PS ⊥∗ APS ⊥∗ )+ ]PS1∗ ∗ (PS1∗ A0 = PS1∗ = A0 PS1∗ ) = PS1∗ ∗

1

1

(use Lemma 3:4(b) and PS1∗ PS ∗⊥ = 0); 1

i.e., (PS1 ∗ A+ PS1∗ )+ A+ ∗ = PS1∗ ∗ = ∗ :

(6.19)

ˆ ) = Im  with probability 1, by (6.8), Similarly, since Im (Y ˆ ∗ )]+ (PS ∗ [(Y ˆ ∗ )]+ PS ∗ )+ = ∗ ∗ [(Y 2 2

with probability 1:

(6.20)

By (6.19) and (6.20), we obtain from (6.11) that with probability 1, ˆ ∗ )]+ (PS ∗ [(Y ∗ )]+ PS ∗ )+ ˆ∗ (Y ) − ∗ = (PS1∗ A+ PS1∗ )+ A+ (Y ∗ − ∗ )[(Y 2 2 = g(U ∗ )h(V ∗ );

(6.21)

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

337

where g(U ∗ ) = U ∗ − ∗ and h(V ∗ ) = [V ∗ (PS ∗⊥ APS ∗⊥ )V ∗ =r2∗ ]+ (PS2∗ [V ∗ (PS ∗⊥ APS ∗⊥ )V ∗ =r2∗ ]+ PS2∗ )+ : 1



1

1



1



Since U and V are independent and U is unbiased for ∗ , E(g(U ∗ )h(V ∗ )) = E(g(U ∗ ))E(h(V ∗ )) = (E(U ∗ ) − ∗ )E(h(V ∗ )) = 0: So ˆ∗ (Y ) is unbiased for ∗ whence (Y ˆ ) is unbiased for :

(6.22)

It is easier to show that (Y ˜ ) is unbiased for ;

(6.23)

and it can be proved that + + + + (Y ˆ ) = ˆ∗ (Y ) = (PS1∗ A PS1∗ ) ⊗ (PS2∗  PS2∗ )

(6.24)

+ + (Y ˜ ) = ˜∗ (Y ) = (PS1∗ A PS1∗ ) ⊗ ;

(6.25)

 = [n − dim(S1∗ ) − 1]={n − dim(S1∗ ) − [r() − dim(S2∗ )] − 1}:

(6.26)

and

where

˜ ) and (Y ˆ ) are the same. If S2∗ = Im , then  = 1 whence by (6.11) and (6.14), (Y Now suppose that S2∗ is a proper subset of Im  and consider the loss function L: L((; ); a) = tr[K(a − ) H (a − )];

(6.27)

where H and K are n.n.d. [ ; de>ned by x; y = tr(Kx Hy), x; y ∈ Mn×p , is a pseudo-inner product, and it is an inner product if H; K are p.d.] The risk for using (Y ˜ ) to estimate  is given by L((; ); (Y ˜ )) ≡ E(tr[K((Y ˜ ) − ) H ((Y ˜ ) − )] = tr((Y ˜ ) (H ⊗ K)) = tr[((PS1∗ A+ PS1∗ )+ ⊗ )(H ⊗ K)] = tr[((PS1∗ A+ PS1∗ )+ H ) ⊗ (K)]; i.e., L((; ); (Y ˜ )) = tr((PS1∗ A+ PS1∗ )+ H )tr(K):

(6.28)

The risk for using (Y ˆ ) to estimate  is given by ˆ ) − )] = tr((Y L((; ); (Y ˆ )) ≡ E(tr[K((Y ˆ ) − ) H ((Y ˆ ) (H ⊗ K)) = tr[((PS1∗ A+ PS1∗ )+ ⊗ (PS2∗ + PS2∗ )+ )(H ⊗ K)]

=  tr[((PS1∗ A+ PS1∗ )+ H ) ⊗ ((PS2∗ + PS2∗ )+ K)]; i.e., L((; ); (Y ˆ )) =  tr[(PS1∗ A+ PS1∗ )+ H ]tr[(PS2∗ + PS2∗ )+ K];

(6.29)

thus by (6.28) and (6.29), L((; ); (Y ˆ ))6L((; ); (Y ˜ ))

(6.30)

338

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

if 6[tr(K)=tr((PS2∗ + PS2∗ )+ K)], or equivalently, n¿n0 , where n0 ¿

tr((PS2∗ + PS2∗ )+ K)(r() − r(S2∗ )) + dim(S1∗ ) + r() − dim(S2∗ ) + 1: tr(K)

If we restrict K = + , or K = + and H = I as in Arnold (1981), then n0 ¿r() + r(S2∗ ) + 1. If H; K are p.d., then 6 in (6.30) becomes ¡ when n¿n0 , i.e., (Y ˆ ) is uniformly better than (Y ˜ ). 7. Maximum likelihood estimator of (; ) Since Im  is observable, we can reduce Y ∗ in Section 6 to a non-singular linear model Z with z = p y∗ q;

Z = p Y ∗ q;

(7.1)

where p(q) is the inclusion map of Im A (Im ) into Rn (Rp ). As operators, p p = I;

q q = I;

pp = A0 ;

qq = 0 ;

(7.2)

as matrices, we may regard p ∈ Mn×s , q ∈ Mp×r with s = r(A), r = r(). Note that Z ∼ Ns×r (>; A∗ ⊗ ∗ );

s = r(A);

r = r();

(7.3)

where A∗ = p Ap;

> = p ∗ q;

∗ = q q

(7.4)

and both A∗ and ∗ are non-singular. By Lemma 3.4(a) and (7.2), A∗−1 = p A+ p;

∗−1 = q + q

(7.5)

+ = q∗−1 q

(7.6)

whence A+ = pA∗−1 p ; and A∗−1=2 = p A−1=2 p; Since

> ∈ (p S1∗ )

∗−1=2 = q −1=2 q:

(7.7)

(q S2∗ );

> = p PS1∗ ?PS2∗ q:

(7.8)

We shall obtain the maximum likelihood estimator of (>; ∗ ) and use it to >nd an estimator of (; ). The density function of Z is f: f(z) = =

1 (2)rs=2 |A∗

⊗ ∗ |1=2

exp{− 12 z − >; (A∗ ⊗ ∗ )−1 (z − >) }

1 (2)rs=2 |A∗ |r=2 |∗ |s=2

exp{− 12 tr[(z − >) A∗−1 (z − >)∗−1 ]}

with z ∈ Ms×r : So the log-likelihood function of Z = z is M : M (>; ∗ ) = −(s=2)ln|∗ | − tr[(z − >) A∗−1 (z − >)∗−1 ] + c;

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

339

where c is a constant. Now >x ∗ and let h(?) ≡ M (>; ∗ ) and d? ∈ Ms×r . Then by (7.8), dh(?)(d?) = −tr[(z − >) A∗−1 (−p PS1∗ d?PS2∗ q)∗−1 ]:

(7.9)

Setting dh(?)(d?) = 0 on Ms×r , we obtain PS2∗ q∗−1 (z − >) A∗−1 p PS1∗ = 0:

(7.10)

Transposing (7.10), we obtain PS1∗ pA∗−1 (z − >)∗−1 q PS2∗ = 0 whence by (7.8), PS1∗ pA∗−1p PS1∗ ?PS2∗ q∗−1 q PS2∗ = PS1∗ pA∗−1 z∗−1 q PS2∗ ; i.e., ? = (PS1∗ pA∗−1 p PS1∗ )+ (PS1∗ pA∗−1 z∗−1 q PS2∗ )(PS2∗ q∗−1 q PS2∗ )+ : So by Lemma 3.4(b) and (7.8) again, > = p (PS1∗ pA∗−1 p PS1∗ )+ pA∗−1 z∗−1 q (PS2∗ q∗−1 q PS2∗ )+ q:

(7.11)



Now, vary  and let D = p (PS1∗ pA∗−1 p PS1∗ )+ p;

g(∗ ) = M (>; ∗ );

(7.12)

where > is given by (7.11). Then 2g(∗ ) = −s ln|∗ | − tr{[z − DA∗−1 z∗−1 q (PS2∗ q∗−1 q PS2∗ )+ q] A∗−1 ×[z − DA∗−1 z∗−1 q (PS2∗ q∗−1 q PS2∗ )+ q]∗−1 + c:

(7.13)

Let z ∗ = A∗−1=2 z;

E = q PS2∗ q

(E 2 = E):

(7.14)

Then by Lemma 3.4, we obtain from (7.13), 2g(∗ ) = −s ln|∗ | − tr{[z − DA∗−1 z∗−1 (E∗−1 E)+ ] A∗−1 ×[z − DA∗−1 z∗−1 (E∗−1 E)+ ]∗−1 } + c = −s ln|∗ | − tr(z ∗ z ∗ ∗−1 − 2z ∗ A∗−1=2 DA∗−1=2 z ∗ ∗−1 (E∗−1 E)+ ∗−1 + (E∗−1 E)+ ∗−1 z ∗ A∗−1=2 DA∗−1=2 z ∗ ∗−1 (E∗−1 E)+ ∗−1 ) + c: (7.15) By Lemma 3.4(b), we have DA∗−1 D = D;

(E∗−1 E)+ ∗−1 (E∗−1 E)+ = (E∗−1 E)+ :

(7.16)

So by tr(AB) = tr(BA), we obtain from (7.15), 2g(∗ ) = −s ln|∗ | − tr{z ∗ z ∗ ∗−1 − z ∗ A∗−1=2 DA∗−1=2 z ∗ ∗−1 ×(E∗−1 E)+ ∗−1 } + c:

(7.17)

Let W = A∗−1=2 DA∗−1=2 ;

S ∗ = z ∗ (I − W )z ∗ =s;

T ∗ = z ∗ Wz ∗ =s:

(7.18)

Then by (7.16), W 2 = W;

S ∗ + T ∗ = z ∗ z ∗ =s:

(7.19)

340

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

By (7.19), we obtain from (7.17), 2g(∗ ) = s(−ln|∗ | − tr(∗−1 S ∗ ) − tr[T ∗ (∗−1 − ∗−1 (E∗−1 E)+ ∗−1 )] + c): So by the de Morgan law in Lemma 3.5, 2g(∗ ) = s(−ln|∗ | − tr(∗−1 S ∗ ) − tr{T ∗ ((I − E)∗ (I − E))+ }) + c:

(7.20)

By Lemma 4.2 with P = I − E, we conclude from (7.20) that the maximum likelihood ∗ estimator of ∗ is O (Z) with ∗ O (z) = S ∗ + S ∗ {((I − E)S ∗ (I − E))+ T ∗ ((I − E)S ∗ (I − E))+ }S ∗ :

(7.21)



Now replacing ∗ by O (Z) in (7.11), we obtain the maximum likelihood estimator, >(Z), O of >: >(z) O ≡ p (PS1∗ pA∗−1 p PS1∗ )+ pA∗−1 z(O

∗−1

(z))q (PS2∗ q(O

∗−1

(z))q PS2∗ )+ q: (7.22)

[Note that we may >x  >rst and maximize the likelihood function over Pp . Then by O Lemma 4.2 with T = 0, we obtain another expression for (Z): ∗  ∗−1 O (z) = (z − >(z)) O A (z − >(z))=rs: O

(7.23)

So ∗ L(>(z); O O (z)) =

exp(−s=2) ∗ (2)rs=2 |O (z)|1=2

(7.24)

which tells us why the maximum likelihood ratio test statistic for testing a linear hypothesis on the mean is a ratio of two determinants.] ∗ We shall now translate the estimate (>(z); O O (z)) of (>;∗ ) into an estimate, O (>(y); O (y)), of (; ). From (7.20) and (7.4), we obtain that =q∗ q and A=pA∗ p . So we choose ∗ O (y) = qO (z)q :

(7.25)

By (7.21), O ) = qS ∗ q + qS ∗ {((I − E)S ∗ (I − E))+ T ∗ ((I − E)S ∗ (I − E))+ }S ∗ q : (Y

(7.26)

S = qS ∗ q ;

(7.27)

Let T = qT ∗ q :

Then by Lemma 3.4(a), O ) = S + S{[q(I − E)q Sq(I − E)q ]+ T [q(I − E)q Sq(I − E)q ]+ }S: (Y

(7.28)

By (7.14) and (7.18), we obtain, upon simpli>cation, sS = sqS ∗ q = qz  A∗−1=2 (I − A∗−1=2 p (PS1∗ pA∗−1 p PS1∗ )+ pA∗−1=2 )zq = qq y∗ p(p A∗−1=2 p)[I − PIm(A∗−1=2 p PS ∗ ) ]p A∗−1=2 pp y∗ qq 1

= 0 y A−1=2 [I − A−1=2 (PS1∗ A+ PS1∗ )A−1=2 ]y0 = 0 y [A+ − A+ (PS1∗ A+ PS1∗ )A+ ]y0 ;

(7.28)

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

341

sT = sqT ∗ q = qz  A∗−1 p (PS1∗ pA∗−1 p PS1∗ )+ pA∗−1 zq = 0 y A+ (PS1∗ A+ PS1∗ )+ A+ y0 ;

(7.29)

q(I − E)q = q(I − q PS2∗ q)q = 0 [I − PS2∗ ]0 = 0 PS ∗⊥ 0 : 2

(7.30)

By (7.28) and (7.30), @ ≡ q(I − E)q Sq(I − E)q = 0 PS ∗⊥ 0 y [A+ − A+ (PS1∗ A+ PS1∗ )A+ ]y0 PS ∗⊥ 0 2

2

= 0 PS ∗⊥ SPS ∗⊥ 0 : 2

2

By Lemma 3.5, @+ = 0 (PS ∗⊥ SPS ∗⊥ )+ 0 :

(7.31)

2

2

So [email protected]+ [email protected]+ S = S(PS ∗⊥ SPS ∗⊥ )+ T (PS ∗⊥ SPS ∗⊥ )+ S: 2

2

2

2

ˆ ))0 in Theorem 5.1, we obtain that Replacing 0 in S and T by ((Y O ) = S + S(PS ∗⊥ SPS ∗⊥ )+ T (PS ∗⊥ SPS ∗⊥ )+ S: (Y 2

2

2

2

(7.32) (7.33)

0  + 0 ˆ ˆ Note that by (7.28) and (7.29), ((y)) y A y((y)) = S + T . So S and T give rise 0  + 0 ˆ ˆ to a decomposition of the ‘variance’: ((y)) y A y((y)) which is y y if A = I and if  is non-singular. Since

 = y − A0 y0 + ∗

with probability 1;

(7.34)

and by (7.1), (7.2), (7.4), (6.1) and (6.4), ∗ = p>q ;

z = p A0 y0 q;

we have 0  ˆ (y) O = y − A0 y[(y)] + p>(z)q O :

By (7.22), ∗−1 (y) O = y − A0 y0 + pp (PS1∗ pA∗−1 p PS1∗ )+ pA∗−1 p A0 y0 qO (z)q

×(PS2∗ qO

∗−1

(z)q PS2∗ )+ qq ;

i.e., 0 + + ˆ O O (y) O = y − A0 y((y)) + A0 (PS1∗ A+ PS1∗ )+ A+ y((y)) (PS2∗ ((y)) PS2∗ )+ : (7.35)

For writing a computer program for (7.33) and (7.35), we need (6.17) and projection formula (3.11). Discussion of the estimaors of  and  in (7.33) and (7.35) in terms of earlier results and related references was given in Section 1. Acknowledgements The authors wish to thank the referees for their helpful comments which led to the improved version of this paper.

342

C.S. Wong, H. Cheng / Journal of Statistical Planning and Inference 97 (2001) 323–342

References Anderson, T.W., 1984. An Introduction to Multivariate Analysis. Wiley, New York. Anderson, T.W., Olkin, I., 1985. Maximum-likelihood estimation of parameters of a multivariate normal distribution. Linear Alg. Appl. 70, 147–171. Arnold, S.F., 1981. The Theory of Linear Models and Multivariate Analysis. Wiley, New York. Calvert, B., Seber, G.A.F., 1978. Minimization of functions of a positive semide>nite matrix A subject to AX = 0. J. Multivariate Anal. 8, 173–180. Chan, N.N., Keung, T.K., 1997. Linear hypotheses made explicit. Commun. Statist. Theor.-Meth. 16, 387–400. Dykstra, R.L., 1970. Establishing the positive de>niteness of the sample covariance matrix. Ann. Math. Statist. 41, 2153–2154. Eaton, M.L., 1983. Multivariate Statistics. Wiley, New York. Fujikoshi, Y., 1974. On the asymptotic non-null distributions of the LR criterion in a general MANOVA. Canadian J. Statist. 2, 1–12. Gupta, A.K., Varga, T., 1993. Elliptically Contoured Models in Statistics. Kluwer Academic Publishers, Boston. Kruskal, W., 1975. The geometry of generalized inverse. J. Roy. Statist. Soc. Ser. B 37, 272–283. Muirhead, R.J., 1982. Aspects of Multivariate Statistical Theory. Wiley, New York. PotthoK, R.F., Roy, S.N., 1964. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 51, 313–326. Rao, C.R., 1973. Linear Statistical Inference and Its Applications. Wiley, New York. Rao, C.R., Mitra, S.K., 1971. Generalized Inverse of Matrices and Its Applications. Wiley, New York. von Rosen, D., 1991a. The growth curve model: a review. Commun. Statist.-Theor. Meth. 20 (9), 2791–2822. von Rosen, D., 1991b. Moments of estimators. Statistics 22, 111–131. Wong Chi Song, 1985. On the use of diKerentials in statistics. Linear Algebra Appl. 70, 282–299. Wong Chi Song, 1986. Modern Analysis and Algebra. Xian University Press, Xian. Wong Chi Song, 1989. Linear models in a general parametric form. Commun. Statist.-Theory Meth. 18 (8), 3095–3115. Wong Chi Song, 1993. Linear models in a general parametric form. Sankhya, Ser. A 55, 130–149. Wong Chi Song, Liu Dongsheng, 1995. Moments of generalized Wishart distributions. J. Multivariate Anal. 52, 280–294. Wong Chi Song, Masaro, Joe, Deng, Weicai, 1995. Estimating covariance in a growth curve model. Linear Algebra Appl. 214, 103–118. Wong Chi Song, Masaro, Joe, Wang, T., 1991. Multivariate versions of Cochran theorems. J. Multivariate Anal. 39, 154 –174.