[Reference] Hessian of Marginal Likelihood

November 23, 2013

Recall the expression for the i-th element of the gradient of ML w.r.t. indices.

gi(x)=12z(x)K(x)z(x)Zi(x)=zi(x)δi(x)z(x)Zi(x)=fi(x)Zi(x)

Where δi=k(xi,xj), and Zi is the derivitive of the normalization constant.

The goal today is to derive the second derivitive, H. Like the first derivitive, it will have two terms,

H=H1H2

Ultimately, we'll split the second term H2 into two sub-terms:

H=H1H2,AH2,B

Prerequisite: δixj

Recall that the elements of δi are the partial derivatives of the kernel function w.r.t. its first input. The second derivatives δi will be given by the second partial derivatives of the kernel function.

2k(xi,xj)x2i=xjmin(xi,xj)2k(xi,xj)xixj=min(xi,xj)

The derivative of the j-th element of δi is derived below.

(δi)jxk=2k(xi,xj)xixk=(xjmin(xi,xj))1k=i+min(xi,xj)1k=j

Note that this handles the special cases where k = i = j and ki,kj.

We can generalizing to the full vector δi

δixk=Ai1k=i+BikAi=(x1min(xi,x1),,xjmin(xi,xj),)Bik=(0,,min(xi,xk)k-th element,,0)

The A term handles the on-diagonal hessian terms, whereas B is included in all terms.

First term, H1=fi

We use the product rule to take the derivitive of fi=ziδiz.

fi(x)xj=(xjzi(x))δi(x)z(x)+zi(x)(xjδi(x))z(x)+zi(x)δi(x)(xjz(x))=zi(x)δi(x)z(x)+zi(x)(Ai1j=i+Bij)z(x)+zi(x)δi(x)z(x)=zi(x)δi(x)z(x)+1j=izi(x)Aiz(x)+zi(x)min(xi,xj)zj(x)+zi(x)δi(x)z(x)

where

z=z(j)=z(x)xj=xjSU1(x)Sy=SVSy=SU1UU1Sy=(SU1S)K(SU1S)y

Our goal is the generalize this to a single expression for the entire hessian matrix. Note that when ij, the third term disappears, so that term will become a diagonal matrix in the hessian expression. Let Z be the jacobian of z. We can express the hessian asWe can express the hessian as

H1=Z[Δz(11...)]+M(zz)+diag{z(x)Δz(x)}+[z(11...)][ΔZ]

where M is a matrix whose elements mij=min(xi,xj),

Δ is a matrix whose rows are composed of the δi vectors,

Δ is the matrix whose i-th row is the vector Ai,

is the Hadamard matrix product,

and diag() is an operator that converts a vector into a diagonal matrix.

Second term, Zi(x)=H2,A+H2,B

Below are the expressions for the zeroth, first, and second derivitives of Z;

Z=0.5log(det(SKS+I))Zxi=0.5Tr[U1U]2Zxixj=0.5Tr[U1xjU+U1Uxj]=0.5Tr[V(j)U(i)+U1U(ij)]=0.5{Tr[V(j)U(i)]+Tr[U1U(ij)]}=0.5Tr[A]+0.5Tr[B]

Where

A=V(j)U(i)B=U1U(ij)

These two terms correspond to the elements to the two hessian terms, H2,A and H2,B.

We'll begin by finding Tr[A] and H2,A.

Observe that we can rewrite U(i) as

U(i)=SKSU(i)=S(B+B)S

where B is the sparse matrix,

B=(0δi0).

We can exploit this sparsity to further expand U to

U(i)=(Sδi)Si+Si(Sδi)=C(i)+C(i)
where C(i)=SδiSi.

We can use this identity to expand Tr[A].

Tr[A]=Tr[V(j)U(i)]=Tr[U1U(j)U1U(i)]=Tr[U1(C(j)+C(j))U1(C(i)+C(i))]=Tr[U1(C(j)+C(j))U1(C(i)+C(i))]=Tr[(U1C(j)+U1C(j))(U1C(i)+U1C(i))]=Tr[U1C(j)U1C(i)+U1C(j)U1C(i)+U1C(j)U1C(i)+U1C(j)U1C(i)]=2Tr[U1C(j)U1C(i)]2Tr[U1C(j)U1C(i)]=2Tr[U1(SδjSj)U1(SδiSi)]2Tr[U1(SδjSj)U1(SiδiS)]=2Tr[SiU1SδjSjU1Sδi]2Tr[δiSU1SδjSjU1Si]
The last identity exploits the fact that Traces are invariant under cyclic permutations. Note that both expressions inside the trace operator are scalar products, which makes the trace operator redundant. Tr[A]=2SiU1SδjSjU1Sδi2δiSU1SδjSjU1Si=2(SiU1Sδj)(SjU1Sδi)2(δiSU1Sδj)(SjU1Si)
Here, we've regrouped the dot-products in each term to be a product of two dot-products. We can generalize this for the full hessian as follows: H2,A=(SU1SΔ)(SU1SΔ)(ΔSU1SΔ)(SU1S)

Next is the second term, Tr[B].

First lets derive U.

U(ij)=U(i)xj=xj{(Sδi)Si+Si(Sδi)}=(Sδixj)Si+Si(Sδixj)=S(Ai1i=j+Bij)Si+Si(Ai1i=j+Bij)S=1i=jSAiSi+SBijSi+1i=jSiAiS+SiBijS=1i=jSAiSi+1i=jSiAiS+Sjmin(xi,xj)Si+Simin(xi,xj)Sj=1i=j(SAiSi+SiAiS)+min(xi,xj)(SiSj+SjSi)

Now we can derive Tr[B].

Tr[B]=Tr[U1U(ij)]=Tr[U1{1i=j(SAiSi+SiAiS)+min(xi,xj)(SiSj+SjSi)}]=1i=jTr[U1SAiSi]+1i=jTr[U1SiAiS]+min(xi,xj)Tr[U1SiSj]+min(xi,xj)Tr[U1SjSi]=1i=jTr[SiU1SAi]+1i=jTr[AiSU1Si]+min(xi,xj)Tr[SjU1Si]+min(xi,xj)Tr[SiU1Sj]=21i=jSiU1SAi+2min(xi,xj)SjU1Si

This is for a single term of the Hessian. We can rewrite it to compute the entire Hessian using matrix arithmetic:

H2,B=MSU1S+(S(U1SA))I

Here, M is defined as, mij=min(xi,xj), and A is the matrix whose i-th column is Ai. Note that only the diagonal elements of the second term are preserved; in implementation, this can be implemented as

diag(sum(S .* (inv(U) * S * A)))

Full Hessian

The full Hessian is the sum of the three parts above

H=H1H2,AH2,B=Z[Δz(11...)]+M(zz)+diag{z(x)Δz(x)}+[z(11...)][ΔZ]+(SU1SΔ)(SU1SΔ)+(ΔSU1SΔ)(SU1S)MSU1S(S(U1SA))I
Posted by Kyle Simek