[Reference] Hessian of Marginal Likelihood

November 23, 2013

Recall the expression for the i-th element of the gradient of ML w.r.t. indices.

$\begin{align} g'_i(x) &= \frac{1}{2} z(x)^\top K'(x) z(x) - Z'_i(x) \\ &= z_i(x) \delta_i^\top(x) z(x) - Z'_i(x) \\ &= f_i(x) - Z'_i(x) \end{align}$

Where $\delta_i = k'(x_i, x_j)$ , and $Z'_i$ is the derivitive of the normalization constant.

The goal today is to derive the second derivitive, H. Like the first derivitive, it will have two terms,

$H = H_1 - H_2$

Ultimately, we'll split the second term $H_2$ into two sub-terms:

$H = H_1 - H_{2,A} - H_{2,B}$

Prerequisite: $\frac{\partial \delta_i}{\partial x_j}$

Recall that the elements of $\delta_i$ are the partial derivatives of the kernel function w.r.t. its first input. The second derivatives $\delta_i$ will be given by the second partial derivatives of the kernel function.

$\begin{align} \frac{\partial^2 k(x_i, x_j)}{\partial x_i^2} &= x_j - \min(x_i, x_j) \\ \frac{\partial^2 k(x_i, x_j)}{\partial x_i \partial x_j} &= \min(x_i, x_j) \end{align}$

The derivative of the j-th element of $\delta_i$ is derived below.

$\begin{align} \frac{\partial \left(\delta_i \right)_j}{\partial x_k} &= \frac{\partial^2 k(x_i, x_j)}{\partial x_i \partial x_k} \\ &= (x_j - \min(x_i, x_j))\mathbb{1}_{k = i} + \min(x_i, x_j) \mathbb{1}_{k = j} \end{align}$

Note that this handles the special cases where k = i = j and $k \neq i, k \neq j$ .

We can generalizing to the full vector $\delta_i$

$\begin{align} \frac{\partial \delta_i }{\partial x_k} &= A_{i} \mathbb{1}_{k = i} + B_{ik}\\ A_{i} &= (x_1 - \min(x_i, x_1), \dots, x_j - \min(x_i, x_j), \dots)^\top \\ B_{ik} &= (0, \dots, \underbrace{\min(x_i, x_k)}_\text{k-th element}, \dots, 0)^\top \end{align}$

The A term handles the on-diagonal hessian terms, whereas B is included in all terms.

First term, $H_1 = f_i'$

We use the product rule to take the derivitive of $f_i = z_i \delta_i \cdot z$ .

$\begin{align} \frac{\partial f_i(x)}{\partial x_j} &= \left ( \frac{\partial}{\partial x_j} \, z_i(x) \right ) \delta_i^\top(x) z(x) + z_i(x) \left ( \frac{\partial}{\partial x_j}\delta_i^\top(x) \right ) z(x) + z_i(x) \, \delta_i^\top(x) \left ( \frac{\partial}{\partial x_j} z(x) \right ) \\ &= z_i' (x) \, \delta_i^\top(x) \, z(x) + z_i(x) \, (A_{i}\mathbb{1}_{j = i} + B_{ij})^\top \, z(x) + z_i(x) \, \delta_i^\top(x) \, z'(x) \\ &= z_i' (x) \, \delta_i^\top(x) \, z(x) + \mathbb{1}_{j = i} z_i(x) \, A_{i}^\top \, z(x) + z_i(x) \, \min(x_i, x_j) \, z_j(x) + z_i(x) \, \delta_i^\top(x) \, z'(x) \end{align}$

where

$\begin{align} z' = z'_{(j)} = \frac{\partial z(x)}{\partial x_j} &= \frac{\partial}{\partial x_j} S^\top U^{-1}(x) S y \\ &= S V' S^\top y \\ &= -S^\top U^{-1} U' U^{-1} S y \\ &= -(S^\top U^{-1} S) K' (S^\top U^{-1} S) y \end{align}$

Our goal is the generalize this to a single expression for the entire hessian matrix. Note that when $i \neq j$ , the third term disappears, so that term will become a diagonal matrix in the hessian expression. Let $\mathcal{Z}'$ be the jacobian of $z$ . We can express the hessian asWe can express the hessian as

$H_1 = \mathcal{Z}' \odot \left[ \Delta \, z \, (1 \, 1 \, ...) \right] + M \odot \left(z z^\top \right ) + \text{diag}\left\{ z(x) \odot \Delta' z(x) \right\} + \left[ z \, (1 \, 1 \, ...) \right] \odot \left[ \Delta \, \mathcal{Z}' \right]$

where $M$ is a matrix whose elements $m_{ij} = \min(x_i, x_j)$ ,

$\Delta$ is a matrix whose rows are composed of the $\delta_i$ vectors,

$\Delta'$ is the matrix whose i-th row is the vector $A_i$ ,

$\odot$ is the Hadamard matrix product,

and diag() is an operator that converts a vector into a diagonal matrix.

Second term, $Z_i''(x) = H_{2,A} + H_{2,B}$

Below are the expressions for the zeroth, first, and second derivitives of Z;

$\begin{align} Z &= 0.5 \log(\det(S K S^\top + I)) \\ \frac{\partial Z}{\partial x_i} &= 0.5 \text{Tr} \left[ U^{-1} U' \right] \\ \frac{\partial^2 Z}{\partial x_i \partial x_j} &= 0.5 \text{Tr} \left[ \frac{\partial U^{-1}}{\partial x_j} U' + U^{-1} \frac{\partial U'}{\partial x_j} \right] \\ &= 0.5 \text{Tr} \left[ V'_{(j)} U'_{(i)} + U^{-1} U''_{(ij)} \right] \\ &= 0.5 \left \{ \text{Tr} \left[ V'_{(j)} U'_{(i)} \right] + \text{Tr} \left[ U^{-1} U''_{(ij)} \right] \right\} \\ &= 0.5 \text{Tr}[A] +0.5 \text{Tr}[B] \end{align}$

Where

$A = V'_{(j)} U'_{(i)} \\ B = U^{-1} U''_{(ij)}$

These two terms correspond to the elements to the two hessian terms, $H_{2,A}$ and $H_{2,B}$ .

We'll begin by finding $Tr[A]$ and $H_{2,A}$ .

Observe that we can rewrite $U_{(i)}'$ as

$\begin{align} U'_{(i)} &= S K' S^\top \\ U'_{(i)} &= S (B + B^\top) S^\top \\ \end{align}$

where B is the sparse matrix,

$\begin{align} B = \left( \begin{array}{ccccc} \mathbf{0} & \cdots & \delta_i & \cdots & \mathbf{0} \end{array}\right) \end{align}.$

We can exploit this sparsity to further expand $U'$ to

$\begin{align} U'_{(i)} &= \left(S \, \delta_i \right) S_i^\top + S_i \left( S \, \delta_i \right)^\top \\ &= C_{(i)} + C_{(i)}^\top \end{align}$ where

$C_{(i)} = S \, \delta_i \, S_i^\top$ .

We can use this identity to expand $\text{Tr}[A]$ .

$\begin{align} \text{Tr}[A] &= \text{Tr}[V'_{(j)} U'_{(i)}] \\ &= \text{Tr}[-U^{-1} U'_{(j)}U^{-1} U'_{(i)}] \\ &= -\text{Tr}[U^{-1} \left( C_{(j)} + C_{(j)}^\top \right) U^{-1} \left( C_{(i)} + C_{(i)}^\top \right) ] \\ &= -\text{Tr}[U^{-1} \left( C_{(j)} + C_{(j)}^\top \right) U^{-1} \left( C_{(i)} + C_{(i)}^\top \right) ] \\ &= -\text{Tr}[\left( U^{-1} C_{(j)} + U^{-1}C_{(j)}^\top \right) \left(U^{-1} C_{(i)} + U^{-1}C_{(i)}^\top \right) ] \\ &= -\text{Tr}\left [U^{-1} C_{(j)} U^{-1}C_{(i)} + U^{-1} C_{(j)} U^{-1}C_{(i)}^\top + U^{-1}C_{(j)}^\top U^{-1}C_{(i)} + U^{-1}C_{(j)}^\top U^{-1}C_{(i)} ^\top\right] \\ &= -2 \text{Tr}\left [U^{-1} C_{(j)} U^{-1}C_{(i)}\right] - 2 \text{Tr}\left[U^{-1} C_{(j)} U^{-1}C_{(i)}^\top \right ] \\ &= -2 \text{Tr}\left [U^{-1} \left( S \delta_j S_j^\top \right) U^{-1} \left ( S \delta_i S_i^\top \right)\right] - 2 \text{Tr}\left[U^{-1} \left( S \delta_j S_j^\top \right) U^{-1}\left( S_i \delta_i^\top S^\top \right) \right ] \\ &= -2 \text{Tr}\left [S_i^\top U^{-1} S \delta_j S_j^\top U^{-1} S \delta_i \right] - 2 \text{Tr}\left[ \delta_i^\top S^\top U^{-1} S \delta_j S_j^\top U^{-1} S_i \right ] \\ \end{align}$ The last identity exploits the fact that Traces are invariant under cyclic permutations. Note that both expressions inside the trace operator are scalar products, which makes the trace operator redundant.

$\begin{align} \text{Tr}[A] &= -2 S_i^\top U^{-1} S \delta_j S_j^\top U^{-1} S \delta_i - 2 \delta_i^\top S^\top U^{-1} S \delta_j S_j^\top U^{-1} S_i \\ &= -2 \left( S_i^\top U^{-1} S \delta_j \right) \left(S_j^\top U^{-1} S \delta_i \right) - 2 \left(\delta_i^\top S^\top U^{-1} S \delta_j \right) \left( S_j^\top U^{-1} S_i \right) \\ \end{align}$ Here, we've regrouped the dot-products in each term to be a product of two dot-products. We can generalize this for the full hessian as follows:

$\begin{align} H_{2,A} &= -\left( S^\top U^{-1} S \Delta^\top \right)^\top \odot \left( S^\top U^{-1} S \Delta^\top \right) - \left(\Delta S^\top U^{-1} S \Delta^\top\right) \odot \left(S^\top U^{-1} S \right) \end{align}$

Next is the second term, $\text{Tr}[B]$ .

First lets derive $U''$ .

$\begin{align} U''_{(ij)} = \frac{\partial U_{(i)}'}{\partial x_j} &= \frac{\partial}{\partial x_j} \left \{ \left(S \, \delta_i \right) S_i^\top + S_i \left( S \, \delta_i \right)^\top \right \} \\ &= \left(S \, \frac{\partial \delta_i}{\partial x_j} \right) S_i^\top + S_i \left( S \, \frac{\partial \delta_i}{\partial x_j} \right)^\top \\ &= S \, \left(A_i \mathbb{1}_{i = j} + B_{ij} \right) S_i^\top + S_i \left(A_i^\top \mathbb{1}_{i = j} + B_{ij}^\top \right) S^\top \\ &= \mathbb{1}_{i = j} S \, A_i \, S_i^\top + S\, B_{ij} S_i^\top + \mathbb{1}_{i = j} \, S_i \, A_i^\top \, S^\top + S_i \, B_{ij} \, S^\top \\ &= \mathbb{1}_{i = j} S \, A_i \, S_i^\top + \mathbb{1}_{i = j} S_i \, A_i^\top \, S^\top + S_j \min(x_i, x_j) S_i^\top + S_i \min(x_i, x_j) S_j^\top \\ &= \mathbb{1}_{i = j} \left ( S \, A_i \, S_i^\top + S_i \, A_i^\top \, S^\top \right ) + \min(x_i, x_j) \left( S_i S_j^\top + S_j S_i^\top \right) \end{align}$

Now we can derive $\text{Tr}[B]$ .

$\begin{align} \text{Tr}[B] &= \text{Tr}[U^{-1} U''_{(ij)}] \\ &= \text{Tr}[U^{-1} \left \{\mathbb{1}_{i = j} \left ( S \, A_i \, S_i^\top + S_i \, A_i^\top \, S^\top \right ) + \min(x_i, x_j) \left( S_i S_j^\top + S_j S_i^\top \right) \right \} ] \\ &= \mathbb{1}_{i = j} \text{Tr}\left [ U^{-1} S A_i S_i^\top \right ] + \mathbb{1}_{i = j} \text{Tr}\left [ U^{-1} S_i A_i^\top S \right ] + \min(x_i, x_j) \text{Tr}\left [ U^{-1} S_i S_j^\top \right ] + \min(x_i, x_j) \text{Tr}\left [ U^{-1} S_j S_i^\top \right ] \\ &= \mathbb{1}_{i = j} \text{Tr}\left [ S_i^\top U^{-1} S A_i \right ] + \mathbb{1}_{i = j} \text{Tr}\left [ A_i^\top S U^{-1} S_i \right ] + \min(x_i, x_j) \text{Tr}\left [S_j^\top U^{-1} S_i \right ] + \min(x_i, x_j) \text{Tr}\left [ S_i^\top U^{-1} S_j \right ] \\ &= 2 \mathbb{1}_{i = j} S_i^\top U^{-1} S A_i + 2 \min(x_i, x_j) S_j^\top U^{-1} S_i \\ \end{align}$

This is for a single term of the Hessian. We can rewrite it to compute the entire Hessian using matrix arithmetic:

$H_{2,B} = M \odot S^\top U^{-1} S + ( S^\top \left ( U^{-1} S \mathcal{A} \right ) ) \odot I$

Here, M is defined as, $m_{ij} = \min(x_i, x_j)$ , and $\mathcal{A}$ is the matrix whose i-th column is $A_i$ . Note that only the diagonal elements of the second term are preserved; in implementation, this can be implemented as

diag(sum(S .* (inv(U) * S * A)))

Full Hessian

The full Hessian is the sum of the three parts above

$\begin{align} H =& H_1 - H_{2,A} - H_{2,B} \\ =& \mathcal{Z}' \odot \left[ \Delta \, z \, (1 \, 1 \, ...) \right] + M \odot \left(z z^\top \right ) + \text{diag}\left\{ z(x) \odot \Delta' z(x) \right\} + \left[ z \, (1 \, 1 \, ...) \right] \odot \left[ \Delta \, \mathcal{Z}' \right] + \\ & \left( S^\top U^{-1} S \Delta^\top \right)^\top \odot \left( S^\top U^{-1} S \Delta^\top \right) + \left(\Delta S^\top U^{-1} S \Delta^\top\right) \odot \left(S^\top U^{-1} S \right) - \\ & M \odot S^\top U^{-1} S - ( S^\top \left ( U^{-1} S \mathcal{A} \right ) ) \odot I \end{align}$

Posted by Kyle Simek

Work Log → ← Chicken and Egg

KLS Research Blog ← Nothing to see here...

[Reference] Hessian of Marginal Likelihood

Prerequisite: ∂δi∂xj\frac{\partial \delta_i}{\partial x_j}

First term, H1=f′iH_1 = f_i'

Second term, Z″i(x)=H2,A+H2,BZ_i''(x) = H_{2,A} + H_{2,B}

Full Hessian

Prerequisite: $\frac{\partial \delta_i}{\partial x_j}$

First term, $H_1 = f_i'$

Second term, $Z_i''(x) = H_{2,A} + H_{2,B}$