To optimize indices, we'll need to compute the derivative of the marginal log-likelihood w.r.t. changing indices.
I first tried to derive this using the generalization of the chain rule to matrix expressions (see matrix cookbook, section 2.8.1), but the computation exploded. Since ultimately, the derivative is a simple single-input, single output function, we can use differentials to derive the solution.
Let the marginal likelihood as a function of indices be \(g(x)\):
Here, \(M = S^\top U^{-1} S \), (which is symmetric), and \(z = M y\).
This equation gives us a single element of the gradient, namely \(d g(x)/dx_i\). However, once \(z\) is computed, we can reuse it when recomputing (1) for all other \(x_j\)'s. The cost of each subsequent gradient element becomes \(O(n^2)\), making the total gradient \(O(n^3)\), which is pretty good. (This assumes the K's can be computed efficiently, which is true; see below.) However, we also observe that \(K'\) is sparse with size \(O(n)\), so we can do sparse multiplication to reduce the running time to linear, and the full gradient takes \(O(n^2)\), assuming \(z\) is precomputed. Cool!
First, we'll layout the general form of K', whose elements are the full derivative of the kernel w.r.t. \(x_k\).
The first term is nonzero only on the i-th row of K', and the second term is nonzero on the i-th column of K'. This suggests the following convenient sparse representation for K'.
Let the vector \(\delta_i\) be the vector whose j-th element is \( \frac{\partial k(x_i, x_j) }{\partial x_i} \). Using this notation, we can rewrite \(K'\) as
where \(C = \left(0 \, \dots \, \delta_i \, \dots \, 0 \right) \).
Below we derive the derivative \(\frac{\partial k(x_i, x_j)}{\partial x_i}\) for each of the three covariance expresssions.
Cubic covariance
Recall the cubic covariance expression:
Taking the derivative w.r.t. (x_i) gives:
Or equivalently
Linear Covariance
Recall the linear covariance expression:
Offset Covariance
Recall the offset covariance expression:
Implementation
Implemented end-to-end version in kernel/get_model_kernel_derivative.m
; see also components in kernel/get_spacial_kernel_derivative.m
and kernel/cubic_kernel_derivative.m
.
These functions return all of the partial derivatives of the matrix with respect to the first input. The i-th row of the result make up the nonzero values in \(\frac{\partial K}{\partial x_i}\). Below is example code that computes all of the partial derivative matrices.
N = 100;
% construct indices
x = linspace(0, 10, N);
% construct derivative rows
d_kernel = get_model_kernel_derivative(...);
d_K = eval_kernel(d_kernel, x, x);
% construct dK/dx_i, for each i = 1..N
d_K_d_x = dcell(1,N);
for i = 1:N
tmp = sparse(N, N);
tmp(i,:) = d_K(i,:);
tmp(:,i) = d_K(i,:)';
d_K_d_x{i} = tmp;
end
Directional Derivatives
I think we can get directional derivatives of \(K\) by taking the weighted sum of partial derivatives, where the weights are the component lengths of the direction vector. I have yet to confirm this beyond a hand-wavy hunch, and in practice, this might not even be needed, since computing the full gradient is so efficient.
As we saw earlier, \(\frac{\partial K}{\partial x_i}\) is sparse, and has the form in equation (4). We can use the sparsity to ultimately compute the entire gradient in a single matrix multiplication.
First we'll rewrite \(g'\) in terms if \(\delta_i\)
We can generalize this to the entire gradient using matrix operations:
Where \(\Delta\) is the matrix whose ith row is \(\delta_i\), and \(\odot\) denotes element-wise multiplication.
To handle multiple dimensions, simply apply to each dimension independently and sum the results.
Posted by Kyle Simek