[Work Log] FIRE - piecewise linear inference

May 05, 2014

Finished compiling inference test with synthetic data.

Gradient is taking incredibly long, especially for a 7-dimensional model. PErhaps 100 data points is too large, but I'm guessing the cost of allocating vectors for each function evaluation is the botteneck (GDB seems to agree).

Time to run 1 iteration
Single threaded, on bayes01
Baseline
(Debugging mode, allocation checking enabled)
1:28.94
Heap checking disabled 0:21.06 (-1:07.88, 4.22x)
Heap & initialization checking disabled 0:11.19 (-9.87 1.88x)
PRODUCTION=1 (with -O2) 0:07.24 (-3.95 1.55x)
-O3 0:07.86 (+0.62 0.92x)

Used gprof with grpof2dot.py to get the following diagram: gprof.pdf.

iso_mvn_lpdf is getting hit hard:

    double iso_mvn_lpdf(const double mu, const double y, double epsilon, size_t D)
    {
        double accum = 0;
        double d;
        for(size_t i = 0; i < D; ++i)
        {
            d = (mu++ - y++) / epsilon;
            accum += dd;
        }
        return -0.5  (accum + Dlog(2M_PIepsilonepsilon));
    }

It's already pretty lean (no alloc, all c-style). But we can move the divide-by-epsilon out of the loop for an easy 1.8x speedup.

    double iso_mvn_lpdf(const double mu, const double y, double epsilon, size_t D)
    {
        double accum = 0;
        double d;
        for(size_t i = 0; i < D; ++i)
        {
            d = mu++ - y++;
            accum += dd;
        }
        accum /= epsilonepsilon;
        return -0.5  (accum + Dlog(2M_PIepsilon*epsilon));
    }

Now the bottleneck is all the allocation, copying and freeing of kjb::Vector temporaries.

I tweaked the code for evaluating a piecewise linear function to avoid creating kjb::Vector temporaries, and running time dropped dramatically from 11.4s to 1.26s. This is in production mode, so its surprising that more temporaries aren't optimized out.

GProf after eliminating temporaries: gprof.pdf.

Since 1 iterations takes about 1.2s, bumping up to 10 iterations.

Remaining speed-up opportuntiies: exploit gradient independence, parallel gradient

Parallel gradient

Enabled 8-way parallel gradient evaluation, and got worse performance!

single threaded:  0:09.63
multi threaded: 0:15.48

This despote top displaying 550% CPU utilization.

Maybe gnuprof is affecting performance

Tuning

step size gradient size

Posted by Kyle Simek
blog comments powered by Disqus