[Work Log] Linux NVidia/Cuda/X11 erorrs; Cuda server; matlab integration

October 21, 2013

Getting my (cuda-equipped) desktop system running with likelihod_server.

Trouble building makefiles. Csh issue with new makefiles. Kobus fixed

Still can't build makefiles. Probilems in the ./lib subdirectory is apparently blocking kjb_add_makefiles from finishing

Some makefiles were missing in ./lib. Adding them was problematic, because (like before) kjb_add_makefiles wasn't finishing, because dependency directories were missing makefiles. Also had some issues in kjb /lib, out of date from svn and some compile bugs I introduced in my last commit.

Compiling now, but shader is giving errors at runtime (opengl can't compile it, but no error message is given.

Rebooting...

No desktop. Missing disk issue? Tweaked fstab, still no X11. Must be a driver issue. Couldn't find driver from NVidia's I downloaded from web site months ago. Using driver "Ubuntu-X" PPA.

Still no X. /var/log/Xorg.0.log says driver and client have different versions. Got a tip online:

sudo dpkg --get-selections | grep nvidia

Lots of extra crap there (old versions, experimental drivers, etc). 'sudo apt-get purged' all the non-essentials. Now we're getting to login screen, but login is failing (simply returns to login screen).

Found solution:

sudo rm ~/.Xauthority

Booting successfully.

Running likelihood_server... shader is now compiling successfully! Previous issues must have been driver issues as a result of the recent 'sudo apt-get upgrade' without restart.

Getting segfault when dumping pixels, though.

Caused by trying to read from cuda memory as if it were main memory. Should be dumping if we're using GPU (or use a different dump function).

Fixed; if using gpu, copy from cuda to temp buffer, then call normal routine.

Getting nan's from likelihood. Need to dump from GPU likleihood, which means digging into libcudcvt. Got some build errors resulting from a boost bug from 8 months ago (which arose because I updated GCC).

cudcvt updated to grab blurred frames for (some method names changed to better match the analogous CPU likelihood in KJB.).

Moved "dump" routines from Bd_mv_likelihood_cpu to the base class Bd_mv_likelihood as pure virtual function, and implemented a version in Bd_mv_likelihood_gpu to call cudcvt's new frame-grabbing code. So "dump" mode now works on both cpu and gpu version.

Dumped data from CUDA likelihood; looks fine. So I'm still clueless why we're getting nan values.

Sanity check -- check that tests in libcudcvt are still passing

test is failing on an assert. Looks like an overzealous assert with a low percent-error threshold.

Lowered threshold, test finished; GMM model doesn't match between gpu and cpu versions...

what's changed? Rolling back to first releast to see if tests pass. Is it possible I never validated this? Or maybe different GGM's being used between cpu and gpu?

Old version crashed and burned hard. GPU is returning 'inf'. No help here...

Oops, svn version has lots of uncommitted changes, including bugfixes. No wonder it was no help.

found issue: introduced a regression when troubleshooting last bug. Grrr.

Still getting 'inf'. Checked:

GPU buffers contain valid data for data and model
CPU version gives finite results.
GPU can give finite results (e.g. in test application)

Modified libcudcvt test to receive arbitrary data and model images. It's giving finite results, so it must be something about how I'm calling it in the likelihood server program.

issue: Bd likelihood in GMM mode gives lower values for ground truth model than random model.

Bug. fixed.

Digging deeper into 'inf' issue.

Dug all the way into the thrust::inner_product call. Probed both buffers -- look okay. mixture components look okay

Got it! Was able to reproduce the 'inf's in the likelihood test program in libcudcpp. It was so hard to reproduce because it only occurred when

old GMM values were used
unblurred model and data were used

Regarding 2, previosuly I just used the unblurred model, since it was at hand from a cuda dump, but used the blurred data from a training dump.

Blurring doesn't seem to matter; both 2.0 and 5.0 give inf.

Interestingly, the perfect model doesn't oveflow, but the random (almost null) model does.

Also, for all the 'inf' cases, the cpu results aren't terribly high, so float overflow seems unlikely.

Most likely it's the conditioning, where the joint pdf is divided by the marginal. Maybe we're getting some bizzare model values that are off the charts? But using the model image with the blurred data image is no problem, so that suggests the model values aren't a problem.

Interesting: Tried removing border edges and the problem disappeared. Is it possible that the blurring routine is not robust to near-edge pixels?

idea: dump image after scale-offset, but before reduce

Results:

Probability maps look sensible when a 2-pixel margin is added (--data-margin=2).

(above: 2.0 blurring, 2 pixel margin)

(above: 5.0 blurring, 2 pixel margin)

Notice what happens when we disable the data margin (rendered dimmer and with blue padding to emphasize effect):

(2.0 blurring, 0 pixel margin)

Notice the white pixels around the border, which presumably correspond to inf values.

Inspecting the blurred data image, it looks like these pixels are, indeed, the brightest in the image. It's possible we were lucky enough to have found the maximum range of this gmm.

It looks like evaluating the bivariate normal is underflowing. It could have been brought back down to size during conditioning, but we never made it that far. could refactor by computing conditional covariance matrix beforehand, instaed of taking a ratio of computed values

TODO (new)

Pre-test all incoming data images: evaluate against an empty image and a full image and check for NaN and inf.

Posted by Kyle Simek

GPU debugging → ← KJB EM GMM

KLS Research Blog ← Nothing to see here...

[Work Log] Linux NVidia/Cuda/X11 erorrs; Cuda server; matlab integration

TODO (new)