Learning Models of Object Structure
Fig. 1 Example of the generative image model for detected features. The rightward arrows show statistical generation and the leftward show feature detection.

The goal of this project is to learn stochastic geometric models of object categories from single view images. Our approach focuses on learning structure models that are expressible as spatially contiguous assemblage of blocks. The assemblages, or topologies, are learned across groups of images, with one or more such topologies linked to an object category (e.g. chairs). Fitting learned topologies to an image is useful for identifying an object class, as well as detail its geometry. The latter goes beyond just labeling objects, as it provides the geometric structure of particular instances.

Fig. 2 Generated samples of tables and chairs from the learned structure topology and statistical category parameters.

We learn the models using joint statistical inference over instance and category-level parameters. The instance parameters include the specific size and shape of an object in an image and its camera capturing the view. The category parameters encompass shared object topology and shape statistics. Together these produce an image likelihood for detected features through a generative, statistical imaging model (Figure 1). The category statistics additionally define a likelihood over structure and camera instance parameters (Figure 2).

For model inference and learning, we use trans-dimensional sampling to explore the varying dimensions of topology hypotheses. We further alternate between Metropolis-Hastings and stochastic dynamics to fit the instance and statistical category parameters. Figure 3 illustrates the learning process with a sequence of samples for two images in a set of tables; the instance parameters and shared category parameters are being inferred simultaneously.

Fig. 3 From left to right, successive random samples from 2 of 15 table instances, each after 2K iterations of model inference. The category topology and statistics are learned simultaneously from the set of images; the form of the structure is shared across instances.

Our experiments on images of furniture objects, such as tables and chairs, suggest that this is an effective approach for learning models that encode simple representations of category geometry and their statistics. We have also shown the learned models support inferring both category and geometry on held out single view images. Figure 4 shows some example results for learned models of tables, chairs, footstools, sofas and desks.

This work is published in NIPS 2009 and the software is available for download We are currently working on substantial extensions in a technical report that will ultimately be a journal submission. Comments are welcome, please send me an email.

Fig. 4 Learning the topology of furniture objects. Sets of contiguous blocks were fit across five image data sets. Model fitting is done jointly across each set. The fits for these examples is shown by the blocks drawn in red. Detected edge points are shown in green.