Multiview Compressive Coding for 3D Reconstruction
Abstract
A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. Comparatively, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models and category-specific priors which hinder scaling to novel settings. In this work, we explore single-view 3D reconstruction by learning generalizable representations inspired by advances in self-supervised learning. We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry to predict the 3D structure by querying a 3D-aware decoder. MCC's generality and efficiency allow it to learn from large-scale and diverse data sources with strong generalization to novel objects imagined by DALLcdotE 2 or captured in-the-wild with an iPhone.
Community
Introduces Multiview compressive coding (MCC): encode appearance and 3D geometry of a scene and predict/reconstruct it using 3D-aware decoder; single self-supervised framework trained in RGB-D videos; dense single-view 3D reconstruction and shape completion using advances in SSL and MAE (masked modeling). RGBD input, compress using transformers, sample and query points using 3D aware decoder (predict occupancy and color of new/unseen/occluded points). Generalizes to Taskonomy, CO3D, Hypersim, in-the-wild iPhone captures (with depth), and DALL-E 2 generations. Unproject RGBD to 3D point cloud; RGB/image encoder is transformer/ViT (convert image to patch embeddings); 3D (XYZ) point encoder is linear project, local attention and pooling, and transformer; decoder takes concatenated embeddings and 3D point queries, passes through transformer, outputs full object (binary classifier for occupancy and 256-way classifier for color). Attention masking pattern of decoder has self-attention for queries (and embeddings), embeddings don’t attend queries (other way works), and has global context/CLS token. Rotational equivariance in training by data augmentation; coordinate system from CO3D. Qualitative results on CO3D-v2 novel object category (better than PoinTr), better quantitative results than NeRF-WCE and NerFormer; has ablations for encoder (decoupled) structure, encoder choices (PointNet or Transformer for 3D), and decoder design. Tested on iPhone and DALL-E 2 generated images (depth through iPhone sensor, if only RGB available, uses ViT for dense depth prediction). Also did scene reconstruction from single image (on Hypersim dataset). Sensitive to depth, and high-fidelity texture is challenging. Appendix has architecture specifications and other implementation details (follows CO3D largely). Code shows reconstruction-like loss. From Meta (Jitendra Malik).
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper