TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

1Xi'an Jiaotong University 2University of Illinois at Urbana-Champaign,
* Equal Contribution
teaser image.

We propose a novel multi-modal learning framework TAMM with two learning stages and three unified adapter modules. Our proposed TAMM better exploits both image and language modalities and improves 3D shape representations.

Abstract

The limited scale of current 3D shape datasets hinders the advancements in 3D shape understanding, and motivates multi-modal learning approaches which transfer learned knowledge from data-abundant 2D image and language modalities to 3D shapes. However, even though the image and language representations have been aligned by cross-modal models like CLIP, we find that the image modality fails to contribute as much as the language in existing multi-modal 3D representation learning methods. This is attributed to the domain shift in the 2D images and the distinct focus of each modality.

To more effectively leverage both modalities in the pre-training, we introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergetic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training.

Extensive experiments demonstrate that TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, we boost the zero-shot classification accuracy on Objaverse-LVIS from 46.8 to 50.7, and improve the 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1 to 99.0.

Overall Pipeline

In Stage 1, TAMM fine-tunes a lightweight CLIP Image Adapter (CIA) through contrastive learning and re-aligns the image features with the text features to alleviate the domain shift originated from rendered images.

In Stage 2, TAMM introduces Image Alignment Adapter (IAA) and Text Alignment Adapter (TAA) to decouple 3D representations into two sub-spaces: one focusing more on visual attributes and the other for semantic understanding, ensuring a more comprehensive and effective multi-modal pre-training strategy.

TAMM adaptively utilizes decoupled 3D features for various downstream tasks including linear probing classification and zero-shot classification (bottom), achieving more robust classification results.

Zero-shot Classification Results

TAMM sets a new state of the art in zero-shot classification across Objaverse-LVIS,ModelNet-40, and ScanObjectNN benchmarks.

Linear Probing Classification Results

TAMM outperforms previous methods by a large margin in Linear probing 3D classification accuracy.

Real World Recognition Results

Qualitative Results

BibTeX


    @article{zhang2024tamm,
      title={TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding},
      author={Zhang, Zhihao and Cao, Shengcao and Wang, Yu-Xiong},
      journal={arXiv preprint arXiv:2402.18490},
      year={2024}
    }