lsxi77777 commited on
Commit
22881c1
·
1 Parent(s): 5032e08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -14
README.md CHANGED
@@ -4,17 +4,10 @@ license: apache-2.0
4
 
5
  ## Abstract
6
 
7
- Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the
8
- modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try
9
- to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In
10
- this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy
11
- modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we
12
- propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich
13
- scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching
14
- data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are
15
- well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset
16
- that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching
17
- pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and
18
- zero-shot matching tasks, including 19 cross-modal cases, demonstrate that our MINIMA can significantly outperform the
19
- baselines and even surpass modality-specific methods. The dataset and code are available
20
- at https://github.com/LSXI7/MINIMA .
 
4
 
5
  ## Abstract
6
 
7
+ Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including 19 cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at https://github.com/LSXI7/MINIMA .
8
+
9
+
10
+
11
+ ## Citation
12
+
13
+ Paper:https://huggingface.co/papers/2412.19412