Multi-Axis Vision Transformer (MaxViT)
This model is a Multi-Axis Vision Transformer (MaxViT) trained for video generation on the UCF101 dataset. MaxViT leverages a novel attention mechanism, Multi-Axis Self-Attention (Max-SA), to balance local and global spatial interactions efficiently. It represents a significant step forward in Transformer-based vision tasks, achieving high performance through its hybrid architecture.
Motivation
Vision Transformers (ViTs), as introduced in the original paper, showed a new way to process image-based tasks by applying the Transformer mechanism to vision. However, ViTs are data-hungry, lack sufficient inductive bias, and often underperform without extensive pertaining.
MaxViT addresses these limitations by combining local and global attention mechanisms with hierarchical architectures, resulting in improved scalability and generalizability.
The Swin Transformer attempted to address the data-hungry nature of Transformers by introducing shifted non-overlapping windows for self-attention, enabling hierarchical feature extraction. This innovative approach allowed Swin to outperform ConvNets on the ImageNet benchmark, marking a significant milestone for vision Transformers. However, its reliance on window-based attention limited the model's capacity due to a loss of non-locality, making it less effective for larger datasets like ImageNet-21K.
In contrast, MaxViT leverages Multi-Axis Self-Attention (Max-SA) to seamlessly combine local and global interactions within a single module, overcoming these limitations by providing a global receptive field with linear computational complexity. This approach strikes a balance between capacity, generalizability, and efficiency.
Key Features of MaxViT
Local and Global Interactions:
- Combines blocked local attention and dilated global attention, ensuring a global receptive field with linear complexity.
Hybrid Design:
- Integrates convolutional operations with attention mechanisms to enhance feature extraction and model capacity.
Scalability and Simplicity:
- Designed with simplicity in mind, using modular Max-SA blocks that can be stacked hierarchically.
Performance:
- Achieves competitive performance under various data regimes for tasks like classification, detection, and segmentation.
Training Details
- Dataset: UCF101 Action Recognition
- Training Duration: 6.125 hours
- Hardware: NVIDIA A100 GPU (Colab Pro)
- Framework: PyTorch
- Training Code: GitHub Repository
- Epochs: 50
- Metrics:
- MSE: 0.0049
- SSIM: 0.7784
Comparison with Other Models
This dataset was also evaluated using ConvLSTM and PredRNN models, which achieved superior performance within 20 epochs, compared to the 50 epochs required for MaxViT.
ConvLSTM (Paper):
- Combines convolutional layers with LSTM for spatiotemporal modeling.
- Performs well without requiring extensive data due to inherent inductive biases.
PredRNN (Paper):
- Extends ConvLSTM with a recurrent memory mechanism, enhancing temporal modeling.
These models outperformed MaxViT within fewer epochs because they lack a Transformer architecture, which typically requires more training data and epochs to reach convergence.
Hypothesis for MaxViT Performance
MaxViT demonstrated consistent improvement throughout the 50 training epochs without signs of plateauing. It is hypothesized that with additional training, MaxViT would achieve comparable or even better results due to its superior ability to model both local and global interactions.
Model Contributions
Versatile Backbone:
- MaxViT is suitable for a wide range of visual tasks, including video action recognition, image aesthetics assessment, and object detection.
Innovative Attention Module:
- The Max-SA module combines local and global attention with linear complexity, overcoming limitations of previous window-based or full-attention approaches.
Efficient Design:
- Modular architecture simplifies implementation and facilitates scalability across diverse datasets.
Performance Metrics
The model was evaluated on the UCF101 dataset for video action recognition:
- Mean Squared Error (MSE): 0.0049
- Structural Similarity Index Measure (SSIM): 0.7784
References
- MaxViT: Multi-Axis Vision Transformer
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- ConvLSTM: Convolutional LSTM Network
- PredRNN: A Recurrent Neural Network for Spatiotemporal Prediction