Abstract
What is an image and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better results than ConvNets or ViTs on several benchmarks. Codes are available at: https://github.com/ma-xu/Context-Cluster.
Community
- Proposes Context Clusters (CoCs): A clustering approach to image encoding, in contrast to ConvNets or Vision transformers (ViTs); not SOTA, but close and more intuitive learning; treat image as a set of points and borrow ideas from point cloud analysis to aggregate them. A context of clusters (CoCs) has feature aggregation and distribution/dispatch layers, as opposed to SuperPixel which groups pixels with common characteristics or Vision GNN (ViG) which extracts graph-level features for visual tasks.
- Add two channels for relative pixel position (scaled from the center) to each pixel, giving 'n' points (number of pixels) each 5-dim; reduce points by joining and projecting k-nearest neighbours (can also be implemented as conv in ordered set); successive stages of point reduce and context cluster blocks; each context cluster block has context cluster operation and MLP (both in residual) - inspired by transformer design.
- Context cluster operation: linear project each point (for similarity computation), initialize 'c' centers evenly and assign kNN average, get pair-wise cosine similarity matrix (c by n shape), allocate each point to closest cluster (there could be redundant clusters with zero points). For the 'm' (dynamic) points in each cluster, get the aggregated feature: linearly project point features, get center in this value space, weight with normalized (scale and shifted) similarity of points (to the cluster center) - see Eq 1; update each point in the cluster with this aggregated vector (scaled with distance) - Eq 2, pass through FC to make the dimensionality same; do this operation in multiple heads and concatenate results (inspired by multi-head computing in transformers).
- Since feature similarity would become bottleneck, partition the region into local regions (inspired by Swin Transformer). The position of the cluster centers is fixed and clusters do not overlap (a point goes to only one cluster).
- Trained and tested on ImageNet for image classification: Good throughput compared to MLP, attention, and convolution blocks; not SOTA but close; can visualize the clustering map (like CAM class activation for convolution or attention/CLS-similarity for ViT). SOTA on 3D point cloud classification on ScanObjectNN; uses PointMLP backbone to get slightly better results. SOTA on object detection and instance segmentation (using Mask-RCNN head) on MS-COCO; compared to ResNet-18 (conv) and PVT-Tiny (attention). SOTA on semantic segmentation on ADE20K.
- Appendix A has model implementation details (hyperparameters and model architecture); Appendix B has detailed explanation for anchor points and clustering; Appendix C has more experiments (MS-COCO segmentation and detection) and visualizations (SuperPixel-like clustering results); Appendix D makes a case for generalization to more challenging image scenarios (masked, irregular, or RGB-D images).
- From Northeastern University and Adobe.
Links: Website (OpenReview), GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper