Global Context Vision Mixture of Experts

Introduction

We propose Global Context Vision Mixture of Experts (GC-ViMoE), which integrates the Mixture of Experts paradigm with the Global Context Vision Transformer. By replacing traditional multilayer perceptron blocks with dynamically routed expert blocks, our architecture maintains the rich contextual features and locality-aware properties of GCViT while substantially reducing computational overhead.

GCViMoE is expected to achieve efficiency through selective expert activation based on input characteristics, demonstrating comparable or superior performance to state-of-the-art models with significantly lower computational requirements.

The main block in the architecture of GCViMoE

Key Features

Mixture of Experts Integration: Dynamically routes inputs to specialized expert networks
Global Context Modeling: Maintains rich contextual features from GCViT architecture
Computational Efficiency: Significantly reduces computational overhead through selective expert activation
State-of-the-art Performance: Achieves comparable or superior performance with lower computational requirements

Implementation

The project is implemented in PyTorch and includes:

Training scripts for ImageNet-tiny dataset
Support for multi-GPU training
Model validation and testing utilities
ONNX conversion support

For more details and code, visit the GitHub repository.