HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection
Available in: 中文
Mixture-of-Experts (MoE) has transformed language models, but applying it to computer vision — particularly object detection — requires a fundamentally different approach. HI-MoE introduces hierarc...
Mixture-of-Experts (MoE) has transformed language models, but applying it to computer vision — particularly object detection — requires a fundamentally different approach. HI-MoE introduces hierarchical, instance-conditioned routing that matches the structure of detection tasks.
The Problem with Existing Vision MoE
Current vision MoE methods operate at the image or patch level — treating all regions equally. This is poorly aligned with object detection, where:
- The fundamental unit is an object query (candidate instance)
- Different scenes have vastly different object compositions
- Each instance may need different expertise
HI-MoE's Two-Stage Routing
| Stage | Router | Function |
|---|---|---|
| Scene Router | Lightweight | Selects scene-consistent expert subset |
| Instance Router | Per-query | Assigns each object query to experts within that subset |
This hierarchical design preserves sparse computation while better matching the heterogeneous, instance-centric structure of detection.
Key Innovation
- Scene-level routing first — Determines which experts are relevant for the entire image
- Instance-level routing second — Fine-grained expert selection per object
- DETR-style architecture — Built on modern detection framework (DINO baseline)
Results
On COCO dataset, HI-MoE improves over:
- Dense DINO baseline
- Simpler token-level MoE approaches
- Instance-only routing methods
With preliminary specialization analysis on LVIS for long-tail detection.
Why It Matters
- Efficient detection — Sparse computation for the increasingly important detection task
- Specialized expertise — Different experts can specialize in different object categories
- Scalability — Hierarchical routing may extend to other structured prediction tasks
← Previous: Spatially Aware GNNs with Contrastive Learning Improve Power Outage Prediction During Extreme WeatherNext: Tailslayer: C++ Library for Eliminating Tail Latency in RAM Reads →
0