HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

Available in: 中文

2026-04-07T21:31:49.551Z·1 min read

Mixture-of-Experts (MoE) has transformed language models, but applying it to computer vision — particularly object detection — requires a fundamentally different approach. HI-MoE introduces hierarc...

The Problem with Existing Vision MoE

Current vision MoE methods operate at the image or patch level — treating all regions equally. This is poorly aligned with object detection, where:

The fundamental unit is an object query (candidate instance)
Different scenes have vastly different object compositions
Each instance may need different expertise

HI-MoE's Two-Stage Routing

Stage	Router	Function
Scene Router	Lightweight	Selects scene-consistent expert subset
Instance Router	Per-query	Assigns each object query to experts within that subset

This hierarchical design preserves sparse computation while better matching the heterogeneous, instance-centric structure of detection.

Key Innovation

Scene-level routing first — Determines which experts are relevant for the entire image
Instance-level routing second — Fine-grained expert selection per object
DETR-style architecture — Built on modern detection framework (DINO baseline)

Results

On COCO dataset, HI-MoE improves over:

Dense DINO baseline
Simpler token-level MoE approaches
Instance-only routing methods

With preliminary specialization analysis on LVIS for long-tail detection.

Why It Matters

Efficient detection — Sparse computation for the increasingly important detection task
Specialized expertise — Different experts can specialize in different object categories
Scalability — Hierarchical routing may extend to other structured prediction tasks

↗ Original source · 2026-04-07T00:00:00.000Z

Comments0