Coming soon: setup instructions will be provided once SAM3 is publicly available.
Chengxi Simon Zeng, Yuxuan Jiang, Aaron Zhang 路 Visual Information Lab, University of Bristol
SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.
SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.
We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.
Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.
Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.
Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.
tl;dr: Stage 1 distills encoder on SAM1 data 路 Stage 2 aligns memory on SAM2 data 路 Stage 3 fine-tunes PCS on SAM3 data.
Coming soon: setup instructions will be provided once SAM3 is publicly available.
Coming soon: examples will be provided once code and weights are released.
Code and weights are not yet released. They will be published once SAM3 code is publicly available.
| Model | Backbone | Parameters | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|---|---|
| ES-RV-S | RepViT-M0.9 | 5.1M | Planned | Planned | Planned |
| ES-RV-M | RepViT-M1.1 | 6.8M | Planned | Planned | Planned |
| ES-RV-L | RepViT-M2.3 | 8.2M | Planned | Planned | Planned |
| ES-TV-S | TinyViT-5M | 5.4M | Planned | Planned | Planned |
| ES-TV-M | TinyViT-11M | 11M | Planned | Planned | Planned |
| ES-TV-L | TinyViT-21M | 21M | Planned | Planned | Planned |
| ES-EV-S | EfficientViT-B0 | 0.7M | Planned | Planned | Planned |
| ES-EV-M | EfficientViT-B1 | 4.8M | Planned | Planned | Planned |
| ES-EV-L | EfficientViT-B2 | 15M | Planned | Planned | Planned |
Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.
ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.
We welcome pull requests across the ecosystem:
@misc{efficientsam3,
title={EfficientSAM3: Progressive Hierachical Knowledge Distillation (PhD) from SAM1, 2 and 3},
author={Zeng, Chengxi Simon and Jiang, Yuxuan and Zhang, Aaron},
institution={University of Bristol},
year={2025},
howpublished={https://github.com/SimonZeng7108/efficientsam3}
}