pip install -e ".[stage1]"
See the installation guide for full setup instructions.
Chengxi Simon Zeng, Yuxuan Jiang, Aaron Zhang · Visual Information Lab, University of Bristol
SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.
SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.
We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.
Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.
Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.
Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.
tl;dr: Stage 1 distills encoder on SAM1 data · Stage 2 aligns memory on SAM2 data · Stage 3 fine-tunes PCS on SAM3 data.
pip install -e ".[stage1]"
See the installation guide for full setup instructions.
# Image prompt
model = build_efficientsam3_image_model(
checkpoint_path="efficient_sam3_tinyvit_s.pt",
backbone_type="tinyvit", model_name="5m"
)
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
masks, scores, _ = model.predict_inst(
inference_state, point_coords=points,
point_labels=labels
)
# Text prompt
model = build_efficientsam3_image_model(
checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
backbone_type="tinyvit", model_name="11m",
text_encoder_type="MobileCLIP-S1"
)
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(
inference_state, prompt="shoe"
)
masks, scores, _ = model.predict_inst(inference_state)
See image example and text prompt example for details.
Stage 1 image encoder weights (distilled from SAM3 image encoder) and text encoder weights (distilled from SAM3 text encoder) are now available via Google Drive and Hugging Face. Stage 2 and 3 weights coming soon.
| Model | Backbone | Parameters | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|---|---|
| ES-RV-S | RepViT-M0.9 | 5.1M | GDrive / HF | Planned | Planned |
| ES-RV-M | RepViT-M1.1 | 6.8M | GDrive / HF | Planned | Planned |
| ES-RV-L | RepViT-M2.3 | 8.2M | GDrive / HF | Planned | Planned |
| ES-TV-S | TinyViT-5M | 5.4M | GDrive / HF | Planned | Planned |
| ES-TV-M | TinyViT-11M | 11M | GDrive / HF | Planned | Planned |
| ES-TV-L | TinyViT-21M | 21M | GDrive / HF | Planned | Planned |
| ES-EV-S | EfficientViT-B0 | 0.7M | GDrive / HF | Planned | Planned |
| ES-EV-M | EfficientViT-B1 | 4.8M | GDrive / HF | Planned | Planned |
| ES-EV-L | EfficientViT-B2 | 15M | GDrive / HF | Planned | Planned |
Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.
| Model | Backbone | Parameters | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|---|---|
| ES-RV-S-MC-S1 | RepViT-M0.9 & MobileCLIP-S1 | 4.72M + 63.56M | GDrive / HF | Planned | Planned |
| ES-RV-M-MC-S1 | RepViT-M1.1 & MobileCLIP-S1 | 7.77M + 63.56M | GDrive / HF | Planned | Planned |
| ES-RV-L-MC-S1 | RepViT-M2.3 & MobileCLIP-S1 | 22.40M + 63.56M | GDrive / HF | Planned | Planned |
| ES-TV-S-MC-S1 | TinyViT-5M & MobileCLIP-S1 | 5.07M + 63.56M | GDrive / HF | Planned | Planned |
| ES-TV-M-MC-S1 | TinyViT-11M & MobileCLIP-S1 | 10.55M + 63.56M | GDrive / HF | Planned | Planned |
| ES-TV-L-MC-S1 | TinyViT-21M & MobileCLIP-S1 | 20.62M + 63.56M | GDrive / HF | Planned | Planned |
| ES-EV-S-MC-S1 | EfficientViT-B0 & MobileCLIP-S1 | 0.68M + 63.56M | GDrive / HF | Planned | Planned |
| ES-EV-M-MC-S1 | EfficientViT-B1 & MobileCLIP-S1 | 4.64M + 63.56M | GDrive / HF | Planned | Planned |
| ES-EV-L-MC-S1 | EfficientViT-B2 & MobileCLIP-S1 | 14.98M + 63.56M | GDrive / HF | Planned | Planned |
Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset.
Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.
ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.
We welcome pull requests across the ecosystem:
Organizations and projects using EfficientSAM3:
Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.
@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Fan Aaron Zhang},
year={2025},
eprint={2511.15833},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15833},
}