pip install -e ".[stage1]"
See the installation guide for full setup instructions.
Chengxi Simon Zeng, Yuxuan Jiang, Gao Ge, Shuai Wang, Fan Aaron Zhang · Visual Information Lab, University of Bristol; MultiX lab, University of Amsterdam
SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.
SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.
We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.
Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.
Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.
Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.
tl;dr: Stage 1 distills encoder on SAM1 data · Stage 2 aligns memory on SAM2 data · Stage 3 fine-tunes PCS on SAM3 data.
pip install -e ".[stage1]"
See the installation guide for full setup instructions.
# Image prompt
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load model
model = build_efficientsam3_image_model(
checkpoint_path="efficient_sam3_efficientvit_s.pt",
backbone_type="efficientvit",
model_name="b0",
enable_inst_interactivity=True,
)
# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_inst(
inference_state,
point_coords=points,
point_labels=labels
)
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load model with text encoder
model = build_efficientsam3_image_model(
checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
backbone_type="tinyvit",
model_name="11m",
text_encoder_type="MobileCLIP-S1"
)
# Process image and predict with text prompt
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
masks = inference_state["masks"]
scores = inference_state["scores"]
print(len(scores), scores)
See image example and text prompt example for details.
Stage 1 image encoder weights (distilled from SAM3 image encoder) and text encoder weights (distilled from SAM3 text encoder) are now available via Google Drive and Hugging Face. Stage 1 fine-tuned (ft) weights are also available. Stage 2 and 3 weights coming soon.
| Model | Backbone | Parameters | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|---|---|
| ES-RV-S | RepViT-M0.9 | 4.72M | GDrive / HF | Planned | Planned |
| ES-RV-M | RepViT-M1.1 | 7.77M | GDrive / HF (ft: GDrive, HF) | Planned | Planned |
| ES-RV-L | RepViT-M2.3 | 22.40M | GDrive / HF | Planned | Planned |
| ES-TV-S | TinyViT-5M | 5.07M | GDrive / HF | Planned | Planned |
| ES-TV-M | TinyViT-11M | 10.55M | GDrive / HF (ft: GDrive, HF) | Planned | Planned |
| ES-TV-L | TinyViT-21M | 20.62M | GDrive / HF | Planned | Planned |
| ES-EV-S | EfficientViT-B0 | 0.68M | GDrive / HF | Planned | Planned |
| ES-EV-M | EfficientViT-B1 | 4.64M | GDrive / HF (ft: GDrive, HF) | Planned | Planned |
| ES-EV-L | EfficientViT-B2 | 14.98M | GDrive / HF | Planned | Planned |
Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.
Note (2026/01/11): The fine-tuned (ft) image models use geometry-prompt fine-tuning on the same 1% subset of SA-1B.
| Model | Backbone | Parameters | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|---|---|
| ES-RV-S-MC-S1 | RepViT-M0.9 & MobileCLIP-S1 | 4.72M + 63.56M | GDrive / HF | Planned | Planned |
| ES-RV-M-MC-S1 | RepViT-M1.1 & MobileCLIP-S1 | 7.77M + 63.56M | GDrive / HF (ft: GDrive, HF) | Planned | Planned |
| ES-RV-L-MC-S1 | RepViT-M2.3 & MobileCLIP-S1 | 22.40M + 63.56M | GDrive / HF | Planned | Planned |
| ES-TV-S-MC-S1 | TinyViT-5M & MobileCLIP-S1 | 5.07M + 63.56M | GDrive / HF | Planned | Planned |
| ES-TV-M-MC-S1 | TinyViT-11M & MobileCLIP-S1 | 10.55M + 63.56M | GDrive / HF (ft: GDrive, HF) | Planned | Planned |
| ES-TV-L-MC-S1 | TinyViT-21M & MobileCLIP-S1 | 20.62M + 63.56M | GDrive / HF | Planned | Planned |
| ES-EV-S-MC-S1 | EfficientViT-B0 & MobileCLIP-S1 | 0.68M + 63.56M | GDrive / HF | Planned | Planned |
| ES-EV-M-MC-S1 | EfficientViT-B1 & MobileCLIP-S1 | 4.64M + 63.56M | GDrive / HF (ft: GDrive, HF) | Planned | Planned |
| ES-EV-L-MC-S1 | EfficientViT-B2 & MobileCLIP-S1 | 14.98M + 63.56M | GDrive / HF | Planned | Planned |
Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset.
Note (2026/01/11): Fine-tuned (ft) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.
Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.
ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.
We welcome pull requests across the ecosystem:
Organizations and projects using EfficientSAM3:
Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.
@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Fan Aaron Zhang},
year={2025},
eprint={2511.15833},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15833},
}