pip install -e ".[stage1]"
See the installation guide for full setup instructions.
Chengxi Simon Zeng, Yuxuan Jiang, Gao Ge, Shuai Wang, Fan Aaron Zhang · Visual Information Lab, University of Bristol; MultiX lab, University of Amsterdam
SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.
v0.4.0 — Stage 3 Fine-Tuned PCS Models (2026-06-11)
EfficientSAM3 full models (EV-M, RV-M, TV-M) fine-tuned on 5% SA1B data with complete PCS capabilities. Actual params: 89.2M / 92.7M / 95.3M (90% smaller than ImageSAM3).
v0.3.0 — SAM3.1 & HuggingFace Integration (2026-04-13)
EfficientSAM3.1 & SAM3.1-LiteText released with official HuggingFace Transformers support.
v0.2.0 — SAM3-LiteText (2026-02-18)
88% text encoder reduction with similar performance. Accepted by ICMR2026!
SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.
We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.
Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.
Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.
Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.
tl;dr: Stage 1 distills encoder on SAM1 data · Stage 2 aligns memory on SAM2 data · Stage 3 fine-tunes PCS on SAM3 data.
pip install -e ".[stage1]"
See the installation guide for full setup instructions.
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image
# Load EfficientSAM3 TV-M model (uses TinyViT vision encoder + MobileCLIP-S0 text encoder)
model = build_efficientsam3_image_model(
checkpoint_path="efficientsam3_tinyvit.pt",
backbone_type="tinyvit",
model_name="11m",
text_encoder_type="MobileCLIP-S0",
text_encoder_context_length=16,
load_from_HF=False,
)
# Process image
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)
# Text prompt segmentation
state = processor.set_text_prompt("dog", state)
# Get masks
masks = state["masks"]
scores = state["scores"]
print(f"Found {len(masks)} masks")
See Project README for more examples including SAM3-LiteText.
EfficientSAM3 compresses both SAM3's vision encoder and text encoder into lightweight student models while maintaining competitive performance on downstream benchmarks.
| Model | Vision | Text | Transformer | Other | Params | vs ImageSAM3 | Download |
|---|---|---|---|---|---|---|---|
| EV-M | 22.2M | 42.5M | 21.0M | 3.5M | 89.2M | 90% smaller | HF |
| RV-M | 25.6M | 42.5M | 21.0M | 3.5M | 92.7M | 89% smaller | HF |
| TV-M | 28.3M | 42.5M | 21.0M | 3.5M | 95.3M | 89% smaller | HF |
Note: "Text" is the distilled text encoder. "Transformer" is the mask decoder. "Other" includes segmentation head + scoring. ImageSAM3 (for comparison): Vision: 463M + Text: 354M + Transformer: 30.3M + Other: 14.2M = **861.5M**
SAM3-LiteText keeps the SAM3 vision encoder but replaces the text encoder with lightweight MobileCLIP variants.
| Model | Vision | Text | Transformer | Other | Params | vs ImageSAM3 | Download |
|---|---|---|---|---|---|---|---|
| LiteText-S0-16 | 463.0M | 42.5M | 30.3M | 14.2M | 550.0M | 36% smaller | HF |
| LiteText-S0-32 | 463.0M | 42.5M | 30.3M | 14.2M | 550.0M | 36% smaller | HF |
| LiteText-S1-16 | 463.0M | 63.5M | 30.3M | 14.2M | 571.0M | 34% smaller | HF |
| LiteText-S1-32 | 463.0M | 63.5M | 30.3M | 14.2M | 571.0M | 34% smaller | HF |
| LiteText-L-16 | 463.0M | 123.8M | 30.3M | 14.2M | 631.3M | 27% smaller | HF |
| LiteText-L-32 | 463.0M | 123.8M | 30.3M | 14.2M | 631.3M | 27% smaller | HF |
Note: "Text" is the distilled text encoder (42.5M-123.8M). SAM3-LiteText keeps SAM3's ViT-H vision encoder (~463M) but replaces the text encoder. "Other" includes geometry encoder + segmentation head + scoring.
Stage 1 distilled image encoder and text encoder weights for training. See README_stage1.md for details.
Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.
ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.
We welcome pull requests across the ecosystem:
Organizations and projects using EfficientSAM3:
Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.
@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
year={2025},
eprint={2511.15833},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15833},
}
@misc{zeng2026sam3litetextanatomicalstudysam3,
title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation},
author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
year={2026},
eprint={2602.12173},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.12173},
}