✓ Stage 1 image & text encoder weights released!

EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, SAM2 & SAM3

Chengxi Simon Zeng, Yuxuan Jiang, Aaron Zhang · Visual Information Lab, University of Bristol

Why EfficientSAM3?

SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.

EfficientSAM3 architecture diagram

Updates

  • 2025-12-08: Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
  • 2025-12-02: Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - distilled on 1% of SA-1B dataset. Available via Google Drive and Hugging Face.
  • 2025-10-18: Project announced.

Highlights

  • Image encoders distilled into RepViT, TinyViT, and EfficientViT families.
  • Text encoders distilled into MobileCLIP variants (up to 87.96% smaller than SAM3's 354M text encoder).
  • Perceiver-based memory compression aligned with SAM2 temporal tracking.
  • ONNX/CoreML support for real-time mobile, embedded, and desktop deployment.

Resources

Abstract

SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.

We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.

Three-Stage Progressive Distillation

Stage 1 · Compact Encoder

Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.

Stage 2 · Temporal Memory

Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.

Stage 3 · Promptable PCS

Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.

tl;dr: Stage 1 distills encoder on SAM1 data · Stage 2 aligns memory on SAM2 data · Stage 3 fine-tunes PCS on SAM3 data.

Get Started

Installation

pip install -e ".[stage1]"

See the installation guide for full setup instructions.

Quick Start

# Image prompt
model = build_efficientsam3_image_model(
    checkpoint_path="efficient_sam3_tinyvit_s.pt",
    backbone_type="tinyvit", model_name="5m"
)
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
masks, scores, _ = model.predict_inst(
    inference_state, point_coords=points, 
    point_labels=labels
)

# Text prompt
model = build_efficientsam3_image_model(
    checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
    backbone_type="tinyvit", model_name="11m",
    text_encoder_type="MobileCLIP-S1"
)
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(
    inference_state, prompt="shoe"
)
masks, scores, _ = model.predict_inst(inference_state)

See image example and text prompt example for details.

EfficientSAM3 Model Zoo & Weight Release

Stage 1 image encoder weights (distilled from SAM3 image encoder) and text encoder weights (distilled from SAM3 text encoder) are now available via Google Drive and Hugging Face. Stage 2 and 3 weights coming soon.

Image Encoder Models

Model Backbone Parameters Stage 1 Stage 2 Stage 3
ES-RV-S RepViT-M0.9 5.1M GDrive / HF Planned Planned
ES-RV-M RepViT-M1.1 6.8M GDrive / HF Planned Planned
ES-RV-L RepViT-M2.3 8.2M GDrive / HF Planned Planned
ES-TV-S TinyViT-5M 5.4M GDrive / HF Planned Planned
ES-TV-M TinyViT-11M 11M GDrive / HF Planned Planned
ES-TV-L TinyViT-21M 21M GDrive / HF Planned Planned
ES-EV-S EfficientViT-B0 0.7M GDrive / HF Planned Planned
ES-EV-M EfficientViT-B1 4.8M GDrive / HF Planned Planned
ES-EV-L EfficientViT-B2 15M GDrive / HF Planned Planned

Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.

EfficientSAM3 Text Encoder + Image Encoder Models

Model Backbone Parameters Stage 1 Stage 2 Stage 3
ES-RV-S-MC-S1 RepViT-M0.9 & MobileCLIP-S1 4.72M + 63.56M GDrive / HF Planned Planned
ES-RV-M-MC-S1 RepViT-M1.1 & MobileCLIP-S1 7.77M + 63.56M GDrive / HF Planned Planned
ES-RV-L-MC-S1 RepViT-M2.3 & MobileCLIP-S1 22.40M + 63.56M GDrive / HF Planned Planned
ES-TV-S-MC-S1 TinyViT-5M & MobileCLIP-S1 5.07M + 63.56M GDrive / HF Planned Planned
ES-TV-M-MC-S1 TinyViT-11M & MobileCLIP-S1 10.55M + 63.56M GDrive / HF Planned Planned
ES-TV-L-MC-S1 TinyViT-21M & MobileCLIP-S1 20.62M + 63.56M GDrive / HF Planned Planned
ES-EV-S-MC-S1 EfficientViT-B0 & MobileCLIP-S1 0.68M + 63.56M GDrive / HF Planned Planned
ES-EV-M-MC-S1 EfficientViT-B1 & MobileCLIP-S1 4.64M + 63.56M GDrive / HF Planned Planned
ES-EV-L-MC-S1 EfficientViT-B2 & MobileCLIP-S1 14.98M + 63.56M GDrive / HF Planned Planned

Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset.

Datasets

Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.

Export & Deployment

ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.

Roadmap

  • ✓ Completed Release Stage 1 image encoder weights (distilled from SAM3 image encoder)
  • ✓ Completed Release Stage 1 text encoder weights (distilled from SAM3 text encoder to MobileCLIP-S1 combined with all 9 image encoder variants)
  • Planned Release Stage 1+ fine-tuned encoder weights (prompt-in-the-loop supervised fine-tuning)
  • Planned Release Stage 2 memory bank aligned models
  • Planned Release Stage 3 fine-tuned PCS models
  • Planned ONNX/CoreML export
  • Planned Interactive web demo

Call for Contributions

We welcome pull requests across the ecosystem:

  • Efficient MedSAM3 integration and medical datasets
  • Gradio demos, Vercel deployments, and Hugging Face Spaces
  • Annotation tool support (X-AnyLabeling, AnyLabeling)
  • iOS, Android, and NVCC-based desktop applications

Users

Organizations and projects using EfficientSAM3:

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

Citation

@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
  title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3}, 
  author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Fan Aaron Zhang},
  year={2025},
  eprint={2511.15833},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2511.15833}, 
}