✓ Stage 1 + fine-tuned (ft) encoder weights released!

EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, SAM2 & SAM3

Chengxi Simon Zeng, Yuxuan Jiang, Gao Ge, Shuai Wang, Fan Aaron Zhang · Visual Information Lab, University of Bristol; MultiX lab, University of Amsterdam

Why EfficientSAM3?

SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.

EfficientSAM3 architecture diagram

Updates

  • 2026-01-11: Stage 1 geometry-prompt fine-tuned (ft) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
  • 2025-12-08: Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
  • 2025-12-02: Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - distilled on 1% of SA-1B dataset. Available via Google Drive and Hugging Face.
  • 2025-10-18: Project announced.

Highlights

  • Image encoders distilled into RepViT, TinyViT, and EfficientViT families.
  • Text encoders distilled into MobileCLIP variants (up to 87.96% smaller than SAM3's 354M text encoder), with SA-Co Gold+Silver fine-tuned (ft) weights.
  • Perceiver-based memory compression aligned with SAM2 temporal tracking.
  • ONNX/CoreML support for real-time mobile, embedded, and desktop deployment.

Resources

Abstract

SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.

We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.

Three-Stage Progressive Distillation

Stage 1 · Compact Encoder

Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.

Stage 2 · Temporal Memory

Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.

Stage 3 · Promptable PCS

Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.

tl;dr: Stage 1 distills encoder on SAM1 data · Stage 2 aligns memory on SAM2 data · Stage 3 fine-tunes PCS on SAM3 data.

Get Started

Installation

pip install -e ".[stage1]"

See the installation guide for full setup instructions.

Quick Start

# Image prompt
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model
model = build_efficientsam3_image_model(
  checkpoint_path="efficient_sam3_efficientvit_s.pt",
  backbone_type="efficientvit",
  model_name="b0",
  enable_inst_interactivity=True,
)

# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)

# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_inst(
    inference_state, 
    point_coords=points, 
    point_labels=labels
)

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model with text encoder
model = build_efficientsam3_image_model(
    checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
    backbone_type="tinyvit",
    model_name="11m",
    text_encoder_type="MobileCLIP-S1"
)

# Process image and predict with text prompt
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
masks = inference_state["masks"]
scores = inference_state["scores"]
print(len(scores), scores)

See image example and text prompt example for details.

EfficientSAM3 Model Zoo & Weight Release

Stage 1 image encoder weights (distilled from SAM3 image encoder) and text encoder weights (distilled from SAM3 text encoder) are now available via Google Drive and Hugging Face. Stage 1 fine-tuned (ft) weights are also available. Stage 2 and 3 weights coming soon.

Image Encoder Models

Model Backbone Parameters Stage 1 Stage 2 Stage 3
ES-RV-S RepViT-M0.9 4.72M GDrive / HF Planned Planned
ES-RV-M RepViT-M1.1 7.77M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-RV-L RepViT-M2.3 22.40M GDrive / HF Planned Planned
ES-TV-S TinyViT-5M 5.07M GDrive / HF Planned Planned
ES-TV-M TinyViT-11M 10.55M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-TV-L TinyViT-21M 20.62M GDrive / HF Planned Planned
ES-EV-S EfficientViT-B0 0.68M GDrive / HF Planned Planned
ES-EV-M EfficientViT-B1 4.64M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-EV-L EfficientViT-B2 14.98M GDrive / HF Planned Planned

Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.

Note (2026/01/11): The fine-tuned (ft) image models use geometry-prompt fine-tuning on the same 1% subset of SA-1B.

EfficientSAM3 Text Encoder + Image Encoder Models

Model Backbone Parameters Stage 1 Stage 2 Stage 3
ES-RV-S-MC-S1 RepViT-M0.9 & MobileCLIP-S1 4.72M + 63.56M GDrive / HF Planned Planned
ES-RV-M-MC-S1 RepViT-M1.1 & MobileCLIP-S1 7.77M + 63.56M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-RV-L-MC-S1 RepViT-M2.3 & MobileCLIP-S1 22.40M + 63.56M GDrive / HF Planned Planned
ES-TV-S-MC-S1 TinyViT-5M & MobileCLIP-S1 5.07M + 63.56M GDrive / HF Planned Planned
ES-TV-M-MC-S1 TinyViT-11M & MobileCLIP-S1 10.55M + 63.56M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-TV-L-MC-S1 TinyViT-21M & MobileCLIP-S1 20.62M + 63.56M GDrive / HF Planned Planned
ES-EV-S-MC-S1 EfficientViT-B0 & MobileCLIP-S1 0.68M + 63.56M GDrive / HF Planned Planned
ES-EV-M-MC-S1 EfficientViT-B1 & MobileCLIP-S1 4.64M + 63.56M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-EV-L-MC-S1 EfficientViT-B2 & MobileCLIP-S1 14.98M + 63.56M GDrive / HF Planned Planned

Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset.

Note (2026/01/11): Fine-tuned (ft) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.

Datasets

Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.

Export & Deployment

ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.

Roadmap

  • ✓ Completed Release Stage 1 image encoder weights (distilled from SAM3 image encoder)
  • ✓ Completed Release Stage 1 text encoder weights (distilled from SAM3 text encoder to MobileCLIP-S1 combined with all 9 image encoder variants)
  • ✓ Completed Release Stage 1+ fine-tuned (ft) encoder weights (geometry-prompt fine-tuning; SA-Co Gold+Silver text fine-tuning)
  • Planned Release Stage 2 memory bank aligned models
  • Planned Release Stage 3 fine-tuned PCS models
  • Planned ONNX/CoreML export
  • Planned Interactive web demo

Call for Contributions

We welcome pull requests across the ecosystem:

  • Efficient MedSAM3 integration and medical datasets
  • Gradio demos, Vercel deployments, and Hugging Face Spaces
  • Annotation tool support (X-AnyLabeling, AnyLabeling)
  • iOS, Android, and NVCC-based desktop applications

Users

Organizations and projects using EfficientSAM3:

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

Citation

@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
  title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3}, 
  author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Fan Aaron Zhang},
  year={2025},
  eprint={2511.15833},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2511.15833}, 
}