✓ Stage 3 Fine-tuned Models Released! (90% smaller than ImageSAM3)

EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, SAM2 & SAM3

Chengxi Simon Zeng, Yuxuan Jiang, Gao Ge, Shuai Wang, Fan Aaron Zhang · Visual Information Lab, University of Bristol; MultiX lab, University of Amsterdam

Why EfficientSAM3?

SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.

EfficientSAM3 architecture diagram

Updates

v0.4.0 — Stage 3 Fine-Tuned PCS Models (2026-06-11)

EfficientSAM3 full models (EV-M, RV-M, TV-M) fine-tuned on 5% SA1B data with complete PCS capabilities. Actual params: 89.2M / 92.7M / 95.3M (90% smaller than ImageSAM3).

HuggingFace · Stage 3 Guide

v0.3.0 — SAM3.1 & HuggingFace Integration (2026-04-13)

EfficientSAM3.1 & SAM3.1-LiteText released with official HuggingFace Transformers support.

HF Docs · Demo

v0.2.0 — SAM3-LiteText (2026-02-18)

88% text encoder reduction with similar performance. Accepted by ICMR2026!

Paper · ICMR2026

Older updates
  • 2026/01/11: Stage 1 geometry-prompt fine-tuned weights released.
  • 2025/12/08: Stage 1 text encoder weights released (MobileCLIP S0, S1, MobileCLIP2 L).
  • 2025/12/02: Stage 1 image encoder weights released (RepViT, TinyViT, EfficientViT).
  • 2025/10/18: EfficientSAM3 project announced.

Highlights

  • Image encoders distilled into RepViT, TinyViT, and EfficientViT families.
  • Text encoders distilled into MobileCLIP variants (up to 87.96% smaller than SAM3's 354M text encoder), with SA-Co Gold+Silver fine-tuned (ft) weights.
  • Perceiver-based memory compression aligned with SAM2 temporal tracking.
  • ONNX/CoreML support for real-time mobile, embedded, and desktop deployment.

Resources

Abstract

SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.

We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.

Three-Stage Progressive Distillation

Stage 1 · Compact Encoder

Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.

Stage 2 · Temporal Memory

Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.

Stage 3 · Promptable PCS

Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.

tl;dr: Stage 1 distills encoder on SAM1 data · Stage 2 aligns memory on SAM2 data · Stage 3 fine-tunes PCS on SAM3 data.

Get Started

Installation

pip install -e ".[stage1]"

See the installation guide for full setup instructions.

Quick Start

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image

# Load EfficientSAM3 TV-M model (uses TinyViT vision encoder + MobileCLIP-S0 text encoder)
model = build_efficientsam3_image_model(
    checkpoint_path="efficientsam3_tinyvit.pt",
    backbone_type="tinyvit",
    model_name="11m",
    text_encoder_type="MobileCLIP-S0",
    text_encoder_context_length=16,
    load_from_HF=False,
)

# Process image
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)

# Text prompt segmentation
state = processor.set_text_prompt("dog", state)

# Get masks
masks = state["masks"]
scores = state["scores"]
print(f"Found {len(masks)} masks")

See Project README for more examples including SAM3-LiteText.

EfficientSAM3 Model Zoo & Weight Release

EfficientSAM3 Full Models (Lightweight Image + Text Encoders) New

EfficientSAM3 compresses both SAM3's vision encoder and text encoder into lightweight student models while maintaining competitive performance on downstream benchmarks.

Model Vision Text Transformer Other Params vs ImageSAM3 Download
EV-M 22.2M 42.5M 21.0M 3.5M 89.2M 90% smaller HF
RV-M 25.6M 42.5M 21.0M 3.5M 92.7M 89% smaller HF
TV-M 28.3M 42.5M 21.0M 3.5M 95.3M 89% smaller HF

Note: "Text" is the distilled text encoder. "Transformer" is the mask decoder. "Other" includes segmentation head + scoring. ImageSAM3 (for comparison): Vision: 463M + Text: 354M + Transformer: 30.3M + Other: 14.2M = **861.5M**

SAM3-LiteText Models (Lightweight Text Encoder Only)

SAM3-LiteText keeps the SAM3 vision encoder but replaces the text encoder with lightweight MobileCLIP variants.

Model Vision Text Transformer Other Params vs ImageSAM3 Download
LiteText-S0-16 463.0M 42.5M 30.3M 14.2M 550.0M 36% smaller HF
LiteText-S0-32 463.0M 42.5M 30.3M 14.2M 550.0M 36% smaller HF
LiteText-S1-16 463.0M 63.5M 30.3M 14.2M 571.0M 34% smaller HF
LiteText-S1-32 463.0M 63.5M 30.3M 14.2M 571.0M 34% smaller HF
LiteText-L-16 463.0M 123.8M 30.3M 14.2M 631.3M 27% smaller HF
LiteText-L-32 463.0M 123.8M 30.3M 14.2M 631.3M 27% smaller HF

Note: "Text" is the distilled text encoder (42.5M-123.8M). SAM3-LiteText keeps SAM3's ViT-H vision encoder (~463M) but replaces the text encoder. "Other" includes geometry encoder + segmentation head + scoring.

Stage 1 Encoder Weights (for training)

Stage 1 distilled image encoder and text encoder weights for training. See README_stage1.md for details.

Browse all Stage 1 weights on HuggingFace | Training guide

Datasets

Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.

Export & Deployment

ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.

Roadmap

  • ✓ Completed Release Stage 1 image encoder weights (distilled from SAM3 image encoder)
  • ✓ Completed Release Stage 1 text encoder weights (distilled from SAM3 text encoder to MobileCLIP-S1 combined with all 9 image encoder variants)
  • ✓ Completed Release Stage 1+ fine-tuned (ft) encoder weights (geometry-prompt fine-tuning; SA-Co Gold+Silver text fine-tuning)
  • ✓ Completed Release SAM3-LiteText weights (distilled lightweight text encoder competitive with SAM3 text encoder)
  • ✓ Completed Release Stage 3 fine-tuned PCS models (2026/06/11)
  • Planned Release Stage 2 memory bank aligned models
  • Planned ONNX/CoreML export
  • Planned Interactive web demo

Call for Contributions

We welcome pull requests across the ecosystem:

  • Efficient MedSAM3 integration and medical datasets
  • Gradio demos, Vercel deployments, and Hugging Face Spaces
  • Annotation tool support (X-AnyLabeling, AnyLabeling)
  • iOS, Android, and NVCC-based desktop applications

Users

Organizations and projects using EfficientSAM3:

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

Citation

@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
      title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3}, 
      author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
      year={2025},
      eprint={2511.15833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15833}, 
}

@misc{zeng2026sam3litetextanatomicalstudysam3,
      title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation}, 
      author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2026},
      eprint={2602.12173},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.12173}, 
}