✓ SAM3-LiteText released! (88% fewer text params)

EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, SAM2 & SAM3

Chengxi Simon Zeng, Yuxuan Jiang, Gao Ge, Shuai Wang, Fan Aaron Zhang · Visual Information Lab, University of Bristol; MultiX lab, University of Amsterdam

Why EfficientSAM3?

SAM3 delivers Promptable Concept Segmentation (PCS) by combining semantic understanding and temporal tracking, yet its massive backbone and dense memory bank make on-device deployment impractical. EfficientSAM3 compresses SAM1, SAM2, and SAM3 into a family of lightweight student models tailored for edge hardware without sacrificing PCS quality.

EfficientSAM3 architecture diagram

Updates

  • 2026-01-11: Stage 1 geometry-prompt fine-tuned (ft) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
  • 2025-12-08: Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
  • 2025-12-02: Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - distilled on 1% of SA-1B dataset. Available via Google Drive and Hugging Face.
  • 2025-10-18: Project announced.

Highlights

  • Image encoders distilled into RepViT, TinyViT, and EfficientViT families.
  • Text encoders distilled into MobileCLIP variants (up to 87.96% smaller than SAM3's 354M text encoder), with SA-Co Gold+Silver fine-tuned (ft) weights.
  • Perceiver-based memory compression aligned with SAM2 temporal tracking.
  • ONNX/CoreML support for real-time mobile, embedded, and desktop deployment.

Resources

Abstract

SAM3 brought promptable concept segmentation to production scale, but its computational footprint blocks latency-sensitive applications. EfficientSAM3 progressively distills SAM3 into lightweight architectures that maintain PCS quality on edge devices.

We employ a three-stage curriculum: (1) encoder distillation on SA-1B with prompt-in-the-loop supervision, (2) temporal memory distillation on SA-V using a compact Perceiver module, and (3) end-to-end fine-tuning on official SAM3 concept segmentation data. The resulting students deliver real-time segmentation, tracking, and prompt handling on resource-constrained platforms.

Three-Stage Progressive Distillation

Stage 1 · Compact Encoder

Align nine student backbones (RepViT, TinyViT, EfficientViT) with the SAM3 encoder using SA-1B and prompt-in-the-loop supervision.

Stage 2 · Temporal Memory

Compress SAM3's dense video memory into a Perceiver-based module distilled on SA-V, enabling efficient multi-frame reasoning.

Stage 3 · Promptable PCS

Jointly fine-tune encoder, memory, and decoder on SAM3 data to preserve promptable concept segmentation quality.

tl;dr: Stage 1 distills encoder on SAM1 data · Stage 2 aligns memory on SAM2 data · Stage 3 fine-tunes PCS on SAM3 data.

Get Started

Installation

pip install -e ".[stage1]"

See the installation guide for full setup instructions.

Quick Start

# Image prompt
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model
model = build_efficientsam3_image_model(
  checkpoint_path="efficient_sam3_efficientvit_s.pt",
  backbone_type="efficientvit",
  model_name="b0",
  enable_inst_interactivity=True,
)

# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)

# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_inst(
    inference_state, 
    point_coords=points, 
    point_labels=labels
)

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model with text encoder
model = build_efficientsam3_image_model(
    checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
    backbone_type="tinyvit",
    model_name="11m",
    text_encoder_type="MobileCLIP-S1"
)

# Process image and predict with text prompt
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
masks = inference_state["masks"]
scores = inference_state["scores"]
print(len(scores), scores)

See image example and text prompt example for details.

EfficientSAM3 Model Zoo & Weight Release

Stage 1 image encoder weights (distilled from SAM3 image encoder) and text encoder weights (distilled from SAM3 text encoder) are now available via Google Drive and Hugging Face. Stage 1 fine-tuned (ft) weights are also available. Stage 2 and 3 weights coming soon.

Image Encoder Models

Model Backbone Parameters Stage 1 Stage 2 Stage 3
ES-RV-S RepViT-M0.9 4.72M GDrive / HF Planned Planned
ES-RV-M RepViT-M1.1 7.77M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-RV-L RepViT-M2.3 22.40M GDrive / HF Planned Planned
ES-TV-S TinyViT-5M 5.07M GDrive / HF Planned Planned
ES-TV-M TinyViT-11M 10.55M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-TV-L TinyViT-21M 20.62M GDrive / HF Planned Planned
ES-EV-S EfficientViT-B0 0.68M GDrive / HF Planned Planned
ES-EV-M EfficientViT-B1 4.64M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-EV-L EfficientViT-B2 14.98M GDrive / HF Planned Planned

Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.

Note (2026/01/11): The fine-tuned (ft) image models use geometry-prompt fine-tuning on the same 1% subset of SA-1B.

EfficientSAM3 Text Encoder + Image Encoder Models

Model Backbone Parameters Stage 1 Stage 2 Stage 3
ES-RV-S-MC-S1 RepViT-M0.9 & MobileCLIP-S1 4.72M + 63.56M GDrive / HF Planned Planned
ES-RV-M-MC-S1 RepViT-M1.1 & MobileCLIP-S1 7.77M + 63.56M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-RV-L-MC-S1 RepViT-M2.3 & MobileCLIP-S1 22.40M + 63.56M GDrive / HF Planned Planned
ES-TV-S-MC-S1 TinyViT-5M & MobileCLIP-S1 5.07M + 63.56M GDrive / HF Planned Planned
ES-TV-M-MC-S1 TinyViT-11M & MobileCLIP-S1 10.55M + 63.56M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-TV-L-MC-S1 TinyViT-21M & MobileCLIP-S1 20.62M + 63.56M GDrive / HF Planned Planned
ES-EV-S-MC-S1 EfficientViT-B0 & MobileCLIP-S1 0.68M + 63.56M GDrive / HF Planned Planned
ES-EV-M-MC-S1 EfficientViT-B1 & MobileCLIP-S1 4.64M + 63.56M GDrive / HF (ft: GDrive, HF) Planned Planned
ES-EV-L-MC-S1 EfficientViT-B2 & MobileCLIP-S1 14.98M + 63.56M GDrive / HF Planned Planned

Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset.

Note (2026/01/11): Fine-tuned (ft) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.

SAM3-LiteText Models

SAM3-LiteText replaces the SAM3 text encoder with a lightweight distilled text encoder, reducing text encoder parameters by up to 88% with comparable performance. See the SAM3-LiteText paper for details.

Model Text Encoder Ctx Text Params Weights
SAM3-LiteText-S0-16 MobileCLIP-S0 16 42.54M GDrive / HF
SAM3-LiteText-S1-16 MobileCLIP-S1 16 63.53M GDrive / HF
SAM3-LiteText-L-16 MobileCLIP2-L 16 123.80M GDrive / HF

Note: All SAM3-LiteText models use the SAM3 ViT-H image encoder (353.72M vision params). The text encoder parameters shown represent the distilled student replacing the original 353.72M text encoder.

Note (2026/02/18): SAM3-LiteText released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. Paper available on arXiv.

Datasets

Dataset preparation scripts for COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS are located under data/download_*.sh. Refer to README_dataset.md for detailed instructions.

Export & Deployment

ONNX and CoreML export pipelines are under development to unlock mobile and cross-platform deployment. Follow the repository issues for progress updates.

Roadmap

  • ✓ Completed Release Stage 1 image encoder weights (distilled from SAM3 image encoder)
  • ✓ Completed Release Stage 1 text encoder weights (distilled from SAM3 text encoder to MobileCLIP-S1 combined with all 9 image encoder variants)
  • ✓ Completed Release Stage 1+ fine-tuned (ft) encoder weights (geometry-prompt fine-tuning; SA-Co Gold+Silver text fine-tuning)
  • ✓ Completed Release SAM3-LiteText weights (distilled lightweight text encoder competitive with SAM3 text encoder)
  • Planned Release Stage 2 memory bank aligned models
  • Planned Release Stage 3 fine-tuned PCS models
  • Planned ONNX/CoreML export
  • Planned Interactive web demo

Call for Contributions

We welcome pull requests across the ecosystem:

  • Efficient MedSAM3 integration and medical datasets
  • Gradio demos, Vercel deployments, and Hugging Face Spaces
  • Annotation tool support (X-AnyLabeling, AnyLabeling)
  • iOS, Android, and NVCC-based desktop applications

Users

Organizations and projects using EfficientSAM3:

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

Citation

@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
      title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3}, 
      author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
      year={2025},
      eprint={2511.15833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15833}, 
}

@misc{zeng2026sam3litetextanatomicalstudysam3,
      title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation}, 
      author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2026},
      eprint={2602.12173},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.12173}, 
}