Instructions to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="HuggingFaceTB/SmolVLM2-2.2B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
model = AutoModelForImageTextToText.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct

SGLang

How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceTB/SmolVLM2-2.2B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceTB/SmolVLM2-2.2B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with Docker Model Runner:
```
docker model run hf.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct
```

Input Video length constraints

by NikhilJoson - opened Feb 21, 2025

Discussion

NikhilJoson

Feb 21, 2025

Is there any limits to the length of the input video that can be provided to SmolVLM2 2.2B?
What is the max length, of a video, that it can handle?

mfarre

Feb 21, 2025

there is a limit of 64 frames.
SmolVLM2 will sample frames at 1FPS with a max of 64 frames.
If the video is longer than 64 seconds, then it will sample evenly spaced frames

j0yk1ll

Feb 24, 2025

•

edited Feb 24, 2025

@mfarre 1FPS is a bit low for my desired usecase. Is there an option to increase the sample rate? Would it be an alternative to manually sample the frames and use the multi-image inference?

orrzohar

Feb 24, 2025

@j0yk1ll
You can adjust the fps by:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "path_to_video.mp4", "target_fps": fps},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

Best,
Orr

RuchitRawal

Mar 2, 2025

Thanks for the awesome models!

I am trying to use the above snippet to control the number of frames the model can access.

For this, I try to first calculate the desired FPS using:

def get_desired_fps_given_video_path_and_num_frames(video_path, num_frames):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    num_frames_in_video = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    cap.release()
    return fps * num_frames / num_frames_in_video

and then control the SmolVLM2 by:

    processor.image_processor.video_sampling["max_frames"] = args.num_frames
    processor.image_processor.video_sampling["fps"] = desired_fps

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "path": args.clip_path, "target_fps": desired_fps},
                {"type": "text", "text": prompt},
            ],
        },
    ]

However, in the log that model prints internally when it's called, it seems to still be extracting 64 frames:

Desired FPS: 0.21333333333333335
User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.

Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:

Do you have any suggestions on where I might be going wrong?

orrzohar

Mar 2, 2025

•

edited Mar 2, 2025

In general, the SmolVLM processor will not sample more then max_frames frames. You will only sample at target_fps if the resulting number of frames is < then max_frames.

Therefore, you only really need to set max_frames.

just do

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "path": args.clip_path, "max_frames": max_frames},
                {"type": "text", "text": prompt},
            ],
        },
    ]

i don't think you need to do

    processor.image_processor.video_sampling["max_frames"] = args.num_frames
    processor.image_processor.video_sampling["fps"] = desired_fps

RuchitRawal

Mar 2, 2025

Thanks for the super quick reply! I ran with setting the "max_frames" parameter.

However, I still get the same output:

User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.

Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:

orrzohar

Mar 2, 2025

OK i tested it: you don't need to pass it with messages, but as input to apply_chat_template:

max_frames=2


inputs = processor.apply_chat_template(
    messages,
    max_frames=max_frames,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

RuchitRawal

Mar 2, 2025

This work! Thanks a lot for your swift responses! I appreciate it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment