Instructions to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="HuggingFaceTB/SmolVLM2-2.2B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct") model = AutoModelForImageTextToText.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceTB/SmolVLM2-2.2B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct
- SGLang
How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolVLM2-2.2B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolVLM2-2.2B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with Docker Model Runner:
docker model run hf.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct
Input Video length constraints
Is there any limits to the length of the input video that can be provided to SmolVLM2 2.2B?
What is the max length, of a video, that it can handle?
there is a limit of 64 frames.
SmolVLM2 will sample frames at 1FPS with a max of 64 frames.
If the video is longer than 64 seconds, then it will sample evenly spaced frames
@mfarre 1FPS is a bit low for my desired usecase. Is there an option to increase the sample rate? Would it be an alternative to manually sample the frames and use the multi-image inference?
@j0yk1ll
You can adjust the fps by:
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": "path_to_video.mp4", "target_fps": fps},
{"type": "text", "text": "Describe this video in detail"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
Best,
Orr
Thanks for the awesome models!
I am trying to use the above snippet to control the number of frames the model can access.
For this, I try to first calculate the desired FPS using:
def get_desired_fps_given_video_path_and_num_frames(video_path, num_frames):
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
num_frames_in_video = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.release()
return fps * num_frames / num_frames_in_video
and then control the SmolVLM2 by:
processor.image_processor.video_sampling["max_frames"] = args.num_frames
processor.image_processor.video_sampling["fps"] = desired_fps
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": args.clip_path, "target_fps": desired_fps},
{"type": "text", "text": prompt},
],
},
]
However, in the log that model prints internally when it's called, it seems to still be extracting 64 frames:
Desired FPS: 0.21333333333333335
User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.
Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:
Do you have any suggestions on where I might be going wrong?
In general, the SmolVLM processor will not sample more then max_frames frames. You will only sample at target_fps if the resulting number of frames is < then max_frames.
Therefore, you only really need to set max_frames.
just do
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": args.clip_path, "max_frames": max_frames},
{"type": "text", "text": prompt},
],
},
]
i don't think you need to do
processor.image_processor.video_sampling["max_frames"] = args.num_frames
processor.image_processor.video_sampling["fps"] = desired_fps
Thanks for the super quick reply! I ran with setting the "max_frames" parameter.
However, I still get the same output:
User Prompt:
"User: You are provided the following series of sixty-four frames from a 0:01:15 [H:MM:SS] video.
Frame from 00:01:
Frame from 00:02:
Frame from 00:03:
Frame from 00:04:
Frame from 00:05:
Frame from 00:06:
Frame from 00:07:
Frame from 00:09:
Frame from 00:10:
Frame from 00:11:
Frame from 00:12:
Frame from 00:13:
Frame from 00:14:
Frame from 00:16:
Frame from 00:17:
Frame from 00:18:
Frame from 00:19:
Frame from 00:20:
Frame from 00:21:
Frame from 00:23:
Frame from 00:24:
Frame from 00:25:
Frame from 00:26:
Frame from 00:27:
Frame from 00:28:
Frame from 00:29:
Frame from 00:31:
Frame from 00:32:
Frame from 00:33:
Frame from 00:34:
Frame from 00:35:
Frame from 00:36:
Frame from 00:38:
Frame from 00:39:
Frame from 00:40:
Frame from 00:41:
Frame from 00:42:
Frame from 00:43:
Frame from 00:45:
Frame from 00:46:
Frame from 00:47:
Frame from 00:48:
Frame from 00:49:
Frame from 00:50:
Frame from 00:51:
Frame from 00:53:
Frame from 00:54:
Frame from 00:55:
Frame from 00:56:
Frame from 00:57:
Frame from 00:58:
Frame from 01:00:
Frame from 01:01:
Frame from 01:02:
Frame from 01:03:
Frame from 01:04:
Frame from 01:05:
Frame from 01:07:
Frame from 01:08:
Frame from 01:09:
Frame from 01:10:
Frame from 01:11:
Frame from 01:12:
Frame from 01:14:
OK i tested it: you don't need to pass it with messages, but as input to apply_chat_template:
max_frames=2
inputs = processor.apply_chat_template(
messages,
max_frames=max_frames,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
This work! Thanks a lot for your swift responses! I appreciate it.