When it comes to AI and robotics, you’d probably get a more reliable answer by asking on the LeRobot Discord.
I’ve heard that on a Jetson, it’s not a good idea to allocate all the VRAM to an LLM. If that’s the case, a 7B model might be a bit too large.
Yes. But the practical answer is not “find a magical 7B–9B model that suddenly becomes fast on NX.” The practical answer is to use a smaller local model as the supervisor, keep context short, use retrieval for manuals/logs/SOPs, and only invoke a vision model when you actually need image understanding. Current Jetson deployment guidance and current small-model research both point in that direction. The biggest constraint is usually memory bandwidth and KV cache, not just raw parameter count. (Jetson AI Lab)
Why your 7B–9B tests felt bad
On Jetson, a model can “fit” and still be a bad deployment. The weights load first, then the remaining memory is consumed by runtime overhead and KV cache. Longer prompts, longer outputs, and concurrent robotics processes make this worse. NVIDIA’s Jetson benchmarking guide is explicit that the remaining GPU memory after weights is pre-allocated to KV cache, so edge failures often come from context length and runtime configuration, not only from the model family itself. (Jetson AI Lab)
There is also a second trap: Jetson results can collapse when the fast kernel/backend path is missing, when clocks are not pinned, or when you benchmark through a slower runtime. So published phone or server numbers often do not transfer to Jetson as-is. (Jetson AI Lab)
First, separate Xavier NX from Orin NX
This matters a lot. If by “Jetson NX” you mean Xavier NX, that is a much harder target than Orin NX. NVIDIA says Orin NX delivers up to 5x the performance of Xavier NX, and current Jetson pages position Orin NX as the compact platform for multiple concurrent AI pipelines. Current official Ollama-on-Jetson support lists AGX Orin 64GB, AGX Orin 32GB, Orin NX 16GB, and Orin Nano 8GB. Xavier NX is not on that current support list. (NVIDIA Developer)
That leads to the blunt hardware conclusion:
- Xavier NX: tiny-model territory.
- Orin NX 16GB: practical for a real local agent if you stay disciplined.
- AGX Orin 32GB/64GB: the comfortable option if you want fewer compromises. (NVIDIA Developer)
What is actually realistic by hardware tier
Xavier NX
If your final target is truly Xavier NX-class, I would not plan around dense 7B–9B models. I would treat it as a board for 0.8B–1.5B, maybe 3B only in a stripped-down text-first setup. The current Xavier NX anecdotal experience in the llama.cpp community is still consistent with this: one Xavier NX report was about ~600 ms/token, which is far too slow for a responsive operator assistant. (GitHub)
Orin NX 16GB
This is the first small Jetson where a local robotics agent starts to make sense. Current Jetson AI Lab model pages show that Llama 3.2 3B, Gemma 3 4B, Qwen3.5 4B, and Llama 3.1 8B all have Jetson-ready paths. Their listed memory footprints are roughly 4GB for Llama 3.2 3B, 4GB for Gemma 3 4B, 4GB for Qwen3.5 4B, and 8GB for Llama 3.1 8B. Jetson AI Lab also reports 52.58 output tok/s for Llama 3.2 3B and 28.14 output tok/s for Llama 3.1 8B on Jetson Orin with vLLM under their benchmark conditions. (Jetson AI Lab)
So on Orin NX 16GB, the useful rule is:
- 3B–4B is the safe production zone.
- 8B is possible, but with tighter margins.
- 9B multimodal is possible on paper, but it is much less comfortable once the rest of the robot stack is alive. (Jetson AI Lab)
AGX Orin 32GB or 64GB
This is where you can stop fighting every constraint. NVIDIA’s AGX Orin pages list 32GB and 64GB modules, and current Jetson AI Lab pages show larger models that are explicitly positioned for AGX/Orin-class memory, including GPT OSS 20B at 16GB RAM minimum, AGX Orin minimum, and Qwen3.5 35B-A3B MoE at 20GB RAM with about 30 output tok/s on Jetson Orin in their benchmark. (NVIDIA Developer)
For AGX Orin, that means you can reasonably consider:
- a strong dense 8B text model,
- a larger efficient MoE model,
- or a 20B class model if you truly need more general capability. (Jetson AI Lab)
Models I would actually recommend for your case
Your tasks are not pure free-form chat. They sound like:
- on-site diagnosis,
- operator-ready summaries,
- maybe log/manual interpretation,
- maybe image-grounded inspection.
That is a classic small planner + retrieval + optional vision problem.
Best text-first choices today
Llama 3.2 3B is one of the safest starting points. Jetson AI Lab explicitly positions it as a compact edge model for resource-constrained Jetson deployments, lists 4GB RAM / 2.0GB size, and publishes very strong Jetson Orin benchmark numbers for it. Meta’s model card also explicitly mentions dialogue, retrieval, and summarization use cases. (Jetson AI Lab)
Llama 3.1 8B is the stronger text model to try once you move up to Orin NX 16GB or AGX Orin. Jetson AI Lab lists 8GB RAM / 4.5GB size for the quantized Jetson build and reports 28.14 output tok/s on Jetson Orin with vLLM. (Jetson AI Lab)
SmolLM3-3B is a strong newer candidate outside NVIDIA’s Jetson pages. Hugging Face describes it as a 3B model with dual-mode reasoning, 6 languages, and long context, and positions it as a strong model at the 3B–4B scale. I would treat it as a serious test candidate for Orin NX 16GB. The missing piece is that I have not found Jetson-specific benchmark numbers for it yet, so this is a “worth testing” recommendation, not a “Jetson-proven” one. (Hugging Face)
Phi-4-mini-instruct is also worth testing if your workload leans toward longer logs, procedures, or technical text. Microsoft’s model card describes it as a lightweight open model with 128K context, and Microsoft also provides an ONNX variant aimed at optimized inference. Again, this is promising for edge deployment, but I have not seen Jetson-specific benchmark numbers from NVIDIA for it. (Hugging Face)
LiquidAI LFM2.5-1.2B-Instruct is one of the more interesting newer ultra-compact options. Its Hugging Face card says it is designed for on-device deployment, with 32K context and support for common inference stacks. If you need the lightest possible text planner on an NX-class device, this is one of the few truly modern models I would put high on the shortlist. (Hugging Face)
Best compact multimodal choices
If your agent has to inspect images, you should usually keep the VLM separate and call it only when needed.
Gemma 3 4B is a strong compact multimodal option. Jetson AI Lab lists it at 4GB RAM / 2.5GB size, with text-plus-image input and a 128K context window for the 4B size. (Jetson AI Lab)
Qwen3.5 4B is another strong compact multimodal option. Jetson AI Lab lists 4GB RAM / 2.5GB size, AWQ 4-bit quantization, and specifically calls out multimodal instruction following, visual understanding, and agent-style workloads on Jetson. (Jetson AI Lab)
Qwen3.5 9B is the “bigger multimodal” step up. Jetson AI Lab lists 8GB RAM / 5GB size and tool-calling support, but in your case I would only try it on a roomy Orin NX 16GB setup or AGX Orin, because the rest of the robotics stack will eat into that headroom fast. (Jetson AI Lab)
Gemma 3 1B is the fallback when memory is brutal. Jetson AI Lab lists 2GB RAM / 1.2GB size, and it is multimodal. That makes it useful as a tiny image-aware helper or as a lightweight local assistant on very constrained deployments. (Jetson AI Lab)
The best architecture for your robotics agent
This is the part that matters most.
I would not try to make one large local model do everything. I would build:
- a small text model as the main planner and explainer,
- a retrieval layer for manuals, logs, SOPs, error catalogs, and maintenance data,
- an optional VLM that only runs for image-dependent questions,
- and a deterministic robotics layer below that.
In practice, this means the LLM should answer questions like:
- “What does this alarm likely mean?”
- “Summarize the fault for the operator.”
- “Which check should happen next?”
- “Should I inspect the gripper, the vision path, or the fixture first?”
The LLM should not be the thing that continuously drives low-level control behavior. On embedded hardware, the best use of your limited compute budget is usually reasoning over structured state and retrieved facts, not generating long free-form text or directly controlling motion. That design also reduces latency pressure because the model can work on short structured inputs instead of a huge prompt dump. (ACL Anthology)
My practical recommendations by board
If you end up on Xavier NX-class hardware
Use a tiny planner and accept that this is not a general LLM workstation.
My shortlist would be:
- LiquidAI LFM2.5-1.2B-Instruct
- Gemma 3 1B
- Qwen3.5 0.8B if you need a very small multimodal helper
- maybe Llama 3.2 3B for text-only planning if the rest of the system is very lean. (Hugging Face)
If you use Orin NX 16GB
This is the tier I would actually target for a fully local robotics assistant.
My shortlist would be:
- Llama 3.2 3B as the safest starting point,
- SmolLM3-3B as a newer text-first candidate,
- Gemma 3 4B if you want multimodal support in a compact footprint,
- Qwen3.5 4B if image understanding and tool use matter,
- Phi-4-mini-instruct if your workload is heavy on long technical text,
- Llama 3.1 8B only after the smaller models are working well. (Jetson AI Lab)
If you use AGX Orin 32GB or 64GB
Then you can step up without fighting the board all the time.
My shortlist would be:
- Llama 3.1 8B as the strong dense baseline,
- GPT OSS 20B if you want a broader text model and have AGX Orin,
- Qwen3.5 35B-A3B MoE if you want larger-model behavior with more efficient active parameters,
- plus a compact VLM such as Qwen3.5 4B or Gemma 3 4B for image questions. (Jetson AI Lab)
Why I would not center the design on 7B–9B dense models
Because edge deployment is now good enough that small models are often the better engineering choice. The ACL 2025 study on edge deployment found that modern small language models can outperform some 7B models on general tasks, and their measurements on Jetson Orin NX 16GB also showed why decode remains a bottleneck even when GPU helps a lot in prefill. That is exactly the pattern you are seeing: the big model does not feel proportionally better once it hits embedded memory and decode limits. (ACL Anthology)
Runtime choice matters more than many people expect
For final deployment, I would favor vLLM over Ollama. NVIDIA’s current Jetson guidance explicitly says Ollama is an easy local entry point but gets roughly half of peak performance versus faster APIs like NanoLLM, and their more serious benchmarking/tutorial flow uses vLLM. (Jetson AI Lab)
I would also keep these deployment rules:
- use quantized checkpoints that are already known to run on Jetson,
- keep prompts short,
- keep output length bounded,
- avoid enormous context windows unless the task truly needs them,
- pin clocks and use the correct Jetson-matched software stack,
- and verify that you are not silently on a slow path. (Jetson AI Lab)
My direct recommendation for your specific case
If the goal is a fully local robotics agent that does diagnosis, operator summaries, and possibly some image-grounded reasoning:
- Do not build the first version around Xavier NX.
- Use Orin NX 16GB as the minimum serious target.
- Use AGX Orin if budget allows and you want breathing room. (NVIDIA Developer)
Then start with:
- text brain: Llama 3.2 3B or SmolLM3-3B
- image helper: Qwen3.5 4B or Gemma 3 4B
- retrieval over manuals/logs/SOPs
- vLLM for serving. (Jetson AI Lab)
After that, only move up if the small stack clearly fails your real tasks. At that point:
- test Llama 3.1 8B on Orin NX 16GB or AGX Orin,
- or move to GPT OSS 20B / Qwen3.5 35B-A3B on AGX Orin. (Jetson AI Lab)
Bottom line
Yes. There are local models that can solve your problem.
But the winning answer is usually not “run a 7B–9B dense model on NX and hope.”
The winning answer is:
- small local planner model
- retrieval for factual depth
- optional compact VLM
- good runtime and quantization
- Orin NX 16GB minimum, AGX Orin preferred. (ACL Anthology)