Instructions to use prithivMLmods/proxima-ocr-d.markdown-post3.0.l with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/proxima-ocr-d.markdown-post3.0.l with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/proxima-ocr-d.markdown-post3.0.l")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")
model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use prithivMLmods/proxima-ocr-d.markdown-post3.0.l with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/proxima-ocr-d.markdown-post3.0.l"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/proxima-ocr-d.markdown-post3.0.l",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/proxima-ocr-d.markdown-post3.0.l

SGLang

How to use prithivMLmods/proxima-ocr-d.markdown-post3.0.l with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/proxima-ocr-d.markdown-post3.0.l" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/proxima-ocr-d.markdown-post3.0.l",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/proxima-ocr-d.markdown-post3.0.l" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/proxima-ocr-d.markdown-post3.0.l",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/proxima-ocr-d.markdown-post3.0.l with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/proxima-ocr-d.markdown-post3.0.l
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

proxima-ocr-d.markdown-post3.0.l

proxima-ocr-d.markdown-post3.0.l is an experimental document AI multimodal model fine-tuned on top of Qwen3-VL-8B-Instruct, optimized for high precision OCR and structured document reconstruction. The model converts documents into Markdown, HTML-Markdown, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content.

Key Enhancements

Dynamic Markdown Reconstruction Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment.
Inline Code and Language Embedding Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation.
High Fidelity OCR and Visual Parsing Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning.
Complex Layout Interpretation Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion.
Document Retrieval and Semantic Linking Efficient multi page chunking with cross reference recognition and content traceability.
Multimodal Long Reasoning Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts.

👉 This model is a stage progression model, and it may currently contain artifacts.

Example Preview

[1] Markdown HTML

Input Image	Markdown Preview Page 1	Markdown Preview Page 2

[2] JSON Nodes

Input Image	Node Preview Page 1	Node Preview Page 2

[3] YAML Nodes

Input Image	Node Preview Page 1	Node Preview Page 2

Quick Start with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Convert to Markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

OCR to Markdown or HTML-Markdown conversion
Complex document reconstruction and formatting regeneration
Multi page document reasoning and retrieval
Table extraction and structured output transformation
Mathematical OCR and LaTeX conversion
Form extraction and structured entity generation
Knowledge base indexing and large document QA
Documentation regeneration for enterprise automation

Limitations

Accuracy may drop on extremely damaged or poorly scanned images
Significant GPU VRAM required for long sequences and multi page documents
Language accuracy varies for low resource scripts
Complex objects such as mixed orientation blocks may require secondary post processing
May occasionally produce formatting misalignment in highly irregular layouts

Training Details

Parameter	Value
Dataset Size	approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ]
Architecture	Qwen3VLForConditionalGeneration
Training Time	approx. 17,040 seconds (4 h 44 m)
Precision	bfloat16
Hardware	4x H100 SXM (320 GB VRAM)
System Memory	752 GB RAM
CPU	80 vCPU