Quick start

ollama run minicpm-v

Available sizes

Tag	Size	Quantization	Context	Min RAM
minicpm-v:latest	5.5GB	q4_k_m	32K context	6.9 GB

Strengths & Limitations

Strengths

Vision-language understanding
Multimodal capabilities
Designed for MLLMs

Related models

llavaMultimodal

🌋 LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.

12.9M pulls

llava-llama3Multimodal

A LLaVA model fine-tuned from Llama 3 Instruct with better scores in several benchmarks.

2.1M pulls

qwen3-vlMultimodal

The most powerful vision-language model in the Qwen model family to date.

1.6M pulls

qwen2.5vlMultimodal

Flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL.

1.3M pulls