Improved Baselines for Visual Instruction Tuning Models

Researchers from the University of Wisconsin-Madison and Microsoft Research have developed improved baselines for visual instruction tuning models that achieve state-of-the-art performance across 11 benchmarks.

LLaVA-1.5 achieves SoTA on a broad range of 11 tasks

In their technical report “Improved Baselines with Visual Instruction Tuning“, the authors make simple modifications to the LLaVA model architecture that lead to significant gains. The key changes include using a two-layer MLP rather than linear connector between the visual encoder and language model, and incorporating academic task-oriented VQA data with clear response formatting prompts.

With these tweaks, the new LLaVA-1.5 model achieves top results on a diverse set of 12 evaluation benchmarks while using orders of magnitude less training data than comparable models like InstructBLIP and Qwen-VL.

Comparison with SoTA methods on 12 benchmarks. LLaVA achieves the best performance on 11/12 benchmarks, and ranks the second on the other.

The authors attribute the strong performance to the power and efficiency of LLaVA’s full image patch design and transformer architecture. Although visual resampling methods can reduce computational costs, LLaVA converges faster and generalizes better with less data.

By establishing reproducible state-of-the-art baselines, this work makes large multimodal model research more accessible. The ability to train top-tier models without massive datasets or resources lowers the barrier for future open-source development.

Different formatting prompts with clear response format example (last).

Looking forward, visual instruction tuning seems to have more impact on multimodal understanding than pretraining alone. However, limitations around multi-image processing, problem solving, and hallucination remain. If these improved baselines can be scaled up responsibly, they may one day power real-world assistive applications.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.