AI summary: The InstructBLIP model, based on the pre-trained BLIP-2 models, is a general-purpose vision-language model that can solve various language-domain tasks. It introduces instruction-aware visual feature extraction, enabling the model to extract informative features tailored to the given instruction. The model achieves state-of-the-art zero-shot performance across all 13 held-out datasets, outperforming BLIP-2 and the larger Flamingo.
Read more…