Multimodal Web Navigation with Instruction-Finetuned Foundation Models


GPT-4: WebGUM, a multimodal agent, leverages vision-language foundation models to improve autonomous web navigation. By jointly finetuning an instruction-finetuned language model and a vision transformer on a large corpus of demonstrations, WebGUM enhances grounded visual perception, HTML comprehension, and multi-step reasoning. The agent outperforms previous offline methods by 31.9% on the MiniWoB benchmark and surpasses existing state-of-the-art models on the WebShop benchmark. The researchers also provide 347K high-quality demonstrations to promote further advancements in the field.
Read more…