A team of researchers from Google DeepMind and University of Tokyo has developed a new web agent system called WebAgent that can follow natural language instructions to complete tasks on real-world websites. The system combines two large language models (LLMs) – one specialized for website navigation and one for general programming – to overcome challenges like long website HTML and open-ended actions.
WebAgent uses a model called HTML-T5 to plan the sub-steps to accomplish the overall instruction and summarize long HTML code into relevant snippets. It then feeds these snippets into Flan-U-PaLM, a 540B parameter LLM trained on code, which generates Python programs to execute the sub-steps on the actual website.
Key results:
- Achieved 70% success rate on tasks on real estate and social media sites, 50% higher than single LLM approaches
- HTML-T5 model outperformed prior best method by 15% on MiniWoB benchmark of 56 web tasks
- Performed better than single generalist or specialist LLM models on static HTML comprehension
The modular approach allows each model to focus on its strengths – HTML-T5 handles instruction following and HTML structure while Flan-U-PaLM generates programs. The HTML-T5 model uses specialized local-global attention and training on HTML data to better capture document structure.
Key actions that WebAgent can perform:
- Fill out forms on websites by locating form elements like text boxes, drop downs, checkboxes etc. and populating them.
- Click on buttons, links, tabs, menu items to navigate between pages and sections of a website.
- Scroll up or down on a page to bring specific elements into view.
- Interact with search bars to lookup information by entering text and submitting queries.
- Scrape and extract information from webpages by locating relevant DOM elements.
- Execute JavaScript code snippets to control page behavior.
- Set values of input elements like date pickers, sliders, radio buttons etc.
- Upload files by locating upload fields and submitting file paths programmatically.
- Download files from links and export web data.
- Automate multi-page workflows by chaining together sequences of actions.
- Extract summaries of page content by locating relevant DOM elements.
Broader Impact:
This work could enable more capable web agents that can assist people in completing complex online tasks. The modular design is more scalable as additional expert models can be plugged in. Code generation also provides an open action space beyond predefined actions.
However, security and misuse remain a concern if such agents are deployed autonomously without human supervision. More research is still needed to ensure robust and safe web navigation across the diversity of real-world websites.
The specialized HTML-T5 model exemplifies how inductive biases can make LLMs better suited for particular domains, an area likely to grow. This could reduce the need for massive general models.
Overall the work demonstrates how combining modular LLMs that have complementary skills and training can achieve better performance on complex real-world tasks. As LLMs advance, finding the right decompositions and specializations will be key to realizing their full potential.