Researchers from UC Irvine and Carnegie Mellon University have developed a novel technique that allows AI agents to successfully carry out computer tasks using only natural language instructions. Published in the paper “Language Models can Solve Computer Tasks,” this method represents a major advance towards more flexible and general-purpose AI systems.
The key innovation involves an approach called Recursive Criticism and Improvement (RCI), where language models recursively critique and improve their own outputs to achieve the desired behavior. For computer tasks, RCI enhances the model’s “grounding” at three levels: connecting high-level concepts to UI elements (state grounding), generating feasible actions (agent grounding), and optimizing the overall plan (task grounding).
For example, if an LLM generates an incorrect answer to a math problem, RCI would first ask the model to review its answer and find any mistakes. The model may recognize that it failed to carry a number properly. RCI would then prompt the model to improve its answer based on this critique. By iterating this process of critiquing and improving, RCI allows LLMs to correct their own errors and arrive at more logical solutions.
Evaluated on the MiniWoB++ benchmark for web tasks, the RCI agent achieved state-of-the-art performance, surpassing prior supervised learning, reinforcement learning, and combination methods. Remarkably, it attained a 94% success rate using only a couple of demonstrations per task, compared to tens of thousands for imitation learning techniques.
The RCI prompting scheme also improved language model reasoning on arithmetic and common sense datasets. When combined with the existing chain-of-thought method, performance further increased, demonstrating the synergistic effects of different reasoning techniques.
The researchers suggest that as foundation model capabilities continue to grow, this instruction-following approach will be able to tackle more complex cognitive tasks on computers. Potential use cases include intelligent assistants that can automate repetitive workflows, AI co-pilots that can interpret and execute commands, and automated scientists that can conduct in silico experiments.
However, the authors caution that such autonomous systems could also enable new forms of cybercrime and technological unemployment if deployed irresponsibly. Overall, this work represents an important milestone towards realizing advanced AI agents that can flexibly accomplish a wide range of computer-based tasks.