Fintechs.fi

Fintech & Crypto News

ChatGPT And Claude Progressing In Real-World Tasks, Say Scientists

ChatGPT and Claude Progressing in Real-World Tasks, Say Scientists

The researchers created an ” AgentBench ” tool to evaluate LLM models as agents.

An approach for evaluating the skills of large language models (LLMs) as almost 

two dozen researchers developed actual agents from Tsinghua University, Ohio State University, and the University of California at Berkeley.

The technological industry has been captivated with LLMs like OpenAI’s ChatGPT and Anthropic’s Claude over the past year as cutting-edge “chatbots” have shown helpful in many activities, including coding, cryptocurrency trading, and text production.

These models are typically evaluated by their performance on plain-language assessments created for people or by how well they produce text that is considered humanlike. In contrast, the topic of LLM models as actors has received far less research.

Agents with artificial intelligence (AI) carry out specified duties, such as sticking to guidelines while operating in a particular setting. 

To examine how machine learning might be used to produce safe autonomous robots, researchers, for instance, frequently train an AI agent to navigate a challenging digital environment.

Because training models like ChatGPT and Claude are so expensive, conventional machine learning agents like those in the video, as mentioned above, are only sometimes created as LLMs. The biggest LLMs, though, have demonstrated promise as agents.

The researchers from Tsinghua, Ohio State, and UC Berkeley created AgentBench, which they claim is the first tool of its type, to assess and gauge LLM models’ performance as actual agents.

The critical hurdle in developing AgentBench, according to the researchers’ preprint article, was moving beyond conventional AI learning environments, such as video games and physics simulators, and figuring out how to apply LLM capabilities to real-world issues so they could be successfully tested.

Flowchart of AgentBench’s evaluation process. Source: Liu et al

They developed a multifaceted battery of assessments that gauge a model’s capacity for challenging work in various settings.

These include planning and carrying out household cleaning chores, using an operating system, having models carry out SQL database functions, shopping online, and many other high-level tasks requiring detailed problem-solving.

According to the report, the biggest, priciest models performed noticeably better than open-source models:

“[W]e have conducted a comprehensive evaluation of 25 different LLMs using AgentBench, including both API-based and open-source models. Our results reveal that top-tier models like GPT-4 can handle a wide array of real-world tasks, indicating the potential for developing a potent, continuously learning agent.”

Even while they acknowledged that open-sourced rivals still have a “long way to go,” the researchers said that “top LLMs are becoming capable of tackling complex real-world missions.”