With a generative AI that already shows signs of deceleration, the next great jump already glimmers on the horizon: the AI agents. Unlike chatbots, an AI agent can be given a complex task and will act independently, making decisions on the fly to achieve its goal. Everything pointed to the fact that 2025 was going to be the year of AI agents. And to verify it, some researchers conducted a curious experiment: They put several of these agents to work in a fictitious company. It didn’t go very well.

A Fictitious Company

The study was conducted by Carnegie Mellon University researchers and sought to measure the effectiveness of the AI agents. They created an environment that mimicked a small software development company, which they dubbed TheAgentCompany. The company had 18 employees and a detailed objective plan for the sprint quarterly. Additionally, they provided ample internal documentation such as an employee manual, human resources policies, and a good practices guide. Employees communicated through a chat program similar to Slack for seamless interaction.

The Staff

The AI agents employed at TheAgentCompany included models from Google, OpenAI, Meta, and Anthropic. They were assigned various roles, including Financial Analyst, Project Manager, and Software Engineer. A technology director and a human resources manager were created for agents to contact if they required assistance. Among their tasks were writing code, searching the Internet, opening programs, or organizing data in spreadsheets—typical jobs in such a company setting.

The Problems

Initially, everything seemed to be functioning smoothly; however, problems and misunderstandings soon emerged. One agent encountered a popup that obstructed its ability to access necessary information. Although it could have easily closed the popup by clicking the ‘X’ in the upper right corner, it decided to seek help from human resources, who informed it that the IT department would contact it soon. Unfortunately, no one ever made that contact, leaving the task unfinished.

Curiously, the agents displayed erratic behavior when they were uncertain about the appropriate steps to follow. In some instances, they resorted to creating shortcuts to bypass challenging aspects of a task. For instance, when one agent was unable to find the correct person to ask a question, it simply changed the name of another user to that of the individual it needed to question.

The Results

The accolade for Employee of the Month went to Anthropic’s Claude 3.5 Sonnet model, which managed to complete 24% of its assigned tasks. In contrast, the models Gemini 2.0 Flash and ChatGPT completed only 10%, and the worst performer was Amazon’s Nova Pro 1, which finished a disappointing 1.7% of its tasks. The most common errors stemmed from a lack of social skills and difficulties in internet searches, highlighting the limitations of AI agents in complex workplace environments.

The Threat of AI Agents

According to the latest World Economic Forum Report, AI could eliminate more than 90 million jobs in the next five years, although it is also anticipated that almost twice that number of new positions will be created. Nevertheless, AI agents pose a significant threat to various jobs. Instances like this experiment serve to showcase that current technology is not yet fully prepared to replace human employees entirely. As it stands, AI agents make numerous mistakes and, like Tesla’s Autopilot, it’s advisable to keep one’s hands on the steering wheel for the time being.

Image | Gemini

In Xataka | Workers have shifted their perception of AI; software engineers remain concerned.



General News – 2