A few months ago, a group of Spanish researchers thought of putting an AI chatbot to the test with a curious experiment. They uploaded an image of an analog clock to the chatbot and asked a simple question: “What time is it on that clock?” Disturbingly, the AI failed to provide an accurate response.
Assessing AI’s Time-Reading Abilities
Researchers from the Polytechnic University of Madrid, the University of Valladolid, and the Politecnico di Milano recently published a study aimed at evaluating the intelligence of various AI models. To conduct this evaluation, they created a dataset of synthetic images depicting analog clocks, showcasing 43,000 different times available on Hugging Face.

Before fine-tuning, the AI models struggled to tell the time accurately. Improvements were made after adjustments, but the issue persisted.
Testing Multiple AI Models
Initially, four generative AI models were tasked with determining the time displayed in the clock images: GPT-4o, Gemma3-12B, LlaMa3.2-11B, and QwenVL-2.5-7B. None could accurately identify the time, revealing significant challenges in distinguishing the hands or their positions relative to the clock’s numbers.
Attempts to Improve Performance
Following these discouraging results, the researchers attempted to enhance the models’ capabilities through “fine-tuning.” By training them with an additional 5,000 clock images from the same dataset, the researchers hoped for better outcomes. Unfortunately, when tested with a new set of analog clock images, the models continued to falter.
Lack of Generalization Skills
One key takeaway from this study is the AI’s inability to generalize. While these models can efficiently recognize familiar data, they struggle with scenarios outside their training sets—confirming an inherent limitation in their design.
Exploring AI’s Limitations
To further investigate the failures, researchers introduced new sets of clock images, including Salvador Dalí’s distorted clocks. Unlike AI, humans can interpret time displayed on distorted clocks, but this remains a significant hurdle for AI systems.
Consequences for Critical Applications
The implications of these findings are sobering. The inability of generative AI to interpret simple tasks like reading a clock raises concerns about their reliability in more critical areas, such as medical imaging or autonomous driving. Given that these models cannot even identify clock hands reliably, the stakes seem high if they are tasked with analyzing life-and-death situations.
Conclusion: The Reality of AI Intelligence
Despite their impressive performance in various applications, generative AI models often merely “regurgitate” responses from their training data. Prominent thinkers in AI, such as Thomas Wolf from Hugging Face and Yann LeCun, emphasize this point, suggesting that these systems remain fundamentally limited in intelligence.


Source: clocks.brianmoore.com
This research illustrates the ongoing challenges in the development of intelligent AI models. Despite advances, the gap between human-like understanding and AI capabilities remains stark. Until models can reliably interpret and generalize information, they may not be the intelligent assistants we hoped for.
Image | Yaniv Knobel
In other news, as if there weren’t enough AI companies, Jeff Bezos has just returned from the shadows to build another one, according to the NYT.
