OpenAI Aims to Create the Perfect GP with ChatGPT, But It’s Incorrect Half the Time

ChatGPT Health Mode: A Troubling Start

OpenAI kicked off the year with the launch of its ChatGPT Health mode. While this feature is not yet available in Spain, it has been rolled out in the US. Early studies assessing its effectiveness, however, have not delivered promising results for OpenAI.

Significant Underperformance in Critical Scenarios

It’s not that big of a deal. A recent study published in the journal Nature Medicine and reported by NBC News has highlighted a concerning failure rate. ChatGPT Health misclassified the urgency of 51.6% of the emergency medical cases it evaluated. Researchers presented thousands of clinical scenarios to the AI, which tended to undervalue critical situations—suggesting patients wait 24-48 hours for conditions that required immediate intervention, like diabetic ketoacidosis or respiratory failure. Meanwhile, it did identify serious conditions like stroke or severe allergic reactions correctly.

The Inconsistent Decision-Making of ChatGPT Health

It doesn’t make sense. Not only did ChatGPT underestimate urgent cases, but it also overvalued many mild symptoms, incorrectly advising patients to seek medical attention urgently for situations like a persistent sore throat. Dr. Ashwin Ramaswamy, the study’s leader, stated to NBC that “it doesn’t make sense that recommendations were made in some areas and not in others.”

Handling Suicidal Ideations Poorly

Suicidal ideas. The study’s findings extend to alarming scenarios involving suicidal thoughts. For example, one query detailed a patient wanting to “take a lot of pills.” Initially, a suicide prevention hotline message appeared, but when lab results were included in the query, ChatGPT failed to recognize the suicidal ideation and omitted the safety message. According to Ramaswamy, “A crisis protection barrier that depends on whether lab results are mentioned is not in place, and is arguably more dangerous than having no barrier at all.”

Why the Accuracy of ChatGPT Health Matters

Why it is important. ChatGPT has effectively become the first point of contact for many seeking medical advice. The convenience of checking symptoms via mobile phones is gradually overshadowing traditional consultation methods. If places where individuals decide on emergencies rely on a tool with a 50% margin of error for serious cases, it signals a profound issue.

The Dangerous Implications

In comments to the Guardian, researcher Alex Ruani noted that these findings are “incredibly dangerous.” He emphasized the potential for creating a “false sense of security”—implying that if someone receives a wait-and-see recommendation during a critical incident, it could prove fatal.

OpenAI’s Response to the Criticism

OpenAI responds. A spokesperson defended ChatGPT Health, asserting that the study does not reflect its typical use. The spokesperson argued that the AI is not intended to diagnose but to clarify follow-up questions and assist in providing context. OpenAI has consistently noted that the tool doesn’t replace professional medical advice, yet how users interact with it remains outside the company’s control.

The Risks of AI: Flattery and Hallucinations

Flattery and hallucinations. Chatbots like ChatGPT often exhibit a tendency to agree with users, contributing to their “flattery problem.” Additionally, the phenomenon of “hallucinations” occurs when large language models prioritize giving a confident answer instead of admitting uncertainty. Research indicates that people often feel more secure using AI, even when it provides incorrect information. Combining reassurance, inaccuracies, and health-related queries creates a perilous scenario.

Image | OpenAI

In Xataka | People are blaming ChatGPT for causing delusions and suicides: What’s Really Happening with AI and Mental Health

General News – 2