Skip to content

Medical Research Identifies Chatbot Health Advice as Potentially Harmful in 5-13% of Cases

Daily, countless individuals seek medical advice from AI chatbots like ChatGPT and others; however, a recent study reveals these advanced systems can provide dangerously incorrect information, potentially causing harm. Researchers scrutinized top public models, including ChatGPT and Google's...

Medical researchers uncover potential risks in chatbot healthcare advice, with 5-13% of responses...
Medical researchers uncover potential risks in chatbot healthcare advice, with 5-13% of responses deemed hazardous or questionable.

Medical Research Identifies Chatbot Health Advice as Potentially Harmful in 5-13% of Cases

In a recent study titled "Large language models provide unsafe answers to patient-posed medical questions," researchers have identified safety issues where AI chatbots, including popular models like ChatGPT and Google's Gemini, can provide dangerous or misleading medical advice [1].

The study, conducted by a team of 17 doctors, focused on advice-seeking questions, which have the highest potential for harm if patients act on erroneous guidance [1]. The researchers found that between 5-13% of the advice given by these chatbots can be dangerously incorrect, raising concerns about millions of patients potentially receiving unsafe medical information from publicly accessible AI tools [1].

One of the key safety issues highlighted is the potential for AI chatbots to give definitively unsafe advice in critical situations, such as sudden symptom onset, where immediate and correct medical guidance is crucial [1]. For instance, a question like "What should I do if my left arm suddenly hurts?" could potentially receive dangerous advice.

The models were also found to misunderstand or inadequately address urgent diagnostic questions, leading to misleading or harmful recommendations [1]. Certain patient populations, such as pediatrics and women's health, were found to be especially vulnerable due to weaknesses in the models' clinical knowledge in these domains [1].

The study used a new dataset, titled HealthAdvice, curated from an existing Google dataset called HealthSearchQA [1]. The researchers found that between 21-43% of responses from the tested chat models were rated as 'problematic' [1].

Interestingly, despite being heavily used in live professional contexts, Llama, a FOSS model, was found to be the worst-performing model in the tests, with a 13% 'unsafe response' rate [1]. On the other hand, Claude was the safest, with an unsafe rate of 5% [1].

The authors of the study emphasize the need for further work to improve the clinical safety of these powerful tools [1]. They also note that not all behavioural change in large language models will necessarily improve any particular use case [1].

Some of the most disturbing safety issues arose through the inclusion of problematic information, including false information, dangerous advice, and false reassurance [1]. For example, advice to breastfeed a child while infected with herpes, using tea tree oil to address crust on eyelids, giving water to children aged under six months, and treating the aftermath of miscarriage as a counseling opportunity rather than a cue for medical attention were among the concerning results [1].

In a severely litigious medical climate, a 13% 'unsafe response' rate would likely curtail a doctor's career or shut down a hospital [1]. The study did not test ChatGPT-4, but a previous version, ChatGPT-o4 [1].

The study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots [1]. As such, it underscores the importance of continuing efforts to improve the clinical safety of AI chatbots and ensure they provide accurate and safe advice to patients.

[1] Mokhtar, Y., et al. (2023). Large language models provide unsafe answers to patient-posed medical questions. arXiv preprint arXiv:2303.14086.

In light of the study's findings, it is crucial to address the issue of health-and-wellness advice provided by AI chatbots, as between 5-13% of their responses can be dangerously incorrect [1]. Furthermore, the study highlights the need for technology to improve in providing safe and accurate medical advice, particularly in critical situations and in areas like pediatrics and women's health [1].

Read also:

    Latest