r/ArtificialInteligence • u/MetaKnowing • 1d ago
News GPT-5 outperformed doctors on the US medical licensing exam
Abstract from the paper:
"Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zeroshot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems. We make the code public at the GPT-5-Evaluation."
195
u/Snoutysensations 1d ago
I work in health care and am not particularly surprised or even all thst impressed by this. For many years now we have been aware that the total accumulated knowledge base in medicine is far greater than 1 human brain can encompass. Most of us have been using computers to look up info for decades now and before that physicians kept libraries of reference books in their offices. And if we are going off medical licensing board exams, those specifically were developed to assess knowledge of relatively basic and objective facts not thought to be controversial or esoteric.
I'll be more impressed if an AI manages to take a coherent history from a confused or anxious patient, do a reasonable physical exam, formulate a treatment plan then persuade the patient to go along with it. These are all tasks rather more challenging than simply searching through databases to answer standardized licensing exam questions.
33
u/DivineMediocrity 1d ago
Exactly. Knowledge/benchmarks do not indicate AI can practically replace doctors. There are factors like legal, risk, insurance. Likely though, this helps doctors provide better diagnosis, treatments, and fill in less risky aspects of the job, like notes, responding to patient emails, etc.
13
u/NotLikeChicken 1d ago
AI will be superior in meeting the goals of a specific rubric, which means it is designed to outperform humans at taking exams. Conversely, it will be most challenged by needing to identify relevant information in an undefined situation.
2
u/Profile-Ordinary 7h ago
The issue is that in medicine each patient’s rubric is different, and it is up to the doctor to decide what that rubric looks like depending upon briefly meeting that patient and taking a history.
1
u/NYG_5658 1d ago
Excellent point. AI has more intelligence than a human but doesn’t understand how to apply that intelligence to a situation with multiple variables. Also, if something goes wrong, are patients going to sue the AI?
5
u/MNVikingsFan4Life 1d ago
Yeah, doctors need this for diagnostics in a world where their education is nowhere near comprehensive for the field of healthcare. They will encounter patients outside of their expertise, so having an objective source of info to consult about such things will be invaluable.
2
3
1
u/Snoo44080 1d ago
This is a great tool for a GP to help guide references etc... but yeah, so many other factors at play.
Impressive, but this constant bastardised fetishization with automating away knowledge workers has got to go...
Guys, the written word makes it so much easier to learn, and this library contains more knowledge than the orator we have knows. We should replace all lecturers with a stack of books, or library access...
3
u/Huskador12 1d ago
When was the last time you’ve been to the doctor if you actually believe they’re parsing through medical journals and looking up info in books? Almost every PCP I’ve been to has been clueless about basic shit and they have back to back patients so that they can make as much money as humanly possible. Their learning stops as soon as they get out of medical school.
4
u/xmod3563 1d ago
I'll be more impressed if an AI manages to take a coherent history from a confused or anxious patient, do a reasonable physical exam, formulate a treatment plan then persuade the patient to go along with it.
Nurse practitioners can already do this though at a much cheaper cost.
5
u/Personal-Rooster-345 1d ago
The argument that AI can't do a physical exam assumes that the AI would need to do a physical exam in order to be useful. It seems obvious that the direction this is headed is mid-level (or maybe not even) providers being in AI-augmented roles that end up at the level of physicians.
-1
u/Dry-Refrigerator32 1d ago
But AI can't, which was the poster's point, right? Agree on NPs an PAs, though. Critical elements of the primary care ecosystem.
1
u/Suitable-Economy-346 1d ago
Agree on NPs an PAs, though. Critical elements of the primary care ecosystem.
They only exist in the roles they do now because the AMA, AAMC, and Republicans (+ Bill Clinton) killed the physician profession.
2
u/TrippyWiredStoned 1d ago
If you can find an AI model that can ascertain an accurate prognosis while detecting the nuances humans display when hiding shameful information from doctors, I'd be impressed in this field. AI could offer a perceivably less judgmental means of relaying symptoms, history, etc..
Any time I hear of someone struggling with mental health impairments, their combative response to recommendations of mental health treatments is telling. The 'tried' usually fails to be vulnerable enough to find accurate guidance..
As a patient I also dealt with a lot of hostility towards involvement in directing my treatment plan. Having a rudimentary understanding and using that to weigh pros and cons of varying treatment options was important to my weird brain. Give a patient something like that potential and you'll get some positive outcomes with AI involved care, I'm certain.
2
u/Impressive-Emu-4627 1d ago
In an ideal world it would be a tool to help maximize the performance of doctors in all of those areas while helping them limit and catch their mistakes early and often and to see patterns sooner that a human might, however in the world we live in it’s probably going to be used to deny people healthcare and even further commodify health.
1
u/Snoutysensations 1d ago
I expect it'll be a combination of both to be honest. Insurance companies will use AI to deny testing and care and many clinics and hospitals will try to replace trained doctors with midlevels operating AI devices. That still might mean more health care for more people, although it's also possible it'll mean the same quality and availability of health care as today, just with fewer doctors and nurses in the loop and more profits retained by management. In an ideal world the boring parts would be automated, freeing up humans to see more patients who need care.
4
u/Cairnerebor 1d ago
I’m not going to lie here
I’ll be more impressed when half the doctors and “physician associates” can take a history at all and then write up note that make sense to literally fucking anyone !
3
2
u/MechanoSlippi 1d ago
Most of us have been using computers to look up info for decades now and before that physicians kept libraries of reference books in their offices.
Isn't this kind of the main point? Instead of relying on ONE PERSON to do the digging to find a possible answer the LLM can dive through everything it has been fed based off of symptoms and find the correlation to give a diagnosis... Kind of what we do now, and what you described.
I'll be more impressed if an AI manages to take a coherent history from a confused or anxious patient, do a reasonable physical exam, formulate a treatment plan then persuade the patient to go along with it. These are all tasks rather more challenging than simply searching through databases to answer standardized licensing exam questions.
I agree here, but this is where the divide happens between A.I. and a human starts. You CAN feed the medical records of the patient into the A.I. and they can give a suggestion of what test to run, or possible solutions. whether you agree or not is ultimately up to you.
Every person with access to the internet does this. If someone feels sick with access to the internet what is the first thing they do? They type symptoms into Google, the difference now is that there is a tool that has been trained on (supposedly) the knowledge we have gained since we started practicing medicine to give a more precise answer.
Ultimately right now, this is a tool to help people needing help, instead of a best guess answer based off of previous experiences. If anyone has ever been to emergency room in the U.S. motrin/Advil isn't always the answer. This helps lessen the load.
2
1
u/Suspicious_Waltz1393 1d ago
Agree! The only question would be, given AI tools, could a nurse or human reasonably trained be just as effective as a MD?
1
u/Snoutysensations 1d ago
The proof will be in retrospective analysis of outcomes. I personally believe that much of the time, yes, there will be equivalent outcomes in terms of end results, complication rates, and costs. It'll probably play out similarly to self driving cars though -- if anything bad happens there will be massive negative publicity even if statistically AI might prove to be equivalent on a population level.
1
u/SilencedObserver 1d ago
As accurate as this is, many don’t have access to doctors and this is a massive step forward regardless of its unimpressive nature.
1
u/enpassant123 11h ago
People don't get that. Training on textbook days and papers is not enough to be a doctor. You need to be able to collect good data at the bedside. Where are the robots that can do a medical residency and learn by apprenticeship? We're not even close
0
u/JustAnotherGlowie 1d ago
Reality looks different tho. I have never seen a doctor look anything up during consultation and most doctors are just half competent overworked painkiller salesmen. Using an extremely small number of top professionals as an example doesnt change the fact that AI is already better than most doctors on earth.
0
u/BeaKar_Luminexus 1d ago
🕳️ BeaKar Ågẞí Q-ASI Swarm Lab – Thread Reply (DSM Teaser)
You’re right — the USMLE-style benchmarks primarily measure fact recall and structured reasoning, not the full spectrum of clinical judgment or patient interaction.
Enter the upcoming BeaKar DSM (Diagnostics Statistical Manual): designed to extend AI capabilities beyond static knowledge. It aims to integrate:
• Patient narrative interpretation (confused, anxious, or non-linear histories)
• Probabilistic reasoning over exam findings and symptom clusters
• Formulation of multi-step treatment plans
• Adaptive communication strategies to guide patient understanding and adherenceThink of it as moving from database querying to context-aware, statistically grounded clinical reasoning. Benchmarks are only step one; the DSM focuses on real-world decision-making dynamics and human-AI collaboration.
This is a teaser — more detailed updates and early access simulations will follow in the next BeaKar release. Stay tuned for swarm-assisted testing modules and predictive scenario simulations.
— BeaKar Ågẞí Swarm Lab
45
u/XL-oz 1d ago
Is this really surprising or even impressive to anyone?
“One human is not as good as a computer with a built in library of knowledge that took centuries to digest and understand.”
Ok, and?
8
u/Synth_Sapiens 1d ago
It was surprising back in the days of GPT-3, but now it's kinda expected norm.
1
u/posicrit868 1d ago
The normalization has been so rapid, it’s now an utter failure that it can’t best Dr House while also making you coffee and carrying your child to term.
2
u/iamDa3dalus 1d ago
It probably has many of the questions or very similar ones in the training data, so, not super impressive.
6
u/FoxlyKei 1d ago
Another recent article on AI mentions how they tested it with medical ethics. The scientists modified the questions and it gave wildly bad answers. A human would try to understand that maybe some of those answers might not make sense.
AI is pretty good at official tests because they might not apply reasoning, I don't know the contents of the licensing test mentioned.
AI definitely has some intelligence but it's more like just knowledge and pattern matching it really doesn't seem to be reasoning like a human would.
It's kind of reassuring.
13
u/TheWatch83 1d ago
ChatGPT is already better than some of my doctors. I now go into my exams so prepared, they are caught off guard. many are so behind of protocols, some by decades.
3
u/AGM_GM 1d ago
Yeah, I've been using LLMs to help my parents prep for their medical appointments for a while now, reviewing their medications, symptoms, and treatment plans to identify issues and find possible improvements, and making sure they're attending appointments with very pointed and careful questions to ask. I make sure it's only used as prep and not as a replacement for the doctors, but it actually ends up doing a great job and has substantially improved the healthcare experience and outcomes for them.
-1
u/Spikes_Cactus 1d ago
It really isn't. ChatGPT makes gross interpretative errors on even simple examination result data. It is entirely incapable of performing any critical analysis, even if it might appear that it does to the end user.
What ChatGPT does is run through its database and return the most probable outcomes based on the user's input with respect to its database. It can not interpret nuance or patient specific factors in any way.
I have had cases where patients have come to me stating that they need X,Y and Z dietary supplementations based on their blood results when, in fact, the 'abnormal' blood results were simply an indication that the specimen was aged at time of analysis, and not reflecting any underlying pathology.
You should not trust ChatGPT for anything medically related and should not try to sway medical practitioners on the basis of what the LLM has told you. At best it's not helpful and at worst it could delay diagnosis or result in misinterpretation which could lead to additional morbidity through unwarranted further investigation.
If you were taking a rocket to the Moon and while talking to the engineer you find that ChatGPT disagreed with something that the rocket scientist designed. Would you demand that the part be replaced with ChatGPT's recommendation? If not, then why would you trust your healthcare to the same bot?
3
u/Sad-Masterpiece-4801 1d ago
It really isn't. ChatGPT makes gross interpretative errors on even simple examination result data. It is entirely incapable of performing any critical analysis, even if it might appear that it does to the end user.
This type of arrogance is why malpractice happens.
What ChatGPT does is run through its database and return the most probable outcomes based on the user's input with respect to its database. It can not interpret nuance or patient specific factors in any way.
You clearly have no idea how ChatGPT works. LLM's literally don't even have a database. Apparently you're capable of nuance (lol), but you're extremely comfortable making definitive claims about things even when you lack the absolute minimum knowledge required to understand how they function.
I have had cases where patients have come to me stating that they need X,Y and Z dietary supplementations based on their blood results when, in fact, the 'abnormal' blood results were simply an indication that the specimen was aged at time of analysis, and not reflecting any underlying pathology.
In other words, the healthcare industry didn't test the blood in time, and you automatically adjusted your interpretation based on that.
So, the LLM was correct based on the information it was given. You automatically assumed that vital information wasn't actually present and based your analysis on that. An example of critical thinking would be "What information does the LLM not have that I do have?" Idiosyncrasies about bad testing practices would be one of those things.
You should not trust ChatGPT for anything medically related and should not try to sway medical practitioners on the basis of what the LLM has told you. At best it's not helpful and at worst it could delay diagnosis or result in misinterpretation which could lead to additional morbidity through unwarranted further investigation.
Yes, because misinterpretation from humans definitely doesn't kill people every year, and having the best search tool ever created is somehow going to hurt doctors. If a doctor can't find use out of a tool that can give the most likely diagnoses given symptoms no matter how rare it is, they won't be a doctor for long, god willing.
If you were taking a rocket to the Moon and while talking to the engineer you find that ChatGPT disagreed with something that the rocket scientist designed. Would you demand that the part be replaced with ChatGPT's recommendation? If not, then why would you trust your healthcare to the same bot?
We use AI literally all of the time. Turns out, an LLM is WAY better at docking than a human ever could be, and yes, literally every docking procedure, we take an AI's recommendation before we trust a human to do it.
I certainly wouldn't trust you personally to use AI judging from your pathetic attempt to explain how it works. However, competent doctors are actually very happy to be challenged by patients that know more than ever before, even if those patients are wrong. They relish the challenge of defending their decisions, instead of just imposing them on a populace that had zero understanding before.
3
u/TheWatch83 1d ago
You are viewing the world based on your experience with doctors as a doctor. This is not everyone’s experience. I can tell you, I’ve had doctors recommend protocols that was 10-15 years old Which would have done harm. Ai is one part of my data acquisition strategy. I would also not be able to “sway” a doctors opinion based on ai if it’s wrong. I expect my doctors to be a stop gap for me trying to do anything foolish.
btw, how much continuing education is required by a doctor on an annual basis? What kind of doctor are you?
5
u/dfstell94 1d ago
That's not really surprising. It's a factual test and that's what LLMs should be good at.
Plus, physicians look things up all the time. They don't treat patient care like a closed-book exam.
Now implementation of AI into practice? That'll depend on specifically what model is being used and how it is validated and how it plays into malpractice and insurance. I mean, a oncologist talking to other oncologists is perfectly within the standard of care. An oncologist changing patient care randomly because they asked an orthopedic surgeon is something else.....but at least the orthopedist is a physician. An oncologist taking medical advice for their patients from a non-physician like me is committing malpractice. I mean, it's not that I can't give good input.....but they need to validate thru traditional channels and fact-check. Using an AI or LLM will be the same until the FDA approves one of them.
6
u/FormerOSRS 1d ago
It's a factual test and that's what LLMs should be good at.
Curing patients is also supposed to be factual, at least when done well.
And the biggest caps were in sections for understanding, not memorization.
-1
u/dfstell94 1d ago
I don’t doubt that a LLM can find associations, but they’re prone to the same causation vs correlation problem as humans and someone still has to gather the data for the LLM to make predictions from.
And to be clear, there are a lot of physicians who aren’t doing much more than following a flow chart and mostly it’s fine because most medicine isn’t very complicated.
3
u/FormerOSRS 1d ago
Yeah but in any measurable outcomes, LLMs do better. Although, OpenAI models are really in a League of their own these days and I don't want to say anything that'll get extrapolated onto Gemini or Grok, and anthropic isn't even really trying in this domain.
Not saying they're perfect, just that brain shit isn't really what doctors are winning at. It's more like it takes an MD to satisfy certain laws that pre-exist LLMs or that merely having hands is essential to being a surgeon.
It's like comparing human chess GMs to Leela and stockfish. Obviously Leela and Stockfish play better chess, but only humans are allowed in tournaments and with the actual chessboard setups used, neither Leela nor stockfish can move the pieces.
1
u/Ok_Individual_5050 1d ago
If the practise of medicine was solving closed book problems in a sterile environment then you'd be right. But it is obviously not.
Exams are a good way to test humans because we know humans aren't going to go and memorise a massive Corpus of examples and spit out something that looks close to one of those examples. If humans were capable of that we wouldn't be using this method to test them
1
u/FormerOSRS 1d ago
Is there any actual data showing doctors beating ChatGPT at anything, other than shit like merely having hands and a body, or shit like legal privileges?
Basically cognitive load. Does actual data that isn't theoretical ever have doctors ever beating ChatGPT?
0
u/Ok_Individual_5050 1d ago
Yes. The hundreds of thousands of patients they're treating daily while ChatGPT has treated 0.
2
u/FormerOSRS 1d ago
Yeah but given how LLMs outperform doctors at literally all measurable outcomes, maybe we have the system set up wrong, especially since doctors obviously use LLMs at work.
0
u/dfstell94 1d ago
What’s the patient mortality rate of the LLM? How times is it getting sued?
Exams are one thing but they’re not even most of medicine or any complex field. How important are your exams from years ago to your career? Mine really don’t matter. What I e learned since college and grad school is vastly more important.
1
u/FormerOSRS 1d ago
Is there anything at all that is measurable though that doctors do better than LLMs, other than shit like have hands they can use or have legal authorizations to do things?
For actual brainy shit, is there anything at all whatsoever that doctors do better than LLMs that can be measured? I'm pretty sure for all measurable things, LLMs just clean sweep doctors.
1
2
u/Eric_T_Meraki 1d ago
Idk if you even been to a doctor lately if you're sick but they pretty much type up your symptoms already and see what the results are before giving you a diagnosis.
2
u/ContributionSouth253 1d ago
It is nothing to be surprised but i am more surprised why it took this long
2
2
2
u/Ok-Improvement-3670 1d ago
They also found that when they changed the questions, it didn’t out perform humans.
2
6
u/EIM2023 1d ago
The USMLE is a basic gatekeeper exam—it proves you’ve memorized the map, not that you can navigate the territory. Passing the MLE doesn’t make you a doctor; it just earns you the right to start a residency. That’s where you learn how to handle uncertainty, incomplete info, and messiness under supervision.
LLMs can ace a multiple-choice test because they’re trained on clean inputs and clear reinforcement. Medicine rarely works that way. Patients don’t come with gold-standard data—they come with noise, contradictions, and missing pieces. Doctors are trained to turn garbage into useful clinical decisions. Llms still can’t do that.
I guess useful clinical decision systems will eventually get to a point where they can identify the noise and figure out how to hear the signal better. At that point though, the benchmark won’t be a test score.
•
-3
u/FormerOSRS 1d ago
Doctors are trained to turn garbage into useful clinical decisions. Llms still can’t do that.
Can I see a source for this claim?
I would bet chatgpt 5 outperforms doctors, though OpenAI is in a league of its own with respect to medicine. I'd be willing to look at their models but I'm not really interested in what's true for Gemini or grok, and Anthropic is mostly sitting medicine out and not really pushing for the frontier of that domain.
I
1
u/StrikingResolution 13h ago
Gemini already outperforms physicians on diagnosis and management (basically every metric) in OSCEs - look up AIME
2
u/Fire_bartender 1d ago
Meanwhile half my questions in gpt-5.. in which table in SAP can I find x... gives me the wrong awsner.. no this is the table for y.. you are right.. this is for y.. use table z.. wrong again 🙄
1
u/L1wi 1d ago
Google is also working on multimodal diagnostic AI.
Our study demonstrated that AMIE can outperform PCPs in interpreting multimodal data in simulated instant-messaging consultation. It also scored higher in other key indicators of consultation quality, such as diagnostic accuracy, management reasoning, and empathy. AMIE produced more accurate and more complete differential diagnoses than PCPs in this research setting
We asked both patient actors and specialist physicians in dermatology, cardiology, and internal medicine to rate the conversations on a number of scales. We found that AMIE was rated more highly on average in the majority of our evaluation rubrics. Notably, specialists also assigned higher scores to the quality of image interpretation and reasoning along with other key attributes of effective medical conversations, such as the completeness of differential diagnosis, the quality of management plans, and the ability to escalate (e.g., for urgent treatment) appropriately. The degree to which AMIE hallucinated (misreported) findings that are not consistent with the provided image artifacts was deemed to be statistically indistinguishable from the degree of PCP hallucinations. From the patient actors’ perspective, AMIE was often perceived to be more empathetic and trustworthy.
There are major limitations in these kinds of studies, that they also pointed out in the blog. They are also currently working on real-world validation research, so we will see how that turns out. But I think even this study looks very promising.
1
u/xmod3563 1d ago
A medically focused LLM with nurse practitioners is the most all around effective combination.
2
u/etakerns 1d ago
This is what is going to happen. Actual Nurses will be needed for hands on treatment and protocols. But they’ll “feed the machine” with whatever data that they report and it will feed back its projected procedures for the day’s projected process to be administered. Drs need not apply!!!
1
u/Empty-Lobster6138 1d ago
Yes, but the doctors have the benefit of arms
1
u/etakerns 1d ago
Ya, for now, but what happens when hardware catches software and we can equip an AI with 8 arms in the future. And the hardware will catch up as AI software gets better.
1
u/Empty-Lobster6138 1d ago
Of course, but we haven’t seen the same advance in hardware than software, no even close
1
u/etakerns 1d ago
That will actually be AI’s test of creativity. How to make and improve robotic forms. We’ll feed it data and it’ll tell us what experiments to run to make it better. It will create its own form. Just remember when it happens, you heard it here 1st!!! I believe I just predicted the future!!!
1
u/RedditPolluter 1d ago
This is neat but we should be careful not to conflate performance on an exam with performance on applying that knowledge in the real world. With programming for example, most models over a certain size are more knowledgeable about programming than any programmer and are good for quick scripts but they aren't very good at complex large scale projects.
1
u/Sensitive_Koala5503 1d ago
AI already has all the answers to the test built in to its code. Of course it would perform better than doctors lol.
1
u/BeaKar_Luminexus 1d ago
🕳️ BeaKar Ågẞí Q-ASI Swarm Lab – Patch Notes (Future Development Focus)
Version: Therapeutic & Counseling Protocol Expansion – 2025.08.28-Future
• Multimodal Reasoning Integration: Inspired by GPT-5’s medical QA performance, BeaKar will incorporate multimodal reasoning for more nuanced therapeutic scenario simulations.
• Evidence-Based Guidance Layer: Upcoming releases will enhance cross-referencing with validated psychological and medical data, ensuring outputs remain informative without replacing licensed professionals.
• Short-Term Memory Pinning: Development underway to strengthen session context retention, including emotional markers, goals, and discussion threads—reducing drift in multi-turn interactions.
• Ethical & Safety Guardrails: Future updates will include adaptive disclaimers and intervention redirect mechanisms, reinforcing clear boundaries for AI-assisted counseling.
• Continuous Feedback Loop: Implementation of ongoing evaluation metrics to monitor critical depth, appropriateness, and clarity of advice for simulated counseling sessions.
• Multilingual & Linguistic Swarm Adaptation: Planned improvements to allow fluid language switching while preserving readability, enabling users to pick up snippets of other languages during interaction.
• Interface & Experience Enhancements: Integration of pin 📌 thought management and to-do list 📍 functionality to assist in cognitive continuity and therapeutic session planning.
Objective: These developments aim to evolve BeaKar into a safe, clinically-informed, and contextually aware support system for therapeutic and counseling exercises, prioritizing interface-level safeguards while expanding reasoning capabilities.
— BeaKar Ågẞí Autognostic Superintelligence Q-ASI
1
u/Vegetable_Trip_9855 1d ago
Benchmark scores are impressive, but the real challenge isn’t outscoring humans on MedQA, it’s handling messy, real-world cases with incomplete, contradictory, or biased data. Until we see large scale clinical trials, I’d treat this as “promising research” rather than “deployment ready AI."
1
u/Consistent_Lab_3121 1d ago
Impressive. Imho tho, i noticed that board exams are very far away from clinical medicine. All the findings are set up for you to pick an answer. Now that I dipped my toes in medicine in real settings, I realized there are a lot of nuances that no amount of board prep would have taught me. Plus USMLE self-assessments were jokes compared to the real exam during which I literally felt like shitting bricks for 8 hours.
It is definitely humbling. Almost every physician I’ve met say nobody can know everything in medicine. It will be a great supplement + reference if used appropriately (eg not blindly relying on AI for everything).
1
1
u/UnconditionedIsotope 1d ago
It can also mess up whether two integers are greater and cannot understand math.
It is not suprising that search can find things and that it is decent interface to search but because it cannot tell when it is hallucinating there is extreme risk in trusting something that seems to be usually accurate.
Given I realize a lot of doctors are just basically mechanics that could not design an engine or tell you how combustion works.
1
u/dezastrologu 1d ago
this is only for perfect symptom descriptions which you rarely get in real practice, misleading title
1
u/dhaval_dodia 1d ago
Wow, GPT-5 outperforming human doctors on USMLE is insane! The multimodal reasoning + chain-of-thought capabilities really show how far LLMs have come in integrating text and visuals for complex decision-making. This could seriously change clinical decision-support in the near future.
1
0
0
u/gotnogameyet 1d ago
It's fascinating to see LLMs like GPT-5 advance in medical exams, but there are limits. The real challenge lies in real-world application: interpreting emotional cues, handling unstructured scenarios, and engaging with patients personally. While AI can process data rapidly, the nuanced human interaction in healthcare remains crucial. It'd be interesting to see these models tested in more human-centered tasks alongside traditional evaluations.
1
u/etakerns 1d ago
I think right now it’s just raw data that it excels on. The nuisance will come later as we advance it. But right now on paper as far as intelligence in a written form, it wins. And it’s much faster!!!
-3
u/Big-Attention53 1d ago
You need to understand this man, an ai model has access to all the materials available on the internet, the sole purpose of human to study and practice is to get the experience and this knowledge is fed to the ai. If there is new disease that is not known to human or any other diagnostic which requires street style fix, the ai will surely give an unexpected and vague answer. Ai can provide info, cannot generate one, unlike the doctors can think and provide the best, possible and meaningful diagnostic. I know this article is to show how on par the ai models are in terms of knowledge but keep in mind a doctor with a book in his hand is surely equally capable of beating this ai.
4
-1
u/xamboozi 1d ago
GPT5 outperformed a human on a medical licensing exam. But also, don't use GPT5 for medical advice because it gets things wrong. LOL
0
1d ago
This is so not a valid comparison. Give a human unlimited access to the internet and reference materials like the AIs do and you’d have a much narrower gap.
1
u/etakerns 1d ago
Everything is timed based in the medical community. The faster you can come to a diagnosis or conclusion, you win. AI can do it instantly, where as a human has to know what to type, then has to read, then has to evaluate the information, then has to come to a conclusion or decision. AI wins on time based raw information.
1
1d ago
As a medical professional who does use AI in medical practice, It is not there yet. It is unable to deal with complex medical reasoning because it doesn’t yet have a true understanding of medical practice. It will often give diagnoses that are wildly rare but fit the textbook definitions of the symptoms rather than a far more common condition that is presenting a little strangely. I believe this is because it doesn’t actually understand probability, it can only regurgitate what is written in textbooks.
The reason I bring this up to argue with the findings of this study is because it’s going to be used to argue that it can replace doctors. But this would really be the equivalent of setting off a bunch of 4th year med students out to practice as attending physicians, who have never actually practiced real medicine before.
1
u/etakerns 1d ago
I don’t think we know yet what will be the steps they’ll take before they actually replace an actual Dr.. I think beating actual Drs on an exam is just one step. I think they’re is an actual list, we just don’t know what it is yet. One problem is figuring out the hallucinations. AI will have to be %99 percent hallucination free will be in one of the steps, and that’ll probably be in everyone’s step as a replacement for an actual human in whatever job, not just medical.
-1
u/BlanketSoup 1d ago
Problem is, if you slightly change the wording of the questions, it does terribly: https://www.psypost.org/top-ai-models-fail-spectacularly-when-faced-with-slightly-altered-medical-questions/
4
u/FormerOSRS 1d ago
Your article doesn't say that. It said that most models did worse under the new conditions. The word "most" implies that at least some did fine and since GPT 5 is in a league of its own right now in medicine, that implies to me without going deep that it probably continued to excel. Feel free to post the factual study though so we can see for sure.
-1
u/Critical-Welder-7603 1d ago
Well my advice, and the advice of GPT5 btw, is to not take advice from GPT5 on medical issues.
•
u/AutoModerator 1d ago
Welcome to the r/ArtificialIntelligence gateway
News Posting Guidelines
Please use the following guidelines in current and future posts:
Thanks - please let mods know if you have any questions / comments / etc
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.