r/technology • u/reflibman • 7h ago
Machine Learning Top AI models fail spectacularly when faced with slightly altered medical questions
https://www.psypost.org/top-ai-models-fail-spectacularly-when-faced-with-slightly-altered-medical-questions/227
u/Noblesseux 7h ago
I mean yeah, a huge issue in the AI industry right now is people setting totally arbitrary metrics, training a model to do really well at those metrics and then claiming victory. It's why you basically can't trust most of the metrics they sell to the public through glowing articles in news outlets that don't know any better, a lot of them are pretty much meaningless in the broad scope of things.
50
u/karma3000 6h ago
Overfitting.
An overfit model can't be generalised to use on other data that is not in it's training data.
24
u/Noblesseux 4h ago
Even outside of aggressive overfitting there are a lot of situations where it's like why are we confused that the benchmark we made up that the entire industry set as an objective saw improving scores year over year?
This is basically just a case of Goodhart's Law ( https://en.wikipedia.org/wiki/Goodhart%27s_law ), the measure becomes meaningless when the measure becomes an objective. When you treat passing the bar or a medical exam as an important intelligence test for computers you inevitably end up with a bunch of computers that are very good at medical exams even if they're not getting better at other tasks.
6
u/APeacefulWarrior 3h ago
After decades of educators saying that "teaching to the test" was terrible pedagogy, they've gone and applied it to AI.
4
u/happyscrappy 2h ago
Wall Street loves a good overfit. They make a model which can't be completely understood due to complex inputs. To verify they model they backtest it against past data to see it predicts what happened in the past. If it does then it's clearly a winner, right?
... or more likely is it an overfit to the past.
So I figure if you're a company looking to get valued highly by Wall Street probably best to jump in with both feet on the overfitting. You'll be rewarded financially.
1
u/green_meklar 2h ago
The really ironic part is that we've known for decades that measuring intelligence in humans is very hard. I'm not sure why AI researchers think measuring intelligence in computers is somehow way easier.
-2
u/socoolandawesome 6h ago edited 6h ago
The best model they tested was OpenAI’s 3 generation old smaller reasoning model, which also dropped in performance much less than the other models (same with Deepseek r1)
I wouldn’t take much from this study.
19
u/Noblesseux 6h ago
That changes borderline nothing about the fact that all the articles fawning over them for ChatGPT passing tests that it was always well suited and trained to pass via pattern matching were stupid.
It doesn't matter what gen it is, AI boosters constantly do a thing where they decide some super arbitrary test or metric is the end of times for a particular profession, despite knowing very little about the field involved or the objectives in giving the tests to humans in the first place.
This study is actually more relevant than any of the nonsense people talked about because it's being made by actual people who know what is important in the field and not arbitrarily picked out by people who know borderline nothing about healthcare. There is a very important thing to glean here that a lot of people are going to ignore because they care more about being pro AI than actually being realistic about where and how it is best to be used.
7
-7
u/socoolandawesome 5h ago edited 5h ago
I mean this isn’t true tho, the real world utility of these models have clearly increased too. Yes some companies at times have probably overfit for benchmarks, but the researchers at some of these companies talk about specifically going out of their way not to do this. Consumers care about real world utility and to people like programmers that use it, it becomes obvious very quickly which models are benchmaxxed or not.
For instance the IMO gold medal that OpenAI recently got was extremely complex logic proofs and the IMO made completely novel problems for their competition. People thought this was a long ways off before a model could get a gold medal and that math proofs were too open ended and complex for LLMs to be good at.
And you’re also wrong that they aren’t working specifically with professionals in various fields, they constantly are.
5
u/Noblesseux 5h ago edited 5h ago
I mean this isn’t true tho, the real world utility of these models have clearly increased too.
...I'm not sure you're understanding the problem here. No one said "LLMs have no use", I'm saying that when you build a thing that is very good at basically ignoring the core reason why a test is used on humans you cannot then claim that it's basically RIP for doctors.
We don't design tests based on a theoretical human with eidetic memory of previous tests/practice quizzes. We design tests with the intention that you're not going to remember everything and thus need to reason your way through some of them using other things you know. The whole point of professional tests is to make sure you have functional critical reasoning skills that will be relevant in actual IRL use.
Even the IMO thing is neat but not insanely meaningful, it's mostly arbitrary and not a direct communicator of much beyond that they've designed a model that can do a particular type of task at least once through. It's an experiment they specifically trained a thing to see if they could do and Google managed to do it too lol, it's largely arbitrary.
Like if I make a test to see who can get to the top of a tree and grab a coconut and pit a human vs a monkey, does it mean the monkey is smarter than a human? No it means it's well suited to the specific mechanics of that test. Now imagine someone comes in with a chainsaw and cuts the tree down and snatches off a coconut? How do you rate their ability when they basically circumvented the point of the test?
And you’re also wrong that they aren’t working specifically with professionals in various fields, they constantly are.
Don't know how to tell you this big dog but I'm an SWE with a background in physics and math. In AI is is VERY common to make up super arbitrary tests because practically: we don't actually know how to test intelligence. We can't even do it consistently in humans, let alone in AI models. People make benchmarks that current models are bad at, and then try to train the models to be better at those benchmarks. Rinse and repeat. The benchmarks aren't often meant to test the same things that someone who does the job would say are important. For example: I don't see a portion of the SWE benchmark dealing with having someone who doesn't really know what they want half explain a feature and have to make that buildable.
2
u/socoolandawesome 5h ago edited 5h ago
The IMO model was not a special fine tuned model, it was a generalist model. The same model also won a gold medal in the IOI, the analogous competition for competitive coding. Google is another great AI company although their official gold medal was less impressive as it was given hints and a corpus of example problems in its context, although they also claimed to do it with a different model without hints. No one said mathematicians will be irrelevant at GPT-6
No one said doctors are irrelevant now. When people talk about jobs being obsolete, at least for high level jobs, they are talking about future models typically years into the future. Dario Amodei, CEO of Anthropic, said entry level jobs are under threat in the next 5 years.
As to what you are talking about for what we are testing in humans, you are correct.
However I don’t think people grasp that LLMs just progress in a very different way than humans. They do not start from basics like humans in terms of how they progress in intelligence. This is not to say the models don’t grasp basics eventually, I’m speaking in terms of how models are getting better and better. I’ll take this from my other comment and it explains how scaling data makes models more intelligent:
If a model only sees medical questions in a certain multiple choice format in all of its training data, it will be tripped up when that format is changed because the model is overfitted: the parameters are too tuned specifically to that format and not the general medical concepts themselves. It’s not focused on the important stuff.
Start training it with other forms of medical questions, other medical data, in completely different structures as well, the model starts to have its parameters store higher level concepts about medicine itself, instead of focusing on the format of the question. Diverse, high quality data getting scaled allows for it to generalize and solidify concepts in its weights, which are ultimately expressed to us humans via its next word prediction.
It will begin to grasp the basics and reason correctly with enough scale and diversity in data.
Although also I should say the way reasoning is taught is slightly different as it involves RL scaling instead of pretraining scaling. You basically have it start chains of thought to break down complex problems into simpler problems where the models are “thinking” before outputting an answer. When training you give it different questions you know the answer to, let it generate its own chain of thought, and once it gets it correct you tweak the weight so as to increase the probability of the correct chains of thought and decrease the probability of the incorrect chains of thought being outputted by the model. You can also do this for each non individual step in the chain of thought. You then scale all these problems, so that it again begins to generalize its reasoning methods (chains of thought). This basically lets the model teach itself its reasoning.
Again if you don’t like benchmarks, it’s fairly obvious from using the models themselves they are smarter than previous generations with what ever you throw at it. There are also benchmarks that are not released yet and then get released and certain models perform better on them.
2
u/Noblesseux 4h ago
It's a generic model..that they tweaked specifically to deal with IMO.
The IMO is definitely well known within the [AI research] community, including among researchers at OpenAI. So it was really inspiring to push specifically for that.
That is a quote from one of the scientists who worked on this. They specifically have a section where they talk about spending months pushing with this specific objective in mind. It's not like they just gave GPT 5 a pencil and said get on it son, this is like experimental in house thing from a team specifically made to try to make ChatGPT better at this specific type math.
It will begin to grasp the basics and reason correctly with enough scale and diversity in data.
They'll also make shit up more (OpenAI themselves have found that as they scale up their reasoning models they make more shit up) while not guaranteeing the outcome you just said like it's a sure fire thing. Like there are a million caveats and "not exactlys" that can be pinned onto how you just presented that.
Also you don't have to explain the concept of reinforcement learning and reasoning models to me, I've been an SWE for like damn near 12 years.
Again if you don’t like benchmarks, it’s fairly obvious from using the models themselves they are smarter than previous generations with what ever you throw at it.
It would be MORE of a problem if the thing performed worse or the same on the benchmarks we made up and then spent stupid amounts of money specifically trying to address.
2
u/socoolandawesome 4h ago edited 3h ago
https://x.com/polynoamial/status/1946478249187377206
In this thread a lead researcher for it says it was not an IMO specific model. It s a reasoning LLM that incorporates new experimental general purpose techniques.
https://x.com/polynoamial/status/1954966398989635668
In this thread, the same researcher says they took the exact same model and used it for competitive coding and it did the best on that.
It’s hard for me to see how it went beyond normal training data (which obviously includes stuff like IMO and IOI type problems) to fine tune it just for the IMO. It was not fine tuned to just output proofs or something like that. And then was immediately used as is in a completely different domain.
GPT-5 made huge gains in slashing hallucination rates and it is a reasoning model, so that was in out of the norm case when I believe o3 had slightly higher hallucination rates.
They already do grasp the basics better, each model does each generation. I’m just saying it’s not working like humans where it starts from basics and fundamentals, it learns everything all at once and then as it gets more data the concepts/algorithms all become refined, more consistent, more robust, more reliable, including the basics (and more complex concepts).
I wouldn’t expect an SWE to know about RL unless they worked specifically on making models or they just are into AI. RL for LLMs in the manner I described certainly has not been around before this past year when the first COT (chain of thought) reasoning model was made by OpenAI and they started to describe how they did it.
Not sure what you mean by your last point and how that relates to the point I made that you are addressing
1
u/Equivalent-You-5375 36m ago
It’s pretty clear LLMs won’t replace nearly as many jobs as these CEOs claim, even entry level. But the next form of AI definitely could.
72
u/TheTyger 6h ago
My biggest problem with most of the AI nonsense that people talk about is that the proper application of AI isn't to try and use ChatGPT for answering medical questions. The right thing is to build a model which specifically is designed to be an expert in as small a vertical slice as possible.
They should be considered to be essentially savants where you can teach them to do some reasonably specific task very effectively, and that's it. My work uses an internally designed AI model that works on a task that is specific to our industry. It is trained on information that we know is correct, and no garbage data. The proper final implementation is locked down to the sub-topics that we are confident are mastered. All responses are still verified by a human. That super specific AI model is very good at doing that specific task. It would be terrible at coding, but that isn't the job.
Using wide net AI for the purpose of anything technical is a stupid starting point, and almost guaranteed to fail.
12
u/WTFwhatthehell 3h ago
The right thing is to build a model which specifically is designed to be an expert in as small a vertical slice as possible.
That was the standard approach for a long time but then the "generalist" models blew past most of the specialist fine-tuned models.
17
u/creaturefeature16 6h ago
Agreed. The obsession with "AGI" is trying to shoehorn the capacity to generalize into a tool that doesn't have that ability since it doesn't meet the criteria for it (and never will). Generalization is an amazing ability and we still have no clue how it happens in ourselves. The hubris that if we throw enough data and GPUs at a machine learning algorithm, it will just spontaneously pop up, is infuriating to watch.
-7
u/socoolandawesome 6h ago
What is the criteria if you admit you don’t know what it is.
I think people fundamentally misunderstand what happens when you throw more data at a model and scale up. The more data that a model is exposed to in training, the parameters (neurons) of the model start to learn more general robust ideas/algorithms/patterns because they are tuned more to generalize the data.
If a model only sees medical questions in a certain multiple choice format in all of its training data, it will be tripped up when that format is changed because the model is overfitted: the parameters are too tuned specifically to that format and not the general medical concepts themselves. It’s not focused on the important stuff.
Start training it with other forms of medical questions in completely different structures as well, the model starts to have its parameters store higher level concepts about medicine itself, instead of focusing on the format of the question. Diverse, high quality data allows for it to generalize and solidify concepts in its weights, which are ultimately expressed to us humans via its next word prediction.
2
u/creaturefeature16 5h ago
You're describing the machine learning version of "kicking the can down the road".
1
76
u/SantosL 7h ago
LLMs are not “intelligent”
-60
u/Cautious-Progress876 6h ago
They aren’t, and neither are most people. I don’t think a lot of people realize just how dumb the average person is.
62
u/WiglyWorm 6h ago
Nah dude. I get that you're edgy and cool and all that bullshit but sit down for a second.
Large Language Models turn text into tokens, digest them, and then try to figure out what tokens come next, then they convert those into text. They find the statistically most likely string of text and nothing more.
It's your phones autocorrect if it had been fine tuned to make it seem like tapping the "next word" button would create an entire conversation.
They're not intelligent because they don't know things. They don't even know what it means to know things. They don't even know what things are, or what knowing is. They are a mathematical algorithm. It's no more capable of "knowing" than that division problem you got wrong in fourth grade is capable of laughing at you.
-18
u/socoolandawesome 6h ago
What is “really knowing”? Consciousness? Highly unlikely LLMs are conscious. But that’s irrelevant for performing well on intellectual tasks, all that matters is if they perform well.
20
u/WiglyWorm 6h ago
LLMs are no more conscious than your cell phone's predictive text,
-7
u/socoolandawesome 5h ago
I agree that’s incredibly likely. But that’s not really necessary for intelligence
15
u/WiglyWorm 5h ago
LLMs are no more intelligent than your cell phone's predictive text.
0
u/socoolandawesome 5h ago
Well that’s not true. LLMs can complete a lot more intellectual tasks that autocomplete on a phone could never
14
u/WiglyWorm 5h ago
No they can't. They've just been trained on more branches. That's not intelligent. That's math.
2
u/socoolandawesome 5h ago
No they really can complete a lot more intellectual tasks than my phone’s autocomplete. Try it out yourself and compare.
Whether it’s intelligent or not is semantics really. What matters if it performs or not
6
u/notnotbrowsing 6h ago
if only the performed well....
0
u/socoolandawesome 5h ago
They do on lots of things
5
u/WiglyWorm 5h ago
They confidently proclaim to do well many things. But mostly (exclusively) they unfailingly try to make a string of characters that they deem as statistically likely to happen. And then they declare it to be so.
5
u/socoolandawesome 5h ago
It’s got nothing to do with proclaiming. I give it a high school level math problem it’s gonna get it right basically every time.
4
u/WiglyWorm 5h ago
Yes. If the same text string is repeated over and over by LLMs the LLMs are likely to get it right. But they don't do math. Some agentic models are emerging to break prompts like those down to their component parts and process them individually but from the outset it's like you said: Most of the time. LLMs are predictive engines and they are non-deterministic. The LLM that has answered you correctly 1,999 times may suddenly give you the exact wrong answer, or halucinate a solution that does not exist.
4
u/socoolandawesome 5h ago
No you can make up some random high school level math problem guaranteed to not have been in the training data and it’ll get it right, if you use one of the good models.
Maybe, but then you start approaching levels of human error rates, which is what matters. Also there are some problems I think it probably just will never get wrong.
0
u/blood_vein 4h ago
They are an amazing tool. But far from replacing actual highly skilled and trained professionals, such as physicians.
And software developers, for that matter
1
-25
u/Cautious-Progress876 6h ago
I’m a defense attorney. Most of my clients have IQs in the 70-80 range. I also have a masters in computer science and know all of what you said. Again— the average person is fucking dumb, and a lot of people are dumber than even current generation LLMs. I seriously wonder how some of these people get through their days.
1
-1
6h ago
[deleted]
-1
u/Cautious-Progress876 6h ago
No disrespect to them. They are dealing with what nature gave them. But most are barely functioning at the minimal levels of society because of a mixture of poor intelligence and poor impulse control.
Edit: still get the supermajority of their cases dismissed… the first time I deal with them. Most end up repeat flyers though.
2
u/grumboncular 6h ago
Sorry, that was an unreasonable response on my part - I may disagree with the sentiment (although I certainly don’t know what your client base is like) but that’s no reason to be rude to someone I don’t know online.
2
u/Cautious-Progress876 6h ago
I really like them, a lot. It’s nice to help people when possible, but most of them are not running on all cylinders. Part of the reason I support criminal justice reform is I believe our current system unfairly punishes people who often have little control over their own behavior. I don’t know how to fix that situation when people harm others, but our current system doesn’t do anything to help. We basically look at people who are in the “competent but barely” range of life and provide zero assistance. The difference of a few IQ points is the difference between “not criminally responsible” due to intellectual deficiency and “can be executed if the crime is bad enough.”
The majority of low level crime is not committed by evil or mean spirited people, but by people who don’t have the level of executive functioning that you and I take for granted.
Edit: wow, I need to sleep. Not going to even bother trying to correct my grammar and sentences.
2
u/grumboncular 6h ago
Sure; I’m not an expert here, but I do think you can teach people better impulse control and better judgement, as long as you have the right social conditions, too. I would bet that a combination of a better social safety net and restorative instead of retributive justice might get you further than you’d expect with that.
2
u/Cautious-Progress876 6h ago
I agree. Jail hasn’t ever helped any of my clients. No one has gone to jail, said “not again,” and kept up with it, in my experience.
Our school systems massively fail a ton of people.
2
u/Cautious-Progress876 6h ago
Also, no offense taken. I get told worse things all of the time at work (adversarial court systems have downsides). I hope your night is going well.
2
8
u/Nago_Jolokio 6h ago
"Think of how stupid the average person is, and realize half of them are stupider than that." –George Carlin
3
0
u/DaemonCRO 4h ago
All people are intelligent, it’s just that their intelligence sits somewhere on the Gaussian curve.
LLMs are simply not intelligent at all. It’s not a characteristic they have. It’s like asking how high can LLM jump. It can’t. It doesn’t do that.
0
u/CommodoreBluth 3h ago
Human beings (and other animals) take in a huge amount of sensory inputs from the world every single second they’re awake, process them and react/make decisions. A LLM will try to figure out the best response to a text prompt when provided one.
11
u/EvenSpoonier 3h ago
I keep saying it: you cannot expect good results from something that does not comprehend the work it is doing.
15
10
4
u/belowaverageint 2h ago
I have a relative that's a Statistics professor and he says he can fairly easily write homework problems for introductory Stats that ChatGPT reliably can't solve correctly. He does it just by tweaking the problems slightly or adding a few qualifying words that change the expected outcome which the LLM can't properly comprehend.
The outcome is that it's obvious who used an LLM to solve the problem and who didn't.
3
u/Twaam 4h ago
Meanwhile, i work in healthcare tech, and there is a giant push for AI everything, mostly for transcription and speeding up notes/niche use cases but it still makes me feel like we will have this honeymoon period and then the trust will fall off. Although providers seem to love tools like copilot and rely heavy on it
5
u/Moth_LovesLamp 5h ago edited 5h ago
I was trying to research new papers on discoveries about dry eyes, floater treatment and ChatGPT suggested dropping pineapple juice in my eyes for the floaters.
So yeah.
5
1
1
u/gurenkagurenda 2h ago
It would have been nice to see how the modified test affected human performance as well. It’s reasonable to say that the medical reasoning is unchanged, but everyone knows that humans also exploit elimination and pattern matching in multiple choice tests, so that baseline would be really informative.
1
-8
u/anonymousbopper767 6h ago edited 5h ago
Let’s be real: most doctors fail spectacularly at anything that can’t be answered by basic knowledge too. It’s weird that we set a standard of AI models having to be perfect Dr. House’s but doctors being correct a fraction of that is totally fine.
Or do we want to pretend med school isn’t the human version of model training?
16
12
u/RunasSudo 5h ago
This misunderstands the basis of the study, and commits the same type of fallacy the study is trying to unpick, i.e. comparing human reasoning with LLM.
In the study, LLM accuracy falls significantly when the correct answer in an MCQ is replaced with "none of the above". You would not expect the same to happen with "most doctors", whatever their failings.
4
1
u/Perfect-Resist5478 4h ago
Do… do you expect a human to have the memory capacity that could compare to access of the entire internet? Cuz I got news for you boss….
This is such a hilariously ridiculous take. I hope you enjoy your AI healthcare, cuz I know most doctors would be happy to relinquish patients who think like you do
-1
u/ZekesLeftNipple 5h ago
Can confirm. I have an uncommon (quite rare at the time, but known about in textbooks) congenital heart condition and as a baby I was used to train student doctors taking exams. I failed a few of them who couldn't correctly diagnose me apparently.
-3
768
u/zheshelman 7h ago
“…. indicates that large language models, or LLMs, might not actually “reason” through clinical questions. Instead, they seem to rely heavily on recognizing familiar answer patterns.”
Maybe because that’s what LLMs actually do? They’re not magical.