while I love specialized AI (the one used for a specific task, like AlphaFold) I consider generative AI (like chat gpt) as a useful but unreliable tool…useful for relatively simple tasks like a translation or summarizing a document (still, you better check it), unreliable when it comes to jobs that require in depth comprehension (and the ability to say “I don’t know” instead of improvising)
Technically I would be curious to know if they fine tuned llama3 or they created a RAG since I am working on something similar, but still it’s an important achievement to help doctors to keep up with the exponentially growing amount of knowledge out there!
Written by
Maxone73
To view profiles and participate in discussions please or .
I am always suspicious of LLM claims on benchmarks because I don't really see how they can train them while carefully avoiding all the benchmark questions/answers and question templates (same question type with slightly different parameters) whose data would not be isolated to the benchmark test content, but derived from existing medical texts, discussions and so on, to say nothing of endless medical exams with Q&A that are not part of the benchmarks but influenced them, or overlap them. The benchmark answers are infecting the training data, in other words.
eg if one trains an LLM on being knowledgeable about Harry Potter, by pouring the internet through it, but exclude the books of Harry Potter to avoid accusations that it is just regurgitating, it can do really well on a Harry Potter exam called "what comes next?", because of course the harry potter text is already everywhere in the internet, outside the books themselves.
That’s not how you train an expert system based on neural networks. Normally you get a source of info that is reliable. For example for a system reading X-rays to detect breast cancer you use a database of various parameters (my system had 64 parameters or “dimensions” as we used to call them). Think about an excel file with 64 columns per each record in my case. You know the outcome a priori of course, there must be a field that says if at the end of the day that X-ray led to cancer or not. Then you calculate the correlation index between each pair of columns. In case of strong correlation you delete one of the columns to save time and resources when training. Then you check for dominance, for example a record that is often repeated, and you delete duplicates. At that point you split the data randomly (multiple times) and you use a part of it for training, the rest is used to test the results on unknown inputs (because we have deleted all the duplicates). You repeat the process multiple times to verify convergence and the reliability rate. And then comes the hard part: you find a panel of experts, make them guess if a patient had cancer basing their decision on their experience and on the same parameters the network has (but normally we humans use less). You compare the number of answers from each expert with the ones given by the system and that’s roughly your benchmark.With LLMs is similar. When you create a RAG you tokenize and make searchable a bunch of documents, very specialized and make them searchable by the LLM to use that data to create a coherent answer. In that sense you use not the LLM’s knowledge about the topic but it’s ability to create answers that are readable by a human. Plus you don’t have to retrain the whole system when you add new data. In your example, for a RAG I would make it digest Harry Potter books and then ask questions about the books that can be only inferred and not regurgitated. But as you said “old” benchmarks can become spurious if you spread them over the internet and then use the same internet as information database.
Ah that's an earlier neural network training approach. Not an LLM which while it's a neural network is a sub-category called a transformer. An LLM absorbs all the text available. gigabytes of it. Vast data-sets. That's how it ends up "speaking" a usually plausible English. In this particular case I bet it was trained to absorb all the medical text available. I'm just skeptical as to how they can avoid absorbing all the answers to known medical benchmarks while they do this which of course means it then aces said benchmarks.
if someone wants their new LLM to make headlines regarding benchmarks, the temptation to play fast and loose and get data contamination is huge. I reckon.
transformers are an architecture 😄 the concept behind is the same, they are multi layered networks, very deep, but yes with language is different in terms of gigabytes and emergent behaviors, but still I think they are going for RAGs and not trying to retrain a whole network. Having a system that can easily keep you updated with all the latest clinical trials, the level of reliability they have and so on is already something that could have great impact!
well my bet is all they are doing is taking the current LLM training process and narrowing the input data to everything they can scrape that is medical. Then launching that without the normal guardrails open-ai builds in to avoid being sued for a wrong diagnosis. In that sense, as the process of hand picking the training data to avoid pollution of benchmark related text would be enormously time consuming and difficult, and would work against claims of performance anyway, so my other bet is it is likely polluted with the benchmark data they do well on, and the last bet is the actual training will remain proprietary for "competitive reasons" and to avoid getting sued for the same things open-ai is getting sued for now: use of copyright work without permission.
Yes this is a cynical view but that's what 40 years in tech does to one. One ends up less frequently unpleasantly surprised that way.
Thank you so much. That was a great explanation of a very complex process. When my grandson talks to me about that process I get lost in the weeds. I guess just to complex for this old brain to comprehend.
Hey Harriet! Yes Norman? They're posting about A I again. Wanna read about it or back to watching the pornos?. Oh ok Harriet, back to Señor Dong does Dallas....
Content on HealthUnlocked does not replace the relationship between you and doctors or other healthcare professionals nor the advice you receive from them.
Never delay seeking advice or dialling emergency services because of something that you have read on HealthUnlocked.