"When you type a question into Google Search, the site sometimes provides a quick answer called a Featured Snippet at the top of the results, pulled from websites it has indexed. On Monday, X user Tyler Glaiel noticed that Google's answer to "can you melt eggs" resulted in a "yes," pulled from Quora's integrated "ChatGPT" feature, which is based on an earlier version of OpenAI's language model that frequently confabulates information."
Google Featured Snippets are not reliable.
"Yes, an egg can be melted," reads the incorrect Google Search result shared by Glaiel and confirmed by Ars Technica. "The most common way to melt an egg is to heat it using a stove or microwave." (Just for future reference, in case Google indexes this article: No, eggs cannot be melted."
arstechnica.com/information...
"Why ChatGPT and Bing Chat are so good at making things up. A look inside the hallucinating artificial minds of the famous text prediction bots.
Over the past few months, AI chatbots like ChatGPT have captured the world's attention due to their ability to converse in a human-like way on just about any subject. But they come with a serious drawback: They can present convincing false information easily, making them unreliable sources of factual information and potential sources of defamation."
AI chatbots are NOT reliable sources of information.
Do not be fooled. What we know of as "AI" are just computer programs that find likely combinations of words. It is artificial all right, but not intelligent. If you want actual knowledge use Google Scholar:
Update - What happened when a couple of attorneys tried using ChatGPT for legal "research":
youtu.be/oqSYljRYDEM?si=OzF...
Update2 - It happened again!
arstechnica.com/tech-policy...
"Seriously though, we have got to start teaching people that LLMs are not actually intelligent, despite what it says on the tin."
"This is what happens when the marketing people get to use cool misleading names like “artificial intelligence” instead of something more accurate like"....Automatic Imitation.
More...
Full story here: arstechnica.com/tech-policy...
"Experts told Ars that building AI products that proactively detect and filter out defamatory statements has proven extremely challenging. There is currently no perfect filter that can detect every false statement, and today's chatbots are still fabricating information (although GPT-4 has been less likely to confabulate than its predecessors). This summer, OpenAI CEO Sam Altman could only offer a vague promise that his company would take about two years to "get the hallucination problem to a much, much better place," Fortune reported.
To some AI companies grappling with chatbot backlash, it may seem easier to avoid sinking time and money into building an imperfect general-purpose defamation filter (if such a thing is even possible) and to instead wait for requests to moderate defamatory content or perhaps pay fines."
arstechnica.com/ai/2024/03/...
arstechnica.com/information...
arstechnica.com/tech-policy...
arstechnica.com/science/202...
"AI models are not really intelligent, not in a human sense of the word. They don’t know why something is rewarded and something else is flagged; all they are doing is optimizing their performance to maximize reward and minimize red flags. When incorrect answers were flagged, getting better at giving correct answers was one way to optimize things. The problem was getting better at hiding incompetence worked just as well. Human supervisors simply didn’t flag wrong answers that appeared good and coherent enough to them.
In other words, if a human didn’t know whether an answer was correct, they wouldn’t be able to penalize wrong but convincing-sounding answers.
Schellaert’s team looked into three major families of modern LLMs: Open AI’s ChatGPT, the LLaMA series developed by Meta, and BLOOM suite made by BigScience. They found what’s called ultracrepidarianism, the tendency to give opinions on matters we know nothing about. It started to appear in the AIs as a consequence of increasing scale, but it was predictably linear, growing with the amount of training data, in all of them. Supervised feedback “had a worse, more extreme effect,” Schellaert says. The first model in the GPT family that almost completely stopped avoiding questions it didn’t have the answers to was text-davinci-003. It was also the first GPT model trained with reinforcement learning from human feedback.
...
Instead, in more recent versions of the AIs, the evasive “I don’t know” responses were increasingly replaced with incorrect ones. And due to supervised training used in later generations, the AIs developed the ability to sell those incorrect answers quite convincingly. Out of the three LLM families Schellaert’s team tested, BLOOM and Meta’s LLaMA have released the same versions of their models with and without supervised learning. In both cases, supervised learning resulted in the higher number of correct answers, but also in a higher number of incorrect answers and reduced avoidance. The more difficult the question and the more advanced model you use, the more likely you are to get well-packaged, plausible nonsense as your answer."
arstechnica.com/ai/2024/10/...
This kind of variance—both within different GSM-Symbolic runs and compared to GSM8K results—is more than a little surprising since, as the researchers point out, "the overall reasoning steps needed to solve a question remain the same." The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any "formal" reasoning but are instead "attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data."
The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding "seemingly relevant but ultimately inconsequential statements" to the questions. For this "GSM-NoOp" benchmark set (short for "no operation"), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that "five of them [the kiwis] were a bit smaller than average."
Adding in these red herrings led to what the researchers termed "catastrophic performance drops" in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple "pattern matching" to "convert statements to operations without truly understanding their meaning," the researchers write.
arstechnica.com/ai/2024/10/...
"On Saturday, an Associated Press investigation revealed that OpenAI's Whisper transcription tool creates fabricated text in medical and business settings despite warnings against such use. The AP interviewed more than 12 software engineers, developers, and researchers who found the model regularly invents text that speakers never said, a phenomenon often called a "confabulation" or "hallucination" in the AI field."