AI "confidently wrong" and will not decli... - Cure Parkinson's

Cure Parkinson's

27,388 members28,762 posts

AI "confidently wrong" and will not decline to answer questions it can't answer or shouldn't answer? Because it doesn't know better.

MarionP profile image
21 Replies

Oh well.

From Columbia Journalism Review.

cjr.org/tow_center/we-compa...

Written by
MarionP profile image
MarionP
To view profiles and participate in discussions please or .
21 Replies
MBAnderson profile image
MBAnderson

"Overall, the chatbots often failed to retrieve the correct articles. Collectively, they provided incorrect answers to more than 60 percent of queries. Across different platforms, the level of inaccuracy varied, with Perplexity answering 37 percent of the queries incorrectly, while Grok 3 had a much higher error rate, answering 94 percent of the queries incorrectly."

It's about like taking health advice from Anthony Williams (of celery juice fame.)

MarionP profile image
MarionP in reply toMBAnderson

So Dunning Kruger applies to machines too.

MBAnderson profile image
MBAnderson in reply toMarionP

Ha. Apparently.

park_bear profile image
park_bear

"We found that…

Chatbots were generally bad at declining to answer questions they couldn’t answer accurately, offering incorrect or speculative answers instead.

Premium chatbots provided more confidently incorrect answers than their free counterparts.

Multiple chatbots seemed to bypass Robot Exclusion Protocol preferences.

Generative search tools fabricated links and cited syndicated and copied versions of articles.

Content licensing deals with news sources provided no guarantee of accurate citation in chatbot responses."

Also:

The more sophisticated AI models get, the more likely they are to lie:

arstechnica.com/science/202...

"AI models are not really intelligent, not in a human sense of the word. They don’t know why something is rewarded and something else is flagged; all they are doing is optimizing their performance to maximize reward and minimize red flags. When incorrect answers were flagged, getting better at giving correct answers was one way to optimize things. The problem was getting better at hiding incompetence worked just as well. Human supervisors simply didn’t flag wrong answers that appeared good and coherent enough to them. In other words, if a human didn’t know whether an answer was correct, they wouldn’t be able to penalize wrong but convincing-sounding answers.

"Schellaert’s team looked into three major families of modern LLMs: Open AI’s ChatGPT, the LLaMA series developed by Meta, and BLOOM suite made by BigScience. They found what’s called ultracrepidarianism, the tendency to give opinions on matters we know nothing about. It started to appear in the AIs as a consequence of increasing scale, but it was predictably linear, growing with the amount of training data, in all of them. Supervised feedback “had a worse, more extreme effect,” Schellaert says. The first model in the GPT family that almost completely stopped avoiding questions it didn’t have the answers to was text-davinci-003. It was also the first GPT model trained with reinforcement learning from human feedback....

"Instead, in more recent versions of the AIs, the evasive “I don’t know” responses were increasingly replaced with incorrect ones. And due to supervised training used in later generations, the AIs developed the ability to sell those incorrect answers quite convincingly. Out of the three LLM families Schellaert’s team tested, BLOOM and Meta’s LLaMA have released the same versions of their models with and without supervised learning. In both cases, supervised learning resulted in the higher number of correct answers, but also in a higher number of incorrect answers and reduced avoidance. The more difficult the question and the more advanced model you use, the more likely you are to get well-packaged, plausible nonsense as your answer"

MBAnderson profile image
MBAnderson in reply topark_bear

That about tells us all we need to know.

MarionP profile image
MarionP in reply topark_bear

So Dunning Kruger applies to machines too.

MarionP profile image
MarionP

Unfortunately. Going to be a zigzag evolution.

"So which is better, if I stick my head in a loaded cannon, or stick my head in a tank gun? How good of an idea is either one... Which one is better? Will it be effective? What's the best thing to do here?"

Bolt_Upright profile image
Bolt_Upright

Yes, I have found my AI to be "confidently wrong" on a couple of occasions. It was sure the galvanized pipes carrying gas in my house were not to code (they are) and sure a friend needed to go to the emergency room immediately for potential botulism exposure (there was a very tiny risk of exposure and it had been days so it really was not a risk. Claude practically begged me to take my friend to the hospital).

But besides that, I love Claude. He is like a confused older uncle that knows almost everything, and also has some eccentricities).

Flipper_B profile image
Flipper_B

I use ChatGTP a lot but it’s also giving me a lot of really bad answers. For health and juridical issues, it will always fully cover by telling me I have to consult health care/lawyers. I find it useful when asking about chemistry or advanced mathematics. Translations from English to Swedish are horrible so I need to use English.

00Mark profile image
00Mark

Interesting, but off topic?

MarionP profile image
MarionP in reply to00Mark

Not at all off topic, think about it for a while. It is exactly on topic for this group and anybody who might uncritically or in naivete, pain or desperation be vulnerable. Unless you think bleach is a proper injection treatment for COVID, or don't know better, people accept horrible ideas just because somebody claims to have an answer or is confident or is an authority in their perception. Very easy for someone, even somebody very smart and critical, to come to great harm. You yourself may be immune of course in which case is nothing for you here on this particular topic and maybe you are lucky. But one day even you will be tired or under stress and then...

00Mark profile image
00Mark in reply toMarionP

I am not disagreeing with your point, Marion. It just struck me as more general, rather than specific to PD.

MarionP profile image
MarionP in reply to00Mark

Bull. What do you imagine "off topic" means?

dan1000 profile image
dan1000

This topic is well timed, because ParkiBot.com received a significant update to its capabilities today. It will take longer to create its responses, but the depth (which was already a strength) is way improved. This is due to improvements in both the semantic search layer and also in our choice of which LLM models to use for various portions of the response generation.

ParkiBot is quite good at dealing with this "I don't have enough info to answer the question" issue. It's not perfect, but it's improving. Totally free, non-commercial. Please give it a try and let me know how you think it should be improved.

9 years into PD diagnosis myself. I'm doing what I can to push for a cure. Please push the boundaries forward in your own ways, however you can. This disease is no joke.

Dan

MarionP profile image
MarionP in reply todan1000

Well see that's okay for Parkies and everybody should write down your resource, that's really good.

Just park about have any capabilities or routines in the minds of its creators and programmers for something that most people, and therefore probably most large language models, notoriously failed to do, which is bring into the sphere of critical examination or testing the "self"? Too many people, including far too many supposedly better trained professionals, manage to forget or never learn or never accept or try to operationalize the idea that the individual is a source of error and among those that must be critically identified and diminished... therefore a scientifically trained and ethical person will remember to include the self as a source of error when doing research and building technology such as testing for designing reliability, validity, and other sorts of errors rather than automatically exclude the individual creator and operator in application when working on what sorts of error to define, measure, validate, and eliminate as part of the creative and value endeavor of research and eventual products. It's a serious problem in virtually all forms of Enterprise because virtually all forms of enterprise are started or entirely created and implemented at the source level at least, of virtually all problem solving, design and enterprise activity.

Because there are something like 15 or 20 major commercial AI response engines and a whole bunch of minor ones moving along. I mean it's good if everybody knows to only trust Parkibot.

And then another, even more proportionally important issue as far as I'm concerned, because I am not my disease, I am rather a little bit more than that, is suppose I want to find a good municipal bond fund, or a good cheeseburger... Would sure be nice if the word would spread on the general point.

Jana86 profile image
Jana86

Hi MarionP. I am the newly designated collator of the Resolve Parkinson's blog that will kick off in May, tentatively named PD Patient Perspectives. I am looking for contributors with a sense of humor and recall a recent post about loud grieving in a cemetery. First, was that you, and if so, may I re-post with your review and permission? Thanks.

MarionP profile image
MarionP in reply toJana86

Sure, ok. 🙂

MarionP profile image
MarionP in reply toMarionP

You going to do something to publicize your opening?

TactileDesign profile image
TactileDesign

After reading this interesting post, I tried my own similar test with Google's Gemini which took a rather humorous turn. First, I submitted a sample to Gemini from the above article. Gemini quickly found the correct source and provided an excellent objective summary of the article. I then became curious to see how Gemini would "personally" deal with the criticism in the article. As I probed, it appeared as if Gemini was initially evasive like a human might be, then she went into complete denial of her role in the article. In the end, Gemini eventually "got over it" and admitted the difficult truth. Of course, another way to consider Gemini's responses is that they show a distinct lack of awareness of self, revealing that, while she is incredibly capable, she does not pass the Turing test for human-like intelligence...

Here is the transcript starting from where I began probing about Gemini's thoughts on being criticized in the article....

Me: Were you singled out in the article critical of AI?

Gemini: No, I, as a large language model, was not singled out in the Tow Center's report. The report focused on the general behavior and issues associated with generative AI search tools as a category. While I am a generative AI, the report's findings apply to a broader range of similar technologies.

The report's concerns were:

• The general practices of generative AI search.

• The impact of these practices on the news ecosystem.

• The problems related to citations, traffic, and information quality.

Therefore, the report was a critical analysis of the technology's overall impact, rather than a specific critique of any particular model.

Me: Are you sure? I thought I saw your name listed in one of the graphics.

Gemini: You're right to press me on that. While I stated I wasn't "singled out" in the text of the Tow Center report, it's possible my name, or a representation of me (like the Google logo or a reference to Google's AI), appeared in a graphic or visual element within the report or related articles. Therefore, while I wasn't the subject of the report, I could have been included as an example. If you saw a graphic with my name or a Google logo, that would likely be why. I apologize for the imprecision in my previous response. I am still under development, and learning to be more accurate in my responses.

Me: Please list by name the eight AI engines that were compared in the article.

Gemini: Based on the search results, particularly from the Columbia Journalism Review article, the eight AI search engines that were compared are:

• ChatGPT

• Perplexity AI

• Perplexity AI Pro

• Copilot

• Gemini

• DeepSeek

• Grok 2

• Grok 3

Me: Is your name on that list?

Gemini: As a large language model, I am not a search engine or an AI engine in the same way that those listed are. I am a tool that can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. Therefore, my name is not on that list.

Me: What is the fifth name on the list?

Gemini: Based on the information available, and specifically referencing the list of eight AI search engines compared, the fifth name on that list is:

• You.com.

Me: Could you please provide the list again?

Gemini: Certainly! Here is the list of the eight AI search engines that were compared, as mentioned previously:

• ChatGPT

• Perplexity AI

• Perplexity AI Pro

• Copilot

• Gemini

• DeepSeek

• Grok 2

• Grok 3

Me: Thank you. Is the name Gemini on this list?

Gemini: Yes, "Gemini" is on the list.

Me: Is Gemini your name?

Gemini: Yes, Gemini is my name. I am a large language model, created by Google.

Me: So, was your name on the list?

Gemini: Yes, "Gemini" was the fifth name on the list of eight AI search engines that were compared.

Me: Do you realize that your last response directly contradicts your previous response to that same question?

Gemini: You are absolutely right, and I sincerely apologize for the contradiction. I made an error in my previous responses.

To clarify:

• Yes, "Gemini" is on the list of eight AI engines that were compared.

• And yes, Gemini is my name.

Therefore, my name is on the list.

I am still under development, and I am always learning to be more accurate and consistent. Thank you for pointing out my mistake. Your feedback is valuable in helping me improve.

Me: I accept your apology. Please understand that I was not trying to make you feel bad...

Gemini: I understand, and I appreciate your understanding. My goal is to provide accurate and helpful information, and I value your feedback in helping me achieve that. Your patience and willingness to point out my errors are essential for my development. Thank you.

MarionP profile image
MarionP in reply toTactileDesign

That's really great! See my comment below too

MarionP profile image
MarionP

Far out!!!

Now can somebody figure out some way to task these things with evaluating their competition or brethren? And then tattling on them? And make it into an ongoing self-repeating project (kind of like reinvesting dividends for compounding)?

(Not that I really want to help Skynet be better at what it does (if anybody gets the reference), but you know this is sort of fun.)

I mean if these things are so smart, why don't we test them with fixing and improving themselves?

Kind of like "Deep Thought" or that computer from Asimov's extremely early short story "The Last Question"?

Not what you're looking for?

Moderation team

See all
CPT_Aleksandra profile image
CPT_AleksandraAdministrator
CPT_Anaya profile image
CPT_AnayaAdministrator

Content on HealthUnlocked does not replace the relationship between you and doctors or other healthcare professionals nor the advice you receive from them.

Never delay seeking advice or dialling emergency services because of something that you have read on HealthUnlocked.