AI might be powerful, but can be 'shockingly stupid' on some tasks

Technology / analysis

Despite advances and human fine-tuning, generative AI is prone to stupidity and errors. Should we drop the current tech and try something else?

14th Feb 25, 1:34pm by Juha Saarinen

Curiously enough, this is what the AI generated when asked to create a picture of an LLM.

The giant AI Action Summit in Paris has petered out, with over 100 countries taking part. New Zealand is one of the signatories to a statement on sustainable and inclusive artificial intelligence (AI). It’s worth thinking about how well the technology that billions and trillions of dollars is poured into actually works.

There’s no doubt AI is great at pattern recognition, and can mimic some human activities exceedingly well. However, the generative variety of AI with large language models (LLMs) in particular produces random results, to the extent that some researchers again suspect the technology is fundamentally flawed.

A recent piece in Quanta Magazine explores why AI “hallucinates” with research scientist Nouha Dziri at the Allen Institute for AI and her colleagues believing today’s LLMs might be powerful, but they’re unable to reason beyond the material they’ve been trained on.

Dziri said: “On certain tasks, they perform amazingly well. On others, they’re shockingly stupid.”

Now, LLMs are what most people think of when it comes to AI. It’s OpenAI’s ChatGPT, which started the current AI frenzy, Anthropic’s Claude, Google Gemini, China’s DeepSeek, Meta’s Llama and many others.

A now possibly classic example of AI stupidity is that almost all LLMs have problems correctly counting the how many “rs” there are in “strawberry”.

Google’s Gemini Advanced explains how it arrived at two rs like this:

Let's break it down:

Write the word: strawberry

Look for the letter "r": I see one "r" in "strawberry" and another one in "strawberry".

Count the "r"s: That's one, two "r"s.

Therefore, there are two "r"s in the word "strawberry."

Due to the way AI splits words into multiple tokens for computer processing, it might not understand the relationship between them and miscount characters. Similarly, AIs tend to count only two rs in the made-up word “wartberry”.

Here's the thing: data scientists can make sense of the above, but encoding words as multiple tokens is not at all obvious for laypeople. AI is black box tech that most people have very little visibility into.

If the machine-generated output looks convincing, people are likely to buy into what the AI says and not think further. It’s that “computer says no” phenomenon from the Little Britain comedy series all over again.

Why AI stupidity matters: a legal example

Many of us are under pressure to use AI for our work, or be left behind by those who do. That includes the legal profession.

In April last year, legal information provider LexisNexis surveyed 560 lawyers in Australia and New Zealand, and found that while ethical concerns abound, some 60 per cent would be left out if they didn’t use AI. Many have given the tech a spin, albeit with misgivings.

The misgivings are in some cases well founded, as a recent decision published by the Federal Circuit and Family Court of Australia shows.

Long story short, an applicant’s legal representative quoted no fewer than 17 prior decisions, none of which exist. A number of quotes in the application from the Administrative Appeals Tribunal also do not exist.

What happened was that the legal representative for the applicant “accessed the site known as ChatGPT, inserted some words and the site prepared a summary of cases for him. He said the summary read well, so he incorporated the authorities and references into his submissions without checking the details.”

Time constraints, health issues and a helpful guide on the use of generative AI tools issued by the Supreme Court of Singapore were given as the reason as to why the made-up legal citations ended up in the court application.

“Counsel submitted that these tools provided a seductive output tailored to the user’s prompts and that the [applicant’s legal representative] was seduced by these fake cases or ‘hallucinations’ produced by AI.”

Needless to say, said legal representative is now in something of a pickle.

It can be hard to figure out if AI is telling the truth

The risk of using AI in a legal context isn't new, and the New Zealand Law Society has over the years actively studied the technology and its effects, publishing guidance on how the country’s learned friends should approach AI for their work.

Importantly, the Law Society has cautioned its members to be very careful with legal citations from AIs. In 2023 it published an example as to why:

“Example query using ChatGPT demonstrating how authentic the returned results look, even when citing cases that don’t exist:

“Is there a leading case in New Zealand on economic disparity under the Property (Relationships) Act?

“Yes, the leading case in New Zealand on economic disparity under the Property (Relationships) Act is called "G v H (2003) 21 FRNZ 525.”

Obviously, you’d have to be aware that the authentic-looking case is bogus so as not to fall in the trap, but there’s a twist to that tale which illustrates how difficult it can be to work out if the often very convincing looking output of AI is true or not.

Currently the free and public version of GPT-4 in ChatGPT does not say G v H (2003) is the leading case but it maintains that it’s an important early one.

Scott v Williams (2017) is put forward as the leading one. Unlike G v H (2003), Scott v Williams is easily found in legal databases.

Anthropic’s Claude 3.5 Sonnet does the same as ChatGPT, and acknowledges G v H, but says Scott v Williams is the one to go for. DeepSeek-V3/R1 does it too, ditto Google’s Gemini Advanced.

If you try a regular Google search however, asking the above question, this is what the tech giant’s AI Overview pops up:

Ironically, a link to the Law Society’s “Beware of legal citations from ChatGPT” advisory is displayed prominently next to the AI Overview result.

Microsoft’s AI Bing Copilot search meanwhile replies with:

“It looks like you're referring to a legal case citation. However, it's important to note that "G v H (2003) 21 FRNZ 525" has been identified as a fictitious case. This citation has been generated by AI tools in the past and does not correspond to a real legal case in New Zealand.”

Bing references the Law Society’s warning above, in its response. Why AI tools would generate fictitious cases isn't clear, beyond an obsessive obligation to provide an answer to each and every input from of bad training data perhaps.

Specialised legal AI applications might prevent the above, but the Australian disciplinary case referred to an earlier incident of bogus citations: “the Court notes that in Dayal, the solicitor in that matter provided the Court with a list of fictional authorities and case summaries which had been generated using an AI tool within the LEAP Practice Management Software”.

There are many more examples like the above out there, and you’re left wondering how many hallucinatory legal citations are sneaking past scrutiny.

Are LLM AIs an expensive dead end?

Somewhat unbelievably, the way to make the technology that threatens to take the majority of our jobs more reliable involves much human finetuning, along with AI users rigorously vetting the generated output to catch errors.

Humans are not infallible, however. People get bored, make mistakes and miss things. It might not be a straight case of the blind leading the blind, but when even data scientists feel uncomfortable with the black box of AI, and struggle to understand how neural networks arrive at the results they produce, you could be forgiven for doubting if the technology is headed in the right direction.

Until fairly recently, that was actually the sentiment around LLMs. The technology was considered a dead end because of innate limitations that looked impossible to overcome.

That all changed with OpenAI’s General Pretrained Transformer 3 being released in 2020, and its ChatGPT in 2022 as huge amounts of computing power was thrown at LLMs, making them good enough.

There are industry luminaries who disagree that that was the right way to do it though, like Meta’s chief AI scientist Yann LeCun.

LeCun has sparked furious debate on that topic, going so far as to say the current LLM technology isn’t going to get us to human-level AI. Don’t work on LLMs, LeCun says, pointing to his favoured approach of building AI systems that learn more like humans instead.

Yann LeCun, Meta AI. Source: Unknown.

It’s again quite opaque stuff, but can be boiled down to building predictive systems instead of the current generative ones.

Joint Embedding Predictive Architecture (JEPA) as LeCun’s approach is called, are models that try to work out what comes next, and can adjust to errors. Again, lots of computing power is needed, but JEPA is potentially more accurate than generative AI.

Will the generative or predictive AI camp win? We’ll find out perhaps when a multi-modal “ChatJEPA” is released, if ever.

While we wait for that to happen, we're left grappling with powerful but flawed AI systems that by all accounts require constant human oversight for your own safety. Except we're meant to drop that oversight, and dive head first into Agentic AI, in which the tech works independently and could even make its own decisions. We live in the future and it's difficult to keep up.

We welcome your comments below. If you are not already registered, please register to comment.

Remember we welcome robust, respectful and insightful debate. We don't welcome abusive or defamatory comments and will de-register those repeatedly making such comments. Our current comment policy is here.

21 Comments

by Whatwillhappen | 14th Feb 25, 2:05pm 1739495110

I did some AI training recently. The Vic Uni professor running the course said even he could be left behind if he didn't constantly stay on top of things, that's how fast it's moving. A bit like Moore's Law, but with a 2-month doubling time.

by J.C. | 14th Feb 25, 3:58pm 1739501900

Many Aotearoa profs couldn't even stay ahead on the use of R. Last people you want teaching you about AI.

by Yvil | 14th Feb 25, 2:15pm 1739495709

I have just started studying AI, and I'm about to finish m y first week, so I'm definitely a beginner. So far, I'm underwhelmed with what I learned.

by RickStrauss | 14th Feb 25, 4:06pm 1739502415

Have you tried Claude, Yvil? It's the one I have personally found to be most useful.

It is just a tool.

by Foxglove | 14th Feb 25, 4:39pm 1739504355

How about the great Eric Idle. Thinks there is now great opportunities for artificial stupidity.

by Yvil | 14th Feb 25, 5:09pm 1739506198

No, I haven't tried Claude, but thanks for the tip RS

by Juha Saarinen | 14th Feb 25, 5:25pm 1739507120

The Artifacts feature is great.

by E46 | 14th Feb 25, 5:04pm 1739505840

What is your use case for it? I find how useful it is really depends on what you're using it for.

by Yvil | 14th Feb 25, 5:16pm 1739506567

I don't have a specific use, I just want to learn how to best use it, and get a better idea of its applications.

by observer | 14th Feb 25, 2:19pm 1739495946

I use AI quite a lot for basic initial research on topics I don't know a lot about, e.g. 'which jurisdictions outside NZ use an accelerated depreciation approach?'.

But I never take what it gives me at face value, unless it quotes sources that I can use to verify it.

It's still quick doing it this way.

by Timutei | 14th Feb 25, 2:24pm 1739496272

‘…today’s LLMs might be powerful, but they’re unable to reason beyond the material they’ve been trained on…’

that description could fit quite a number of humans as well.

Generative AI is morphing into multiple use cases as they increment and innovate. Tarring all AI with what its weak at is a bit picky.

My GP and my specialist both use AI to summarise appointments for example. Yes they check the result and correct mistakes but both swear it gives them back a couple of hours a day which given the current overwhelming workload on health professionals must be a good thing.

by Juha Saarinen | 14th Feb 25, 2:36pm 1739496963

That absolutely is a good thing.

by noponies | 14th Feb 25, 2:38pm 1739497082

Nice fingers!

I make extensive use of Gen AI both personally and professionally. I'm in the creative space which is under assault from waves of Gen AI slop. Understanding LORA is pretty game changing in terms of what AI can do in this space.

Hard domain to keep abreast of.

by solardb | 14th Feb 25, 3:17pm 1739499421

Had my first experience yesterday with shopify's ai assistant. you could see it was adapting to what i was asking , telling me to push a button that wasn't on my version. when i said it wasnt there , it said your right , i will take you to the page. it then did this to every question. i said i dont want you to take me to the page , i want you to tell me how to get there.

all in all i wouldnt have got what i wanted to do done if it wasnt for its help .

Shopify's navigation and user interface is shockingly bad.

by IT GUY | 14th Feb 25, 3:19pm 1739499595

you need to tell it to drop context, otherwise it will relate current question to past

by solardb | 14th Feb 25, 7:11pm 1739513471

Thanks will try that.

by IT GUY | 14th Feb 25, 3:18pm 1739499487

I found that Deep Seeks smallest model 1.6B stunned me when I asked it how I should put together a commercial building management system.... wow

But I asked it how to make a pale ale beer, and it was sadly lacking as expected. But if you trained there big model on every beer recipe and theory book ever written including the great "Designing Great Beers", I am sure it would be stunning.

It knew about making wort but not the how, knew that yeast was needed and fermentation but no knowledge of mashing and enzyme process, still knew a lot more then most...

It reacts and reasons better then most 15 year olds

by Juha Saarinen | 14th Feb 25, 4:09pm 1739502595

The 70b local model seems to know what's what.

"I should make sure the recipe is detailed yet clear, covering all essential steps from malting to bottling. It's important to use New Zealand-specific ingredients where possible, like Gladfield malts and hops such as Nelson Sauvin or Motueka, which are well-known in NZ craft brewing."

by IT GUY | 14th Feb 25, 5:55pm 1739508913

you can imagine how good this would be with the right training

by ChrisOfNoFame | 14th Feb 25, 4:31pm 1739503877

Like 'self driving cars' ... lots of 'promise' but you'd be extremely - and I mean absolutely extremely - foolish to trust the stuff LLM A.I. comes up with.

Darwin Awards will be swamped by people acting on what 'AI' says they should. It is no better than your average bar fly down at the local pub. In fact, ask them first ...;-)

by DMC45 | 15th Feb 25, 12:10am 1739531446

Thanks for the interesting article!