grid-line

ChatGPT Accuracy Rate (Latest Data)

by Josh Howarth
July 17, 2024

This is a comprehensive list of statistics on the accuracy rate of ChatGPT.

OpenAI’s ChatGPT is getting more and more capable. However, it still displays the same disclaimer: “ChatGPT can make mistakes. Check important info.”

But how accurate is ChatGPT exactly? That question is harder to answer than you might think. ChatGPT’s accuracy can vary based on several factors, and it can even become more or less accurate over time.

In this article, we’ll dive into the nuances of ChatGPT’s accuracy.

Key ChatGPT Accuracy Rate Statistics

  • ChatGPT is accurate around 88% of the time.
  • The same model of ChatGPT can become more or less accurate over time.
  • ChatGPT is most accurate in English.
  • ChatGPT-4o is the most accurate OpenAI model released to date.

ChatGPT Accuracy Rate

According to the latest Massive Multitask Language Understanding research, ChatGPT has a 87.8% accuracy rate.

undefined

Specifically, ChatGPT-4o was recently tested using Massive Multitask Language Understanding (MMLU) and benchmarked against other popular large language models.

The MMLU test was developed in early 2021 as a way to quantify large language model intelligence. It involves dozens of questions across various fields, from math to history to computer science and beyond.

The original paper that introduced the MMLU test extensively tested GPT-3, an early version of ChatGPT. The tool performed well in some subjects, achieving an accuracy rate of 60% or higher in US foreign policy and high school psychology.

However, in subjects like college chemistry and moral scenarios, GPT-3’s accuracy rate was only around 25%. As each question in the test has four possible answers, this level is about the same as if GPT-3 was choosing answers randomly.

However, more recent models of ChatGPT — and other AI tools — perform much better. As of July 2024, the highest scoring model is Google’s Gemini Ultra (90% accuracy), followed closely by OpenAI’s ChatGPT-4o (88.7% accuracy).

undefined

Factors Affecting ChatGPT’s Accuracy

ChatGPT’s accuracy varies based on several factors.

Some of these are actually controlled by the user. Vague prompts, for example, are less likely to produce an accurate answer than specific, well-designed prompts. Sticking to topics included in ChatGPT’s training data also usually produces more accurate responses. That includes avoiding information published after the training cutoff date of the ChatGPT model you are using.

However, some of these factors are out of your control. For example, you might think that more recent ChatGPT models are more accurate. That’s largely true, but testing has revealed some occasions where older models are actually more accurate.

Additionally, ChatGPT generally performs better in English. The more obscure a language is — in technical terms, the fewer resources it has for training data — the worse ChatGPT will perform.

Finally, and perhaps more concerning, studies have shown that the accuracy rate of specific ChatGPT models can change over time.

It’s important to remember that ChatGPT often fails to warn the user when it is unsure or incorrect about something. Instead, it may hallucinate.

ChatGPT’s Accuracy Over Time

ChatGPT’s accuracy can degrade over time (ARXIV)

A study from 2023 found that ChatGPT’s accuracy can markedly decrease over time. Researchers put GPT-4 through the same test in March and then June. The model's ability to accurately identify prime numbers cratered from 84% accuracy in March to 51% in June. Both GPT-4 and GPT-3.5 were also less accurate at producing code in June than in March.

These changes in accuracy over time are referred to as “drift.” Drift could be the knock-on effect of changes made to improve the model in other areas. Even small, targeted changes to parts of the model could have unintended effects on the model’s overall performance.

But its accuracy can also improve (ARXIV)

The same study found that ChatGPT’s accuracy can improve rather than decrease. GPT-3.5, for example, was actually far more accurate at identifying prime numbers in June than in March. GPT-4 also performed better at a different task in June.

The researchers in this study noted that these changes indicate the need for “continuous monitoring of LLMs” like ChatGPT.

undefined

ChatGPT-4 is more accurate than GPT-3.5 at answering obscure questions (Android Authority)

In May, a writer for Android Authority tested different versions of ChatGPT on a variety of tests. He found that GPT-4o was more accurate than GPT-4, which was itself more accurate than GPT-3.5. This was especially the case for more obscure questions, perhaps because GPT-4o is able to search the internet.

For example, GPT-4 was able to accurately answer a question about a travel pass in Japan. GPT-3.5, however, hallucinated.

ChatGPT is becoming more accurate at difficult exams (MedRXIV)

According to a study published in December, 2022, ChatGPT has become successively better at tackling the United States Medical Licensing Examination. The USMLE is a difficult three-exam program that is required for a doctor of medicine to get their license.

Early models of ChatGPT achieved just 36.7% accuracy in the USMLE. ChatGPT-3 reached 46% accuracy, a figure that rose to 50% with some training. The study itself found that ChatGPT achieved more than 60% — the usual passing grade — on most occasions.

GPT-4o is 3.1% less accurate than GPT-4 Turbo at reading comprehension (OpenAI)

Interestingly, OpenAI’s own testing found that GPT-4o was slightly less accurate than its immediate predecessor at reading comprehension. The DROP (f1) test involves answering complex questions. Accuracy requires a high level of reasoning. While GPT-4 Turbo scored 86 points, GPT-4o scored 83.4. That also makes GPT-4o less accurate at reading comprehension than Llama3 400b, though only by 0.1 points.

ChatGPT-4 hallucinates less than GPT-3.5 (PubMed)

One study, published in May, 2024, tested several AI tools on their ability to conduct systematic reviews — in other words, to generate references to scientific writing. ChatGPT-3.5’s hallucination rate was 39.6%, noticeably higher than GPT-4’s 28.6%.

ChatGPT’s Precision

Precision is often conflated with accuracy, but the two are not the same.

A good way to distinguish between the two is to imagine an archer firing arrows at a target. If the archer hits the same spot every time, they are very precise. If that spot is the center of the target, the archer is both precise and accurate. But if that spot is far from the center, the archer is precise but inaccurate.

With AI tools, high precision means a lower likelihood of generating false positives. For example, an imprecise chatbot might answer a query incorrectly but mark that query as successfully resolved. This would be a false positive.

GPT-4o has a precision of 86.21% (Vellum)

This makes GPT-4o the most precise AI model available, as of July 2024.

undefined

GPT-4 and GPT-3.5 are more precise than Bard (PubMed)

The three AI models were tasked with conducting systematic reviews, which is the generation of scientific citations. ChatGPT-3.5 had a precision rate of 9.4%, GPT-4 had 13.4%, while Bard scored 0%.

ChatGPT’s Accuracy in Medical Topics

ChatGPT-3.5 is 84.8% accurate at neurolocalization (CureUs)

In 2023, ChatGPT-3.5 was tested on various questions relating to neurolocalization, or the diagnosis of conditions affecting the nervous system. A team of seven neurosurgeons evaluated the model’s responses, concluding that it had generated “completely correct” or “mostly correct” answers 84.8% of the time.

ChatGPT-4 is more accurate than the average human in medical exams (OpenAI)

OpenAI conducted extensive testing on GPT-4 before its release. The model undertook various tests, including the Medical Knowledge Self-Assessment Program Exam. Its performance varied, though it was often better than human candidates. For example, it scored 64% on the Specialty Certificate Examination Neurology Web Question Bank. In comparison, the average score of the human candidates who took that exam was 60.2%.

undefined

ChatGPT achieved a median accuracy score of 5.5 out of 6 when answering medical questions (JAMA Network)

In October 2023, a group of researchers tested the performance of GPT-3.5 and GPT-4 on a collection of 284 medical questions. The questions were generated by a group of 33 physicians. Answers were scored on a scale of 1 to 6, where 6 is completely correct.

ChatGPT achieved a median score of 5.5 across all questions, and a mean score of 4.8. On easy questions, it achieved a median score of 6.0, while hard questions resulted in a median score of 5.0.

ChatGPT-3.5 was 86.6% accurate when diagnosing common urological conditions, better than Google (MDPI)

This study, published in May 2024, compared ChatGPT-3.5 with Google Search for diagnosing urological conditions. Google Search had an accuracy of just 53.3% when tackling commonly encountered conditions, while ChatGPT-3.5 scored 86.6%.

ChatGPT-3.5 fared significantly worse when evaluating unusual disorders. It provided accurate responses just 16.6% of the time.

ChatGPT-3.5 had a median accuracy of 4 out of 6 when responding to medical test results, worse than Copilot (Nature)

This study was published in April, 2024. ChatGPT-3.5, Copilot, and Gemini were tested on their responses to the results of certain urea and creatine test results. Both GPT-3.5 and Gemini scored a median of 4 out of 6. Copilot scored a median of 5.

ChatGPT was less than 66% accurate at identifying drug-drug interactions, worse than BingAI and Bard (PubMed)

undefined

ChatGPT is 72% accurate in clinical decision-making across all medical specialties (Mass General Brigham)

This study, published in August 2023, tested ChatGPT in a variety of clinical situations. It had to make similar decisions to human healthcare professionals. Overall, its responses were 72% accurate.

ChatGPT performed best at making final diagnoses, achieving 77% accuracy. It was less accurate when making clinical management decisions — for example, choosing what medications to use after deciding on a diagnosis — with just 68% of its responses being accurate.

ChatGPT is just 60% accurate at making differential diagnoses (Mass General Brigham)

A differential diagnosis is a situation where a clinician must differentiate between multiple possible conditions that present similarly. They are often difficult calls to make, and it is therefore unsurprising that ChatGPT struggled. Just 60% of its attempts were accurate.

According to one of the researchers involved in this study, this result “tells us where physicians are truly experts and adding the most value.”

ChatGPT answered 77.5% of medical questions correctly (Nature)

In this study, published in Nature in January 2024, ChatGPT-3.5 was tested on 120 questions relating to disease management. It managed to answer 77.5% of the questions correctly. However, only 61.7% of its responses were both correct and complete per professional guidelines.

Interestingly, the researchers noted that ChatGPT performed better in some topics than others. They hypothesized that this may be due to differing volumes of information about different topics in ChatGPT’s training data.

ChatGPT achieved more than 50% accuracy across all US Medical Licensing Examination exams (MedRXIV)

The USMLE is a program consisting of three exams. Success is required for an individual to become a licensed doctor of medicine. In one study, published in December 2022, ChatGPT performed well on all of the three exams in the USMLE. It was more than 50% accurate in all of the exams, and often surpassed 60% accuracy. While the passing threshold varies by year, it’s usually around 60%.

ChatGPT’s Accuracy vs Other AI Models

ChatGPT-4o is 99% accurate at classification, better than competitors (Lars Wiik)

In May 2024, LLM engineer Lars Wiik tested ChatGPT-4o on a dataset he created himself. The dataset consisted of 200 sentences, each categorized into one of 50 topics. The test involved correctly assigning a sentence to its topic. ChatGPT-4o made just two errors. ChatGPT-4o was the most accurate, beating previous versions of ChatGPT and Gemini.

undefined

ChatGPT is more accurate than PubMedGPT on a key medical exam (MedRXIV)

A study published in December 2022 found that ChatGPT often achieved over 60% accuracy on the United States Medical Licensing Examination. Interestingly, this was more accurate than PubMedGPT, which was just 50.8% accurate. PubMedGPT is similar to ChatGPT, but was only trained on scientific materials. According to the authors of the study, ChatGPT’s advantage may have come from being “exposed to broader clinical content … that [is] more definitive,” rather than only being trained on often-inconclusive or ambivalent scientific literature.

ChatGPT-4o is more accurate than Claude, Gemini, and Llama in four key tests (OpenAI)

When OpenAI released ChatGPT-4o, they trumpeted its strong performance in six tests often applied to LLMs. In some cases, GPT-4o’s performance was only marginally better. For example, it achieved 88.7% accuracy in the MMLU. That is just 0.9% better than Claude3 Opus, and just 2.6% better than Llama3 400b.

In other cases, GPT-4o demonstrated substantial improvements in accuracy. In the MATH test, GPT-4o achieved 76.6% accuracy. That is around 20% better than both Gemini Pro 1.5 and Gemini Ultra 1.0.

undefined

But it sometimes is less accurate (OpenAI)

As you can see from the graph, GPT-4o wasn’t always more accurate than competitors. In the Multilingual GSM8K test (MGSM) — comprised of arithmetic problems in different languages — GPT-4o was only slightly less accurate than Claude3 Opus.

ChatGPT hallucinates 105% less than Bard (PubMed)

In this study, ChatGPT-4 and Bard were tasked with producing scientific references. Bard hallucinated a concerning 91.4% of the time. GPT-4’s hallucination rate was 28.6% — still high, but far lower than Bard’s.

ChatGPT’s Accuracy in Different Languages

As we’ve discussed before, ChatGPT is still fundamentally an English tool. Multiple studies have demonstrated that ChatGPT performs best in English. In other languages, particularly those with fewer resources — material on which the model can train — ChatGPT struggles.

ChatGPT-4o is 99% accurate in English, but slightly less accurate in other languages (Lars Wiik)

Lars Wiik, an LLM engineer, tested various AI models on a dataset that was translated from English into various European languages. The results suggested that ChatGPT is generally very accurate, and that newer models are more accurate than older ones — although this isn’t always true. In Russian, for example, GPT-4 Turbo underperformed.

undefined

ChatGPT-4o is 1-2% less accurate than competing AIs in non-English languages (Lars Wiik)

Wiik also tested some of OpenAI’s top competitors on the same dataset. ChatGPT-4o was comparable to Gemini 1.5. GPT-4o was more accurate in English, Russian, Finnish, and the two were even in Norwegian. However, Gemini was more accurate in Spanish, French, German, Dutch, and Portuguese.

More importantly, Claude 3 Opus outperformed GPT-4o in every language except Norwegian, where the two drew.

undefined

ChatGPT’s Accuracy at Non-Text Tasks

Originally, AI tools like ChatGPT were unimodal. That means they could only deal with text. While we are still waiting for a truly multimodal AI tool, recent models, like GPT-4o, have implemented some multimodal features.

However, some reports indicate that ChatGPT may be less accurate at these visual tasks.

ChatGPT-4’s accuracy can drop to 50% when answering image queries (OpenAI Community)

In March 2024, a user on the OpenAI forums reported encountering a curious problem. They had been using GPT-4’s vision preview model to power a bot that interpreted and answered questions sent to it in image format. Initially, the bot answered questions with 80 to 90% accuracy. But according to the user, one day the accuracy cratered to 50%.

Other users were unsure why this might be the case. One user recounted rumors that ChatGPT would become “lazy” at times. This could be related to research demonstrating that ChatGPT’s accuracy can change over time.

Conclusion

Ultimately, ChatGPT’s accuracy varies by a large degree. Much of this variation is out of your control. However, there are ways to improve accuracy, such as submitting more specific prompts.

While ChatGPT has been shown to sometimes become less accurate over time, its accuracy has generally improved markedly in the few years since its initial release.

Continuing to improve model accuracy will likely be central to the ongoing competition between OpenAI and its competitors.

Therefore, ChatGPT may well become even more accurate in the future.