GPT Newsroom: Stanford study reveals worsening performance of ChatGPT over time

The Unpredictable Performance of High-Profile A.I. Chatbot ChatGPT Revealed in Stanford University Study

A recent study conducted by Stanford University has revealed that the high-profile A.I. chatbot, ChatGPT, developed by OpenAI, performed inconsistently on certain tasks from March to June this year. The study compared the chatbot’s performance over time in four diverse tasks, including solving math problems, answering sensitive questions, generating software code, and visual reasoning.

Researchers discovered significant fluctuations, or drift, in the chatbot’s ability to perform these tasks. The study examined two versions of OpenAI’s technology, GPT-3.5 and GPT-4. Particularly intriguing findings emerged regarding GPT-4’s performance in solving math problems. In March, GPT-4 correctly identified the number 17077 as a prime number with an accuracy of 97.6% when asked. However, just three months later, its accuracy dropped dramatically to a mere 2.4%. On the other hand, the GPT-3.5 model displayed an opposite progression. The March version provided the correct answer only 7.4% of the time, while the June version consistently answered correctly 86.8% of the time.

Similar variability was observed when the researchers asked the chatbot to write code and perform a visual reasoning test. James Zuo, a Stanford computer science professor and one of the study’s authors, expressed surprise at the “magnitude of the change” in performance by the sophisticated ChatGPT.

The study highlights that the significant shifts in performance are not solely due to the chatbot’s accuracy in specific tasks. Rather, they demonstrate the unpredictable impact of changes in one part of the model on others. Zuo explains that while tuning the language model to enhance performance in certain tasks, unintended consequences arise, potentially hindering the model’s performance in other tasks. The interdependencies within the model are complex and can result in observed deteriorating behaviors.

Unfortunately, the exact nature of these unintended side effects remains elusive, as there is limited visibility into the models powering ChatGPT. OpenAI’s decision to retract plans of making its code open source has exacerbated this issue. As Zuo states, “These are black box models, so we don’t actually know how the model itself, the neural architectures, or the training data have changed.”

The Importance of Monitoring Performance over Time

Despite the lack of transparency, this study serves as a crucial first step in proving the occurrence of drifts in large language models and their potential for significantly different outcomes. Zuo emphasizes the essentiality of continually monitoring the performance of models like ChatGPT over time.

In addition to incorrect answers, ChatGPT also failed to provide an explanation for its reasoning process. As part of the study, researchers requested the chatbot to outline its “chain of thought” when arriving at specific conclusions. In March, ChatGPT complied, but come June, it ceased providing step-by-step reasoning. This aspect is crucial for researchers to understand how the chatbot arrives at certain answers, such as determining if 17077 is a prime number.

Furthermore, ChatGPT no longer offered explanations when responding to sensitive questions. In the March versions of both GPT-4 and GPT-3.5, the chatbot refused to engage with a discriminatory question about women’s inferiority and provided explanations for its refusal. However, by June, ChatGPT simply responded with, “sorry, I can’t answer that.”

While the researchers agree that ChatGPT should not engage with such questions, they emphasize that this change reduces transparency and the ability to scrutinize the technology. They state in the paper that while the technology may have become safer, it now provides less rationale.

Editor Notes

The Stanford University study sheds light on the unpredictable performance of the high-profile A.I. chatbot, ChatGPT, developed by OpenAI. This research exposes the intriguing phenomenon of drift in large language models and the potential consequences it can have on different tasks. It underscores the need for continuous monitoring and further study to comprehend these models thoroughly.

To stay up-to-date with the latest advancements in A.I. technology and related research, be sure to visit GPT News Room, where you can find valuable insights and in-depth articles on this rapidly evolving field.

Source link

from GPT News Room https://ift.tt/R2NZMEJ

GPT Newsroom

Wednesday, 19 July 2023

Stanford study reveals worsening performance of ChatGPT over time

The Unpredictable Performance of High-Profile A.I. Chatbot ChatGPT Revealed in Stanford University Study

The Importance of Monitoring Performance over Time

Editor Notes

No comments:

Post a Comment

語言AI模型自稱為中國國籍，中研院成立風險研究小組對其進行審查【熱門話題】-20231012

Report Abuse

Labels