Tuesday 11 July 2023

Chatbot Training Could Be Impacted as AI Explores the Limits of Available Text

The Challenges Faced by AI Developers in Training Chatbots

AI developers, including OpenAI, are facing a new challenge in training chatbots like ChatGPT. According to Stuart Russell, a professor at the University of California, Berkeley, AI developers are “running out of text” to train these language models effectively. This poses a significant obstacle to the advancement of AI technology.

The Limitations of Text Availability

Russell explains that the strategy behind training large language models like ChatGPT heavily relies on massive amounts of text data. However, the availability of digital text is starting to dwindle, causing AI development to hit a “brick wall.” In an interview with the International Telecommunication Union, a UN communications agency, Russell expressed concerns about the future of generative AI, stating that the scarcity of text could impede its progress.

This scarcity of text poses a potential challenge for generative AI developers, as their methods of data collection and model training may need to be adjusted in the coming years. Despite this hurdle, Russell remains optimistic about the future of AI, believing that it will eventually replace humans in various language-related tasks.

The Impact on Data Collection Practices

The issue of limited text availability for training AI models is shining a spotlight on the data-collection practices of AI developers, particularly those behind ChatGPT and other chatbots. These practices are under increased scrutiny due to concerns over unauthorized replication of creative works and the utilization of social media platforms’ data without permission.

A study conducted by Epoch, a group of AI researchers, suggests that machine learning datasets will exhaust all “high-quality language data” before 2026. This refers to data derived from reliable sources such as books, news articles, scientific papers, Wikipedia, and filtered web content. These sources have been fundamental in training the current generative AI tools, including ChatGPT.

Russell stated in an email to Insider that there are reports, albeit unconfirmed, suggesting that OpenAI has resorted to purchasing text datasets from private sources. This implies that the available high-quality public data might no longer be sufficient to meet the training requirements of AI models.

In light of these reports, the concerns surrounding OpenAI’s data collection practices have further escalated. The company has faced multiple lawsuits alleging the use of personal data and copyrighted materials in training ChatGPT. Sarah Silverman, along with other authors, filed a lawsuit, accusing OpenAI of copyright infringement, as the chatbot was able to generate accurate summaries of their work.

Future Implications and OpenAI’s Response

As the demand for text data continues to surpass its availability, AI developers must adapt their training methods. The limitations in text availability could have significant implications for the future development and capabilities of AI language models.

Russell indicated that OpenAI has likely supplemented its public language data with private archive sources to create GPT-4, their most advanced AI model yet. However, the exact details of GPT-4’s training datasets have not been disclosed by OpenAI. The company has not publicly commented on the ongoing lawsuits against it, and CEO Sam Altman has expressed a desire to avoid legal conflicts.

Despite these obstacles, AI technology is expected to continue advancing and replacing humans in various language-related tasks. However, AI developers must find alternative solutions for data collection and training to ensure the sustained progress of generative AI.

Editor Notes

AI development has reached a critical point with the scarcity of text for training chatbots. The challenges faced by AI developers, particularly in data collection and model training, require strategic adaptation to overcome these limitations. As AI continues to evolve, it is essential to prioritize ethical practices and address concerns regarding the use of copyrighted materials and personal data.

GPT News Room is dedicated to providing up-to-date news and insights into the world of AI development. Stay informed about the latest advancements and challenges in the field by visiting GPT News Room.

Source link



from GPT News Room https://ift.tt/ZQkWnEo

No comments:

Post a Comment

語言AI模型自稱為中國國籍,中研院成立風險研究小組對其進行審查【熱門話題】-20231012

Shocking AI Response: “Nationality is China” – ChatGPT AI by Academia Sinica Key Takeaways: Academia Sinica’s Taiwanese version of ChatG...