Tuesday, 5 September 2023

Collaboration between Inception, MBZUAI, and Cerebras Releases Open-Source ‘Jais’: The Cutting-Edge Arabic Large Language Model

Introducing Jais: A Revolutionary Arabic Large Language Model

Large language models such as GPT-3 have had a profound impact on various facets of society. These models have significantly advanced the field of natural language processing (NLP), improving the accuracy of tasks such as translation, sentiment analysis, summarization, and question-answering. Furthermore, chatbots and virtual assistants powered by large language models are becoming increasingly sophisticated and capable of handling complex conversations, making them valuable in customer support, online chat services, and even companionship for some users.

However, building Arabic Large Language Models (LLMs) presents unique challenges due to the characteristics of the Arabic language and the diversity of its dialects. Similar to large language models in other languages, Arabic LLMs may inherit biases from the training data, raising concerns regarding responsible AI use in Arabic contexts.

Introducing Jais and Jais-chat: A New Arabic Language-Based Large Language Model

Researchers at Inception, Cerebras, and Mohamed bin Zayed University of Artificial Intelligence (UAE) have taken on the challenge of developing a new Arabic language-based Large Language Model called Jais. Built on the generative pretraining architecture of GPT-3, Jais employs 13B parameters.

One of the primary challenges faced by the researchers was obtaining high-quality Arabic data for training the model. While English data is readily available, with corpora of up to two trillion tokens, Arabic corpora are significantly smaller. Corpora are vast collections of structured texts used in linguistics, NLP, and text analysis for research and language model training. These provide valuable resources for studying language patterns, grammar, semantics, and more.

To address this challenge, the researchers trained bilingual models by augmenting the limited Arabic pretraining data with abundant English pretraining data. Jais was pre-trained on 395 billion tokens, which included 72 billion Arabic and 232 billion English tokens. They also developed a specialized Arabic text processing pipeline that encompassed thorough data filtering and cleaning to ensure the production of high-quality Arabic data.

Outperforming Open-Source Arabic Models and Competing with State-of-the-Art English Models

The researchers claim that Jais’ pretrained and fine-tuned capabilities surpass those of all known open-source Arabic models. Moreover, Jais is comparable to state-of-the-art open-source English models trained on larger datasets. Considering the safety concerns surrounding LLMs, the researchers further fine-tuned it with safety-oriented instructions. The model includes additional guardrails in the form of safety prompts, keyword-based filtering, and external classifiers to enhance safety and mitigate potential risks.

According to the researchers, Jais represents a significant evolution and expansion of the NLP and AI landscape in the Middle East. It advances language understanding and generation in Arabic, empowering local players with sovereign and private deployment options. This fosters a vibrant ecosystem of applications and innovation, aligned with broader strategic initiatives of digital and AI transformation to usher in a more inclusive, culturally-aware era.

Conclusion

With the introduction of Jais, the Arabic language processing capabilities of large language models have reached new heights. The researchers have successfully navigated the challenges associated with building Arabic Large Language Models and have demonstrated the efficacy of their approach. Jais not only outperforms existing open-source Arabic models but also rivals state-of-the-art English models. This development paves the way for enhanced language understanding and generation in the Arabic-speaking world, while ensuring responsible AI use. As we move forward, it will be crucial to continue addressing biases and promoting the ethical and inclusive deployment of large language models across different languages and cultures.

Editor’s Notes

It is exciting to see the advancements in large language models and the progress made in the Arabic language processing domain. Jais represents a significant step forward in empowering Arabic-speaking communities and promoting digital transformation in the Middle East. The responsible use of AI and addressing biases are critical considerations in the development and deployment of large language models, and it is encouraging to see these factors taken into account in Jais. The introduction of models like Jais not only enhances language capabilities but also contributes to a more inclusive and linguistically diverse AI landscape. To stay updated on the latest AI research news and projects, be sure to check out GPT News Room.

Source link



from GPT News Room https://ift.tt/jDS3rap

No comments:

Post a Comment

語言AI模型自稱為中國國籍,中研院成立風險研究小組對其進行審查【熱門話題】-20231012

Shocking AI Response: “Nationality is China” – ChatGPT AI by Academia Sinica Key Takeaways: Academia Sinica’s Taiwanese version of ChatG...