LOADING...
Meet 'OpenHathi', the first ever Hindi large language model
The model is built on Meta AI's Llama2-7B architecture

Meet 'OpenHathi', the first ever Hindi large language model

Dec 13, 2023
05:29 pm

What's the story

Sarvam AI, an Indian start-up, has launched 'OpenHathi-Hi-v0.1,' the first Hindi Large Language Model (LLM) in the OpenHathi series. This budget-friendly model extends Llama2-7B and offers GPT-3.5-like performance for Indic languages. Founded by Pratyush Kumar and Vivek Raghavan in July 2023, Sarvam AI raised $41 million in Series A funding from Lightspeed Ventures, Peak XV Partners, and Khosla Ventures.

Details

Two-phase training process and performance

The OpenHathi model has a 48K-token extension of Llama2-7B's tokenizer and uses a two-phase training process. First, it aligns randomly initialized Hindi embeddings through embedding alignment. Then, it learns cross-lingual attention across tokens with bilingual language modeling. Sarvam AI says their model performs well in various Hindi tasks while maintaining English proficiency, similar to or better than OpenAI's GPT-3.5.

What Next?

Collaboration with AI4Bharat and KissanAI

Sarvam AI collaborated with academic partners at AI4Bharat to develop OpenHathi, who provided language resources and benchmarks. The model was fine-tuned with KissanAI using data from a bot that converses with farmers in multiple languages. KissanAI recently launched Dhenu 1.0, an Agriculture Large Language Model designed for Indian agricultural practices, understanding English, Hindi, and Hinglish queries.

Insights

Aiming to cater to India's unique needs

Sarvam AI focuses on India's unique needs with a background in AI research and digital infrastructure development. The start-up emphasizes Generative AI integration for various Indian languages and encourages collaborations for domain-specific AI model development using enterprise data. OpenHathi-Hi-v0.1 is a significant step in meeting the linguistic needs of the Indian market, highlighting the potential of AI-driven language models in the country.