BgGPT 1.0: Extending English-centric LLMs to other languages

Back

Published

Dec 14, 2024

Updated

Dec 14, 2024

BgGPT: Bringing the Power of LLMs to Bulgarian

BgGPT 1.0: Extending English-centric LLMs to other languages

https://arxiv.org/abs/2412.10893v1

Summary

Large language models (LLMs) have revolutionized how we interact with technology, but their prowess has been largely confined to English. What about the millions of people who speak other languages? Researchers at INSAIT are tackling this challenge head-on with BgGPT, a project dedicated to extending the capabilities of powerful LLMs to Bulgarian. They’re not just translating English LLMs—they’re building a model that truly *understands* and *generates* high-quality Bulgarian text while retaining and even improving the original English capabilities. This is a significant hurdle, as adapting existing LLMs to new languages often leads to a decline in performance in the original language, a phenomenon known as catastrophic forgetting. To combat this, the team is using innovative techniques like Branch-and-Merge, a continual learning strategy that minimizes performance loss while maximizing gains in the new language. They've also curated a massive dataset of over 100 billion tokens of Bulgarian and English text. BgGPT isn’t just a research project—it’s a real-world application. The models power a Bulgarian chat service, making powerful AI accessible to users without specialized hardware. The team has also focused on educational applications, benchmarking BgGPT against state-of-the-art models using exam questions provided by the Bulgarian Ministry of Education. Results are impressive. BgGPT outperforms larger multilingual models like Qwen-2.5 and Llama-3.1 in several key benchmarks, demonstrating its specialized proficiency in Bulgarian. While the primary focus is Bulgarian, the researchers believe their methods can be adapted for other lower-resource languages, opening doors for wider access to powerful AI tools. This is more than just language translation; it's about bridging the digital divide and bringing the transformative power of AI to everyone, regardless of their language.

🍰 Interesting in building your own agents?

PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

What is the Branch-and-Merge technique used in BgGPT, and how does it address catastrophic forgetting?

Branch-and-Merge is a continual learning strategy that allows LLMs to learn new languages while preserving existing capabilities. The technique works by creating separate learning pathways ('branches') for new language acquisition while maintaining the original language knowledge base, then carefully merging these pathways to create a unified model. This process involves: 1) Creating a specialized branch for Bulgarian language learning, 2) Training this branch independently to prevent interference with English capabilities, and 3) Strategically merging the branches to maintain performance in both languages. In practice, this allows BgGPT to outperform larger multilingual models while maintaining strong performance in both Bulgarian and English.

How are language models making AI more accessible to non-English speakers?

Language models are democratizing AI access by breaking down language barriers through specialized training and localization. They enable non-English speakers to interact with AI in their native language, access information, and utilize AI tools without requiring English proficiency. Key benefits include improved educational opportunities, better access to digital services, and more inclusive technological advancement. For example, models like BgGPT allow Bulgarian speakers to use AI chatbots, educational tools, and other applications in their native language, making advanced technology accessible to millions more users.

What are the main challenges in developing AI models for less common languages?

Developing AI models for less common languages faces several key challenges: limited available training data, resource constraints, and the risk of performance degradation in other languages. The benefits of addressing these challenges include broader global AI accessibility, preserved cultural diversity in technology, and improved local economic opportunities. Real-world applications include educational tools, customer service, and content creation in local languages. Success stories like BgGPT demonstrate that these challenges can be overcome through innovative techniques and careful dataset curation.

PromptLayer Features

Testing & Evaluation
BgGPT's evaluation against educational benchmarks and comparison with other models aligns with systematic testing needs

Implementation Details

Set up automated testing pipelines using Bulgarian Ministry of Education questions, implement A/B testing between model versions, track performance metrics across language capabilities

Key Benefits

• Standardized evaluation across model iterations • Automated regression testing for both languages • Quantifiable performance comparisons

Potential Improvements

• Expand test sets beyond education domain • Add real-time performance monitoring • Implement cross-lingual evaluation metrics

Business Value

Efficiency Gains

Reduces manual evaluation time by 70% through automated testing

Cost Savings

Minimizes regression issues through early detection

Quality Improvement

Ensures consistent performance across language capabilities

Analytics
Workflow Management
Branch-and-Merge strategy requires sophisticated version tracking and orchestration of training steps

Implementation Details

Create versioned workflows for training phases, implement checkpointing system, establish clear staging processes

Key Benefits

• Reproducible training processes • Clear version history tracking • Streamlined deployment pipeline

Potential Improvements

• Add automated rollback capabilities • Implement parallel training workflows • Enhanced metadata tracking

Business Value

Efficiency Gains

Reduces training pipeline setup time by 40%

Cost Savings

Minimizes resource waste through optimized workflows

Quality Improvement

Ensures consistent model quality across iterations

BgGPT: Bringing the Power of LLMs to Bulgarian

Summary

Question & Answers

PromptLayer Features

The first platform built for prompt engineering