OpenCoder-1.5B-Instruct

Property	Value
Parameter Count	1.91B
Model Type	Code Generation LLM
Architecture	Transformer-based
License	inf
Paper	View Paper
Context Length	4K tokens

What is OpenCoder-1.5B-Instruct?

OpenCoder-1.5B-Instruct is part of the OpenCoder family of code generation models, specifically designed for both English and Chinese programming tasks. This instruction-tuned version builds upon a base model trained on 2.5 trillion tokens, comprising 90% raw code and 10% code-related web data. The model has been fine-tuned on over 4.5M high-quality supervised examples to achieve state-of-the-art performance in code generation tasks.

Implementation Details

The model utilizes a BF16 tensor format and supports a 4K token context length. It's built on transformers architecture and has been extensively tested through comprehensive ablation studies on various data-cleaning strategies.

Pretrained on 2.5T tokens with optimized data distribution
Supervised fine-tuning with 4.5M high-quality examples
Supports both English and Chinese programming instructions
Implements advanced file-level and repository-level deduplication

Core Capabilities

Strong performance on HumanEval (72.5%) and MBPP (72.7%) benchmarks
Bilingual code generation and understanding
4K token context window for handling larger code snippets
Comprehensive support for various programming tasks
Production-ready with commercial license support

Frequently Asked Questions

Q: What makes this model unique?

OpenCoder-1.5B-Instruct stands out for its complete transparency, including released training data, checkpoints, and extensive documentation. It's one of the few models that provides full access to its training pipeline and synthetic data generation process.

Q: What are the recommended use cases?

The model excels in code generation, code completion, and programming assistance in both English and Chinese. It's particularly effective for software development, code documentation, and educational purposes in programming.