OctoCoder
Property | Value |
---|---|
Parameter Count | 15.5B |
License | bigcode-openrail-m |
Paper | OctoPack: Instruction Tuning Code Large Language Models |
Training Data | CommitPackFT & OASST |
Supported Languages | 80+ Programming Languages |
What is OctoCoder?
OctoCoder is a sophisticated instruction-tuned language model specifically designed for code generation and understanding. Built upon the StarCoder architecture, it has been fine-tuned on a massive dataset of GitHub commits and specialized instruction data, making it particularly adept at understanding and generating code across multiple programming languages.
Implementation Details
The model is built on a GPT-2 architecture with multi-query attention and employs a Fill-in-the-Middle objective. It underwent extensive training with 250k pretraining steps and 30 instruction tuning steps, processing 1 trillion tokens during pretraining and 2M tokens during instruction tuning. The model operates in bfloat16 precision and was trained using 512 Tesla A100 GPUs over 24 days for pretraining, followed by 4 hours of instruction tuning on 8 Tesla A100 GPUs.
- Architecture based on GPT-2 with multi-query attention
- Trained on CommitPackFT and OASST datasets
- Implements Fill-in-the-Middle objective
- Uses PyTorch and Megatron-LM/Transformers framework
Core Capabilities
- Multi-language code generation across 80+ programming languages
- Strong performance in Python (46.2% pass@1 on HumanEvalSynthesize)
- Effective code explanation and bug fixing capabilities
- Instruction-following with specific input format requirements
Frequently Asked Questions
Q: What makes this model unique?
OctoCoder stands out for its extensive training on real-world GitHub commits and its ability to handle multiple programming languages effectively. Its instruction-tuning approach makes it particularly well-suited for direct interaction through natural language prompts.
Q: What are the recommended use cases?
The model excels at code generation, bug fixing, and code explanation tasks. It's particularly strong in Python programming but maintains good performance across various languages including JavaScript, Java, Go, C++, and Rust. Users should format their queries with "Question: " prefix and "Answer: " suffix for optimal results.