SantaCoder

Property	Value
Parameters	1.1B
Training Data	The Stack v1.1 (Python, Java, JavaScript)
License	BigCode OpenRAIL-M
Paper	SantaCoder: Don't reach for the stars!
Training Infrastructure	96 Tesla V100 GPUs

What is SantaCoder?

SantaCoder is a specialized code generation model trained on a carefully curated dataset of Python, Java, and JavaScript code. It implements Multi-Query Attention and the innovative Fill-in-the-Middle objective, allowing it to not only generate code sequentially but also fill in missing code segments within existing structures.

Implementation Details

The model was trained for 600K steps on 236 billion tokens using a GPT-2 architecture with modifications. It employs a 2048-token context window and uses float16 precision for efficient computation. The training process took 6.2 days on 96 Tesla V100 GPUs, accumulating approximately 2.1 x 10e21 FLOPS.

Utilizes Multi-Query Attention for improved efficiency
Implements Fill-in-the-Middle objective for versatile code completion
Trained with near-deduplication and comment-to-code ratio filtering
Supports three major programming languages: Python, Java, and JavaScript

Core Capabilities

Code generation and completion in Python, Java, and JavaScript
Fill-in-the-Middle functionality for code infilling tasks
Achieves 18% pass@1 on Python HumanEval benchmark
Strong performance on code-to-text tasks with 18.13 BLEU score

Frequently Asked Questions

Q: What makes this model unique?

SantaCoder's combination of Multi-Query Attention and Fill-in-the-Middle objective, along with its focused training on three major programming languages, makes it particularly effective for code generation tasks. The model was trained with careful consideration of code quality through filtering criteria like near-deduplication and comment-to-code ratio.

Q: What are the recommended use cases?

The model excels at code completion and generation when provided with appropriate context like comments or function signatures. It's important to note that it's not an instruction-following model, so inputs should be formatted as they would appear in source code rather than as natural language commands.

santacoder