OLMo2-8B-SuperBPE-t180k
Property | Value |
---|---|
Parameter Count | 8 Billion |
Context Length | 3,000 tokens |
Training Tokens | 331B |
Tokenizer | SuperBPE (200k vocabulary) |
Paper | arXiv:2503.13423 |
What is OLMo2-8B-SuperBPE-t180k?
OLMo2-8B-SuperBPE-t180k is an innovative language model developed by the University of Washington that introduces a novel tokenization approach called SuperBPE. This model represents a significant advancement in language model efficiency by combining traditional subword tokenization with new superword tokens that can span multiple words. The model achieves a 27% improvement in inference efficiency compared to standard BPE tokenization approaches.
Implementation Details
The model is built on the OLMo2 7B architecture and utilizes a unique tokenizer with a 200k vocabulary size. It features a strategic transition point at 180k tokens, where the tokenizer shifts from learning subword tokens to superword tokens. The context length is deliberately set to 3,000 tokens to match the effective context size in bytes of a traditional BPE model with 4,096 tokens.
- Advanced SuperBPE tokenization system
- 8B parameter architecture
- 3,000 token context window
- Trained on 331B tokens
- 200k vocabulary size with 180k transition point
Core Capabilities
- 27% more efficient inference compared to standard BPE models
- Effective handling of both subword and multi-word expressions
- Maintains semantic understanding while reducing token count
- Seamless integration with HuggingFace Transformers library
Frequently Asked Questions
Q: What makes this model unique?
The model's distinguishing feature is its SuperBPE tokenizer, which goes beyond traditional BPE by incorporating superword tokens that can span multiple words, leading to significant efficiency gains in processing and inference.
Q: What are the recommended use cases?
This model is particularly well-suited for applications where inference efficiency is crucial, such as production deployments with resource constraints. It maintains the capabilities of traditional language models while requiring fewer computational resources.