Hymba-1.5B-Base
Property | Value |
---|---|
Parameter Count | 1.52B |
Model Type | Text Generation |
Architecture | Hybrid Mamba-Attention |
License | NVIDIA Open Model License |
Paper | arXiv:2411.13676 |
What is Hymba-1.5B-Base?
Hymba-1.5B-Base is an innovative language model developed by NVIDIA that introduces a hybrid architecture combining Mamba and Attention heads running in parallel. The model represents a significant advancement in efficient language model design, featuring 1.52B parameters and achieving performance that surpasses other sub-2B parameter models.
Implementation Details
The model architecture consists of 32 layers with an embedding size of 1600 and 25 attention heads. It employs a unique hybrid design where each layer combines standard attention heads with Mamba heads for parallel processing. The architecture includes 16 SSM states, 3 full attention layers, and predominantly uses sliding window attention for the remaining layers.
- Embedding dimension: 1600
- MLP intermediate dimension: 5504
- Utilizes Grouped-Query Attention (GQA)
- Implements Rotary Position Embeddings (RoPE)
- Features meta tokens for improved efficacy
- Shares KV cache between layers and heads
Core Capabilities
- Superior performance compared to sub-2B parameter models
- Efficient text generation with parallel processing
- Commercial-ready deployment
- Flexible adaptation for various NLP tasks
- Memory-efficient operation through KV cache sharing
Frequently Asked Questions
Q: What makes this model unique?
The model's hybrid architecture combining Mamba and Attention heads, along with meta tokens and cross-layer KV sharing, creates a uniquely efficient and powerful language model. This design allows for parallel processing of inputs while maintaining high performance with a relatively small parameter count.
Q: What are the recommended use cases?
Hymba-1.5B-Base is suitable for various text generation tasks and can be adapted for commercial applications. However, it's important to note that generation currently only supports batch size 1 due to implementation constraints with meta tokens and sliding window attention.