shisa-v1-llama3-8b

shisa-ai

8B parameter Llama 3-based model optimized for Japanese-English tasks, achieving strong performance on ELYZA100 and MT-Bench benchmarks with 8e-6 learning rate

Property	Value
Base Model	Meta-Llama-3-8B-Instruct
Learning Rate	8e-6
Training Epochs	3
Model Type	LlamaForCausalLM
HuggingFace Link	shisa-ai/shisa-v1-llama3-8b

What is shisa-v1-llama3-8b?

shisa-v1-llama3-8b is a fine-tuned version of Meta's Llama 3 8B model, specifically optimized for Japanese-English language tasks. The model demonstrates impressive performance across multiple benchmarks, achieving an average score of 6.59 across ELYZA100, JA MT-Bench, Rakuda, and Tengu-Bench evaluations.

Implementation Details

The model was trained using the Axolotl framework (version 0.4.0) with a sequence length of 8192 and employs advanced features such as gradient checkpointing and flash attention. Training was conducted using the ultra-orca-boros-en-ja-v1 dataset with a learning rate of 8e-6, which proved optimal among various tested configurations.

Uses 8-bit AdamW optimizer with linear learning rate scheduling
Implements gradient accumulation over 8 steps
Trained with mixed precision (BF16) and flash attention
Achieves 91.30% Japanese character accuracy

Core Capabilities

Strong performance on ELYZA100 (6.67 score)
Excellent MT-Bench results (6.95 score)
Robust Rakuda benchmark performance (7.05 score)
Competitive positioning among other Japanese-capable models

Frequently Asked Questions

Q: What makes this model unique?

The model represents a sweet spot in the performance-size trade-off, achieving strong results with only 8B parameters. Its carefully tuned learning rate of 8e-6 proved optimal among several tested configurations, demonstrating superior performance compared to other learning rates.

Q: What are the recommended use cases?

The model is particularly well-suited for Japanese-English bilingual tasks, showing strong performance in translation, comprehension, and general language understanding. It's positioned as a practical option for applications requiring reliable Japanese language capabilities without the computational overhead of larger models.