Llama-3-Instruct-8B-SPPO-Iter3

Property	Value
Parameter Count	8.03B
License	Apache-2.0
Base Model	meta-llama/Meta-Llama-3-8B-Instruct
Research Paper	Self-Play Preference Optimization

What is Llama-3-Instruct-8B-SPPO-Iter3?

This is an advanced language model developed by UCLA-AGI using Self-Play Preference Optimization (SPPO) methodology. It represents the third iteration of improvements on the Meta-Llama-3-8B-Instruct base model, trained using the UltraFeedback dataset for enhanced instruction-following capabilities.

Implementation Details

The model utilizes a sophisticated training approach with specific hyperparameters including a learning rate of 5e-07, RMSProp optimizer, and linear learning rate scheduling. Training was conducted across 8 devices using DeepSpeed ZeRO-3 optimization.

Trained on synthetic datasets derived from openbmb/UltraFeedback
Implements three-iteration SPPO methodology
Uses BF16 tensor type for efficient computation

Core Capabilities

Achieves 68.28% accuracy on IFEval (0-Shot)
Shows 29.74% normalized accuracy on BBH (3-Shot)
Demonstrates consistent improvement over previous iterations with 39.85% win rate on AlpacaEval
Performs well on multiple benchmarks including arc_challenge (65.19%) and hellaswag (80.86%)

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its iterative SPPO training approach, showing progressive improvements across three iterations, particularly in instruction-following tasks and general language understanding.

Q: What are the recommended use cases?

This model is particularly well-suited for instruction-following tasks, general text generation, and applications requiring strong language understanding capabilities in English. It performs especially well in scenarios requiring precise following of instructions.