Ichigo-llama3.1-s-instruct-v0.4-GGUF

Property	Value
Parameter Count	8.03B
License	Apache 2.0
Architecture	Llama-3
Paper	AudioBench Paper
Language	English

What is Ichigo-llama3.1-s-instruct-v0.4-GGUF?

This is a quantized version of the Ichigo-llama3.1 model, specifically designed to understand both audio and text inputs. It represents a significant advancement in multi-modal AI, trained on over 1 billion tokens from the Instruction Speech WhisperVQ v4 dataset. The model showcases improved robustness against environmental noise and enhanced multi-turn conversation capabilities.

Implementation Details

The model utilizes the Llama-3 architecture as its foundation and incorporates sophisticated audio processing capabilities through WhisperVQ integration. It achieved an impressive MMLU score of 64.66, demonstrating strong performance in both general knowledge and audio understanding tasks.

Trained using FSDP2 implementation on 8x NVIDIA H100-SXM-80GB GPUs
Implements cosine learning rate scheduling with warmup
Uses Adam optimizer with torch fusion
Maximum sequence length of 4096 tokens

Core Capabilities

Dual modality processing (audio and text)
Noise-resistant audio understanding
Multi-turn conversation handling
High performance on AudioBench evaluations (3.5/5 on OpenHermes)
Competitive MMLU scores against base models

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to process both audio and text inputs with high accuracy, while maintaining robustness against environmental noise. It's particularly notable for achieving near-parity with specialized audio models while retaining strong general language understanding capabilities.

Q: What are the recommended use cases?

The model is primarily intended for research applications, particularly in scenarios requiring both audio and text understanding. It's well-suited for multi-turn conversations involving audio inputs, speech understanding tasks, and general language processing applications.