HarmAug-Guard

HarmAug-Guard

hbseong

A 435M parameter DeBERTa-v3-based safety classification model designed to protect against LLM jailbreak attacks using HarmAug data augmentation.

PropertyValue
Parameter Count435M
Base ModelDeBERTa-v3-large
LicenseMIT
PaperarXiv:2410.01524

What is HarmAug-Guard?

HarmAug-Guard is a sophisticated safety classification model designed to protect against LLM jailbreak attacks and assess the safety of conversations with Large Language Models. Built on Microsoft's DeBERTa-v3-large architecture, it employs innovative knowledge distillation techniques combined with data augmentation through the HarmAug methodology.

Implementation Details

The model leverages the DeBERTa-v3 architecture with 435M parameters and implements a specialized training approach using knowledge distillation paired with data augmentation. It processes both individual prompts and prompt-response pairs to evaluate safety scores, outputting probability values indicating the likelihood of unsafe content.

  • F32 tensor type for precise computations
  • Transformer-based architecture with state-of-the-art text classification capabilities
  • Custom HarmAug Generated Dataset integration
  • Dual-mode evaluation for both prompts and complete conversations

Core Capabilities

  • Real-time safety assessment of LLM interactions
  • Jailbreak attack detection and prevention
  • Precise unsafe content probability scoring
  • Support for both single-prompt and conversation-based analysis

Frequently Asked Questions

Q: What makes this model unique?

HarmAug-Guard's uniqueness lies in its innovative use of knowledge distillation combined with data augmentation specifically designed for safety applications. The model can effectively evaluate both individual prompts and complete conversations, providing granular safety assessments.

Q: What are the recommended use cases?

The model is ideal for: implementing safety guards in LLM applications, monitoring conversation safety in real-time, detecting potential jailbreak attempts, and providing content moderation for AI-powered platforms.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026