Fine-tuning open-source models: is it time to move off Frontier Lab models?

Prompt Security

Prompt security is the practice of defending LLM applications against adversarial attacks that exploit the natural-language interface—including prompt injection, jailbreaking, and data leakage—ensuring AI systems behave safely and reliably in production.

What is Prompt Security?

Prompt security is the discipline of protecting large language model (LLM) applications from adversarial attacks that exploit the natural-language interface between users and the model. Because LLMs process instructions and user-supplied data through the same input channel, a malicious actor who controls part of that input can influence—or fully override—the model's intended behavior. Prompt security encompasses the full defensive stack: input validation, output filtering, access controls, version management, and continuous monitoring of every prompt-response interaction.

Key Prompt Security Threats

The OWASP Top 10 for LLM Applications places prompt-related vulnerabilities at the top of the list. The most common threats include:

Prompt injection: Malicious instructions embedded in user input that override the system prompt and redirect the model's behavior. Direct injection happens through the user interface; indirect injection hides malicious instructions in documents or web content the model retrieves. See prompt injection for a full breakdown.
Jailbreaking: Carefully crafted prompts designed to bypass a model's built-in safety guardrails, causing it to produce harmful, policy-violating, or confidential outputs. See jailbreaking.
Prompt leakage: Attacks that coerce the model into revealing its system prompt, which may contain proprietary logic, API keys, or confidential business instructions.
Data exfiltration: Using injected instructions to extract sensitive context—user data, internal knowledge-base content, or training examples—from within the model's context window.
Prompt drift: The gradual, unintentional shift in prompt behavior as models are updated or fine-tuned, opening unexpected security gaps without any deliberate attack.

Prompt Security Best Practices

Effective prompt security requires layering multiple defenses rather than relying on any single control:

Separate instructions from data: Use delimiters and structured formats to isolate the system prompt from user-supplied input, preventing instructions from bleeding across privilege boundaries.
Input validation and guardrails: Screen every user message for known injection patterns before it reaches the model. Post-generation output filters enforce content policies and prevent sensitive data from leaking back to the user. See guardrails for implementation patterns.
Prompt version control and auditing: Track every change to production prompts in a versioned registry. Unauthorized or accidental changes to system prompts are among the most common sources of security regressions. A prompt versioning system ensures all changes go through review with a full audit trail for incident response.
Observability and anomaly detection: Log all prompt-response pairs in production and monitor for unusual patterns—unexpected instruction sequences, abnormally long inputs, or outputs containing PII. Prompt observability platforms provide the visibility needed to detect and respond to prompt attacks in real time.
Least-privilege context: Limit the tools, APIs, and data the model can access to only what is needed for its specific task, reducing the blast radius of a successful injection attack.
Continuous red teaming: Proactively probe prompts with adversarial inputs across every deployment. Treat prompt security testing as a recurring engineering practice, not a one-time pre-launch check.

Prompt Security

What is Prompt Security?

Key Prompt Security Threats

Prompt Security Best Practices

Related Terms