Dataset Management
Curate your data

Build, maintain, and manage all of your datasets in one place. Create datasets from scratch or import them from production data.

Request a demoStart for free 🍰

Dataset Management that Works

Version Control

Version your datasets and maintain a history of changes as you iterate.

Custom Dataset Creation

Create datasets from production data using custom filters like score, prompt versions, and more.

Request Integration

Add individual requests from production data to your datasets.

Collaborative Editing

Edit datasets easily amongst team members in a collaborative environment.

Synthetic Datasets (Coming Soon)

Generate new datasets and augment existing ones with synthetic data.

Human Grading (Coming Soon)

Invite human graders to grade your datasets for quality assurance.

Capture the value of
your data

Stay organized and help your team move quicker by managing all of your datasets in one place. Create datasets from scratch or import them from production data.

no-img

Bootstrap your Dataset

Leverage your production data to make your application better.

Locate the Edge Cases

Use datasets to keep track of edge cases and possible regression points.

Iteratively build datasets

Add production requests to a dataset one at a time. Edit ground truth values and version.

Collaborate with your Team

Share datasets with your team and work together to improve them.

Frequently asked questions

If you still have questions feel free to contact us at sales@promptlayer.com

How to build datasets from production LLM traces?
Teams build datasets from production LLM traces by using an observability platform to log every input/output pair during live usage. Teams can then use metadata and search functions to filter this logged data (e.g., filtering for high-cost runs or runs that received a specific user feedback tag) and export the resulting subset directly into a versioned evaluation dataset for testing, debugging, and continuous improvement.
How do teams label LLM outputs at scale?
Teams label LLM outputs at scale by combining pairing automated judges with human-in-the-loop (HITL) review. In practice, this means running prompts against large datasets, using LLM-based judges for initial labeling, and applying human-in-the-loop workflows to validate or correct results. This approach balances speed with accuracy and produces reliable labels for evaluation and improvement.
How often should LLM eval datasets be updated?
Eval datasets lose reliability as prompts evolve and user behavior shifts, leading to subtle prompt decay. Teams address this by regularly refreshing datasets with recent production LLM traces to capture new edge cases and shifting user behavior. The frequency depends on the pace of application updates and user traffic, but a recurring evaluation template should be run every time a new prompt version is considered.
Our customer data is highly regulated. What compliance and security certifications does PromptLayer have?
We take data security very seriously and serve customers across all verticals. PromptLayer maintains SOC2 Type 2, GDPR, HIPPA, and CCPA certifications. Visit our Trust Center for more information.
What are the risks of synthetic data in LLM evaluation?
The main risk of synthetic data in LLM evaluation is false confidence. Synthetic examples often fail to capture real user behavior and production edge cases, which can cause teams to over-optimize for hypothetical scenarios, resulting in a model that performs well on tests but poorly in a live environment.