Benchmarking Language Model Creativity: A Case Study on Code Generation

Back

Published

Jul 12, 2024

Updated

Jul 12, 2024

Can AI Be Creative? Putting LLMs to the Creativity Test

Benchmarking Language Model Creativity: A Case Study on Code Generation

Yining Lu|Dixuan Wang|Tianjian Li|Dongwei Jiang|Daniel Khashabi

https://arxiv.org/abs/2407.09007v1

Summary

We often associate creativity with humans, that spark of ingenious problem-solving that leads to innovative solutions. But as artificial intelligence rapidly evolves, a question naturally arises: Can AI be truly creative? This question is being tackled head-on by researchers who are exploring the creative potential of Large Language Models (LLMs). One fascinating study has introduced a new way to benchmark LLM creativity, focusing specifically on their ability to generate code. Researchers used a clever technique called “Denial Prompting” to push the boundaries of LLM creativity in code generation. The process starts by giving an LLM a coding problem. Once it solves it, the researchers then “deny” the LLM the techniques it originally used, forcing it to devise alternative solutions. They then introduce another constraint. As the constraints pile up, the LLM has to dig deeper into its ‘creative’ toolbox, coming up with unusual and innovative approaches. To measure the “creativity” of these solutions, the researchers introduced the “NeoGauge” score. This score assesses two key aspects of creative solutions: the solution's ability to actually work and whether the solution is innovative compared to what humans have historically come up with. This dual approach helps distinguish true creativity from simply generating random, nonsensical code. Their study focused on a dataset of coding problems called NeoCoder, using a range of LLMs—including industry giants like GPT-4—to test their mettle. The results? The research highlights that while some LLMs can generate functionally correct and novel code, their creativity is still leagues behind that of humans. Even GPT-4, considered one of the most advanced LLMs, struggles to match the creative problem-solving exhibited by human coders. The study also explored whether ‘reasoning strategies’ could boost LLM creativity. Techniques like Monte Carlo Tree Search were tested, but the gains were minimal, suggesting that simply enhancing reasoning isn’t enough to unlock true creative potential. The study, while confined to code generation, opens exciting avenues for broader AI research. As LLMs become more sophisticated, better methods for evaluating their creative capacity, like Denial Prompting and the NeoGauge score, will be crucial in shaping truly creative AI.

🍰 Interesting in building your own agents?

PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

What is Denial Prompting and how does it work in testing AI creativity?

Denial Prompting is a technical methodology that iteratively constrains an LLM's solution space to force creative problem-solving. The process works through these steps: 1) Present the LLM with a coding problem, 2) Once solved, prohibit the use of the initial solution's techniques, 3) Add progressive constraints that eliminate previously used approaches, forcing the LLM to explore alternative solutions. For example, if an LLM initially uses a for-loop to solve a sorting problem, the next prompt might deny the use of loops, compelling the LLM to devise a recursive or functional programming solution instead. This systematic constraint application helps measure an LLM's ability to generate novel, working solutions under increasingly challenging conditions.

How is artificial intelligence changing creative problem-solving in modern applications?

Artificial intelligence is transforming creative problem-solving by offering new approaches to traditional challenges. AI systems can analyze vast amounts of data and generate multiple solution pathways that humans might not consider. In practical applications, AI assists in fields like product design, marketing campaign creation, and software development by suggesting alternative approaches and optimizing existing solutions. For businesses, this means faster innovation cycles, reduced development costs, and the ability to explore a broader range of possibilities. However, as research shows, AI's creative capabilities still complement rather than replace human creativity, working best in partnership with human expertise.

What are the real-world benefits of measuring AI creativity in software development?

Measuring AI creativity in software development helps organizations better understand AI's capabilities and limitations in practical applications. By using metrics like the NeoGauge score, companies can make informed decisions about where to deploy AI tools most effectively in their development process. This knowledge helps optimize resource allocation, improve code quality, and accelerate development cycles. For example, AI could be used for generating routine code segments while leaving more complex, creative problem-solving tasks to human developers. This approach leads to more efficient development workflows and better utilization of both human and AI resources.

PromptLayer Features

Testing & Evaluation
The paper's Denial Prompting technique and NeoGauge scoring system align directly with PromptLayer's testing capabilities

Implementation Details

Configure batch tests using denial prompting patterns, implement NeoGauge-style scoring metrics, set up automated regression testing pipelines

Key Benefits

• Systematic evaluation of LLM creative capabilities • Reproducible creativity testing framework • Quantifiable creativity metrics across model versions

Potential Improvements

• Add specialized creativity scoring templates • Implement constraint-based testing automation • Develop creativity-specific evaluation dashboards

Business Value

Efficiency Gains

Automates creative capability testing across multiple LLM versions

Cost Savings

Reduces manual evaluation time for creative outputs by 60-70%

Quality Improvement

Ensures consistent creativity benchmarking across projects

Analytics
Workflow Management
The sequential denial prompting process maps to PromptLayer's multi-step orchestration capabilities

Implementation Details

Create reusable templates for constraint-based prompting, implement workflow steps for progressive constraint addition, track solution versions

Key Benefits

• Structured creative problem-solving pipelines • Versioned constraint management • Reproducible creativity experiments

Potential Improvements

• Add constraint visualization tools • Implement solution diversity tracking • Create automated constraint generation

Business Value

Efficiency Gains

Streamlines creative testing workflows by 40%

Cost Savings

Reduces setup time for creativity experiments by 50%

Quality Improvement

Enables systematic creativity assessment across teams

Can AI Be Creative? Putting LLMs to the Creativity Test

Summary

Question & Answers

PromptLayer Features

The first platform built for prompt engineering