CSV Import and Export for Tests and Criteria
We've introduced CSV import and export functionality to enhance visibility and streamline the editing process of tests and criteria.
Features:
-
CSV Import/Export: Accessible via the Import / Export dropdown in the tests sidebar.
- Export Templates: Press Export Tests or Export Criteria to download a template, optionally including existing examples.
- Edit Locally: Modify the CSV file locally using Excel or Google Sheets.
- Re-import: Re-upload the modified file by pressing Import Tests or Import Criteria.
-
Data Synchronization:
- Update Existing Entries: Existing tests or criteria with their ID in the CSV will be updated upon import.
- Create New Entries: Any tests or criteria without an ID will be created.
-
Bulk Image Uploading:
- Option 1: Use public image URLs in the CSV fields. These links act as pointers and will work during benchmarking as long as the image is accessible.
- Option 2: Zip the import CSV along with images, and reference image filenames in the CSV. These images will be uploaded to ModelBench and stored as image inputs for the associated tests.
This functionality restores the efficiency of bulk editing and enhances the management of tests and criteria.
Bug Fixes and Legacy Account Migrations
We've addressed several issues to improve stability and user experience.
Fixes:
- Prompt and Chat Duplication: Resolved errors when duplicating prompts, chats, or sending chats from the playground to a new workbench prompt.
- Prompt UI Lockup: Fixed a bug where the prompt UI would become locked when moving between versions until a refresh.
- Prompts Without Criteria: Corrected a regression that prevented prompts from being saved without any criteria. Now, the benchmarks tab is simply disabled until at least one valid test is present.
These fixes enhance the overall functionality and reliability of the platform.
Benchmark Screen Enhancements
We've made visual improvements to the benchmark screen for better clarity and usability.
Enhancements:
- 1–10 Score Visualization: Scores now appear numerically and are rendered on a red-to-green scale for immediate visual feedback.
- Descriptive Test Names: Test names now appear on the left side instead of numbers, making it easier to identify each set of rounds.
These updates enhance the readability and interpretation of benchmark results.
Editable AI Judgments
You now have the ability to edit AI judgments, just as you can with human judgments.
Updates:
- Flexibility: Previously, judgments made by the AI judge were conclusive.
- User Control: Modify AI judgments to correct any discrepancies or to align with your criteria.
This enhancement provides greater control over evaluation outcomes.
Introducing Human Judgment in Benchmarks
Responding to popular demand, we've added the ability for you and your teammates to manually judge benchmark runs.
Features:
- Human Judgment Default: The default option is now human judgment, allowing you to decide if a test requires expert or domain-specific knowledge.
- AI Judge Toggle: Choose to turn on the AI judge for any specific criteria if desired.
- Full Prompt for AI Judge: You now have the option to provide the full prompt to the AI judge, whereas previously it only saw the response. Be aware this may increase token usage, and an appropriate warning has been added.
This feature offers greater flexibility and control over the evaluation process.
Introducing 1–10 Score Format for Criteria
Enhance your testing flexibility with the new 1–10 score format for criteria.
Features:
- Scoring Options: In addition to pass/fail, you can now choose a 1–10 scoring option for any of your criteria.
- Subjective Judging: Ideal for creative or more subjective evaluation tasks where a nuanced score is more appropriate.
This feature allows for more granular assessment of your prompts and models.
Prompt Tests V2 Now in General Access
We're excited to launch the overhauled Prompt Tests V2. After extensive private beta testing and numerous iterations, the new version is live.
Improvements:
- Self-Contained Tests: Move away from the Excel-like structure. Each test is now self-contained with its own set of inputs.
- Reusable Criteria: Define criteria that can be reused across tests, replacing the previous outcomes.
- Enhanced Benchmark Screen: Addresses the lack of names against examples and improves the ability to share criteria across different tests.
Migration:
- Automatic Upgrade: Users with tests in the V1 format will find their tests and any executed benchmarks automatically migrated to the new V2 format.
This update streamlines the testing process and enhances usability.
Run Workbench Prompts Directly
You can now run prompts in the workbench without needing to benchmark them first.
Features:
- New Run Pane: Add example inputs in the new Run pane beside your prompt.
- Flexible Execution: Press Run on any model with any configuration to execute your prompt.
- Post-Run Options: After running, easily press Add to Prompt to include the response in your prompt or Show Log to view a trace of the run.
This enhancement simplifies the process of testing and refining your prompts.
Image Inputs Now in General Access
After extensive testing, image inputs are now available to all users. You can add image inputs to user messages within your workbench prompts.
Details:
- Enhanced Prompts: Incorporate images directly into user messages to enrich your prompts.
- Testing and Benchmarking: Tests and benchmarks now support image inputs. Ensure you run benchmarks with models that support images to avoid any failures.
This feature expands the possibilities for your prompts and models.
Enhanced LLM Logging Features
We've upgraded the logging system for benchmarks and playground runs to provide more detailed insights.
New features:
- Comprehensive Tracking: Now tracking OpenRouter cost, inference speed, and tokens in/out.
- Improved Viewing: Requests and responses are separated into distinct tabs with syntax highlighting for easier analysis.
- Traces Private Beta Update: For users in the Traces Private Beta, this enhancement replaces the current trace event view.
These improvements offer deeper visibility into your runs, aiding in optimization and troubleshooting.
Multiple Bug Fixes and UI Enhancements
We've addressed several bugs and made numerous UI improvements to enhance your experience.
Fixes and improvements:
- Benchmark Error Handling: Benchmark runs that time out for any reason will now be marked as errored.
- Tool Schema Editor: Fixed an issue where the tool schema editor would go off-screen when working with large tool schemas.
- Image Synchronization: Resolved a bug where images would not synchronize between chat streams when using the * Sync to Here* feature in the playground.
- Performance Gains: Implemented various UI tweaks and performance enhancements across the app.
These updates aim to provide a smoother and more reliable user experience.
Introducing Minimal UI for Enhanced Focus
To help you stay focused during intensive work sessions, we've updated the interface with a new Minimal UI option.
Improvements:
- Hide Main Sidebar: You can now neatly hide the main sidebar by pressing the small arrow button near the top of the divide.
- Floating Secondary Sidebars: Secondary sidebars like the workbench items list and the benchmark execution modal now float over the application. They're only visible when needed and can be hidden with a single click.
This update provides a cleaner workspace, allowing you to concentrate on the task at hand.
AI Judge Enhancements with Outcome Improvement
We've improved our AI judging capabilities by adding an outcome improvement chain to our output judging LLM stack. This enhancement allows you to write short, concise outcomes, which we then expand using Claude 3.5 Sonnet to catch edge cases.
Benefits:
- Efficient Outcome Writing: Save time by writing brief outcomes without compromising thoroughness.
- Cost-Effective Judging: Continue using the cost-effective Claude 3 Haiku model while achieving highly accurate judgments.
This improvement enhances the accuracy and efficiency of AI evaluations in your benchmarks.
Real-Time Benchmark Results
Experience immediate feedback with real-time benchmark results. When you run a benchmark, results now appear instantly without needing to refresh the page.
Features:
- Live Updates: Watch the benchmark grid update in real time as each test completes.
- Detailed Monitoring: Click on a specific round, test, or model to observe the inference process as it happens.
This enhancement allows for a more interactive and efficient benchmarking experience.
AI Judge Upgrade to Claude 3 Haiku
We're upgrading our AI judge to use Anthropic's Claude 3 Haiku model. Through extensive testing with ModelBench, we found that Claude 3 Haiku is as capable, if not more so, at following instructions compared to the latest GPT-4 Turbo model.
Updates:
- Enhanced Judging: Our internal judgment prompts now leverage Claude 3 Haiku for improved accuracy.
- User Feedback Welcome: If you'd like the ability to choose the judging model yourself or modify the prompts, please let us know.
This upgrade aims to provide more reliable evaluations in your benchmarks.
Introducing Prompt Versioning V1
Manage prompt changes more effectively with the new Prompt Versioning feature. Now, when you run a benchmark, the associated prompt is locked to ensure the integrity and purity of your benchmarks.
Key points:
- Prompt Locking: Benchmarked prompts are locked to prevent alterations that could affect benchmark results.
- Version Control: To make changes to a prompt, you can draft a new version, allowing you to track modifications over time.
This feature helps maintain the reliability of your benchmarks and streamlines prompt management.
Benchmark Dashboard Overhaul
We've revamped the Benchmark Dashboard to provide a more intuitive and efficient experience. Previously displaying a list of outcomes for each test and model, the dashboard now features a condensed linear grid that logically organizes tests, cases, and results.
Key improvements:
- Simplified Layout: Tests, cases, and results are presented in a clear and logical grid format.
- Detailed Insights: Click on any executed test case to view detailed results.
- Enhanced Benchmarking: Easily add multiple rounds or more models to an existing benchmark and compare results side by side.
This overhaul simplifies analysis and interpretation of your benchmark results.
Prompt Tests V1 Now in General Access
We're excited to announce that Prompt Tests V1 is now available to all users. Previously in private beta, this feature introduces a grid system where you can add tests with multiple cases.
Highlights:
- Structured Testing: Create tests with a name and multiple cases, each containing example inputs for your prompt and a desired outcome.
- AI Evaluation: Outcomes are judged by AI during the benchmark phase, providing valuable feedback on your prompts.
- Easy Synchronization: Reuse inputs or outcomes across test cases within a test by pressing Sync in any column you wish to synchronize.
This feature streamlines the process of testing and refining your prompts.
Publicly Share Playground Chats
Collaborate more effectively with the new ability to publicly share playground chats. We've added a Public Share action to the top right of the playground. Clicking it and pressing share will create a public URL that reflects the current state of your chat stream.
Key features:
- Live Updates: The shared link stays up to date with any changes made to the chat.
- Revocable Access: Public links can be unshared at any time, giving you control over your content.
- Easy Collaboration: Users outside your organization can try out your prompts by clicking the Try in ModelBench button.
This feature makes it easier than ever to share your work and receive feedback.
Simulate Tool Use in Playground and Prompts
Enhance your testing capabilities by simulating tool use directly in the playground and prompts. When a model calls a tool, a special tool call block now appears in the message box. You can manually add a result and easily modify the call itself.
Additionally, you can now manually add tool calls to assistant messages, giving you greater control over the interaction flow.
Synchronize Message Histories in Playground
Simplify your comparison workflow with synchronized message histories. When comparing multiple models, you can now enable Sync Here at any message in any of the streams. All other streams will automatically synchronize up to that point.
Changes from that message onward in any chat stream will simultaneously update the other chat streams, saving you time and effort.
Compare Models Side by Side
Experiment with different models more efficiently by comparing them side by side in the playground. You can now compare any of the 180+ OpenRouter models simultaneously.
To use this feature, simply press Compare With and select any model from the dropdown.
Send Playground Chats to Prompt Workbench
Streamline your prompt development by sending playground chats directly to the Prompt Workbench. We've added a new button next to the model selector in the playground. Clicking it copies the entire message history over to a workbench prompt.
From the workbench, you can add variables to your prompt to make it dynamic and test it in different scenarios.
Organize Chats and Prompts into Projects
Enhance your workflow by organizing chats and prompts into projects. Organizations can now have one or more projects, each fully separating all chats and prompts for better management.
Create new projects easily from the sidebar project dropdown to keep your work neatly categorized.
Public Beta
We're excited to announce ModelBench is now in public beta.
Sign up for a free trial today and let us know how if our chat playground and workbench speed up your prompt engineering workflows!