Releases

July 09, 2024

Load 2

Over the past several months, you let us know that Gentrace's results view was too complex. This made it hard for users to adopt Gentrace without being taught by someone else.


To solve this, Gentrace has revamped the core test result UI, decomposing the old, cluttered test result view into three different ones for better clarity.
We've also made all of the following views realtime, so that you can watch as evaluation results from LLMs, heuristics, or humans stream in.

Aggregate comparison

The new aggregate view shows the statistical differences between different versions of your LLM-based feature.

Improvements [6]
Persistent user-specific view settings, which can be saved and overriden from a URL o1 support Fixed 68 bugs Added explicit compare button
July 09, 2024

Load 3

Production evaluation graphs

Production evaluators now automatically create graphs to show how performance is trending over time.
For example, you can create a "Safety" evaluator which uses LLM-as-a-judge to score whether an output is compliant with your AI safety policy.
Then, you can see how the average output "Safety" trends over time.

Local evals & local datasets

Gentrace now allows you to more easily define local evaluations  and use completely local data  / datasets.
This makes Gentrace work better with existing unit testing frameworks and patterns. It also makes Gentrace incrementally adoptable into homegrown testing stacks.

Improvements [6]
Persistent user-specific view settings, which can be saved and overriden from a URL o1 support
July 09, 2024

Load 4

Datasets

Gentrace test cases in pipelines currently work well when only one engineer is working on the pipeline.

However, once more than one engineer is working on the same pipeline, it becomes difficult to manage test data in Gentrace. In practice, engineers end up cloning pipelines or overwriting test data, both of which have significant drawbacks.

To solve this, Gentrace has:

  • Introduced datasets, which organize test data into separate groups within a pipeline
  • Migrated existing test data into a default "Golden dataset"
  • Made existing API routes and SDK methods operate on the "Golden dataset," and created optional parameters or new versions which allow you to specify an alternative dataset.

Please give us feedback on how datasets feel.

Test result settings memory

Settings in any of the test result pages (such as hiding evaluators, collapsing inputs, outputs, or metadata, re-ordering fields, etc) are now remembered across test results in the same pipeline.
This makes it easier to see exactly what you want (and only exactly what you want), without having to redo your work every time.

Improvements [6]
Persistent user-specific view settings, which can be saved and overriden from a URL o1 support Fixed 68 bugs Added explicit compare button

Evaluate

Experiment

Compare