The deeper issue with LLM evals is that they force teams to admit something uncomfortable: most companies never really defined “quality” in the first place. With deterministic software, you could hide behind pass/fail tests. With LLMs, that illusion breaks. Now you have to decide what matters: accuracy, usefulness, tone, risk, latency, cost, refusal behavior, source faithfulness, user trust, business outcome. And those things often conflict.
That’s an amazing one! Perfect timing! I would like to add just the CI for regression if they are custom-made for the engineer. For instance, you could build a gold set and run it through our CI to ensure that you achieve the desired outcome for each item when you push the same dataset, such as final evaluation, as positive or negative. This way, you can guarantee that any new changes won’t affect past results.
Tools do not only change what we can do.
Over time, they also change what cognitive muscles we continue to practice using ourselves.
The deeper issue with LLM evals is that they force teams to admit something uncomfortable: most companies never really defined “quality” in the first place. With deterministic software, you could hide behind pass/fail tests. With LLMs, that illusion breaks. Now you have to decide what matters: accuracy, usefulness, tone, risk, latency, cost, refusal behavior, source faithfulness, user trust, business outcome. And those things often conflict.
That’s an amazing one! Perfect timing! I would like to add just the CI for regression if they are custom-made for the engineer. For instance, you could build a gold set and run it through our CI to ensure that you achieve the desired outcome for each item when you push the same dataset, such as final evaluation, as positive or negative. This way, you can guarantee that any new changes won’t affect past results.