Why I refactored my benchmarking framework to be agent-first

I just merged PR #3 in benchctl, a benchmark orchestration tool I have been developing on my own.

I realised that developing a UI, plotting and data analysis features is not really a good focus for this project.

Now, with AI agents, that type of code is pretty cheap to produce. One could even argue that the orchestration code is also quite cheap to produce. But there is a fundamental difference in both. The orchestration code is something that actually repeats itself / something that I reimplement every time I do a new serious benchmark. The rest (plots, data-analysis) tends to be very domain-specific and it’s very hard to add actually good, useful primitives there that I can guarantee will be used.

For this reason, I refactored the codebase away from those features and leaning more into better benchmarking primitives.

Library-first API#

The core workflow is now in a public package in pkg/bench (static configuration) and pkg/run (runtime execution). Both use a functional options pattern.

b := bench.New("my-bench",
    bench.WithResultsPath("./results"),
    bench.WithHost("local", bench.Local()),
    bench.WithStages(
        bench.Stage("hello",
            bench.Host("local"),
            bench.Command("echo 'Hello from benchctl library'"),
        ),
    ),
)

result, err := run.Run(ctx, b,
    run.WithEnv("NAME", "Lucca"),
    run.WithTimeout(5*time.Minute),
)

Definition options and runtime options are separate. You can re-run the same Bench with different env vars, metadata, or skipped stages without mutating the definition.

Now my main.go is basically a small consumer of this library: it loads a YAML config, parses CLI flags into []run.Option, and calls run.Run. No benchmark logic in the CLI.

Comparison cases#

The other big primitive is case-based comparison benchmarks. Now we can express cases and compare them easily, reusing tons of code between them. (example: database A vs database B; slightly different parameters for the bench but basically same post-run code)

Each case gets its own env vars plus BENCHCTL_CASE_NAME. Stages run for every case unless you pin one with OnlyFor:

b := bench.New("engine-compare",
    bench.WithResultsPath("./results"),
    bench.WithCases(
        bench.NewCase("postgres", bench.Env("DB_ENGINE", "postgres")),
        bench.NewCase("mysql", bench.Env("DB_ENGINE", "mysql")),
    ),
    bench.WithStages(
        bench.Stage("run-load",
            bench.Command(`./load.sh "$DB_ENGINE"`),
        ),
        bench.Stage("postgres-extra",
            bench.OnlyFor("postgres"),
            bench.Command("./postgres-extra.sh"),
        ),
    ),
)

Case env vars also expand inside stage commands and output file names, so each variant can write distinct artifacts without copy-pasting whole benchmark configs.

What’s next#

Additionally, I’m planning to develop a skill I’m already writing so that AI agents are better at implementing and managing benchmarks. Things like ensuring repro, considerations about load generation and target system co-location, caveats about “what is being benchmarked?” and more.

The YAML config path is still there and still the main way I use benchctl day to day. The library export is for when the benchmark itself is code, or when an agent needs to compose workflows programmatically.