Self-supervised retrieval evaluation

miscellany
retrieval
Published

December 9, 2025

Modified

December 26, 2025

Evaluation with synthetic queries

Even if you lack real labeled data, it’s still often possible to do meaningful evaluations of retrieval performance using synthetic data.

For example, if you have a piece of text divided into segments and some way to generate a plausible query for which that segment would be a good response, you can create an evaluation dataset by doing this:

loop N times:
    choose arbitrary segment
    query = create_query(segment)
    save (query, segment) to dataset

Synthetic data and property-based testing

When synthetic data is used for training or fine-tuning, it makes sense to think of it as data. But in the context of evaluation, you can also think of the whole process as a kind of stochastic property-based testing, where we verify that some circuit we think ought to exist (based on our understanding of the problem) is in fact closed as expected.

graph LR
    segment -- create_query --> query
    query -- retrieve --> segment

It turns out there’s a whole lore around property-based testing, often exploiting formal properties of domain objects and operations on them. E.g. (following Wlaschin 2014):

  • commutative relationships (and model-based approaches),
  • invertible operations,
  • invariance under transformation,
  • idempotence, and
  • structural induction.

The example given above fits the ‘invertible operations’ paradigm, where the operations are ‘given a query, retrieve a responsive segment’ and ‘given a segment, generate a plausible query’.

But really, for lots of properties that involve a generation step you could collect a dataset of generated examples and call it synthetic evaluation data.

References: