Self-supervised evaluation

Published

December 9, 2025

Retrieval evaluation with synthetic queries

Even if you lack labeled data, it’s still often possible to do meaningful evaluations with synthetic data.

For example, with a segmented text dataset you can:

loop:
    choose arbitrary segment
    query = create_query(segment)
    save (query, segment) to dataset

Synthetic data and property-based testing

When synthetic data is used for training or fine-tuning, it makes sense to think of it as data. But in the context of evaluation, you can also think of this process as a kind of stochastic property-based testing, where we verify that a circuit we think ought to exist based on our understanding of the problem is in fact closed.

graph LR
    segment -- create_query --> query
    query -- retrieve --> segment

It turns out there’s a whole lore around property-based testing, often exploiting formal properties of domain objects and operations on them. E.g. (following Wlaschin 2014):

commutative relationships (and model-based approaches),
invertible operations,
invariance under transformation,
idempotence, and
structural induction.

The example given above fits the ‘invertible operations’ paradigm, where the operations are ‘given a query, retrieve a responsive segment’ and ‘given a segment, generate a plausible query’. But really, for lots of properties involving a generation step you could collect a dataset of generated examples and call it synthetic evaluation data.

References:

Claessen, K. and Hughes, J. (2000). QuickCheck: a lightweight tool for random testing of Haskell programs. ACM SIGPLAN Notices, Vol. 35, Issue 9, pp. 268-279.
Es, S. (2024). All about synthetic data generation. Ragas blog.
Esfandiarpoor, R. et at. (2025). Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance. arXiv:2503.23239 [cs.IR].
Rahmani, H. (2024). Synthetic Test Collections for Retrieval Evaluation. arXiv:2405.07767 [cs.IR].
Wlaschin, S. (2014). Choosing properties for property-based testing. F# Blog.