graph LR
segment -- create_query --> query
query -- retrieve --> segment
Self-supervised retrieval evaluation
Evaluation with synthetic queries
Even if you lack real labeled data, it’s still often possible to do meaningful evaluations of retrieval performance using synthetic data.
For example, if you have a piece of text divided into segments and some way to generate a plausible query for which that segment would be a good response, you can create an evaluation dataset by doing this:
loop N times:
choose arbitrary segment
query = create_query(segment)
save (query, segment) to dataset
Synthetic data and property-based testing
When synthetic data is used for training or fine-tuning, it makes sense to think of it as data. But in the context of evaluation, you can also think of the whole process as a kind of stochastic property-based testing, where we verify that some circuit we think ought to exist (based on our understanding of the problem) is in fact closed as expected.
It turns out there’s a whole lore around property-based testing, often exploiting formal properties of domain objects and operations on them. E.g. (following Wlaschin 2014):
- commutative relationships (and model-based approaches),
- invertible operations,
- invariance under transformation,
- idempotence, and
- structural induction.
The example given above fits the ‘invertible operations’ paradigm, where the operations are ‘given a query, retrieve a responsive segment’ and ‘given a segment, generate a plausible query’.
But really, for lots of properties that involve a generation step you could collect a dataset of generated examples and call it synthetic evaluation data.
References:
- Claessen, K. and Hughes, J. (2000). QuickCheck: a lightweight tool for random testing of Haskell programs. ACM SIGPLAN Notices, Vol. 35, Issue 9, pp. 268-279.
- Es, S. (2024). All about synthetic data generation. Ragas blog.
- Esfandiarpoor, R. et at. (2025). Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance. arXiv:2503.23239 [cs.IR].
- Rahmani, H. (2024). Synthetic Test Collections for Retrieval Evaluation. arXiv:2405.07767 [cs.IR].
- Wlaschin, S. (2014). Choosing properties for property-based testing. F# blog.