Let's imagine that I have an evaluation function f(x) that allow to score a specific result.
It's not clear from this, whether f(x) is a relevence score for a single found item, or is a rating for a complete search result. If it is the former, then you can construct the latter by aggregating results in some semi-arbitrary fashion such as taking an average of f(x) for the top ten results returned by a search. This aggregation is arbitrary, but you can base it on intuitive concepts such as the number of results presented in a page. Ideally you have this grounded in some sense relating to the higher level goals of the system (i.e. although the system is a search system, no-one has "searching" as a stand-alone goal in practice, there is always some context).
To further aggregate this value per attempt into a value per search agent, you will need to construct a test across multiple requests, covering a gamut of different search terms representative of the kinds of search you expect the system will be used for. Take the mean result or f(x) score over all test searches in the test set. This will be necessary because it is very unlikely that different searches will be equally effective, so you will want some kind of averaged score that represents the system's performance against how it will be used in practice.
For instance, how can I tell between two versions of my agent (v1 and v2), which one is performing better ?
It is the one with the better aggregated f(x) score, measured over a standardised test set. You can reduce cost of measuring this, at the expense of accuracy, by using a smaller fair sample from the test set. Or you could achieve similar by having small/fast test set to get rough indication and then a larger test set for refining things. You will want to understand the standard error in your measures of aggregate f(x) so you can tell when a difference is large enough to be significant.
I don’t have access to the entire database of results due to the sheer volume (millions of results) and the cost involved.
Whilst this might rule out certain kinds of system validation, such as demonstrating the current f(x) is the best possible, it does not prevent you testing the system statistically. The important detail is to make the results on repeated test runs with different agents equivalent in as many ways as possible - changing only the internal and controllable aspects of the agent between tests, not how or what you measure.