This research develops evaluation methods that move beyond traditional scalar metrics to directly model user preferences and system robustness. Key contributions include recall-paired preference (RPP), a metric-free evaluation method that computes preferences between ranked lists while simulating multiple user subpopulations per query. This can be generalized to lexicographic evaluation approaches for both precision (lexiprecision) and recall (lexirecall) that improve discriminative power and sensitivity compared to traditional metrics.