Evaluating Information Retrieval

A story of Precision & Recall

5 min readApr 24, 2018

Continuing from the latest post, it would be a great chance to discuss some of my grasps of evaluation methodologies in Information Retrieval presenting the concepts of precision, recall and F1-score (also F-Score, F Measure) or at least give it a try.

“A person reading a book with a magnifying glass and a pen in hand” by João Silas

Before I proceed in a brief discussion of the aforementioned terms, I have to make a reasonable link to the previous post, on how I managed to design and run reproducible experiments, tackling real-world problems using best practices to my knowledge so far, in order to understand and create useful representations of system evaluation.

CBIR system in JSON format

Beside the information retrieval and ranking list concepts, I had to foresee and keep a DRY implementation as a web interface would integrate over the retrieval mechanism and experimental results. I had to meet the final deadline, never faced a same situation before, so I managed to structure the extracted data in JSON format, as it is a universal communication protocol format, not only easily handled with Python but a web and PHP friendly standard.

The problem addressed can be formulated as a “Given query image of a specific building, retrieve all images depicting the same building, from a given database.” I will not stick to complicated computer vision problems, eg. viewpoint variations, geometric transformations etc, whereas the big picture is gone.

Image naming methodology (building01_left_day) proved really helpful, as I could easily conduct a couple of experimental scenarios. The only set values that could used as input queries were the building id’s eg. 01, 02, …,60. Legit enough.

Image processing techniques were developed with OpenCV library, whereas the extracted information is based on the number of “inliers”¹. This information was handled as a (inlier descending) ranking list from the most to the less relevant retrieved building. More specifically in each query, a set of ranking lists were created as it could proved really helpful to calculate Precision & Recall Metrics.

A ranking list of relevant retrieved, to the query building, images.
A ranking list of total retrieved images.

Precision, Recall & F-Measure

Precision is the number of relevant to the query building retrieved images with respect to the total number of retrieved images.

Recall is the number of relevant to the query building retrieved images with respect to the number of identical query building retrieved images.

F-Measure provides a weighted harmonic mean of the precision and recall.

SIFT/SURF descriptor’s PRF curves along inlier threshold experimental spectrum.

Inlier threshold is a restraint factor that defines the similarity value of the retrieved image. The higher it gets, less retrieved images(high precision-less false positives) would be identified as most similar to the query image; namely, these building images are identical to the query building. Off the record and from computer vision perspective, 8 inliers can be considered a classifier border that separates the true negatives from true positives. (less identical/more identical retrieved images)

There are many ways one can improve precision and recall. These include extracting better image features, Bug of Words are a few. However, this can be a considered a computer vision problem still a bit unexplored.

Additional Evaluation Methods

Interpolated Precision vs Recall

The interpolated precision-recall is measured at the 11 levels of 0.0,…,1.0. For each recall level, an arithmetic mean of the interpolated precision is calculated at the recall levels for each information.

Hands-on, this graph was a pain in the ass for me until I get the grip of its concept and implement the theory over the real-world building results. Before the interpolate graph variation, I was aware of the classic jagged Precision-Recall graph.

Problem: The glitch I had, was the number of relevant images in each building query. Each building is queried with 15 images, so 14 (the denominator in Recall fraction) are expected to be retrieved (the omitted is the query one). Somehow I had to squeeze 14 values to 11 mean points.

Solution: Then interpolation and scipy.interpolate was my life saviour.

Highest F-Measure

Following the graph of the highest F-Measure percentage is measured to the peak value of F1 score taken from PRF image descriptors curves. (8 for SIFT, 10 for SURF). It happens SURF descriptor can be proved overall more accurate in higher inliers.

Mean Average Precision

What is mean average precision. Just an average precision value, for the all queries retrieved in the aggregate from a single house.

mAP = Sum Precision of 15 queries per building / # of valid queries (in inlier Threshold)

Each point represents mean average precision value per house in the highest inlier’s
peak in overall descriptor’s performance.

Final Words

Evaluating Precision / Recall could be considered an endless topic, though I believe that I dived beyond its surface and got a decent grip. Learning useful representations of data, you learn different ways examining the problem. Model evaluation along representation variations is the key idea of deep learning models — finding the most appropriate one of input data.

Information retrieval topic helped a lot to make the first steps in machine learning getting easier grasp of entities like statistics, probability and machine learning methodologies like regression, classification and clustering, vital for deep learning breakthrough.

Looking forward for a next topic about the much promising machine or deep learning. Thanks for reading.

References

A pair of images that depict the same building should contain a large number of inliers, while a pair of that would depict a different building would contain mainly outliers. Wiki
Extraction, preprocessing and evaluation source code repository.