A test run for a Learning to Rank algorithm

Basic run for a LambdaRank algorithm

What is a learning to rank algorithm ?

Problem is of the form that you want to rank the retrieved results

Why you should be interested: Some use cases

When you write a query, and google suggests you completed queries (customised to your usage and taste). Product ranking when shopping for items on e commerce pages search on your email box. suppose you type “my family pictures” and it returns results relevant to it, in a ranked order.

Interesting side note: Open source search engine: Apache Solvr.

Remember we need a differential loss ! Just sorting itself is not differentiable. Hence you turn it to a different problem which can be optimized.

Point wise versus pairwise versus listwise ranking algorithms

How does the model differ in the above three cases

Point wise is like regression. You may predict the ‘relevance score’ as a single value (basically a regression ). And then later sort these values to rank the results.

ListNet algorithm

Permutation probability: set of scores converted to probability values for that ranking list. There could be some function which does this. The scores obtained from our features, for each of the results have been cleverly transformed towards a probability distribution. Where a more suitable ranking, would have higher probability values, and a less suitable ranking would have lower probability values. Makes sense, ain’t it eh?

$ P_s (\pi) = \prod_{j=1}^n \frac{\phi(s_{\pi(j)})}{\sum_{k=j}^n \phi(s_{\pi(k)})} $

Your data would already have some relevance scores included in it.

What would be the nature of your gradients?

Neural network is used to model this function which learns how to rank, using this loss value.

Explaining the algorithm We are able to use gradient descent in this scenario too.

What is the loss function in this case?

Learning objective is defined as minimization of this loss: $ \sum_{i=1}^n L(y^{i}, z^{i}) $ where z would be some score from your ‘model’ and ‘y’ would be the relevance value.

Problem is changed to Top one probability for simiplication. Where we just look at probability of those permutations , where the j^{th} is ranked first.

$P_s(j)= \sum_{\pi(1)=j} P_s (\pi)$ And then the authors propose through a reduction, that this value is equivalent to this.

So, voila, it becomes: $ P_s(j) = \frac{\phi(s_j)}{\sum_{k=1}^n\phi(s_k)} $

Then we can use any metric to calculate the distance between two score lists. For example with Cross entropy, the listwise loss function becomes:

$ L(y^{i}, z^{i})= - \sum_{j=1}^n P_{y^i}(j) \log (P_{z^i}(j)) $

which is basically analogus to using P_j as a class prediction, and then doing cross entropy over possible class predictions.

Data arranged as ${(x^1,y^1), (x^2,y^2) \cdots (x^m, y^m)}$, m data pairs. for your data labels(targets/relevance). you can arrange them as say [3,1,0], and then take softmax of these values to get relevance grades.

What kind of losses can we use with ListNet?

ListNet gave us two distributions. The prediction distribution for the rank of items, and the relevance scores: we converted our raw relevance scores to distribution using softmax.

This blog suggested, why don’t we use KL Divergence between these two probability distributions. Original paper had made use of cross entropy loss between the two.

So we can see that this can be a neural model. or basically any model which can predict those score functions

Metrics

Simplest : Reciprocal Rank A relevant item should be at top. If it goes to 2nd position, there is a lesser chance of it having exposure. It is kind of exponential decay.

What about NDCG measures ?

What about triplet losses?

We can use a triplet loss if we want to output, for a given image, a rank of images closest to this particular person. As exact matching of the images would require, every model be re-trained, every time a new image is added(really??)

What is being learnt under the triplet loss paradigm? The function knows how to compare two things, and tell if they are similar (or similar enough) or not.

A side thought about Randome negative sampling in ranking

This appears to be a very useful tool

By the way: What is a Siamese Network ??

Reference papers:

Learning to Rank: From pairwise approach to listwise approach, Cao et al.

I also loved this blog https://theiconic.tech/learning-to-rank-is-good-for-your-ml-career-part-2-lets-implement-listnet-11af69d1704