Saturday 16 March 2013

Query Dependent Ranking

WHY?

 - The first question that might come to your mind is what is the need of this 'Query Dependent    Ranking', when we are living in the age of Google which uses Page Rank which is Query Independent Ranking. 


Brief Overview of Query Independent Ranking

- The goal of query-independent ranking is to assign a score to each page that measures the intrinsic quality or authority of the page. According to the assumptions behind hyperlink analysis a page A to which many hyperlinks point is likely to be more authoritative than a page to which few hyperlinks
point. However, an approach that just counts the hyperlinks can be easily manipulated and it ignores the fact that the pages pointing to A also can be different quality. Brin and Page proposed instead the following recursive approach: The authority score, called PageRank, of a page A depends
on the authority scores of the pages pointing to B. More specifically, the PageRank R(A) of page A is defined as
R(A) = δ/n + (1 − δ) X hyperlink (B,A) R(B)/outdegree(B),
where δ is a constant usually chosen between 0.1 and 0.2, n is the number of pages in the collection, and outdegree(B) is the number of hyperlinks on page B.

Problem with Query Independent Ranking

Considering the large differences between different queries, it might not be the best choice to use a
single ranking function to deal with all kinds of queries. Instead, one may achieve performance gain by leveraging the query differences. To consider the query difference in training, one can use a query-dependent loss function. To further consider the query difference in the test process, a query-dependent ranking function is needed. Several ways of learning a query-dependent ranking function are reviewed in this chapter, including query classification-based approach, query clustering-based approach, nearest neighbor-based approach, and two-layer learning-based approach.

- Queries in web search may vary largely in semantics and the users’ intentions they represent, in forms they appear, and in number of relevant documents they have in the document repository. For example, queries can be navigational, informational, or transactional. Queries can be personal names, product names, or terminology. Queries can be phrases, combinations of phrases, or natural language sentences. Queries can be short or long. Queries can be popular (which have many relevant
documents) or rare (which only have a few relevant documents). If one can successfully leverage query differences in the process of learning to rank, one may have the opportunities to achieve better ranking performances in both training and test processes.

QUERY DEPENDENT SEARCH comes to the rescue!!!

In query-dependent ranking each page is assigned a score that measures its quality as well as its relevance to the user query. The basic idea is to build for each query a small subgraph of the whole web graph, called a neighborhood graph, and perform hyperlinks analysis on it. Carriere and Kazman propose the following approach to construct a neighborhood graph: 
  1. A start set of documents matching the query is determined.Specifically, they propose to choose the top 200 matches returned by a search engine for the query. 
  2. This start set is augmented by its “neighborhood”, which is the set of documents that either contain hyperlinks to a document in the start set or are referenced by documents in the start set. 
  3. Each document in the start set and its neighborhood is modeled by a node in the neighborhood graph. Two nodes are considered to be unaffiliated if they are on different web hosts. There exists an edge from node A and node B if and only if A and B are unaffiliated and document A has a hyperlink to document B. Edges between affiliated nodes are not allowed to limit the manipulation of the score. Given the neighborhood graph various ranking approaches were suggested. Carriere and Kazman proposed simply to count the number of hyperlinks in the neighborhood graph.

Problem with Carriere and Kazman approach

- Computing the PageRank on the neighborhood graph produces a very similar ranking, mostly because the neighborhood graph is usually small (on the order of a thousand nodes) and the impact of the global web is lost. 

Now Kleinberg!!

-Kleinberg proposed another approach of ranking pages in the neighborhood graph. It assumes that a topic can be roughly divided into pages with good content, called authorities, and pages with good references, called hubs. To determine the authorities and hubs the algorithm iteratively computes hub and authority scores for each node. The intuition is that high authority nodes should be pointed to by many hubs and especially by the good hubs. Good hub nodes on the other side should point to high authority nodes.
This leads to the following algorithm on a neighborhood graph with node set V and edge set E. 
  1. For every node A in V , initialize its hub score Hub[A] to 1 and its authority score Aut[A] to an arbitrary value.
  2. While the vectors Hub and Aut have not converged:
  • For all A in V , Aut[A] = P(B,A)∈E Hub[B]
  • For all A in V , Hub[A] = P(A,B)∈E Aut[B]
     3.  Normalize the Hub and Aut vectors to 1.

- Using linear algebra it is straightforward to show that the Hub and Aut vectors will eventually converge, but no bound on the number of iterations required for convergence is known. In practice, they converge quickly. 
- The success of this approach is very dependent on the quality of the neighborhood graph. Specifically, the neighborhood graph needs to contain high-quality pages on the query topic. 
- If the majority of the pages is on a different topic then topic drift can occur where the top authority pages belong to the different topic. 
- Another disadvantage is that adding a few edges to the neighborhood graph can considerably change the top authority and hub pages since the neighborhood graph is usually fairly small. Thus, there is the risk of manipulation of the results by adding just a few edges. 

Many improvements and generalizations have been suggested to the above two algorithms. (for eg [2] and [3] in references. 


References:

  1. B. Amento, L. Terveen, and W. Hill. Does authority mean quality? predicting expert quality ratings of web documents. In Proceedings of the 23rd International ACM SIGIR Conference in Research and Development in Information Retrieval (SIGIR 00), pages 296–303, New York, USA, 2000. ACM Press.
  2. K. Bharat and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 111–104, 1998.
  3. P. Boldi, M. Santini, and S. Vigna. Pagerank as a function of the damping factor. In Proceedings of the 14th International World Wide Web Conference (WWW2005), pages 557–566, May 2005.
  4. Bian, J., Liu, T.Y., Qin, T., Zha, H.: Ranking with query-dependent loss for web search. In:Proceedings of the 3rd International Conference on Web Search and Web Data Mining (WSDM 2010), pp. 141–150 (2010)
  5. Hyperlink Analysis on the World Wide Web Monika Henzinger Google Inc 8002 Zurich, Switzerland monika@google.com.
  6. Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)




No comments:

Post a Comment