The Sogou-SRR Dataset

中文链接

The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks. The dataset consists of 6,338 queries and corresponding top 10 search results. For each search result, the screenshot, title, snippet, HTML source code, parse tree, url as well as a 4-grade relevance score (1-4) and the result type are provided. The queries are sampled from search logs of Sogou.com. The sampled queries with frequency between 100 and 10,000 are usually regarded as torso queries , and usually the most important concerns for ranking algorithm design.

Image 1: An example of Sogou-SRR dataset..

Data Statistics

There are totally 6,338 queries and corresponding top 10 search results in Sogou-SRR. The length distributions of queries, titles and snippets, as well as the distribution of relevance labels are shown below. The average width and height of search result screenshots are 549 and 128 pixels respectively.

#Query #Search Result
6,338 63,380

Image 2: Statistics of SRR dataset.

All search results are manually divided into 19 categories according to their presentation styles. The detailed descriptions of different result types are shown below.

Result Type Description
Organic Result One blue hyperlink with short snippet contents.
Illustrated Vertical Consisting of the title, snippet and an illustration on the left of the search result.
Encyclopedia Vertical Search results from encyclopedia Web sites, usually have similar layout with Illustrated Verticals.
Image Vertical Composed of one row of images.
Video Vertical Composed of one row of video snapshots.
Multi-row Image Vertical Composed of multiple rows of images.
Multi-row Video Vertical Composed of multiple rows of video snapshots.
Tutorial Vertical Providing instructions to some questions, usually containing diagrams with multiple steps.
Forum Vertical Search results from forum websites, usually having an image on the left and a list of hyperlinks on the right.
Map Vertical Consisting of a zoomed map and an input box.
News Vertical Aggregation of multiple news results, of which one is shown in details and usually illustrated with an image while others only have title information.
Question Answering Vertical Aggregation of multiple answers from a Community Question-Answering site, of which one is shown in details while others only have title information.
Textual Vertical Hyperlinks of different channels from a Web site and corresponding snippets.
Download Vertical Direct download links of certain softwares described by the query.
Direct Answer Vertical Directly showing the required information described by the query.
Application Vertical Embedded applications which can be directly interacted on SERPs, such as music or express inquiry services.
Navigation Vertical Giving a catalog of TV serials, books and so on.
Shopping Vertical Shopping search results from E-commerce Web sites.
Others Search results belonging to none of the above categories.

Data Instructions

The files and directories contained in Sogou-SRR are shown below. The dataset is totally about 3.2 GB in size when compressed.

File or Directory Data
SRR.json All the information needed for search results in Sogou-SRR.
Screenshot/ The screenshots of search results.
Tree/xml_raw/ The xml files directly parsed from HTML source codes of search results.
Tree/xml/ The xml files of pruned search result parse trees.
Tree/image/ The images in search result parse trees.
Train.txt/Val.txt/Test.txt The training/validition/testing queries.

The “SRR.json” is organized hierarchically. Keys in “results” denote the positions of search results (from 0 to 9). Smaller key number means higher position in the original ranking list of search engine. The key “tree” denotes the parse tree of each search result. It can be either the XML file in “Tree/xml_raw/” or “Tree/xml/”.

[...
  {
    "query": cat,
    "results": 
      {'0':
        {
          "screenshot": cat_0.png,
          "title": "Cat- Wikipedia",
          "snippet": "Kingdom: Animalia Abstract...",
          "html": <div ...>,
          "tree": cat_0.xml,
          "url": "https://en.wikipedia.org/wiki/Cat",
          "relevance": 4,
          "result type": Encyclopedia Vertical
         }
        ...
        '9':
         {
           "screenshot": cat_9.png,
           "title": "Adopt a cat | Blue Cross",
           "snippet": "We have lots of lovely cats...",
           "html": <div ...>,
           "tree": cat_9.xml,
           "url": "https://www.bluecross.org.uk/rehome/cat",
           "relevance": 3,
           "result type": Organic Result
         }
       }
     }
...]

How to get Sogou-SRR

We provide a demo of Sogou-SRR that contains 10 queries to help researchers have a quick start. For the whole copy of the Sogou-SRR dataset, you need to contact with us (chengluo@tsinghua.edu.cn). After signing an application forum online, we can send you the data.

Citation

If you use Sogou-SRR in your research, please add the following bibtex citation in your references. A preprint of this paper can be found here Sogou-SRR.

@inproceedings{JointRelevanceEstimation,
 title = {Relevance Estimation with Multiple Information Sources on Search Engine Result Pages},
 author = {Zhang, Junqi and Liu, Yiqun and Ma, Shaoping and Tian, Qi},
 booktitle={Proceedings of the 2018 ACM on Conference on Information and Knowledge Management},
 year = {2018},
 numpages = {10},
 organization={ACM}
}