The Sogou-QCL Dataset

The Sogou-QCL dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 537,366 queries, more than 9 million Chinese web pages, and five kinds of relevance labels assessed by click models. Moreover, a 2,000-queries’ dataset with 4-level human assessed relevance labels is also offered to the public for research.



Recently, within the information retrieval field, a number of neural ranking frameworks have been proposed to address the ad-hoc search. These models usually need a large amount of query-document relevance judgments for training. However, obtaining this kind of relevance judgments needs a lot of money and manual effort. To shed light on this problem, researchers seek to use implicit feedback from users of search engines to improve the ranking performance, such as user clicks.


However, there are limitations on adopting user clicks as supervision signals to train neural ranking models. During Web search, user clicks are biased and noisy. A number of click models were proposed to estimate the click probability of a document from query logs by reducing the impacts of the biases and inferring its relevance to the query. This kind of relevance is named as “click model-based relevance”.Thus, in Sogou-QCL, we draw support from click models to address the shortage of labeled data.


Dataset Description

The Sogou-QCL dataset are sampled from query logs of This dataset consists of 10 parts of bz2 files that are totally about 84 GB in size when compressed. To support the research of IR and other related areas, we calculated five kinds of click model-based relevance for the query-document pairs in Sogou-QCL. Each record of a query contains the text, appearance frequency and its documents, while in each document, we provided its title, content, html source, appearance frequency and five click model-based relevance. The relevance values are estimated by click models, such as TCM, DBN, PSCM, TACM and UBM, based on a large scale of query logs during April 1st-18th, 2015. Sogou-QCL can be use support a broad range of research on information retrieval and natural language understanding, such as ad-hoc retrieval, query performance predicting, and etc.


Here are some statistics of Sogou-QCL:


Table 1: The statistics of Sogou-QCL dataset.

In addition, we annotate a small dataset that is sampled from Sogou-QCL by crowdsourcing. This human-labeled dataset contains 2,000 queries, about 50 thousands documents and 4-level relevance labels, which will also be released to researchers!


Dataset Organization

The Sogou-QCL dataset is organized hierarchically, as follows:


		<title>名捕震关东 (豆瓣)</title>
		<content>导演 : 王响伟 /崔凤娟 /韩东编剧 : 华谊剧本工作室...</content>
		<html>html source</html>

For the additional human-labeled dataset, we provide the file which contains the query id, document id and their relevance label, as follows:


q100961	d2624294	3
q101120	d5167219	1
q100965	d2769550	0
q101122	d7859211	2

How to get Sogou-QCL

We provide a demo of Sogou-QCL that contains 10 queries to help researchers have a quick start. For the whole copy of the Sogou-QCL dataset, you need to contact with us ( After signing an application forum online, we can send you the data.



If you use Sogou-QCL in your research, please add the following bibtex citation in your references. A preprint of this paper can be found here Sogou-QCL.


