The Sogou-QCL Dataset

The Sogou-QCL dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 537,366 queries, more than 9 million Chinese web pages, and five kinds of relevance labels assessed by click models. Moreover, a 2,000-queries’ dataset with 4-level human assessed relevance labels is also offered to the public for research.

Sogou-QCL数据集可用于信息检索和自然语言相关的研究。数据集包含537366条查询、超过9百万个中文网页以及5种点击模型自动标注的相关性标签。此外,我们还提供了一份包含2000条查询的数据子集,其中所有查询文档对都带有人工标注的4级相关性标签。

Motivation

Recently, within the information retrieval field, a number of neural ranking frameworks have been proposed to address the ad-hoc search. These models usually need a large amount of query-document relevance judgments for training. However, obtaining this kind of relevance judgments needs a lot of money and manual effort. To shed light on this problem, researchers seek to use implicit feedback from users of search engines to improve the ranking performance, such as user clicks.

针对信息检索领域中的ad-hoc检索任务,近年来有很多深度排序模型被提出。训练这些模型通常需要大量的查询-文档相关性标签。然而,获取这样的标签通常人力成本和时间代价很高。为了解决这一问题,研究人员尝试在检索任务中引入点击等用户隐式反馈。

However, there are limitations on adopting user clicks as supervision signals to train neural ranking models. During Web search, user clicks are biased and noisy. A number of click models were proposed to estimate the click probability of a document from query logs by reducing the impacts of the biases and inferring its relevance to the query. This kind of relevance is named as “click model-based relevance”.Thus, in Sogou-QCL, we draw support from click models to address the shortage of labeled data.

然而,用户点击通常是有偏的和有噪音的,这使得其不适合直接作为训练深度排序模型的监督信号。近些年提出了一些点击模型,它们可以通过有偏的用户点击估计查询-文档的真实相关性。我们称这种相关性为“基于点击模型的相关性”。因此,在Sogou-QCL中我们使用点击模型来自动生成相关性标签。

Dataset Description

The Sogou-QCL dataset are sampled from query logs of Sogou.com. This dataset consists of 10 parts of bz2 files that are totally about 84 GB in size when compressed. To support the research of IR and other related areas, we calculated five kinds of click model-based relevance for the query-document pairs in Sogou-QCL. Each record of a query contains the text, appearance frequency and its documents, while in each document, we provided its title, content, html source, appearance frequency and five click model-based relevance. The relevance values are estimated by click models, such as TCM, DBN, PSCM, TACM and UBM, based on a large scale of query logs during April 1st-18th, 2015. Sogou-QCL can be use support a broad range of research on information retrieval and natural language understanding, such as ad-hoc retrieval, query performance predicting, and etc.

Sogou-QCL数据集取样自商业搜索引擎搜狗的查询日志。数据总共由10个bz2类型的文件组成,压缩后约84GB。为了更好地方便研究者使用,我们为数据集中的所有查询-文档对计算了五种基于点击模型的相关性。每条查询记录了查询的文本出现频率多个文档。每个文档包括了文档的标题全文内容html源码出现频率以及五种基于点击模型的相关性。这些相关性分别由TCM、DBN、PSCM、TACM以及UBM在2015年4月1日至18日的搜狗查询日志上计算得到。Sogou-QCL可以用在信息检索和自然语言理解的很多任务中,例如ad-hoc检索、查询表现预测等等。

Here are some statistics of Sogou-QCL:

这是Sogou-QCL数据集的一些统计信息:

Table 1: The statistics of Sogou-QCL dataset.

In addition, we annotate a small dataset that is sampled from Sogou-QCL by crowdsourcing. This human-labeled dataset contains 2,000 queries, about 50 thousands documents and 4-level relevance labels, which will also be released to researchers!

此外,我们采样了一些查询进行了人工标注。这部分人工标注数据包含了2000个查询、约5万个文档以及4级的人工相关性标签。该数据也会加入Sogou-QCL一同发布。

Dataset Organization

The Sogou-QCL dataset is organized hierarchically, as follows:

Sogou-QCL的数据按照以下结构组织:

<q>
	<query>名捕震关东</query>
	<query_frequency>21</query_frequency>
	<query_id>q812</query_id>
	<doc>
		<url>http://movie.douban.com/subject/3025211/</url>
		<doc_id>d1710504</doc_id>
		<title>名捕震关东 (豆瓣)</title>
		<content>导演 : 王响伟 /崔凤娟 /韩东编剧 : 华谊剧本工作室...</content>
		<html>html source</html>
		<doc_frequency>11</doc_frequency>
		<relevance>
			<TCM>0.37604473478</TCM>
			<DBN>0.216979172374</DBN>
			<PSCM>0.499870328303</PSCM>
			<TACM>0.499875283413</TACMM>
			<UBM>0.310322121915</UBM>
		</relevance>
	</doc>
	<doc>
	  ...
	</doc>
	...
</q>

For the additional human-labeled dataset, we provide the file which contains the query id, document id and their relevance label, as follows:

对于额外的人工标注数据,我们提供查询id、文档id和对应相关性标签的文件,格式如下:

q100961	d2624294	3
q101120	d5167219	1
q100965	d2769550	0
q101122	d7859211	2

How to get Sogou-QCL

We provide a demo of Sogou-QCL that contains 10 queries to help researchers have a quick start. For the whole copy of the Sogou-QCL dataset, you need to contact with us (chengluo@tsinghua.edu.cn). After signing an application forum online, we can send you the data.

我们提供了一份Sogou-QCL的样例数据,其中包含10个查询,用于帮助研究者们快速上手。如果想获取Sogou-QCL全量数据,请通过邮件联系我们(chengluo@tsinghua.edu.cn),完成在线申请后即可获得。

Citation

If you use Sogou-QCL in your research, please add the following bibtex citation in your references. A preprint of this paper can be found here Sogou-QCL.

如果您在研究中使用了Sogou-QCL,请将如下bibtex内容加入到您的引用列表中。关于Sogou-QCL论文,您可以在此处找到。

@inproceedings{Zheng:2018:SND:3209978.3210092,
 author = {Zheng, Yukun and Fan, Zhen and Liu, Yiqun and Luo, Cheng and Zhang, Min and Ma, Shaoping},
 title = {Sogou-QCL: A New Dataset with Click Relevance Label},
 booktitle = {The 41st International ACM SIGIR Conference on Research \&\#38; Development in Information Retrieval},
 series = {SIGIR '18},
 year = {2018},
 isbn = {978-1-4503-5657-2},
 location = {Ann Arbor, MI, USA},
 pages = {1117--1120},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/3209978.3210092},
 doi = {10.1145/3209978.3210092},
 acmid = {3210092},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {document ranking, search evaluation, test collection},
}