The Sogou-ST Dataset

We provide this Chinese-centric Sogou-ST dataset to support researches in a wide range of session-level Information Retrieval (IR) tasks. Refined from an 18-day search log by Sogou, the second largest search engine in China, the dataset consists of 147,155 refined Web search sessions, 40,596 unique queries, 297,597 Web pages, and six kinds of weak relevance labels assessed by click models. We also sample a subset of 2,000 sessions from Sogou-ST and collect 5-level human relevance labels for documents of the last queries in them.



Recently, numerous studies have shown great advantages of considering the context information in various IR tasks such as session search, query suggestion, and etc. However, the lack of proper dataset limits the progress of related research. There are few test collections available for session-level IR research. Among them, TREC Session Tracks, running from 2011 to 2014, are the most widely applied datasets. They provide test collections with various forms of implicit feedbacks as well as human relevance labels for participants to optimize document ranking performance for the last query in a session. However, these tracks are mainly collected via user studies or crowdsourcing experiments with simulated search tasks. Therefore, they may not necessarily represent real-world Web search scenarios and only contain tens to thousands sessions that are usually deficient for more sophisticated models. Besides, the large-scale AOL search log is collected from real users, but it is noisy and outdated (a certain proportion of URLs are no longer accessible). Considering the aforementioned issues, we provide this large-scale refined session benchmark.

近年来,有很多的研究表明在各种信息检索任务中引入上下文因素可以更好地提升系统性能, 例如会话搜索任务、查询推荐任务等等。然而,由于缺少相应合理的数据集限制了相关研究的进展。在已有的数据集中,2011-2014年由TREC Session Tracks提供的会话数据集被学术界广泛地使用。Session Tracks 提供了具有各种用户反馈信息的会话数据以及人工标注相关性标签,以使参赛者能够利用这些信息来改进会话中最后一个查询下的文档排序性能。然而这些会话数据主要是通过基于模拟搜索任务的用户实验或者众包实验收集的,数据量较小且不能反映真实的用户搜索场景。 另外,从真实用户中收集的大规模AOL日志数据,其发布年代较为久远,还包含许多噪音。 基于以上考虑,我们发布了一份全新的大规模会话数据集来支持相关的研究工作。

Dataset Description

The Sogou-ST dataset is extracted from a query log collected by This dataset consists of a XML-formated session data (*zip, about 1.5GB), a crawled Web page set (*zip, about 1.95BG) and a human label file (*txt, 542KB).


The raw log data contains abundant Web search sessions mingled with noise. Therefore, it is hard to directly employ it for research purpose. To tackle the issue, we refine the sessions through a series of procedures and filter the noisy data step by step. These steps include filtering sessions which contain pornographic, violent or politically sensitive contents, removing sessions with long-tailed queries, discarding sessions without any clicks, and etc. More details can be found in the corresponding paper.

原始的日志数据包含噪音,难以直接用于研究。为此,我们通过一系列步骤将数据进行清洗和提炼。这些步骤包括:过滤包含色情、暴力和政治敏感词的会话, 去掉包含稀有查询的会话以及过滤不含任何点击的会话等等。更多详细操作请参见相应的论文。

Basic statistics of Sogou-ST compared to some existing session datasets are as follows:


Figure 1: Format of the session data.

Dataset Organization

The session data is organized in a prettified XML format similar to TREC Session Tracks, as shown in the following.

Sogou-ST的会话数据按照TREC Session Tracks组织为XML格式,如下图所示:

	<session num="348" starttime="1427889449.3">
		<interaction num="2" starttime="1427889470.05">
				<result rank="1">
					<title>名侦探柯南 国语版-动漫动画-全集高清正版视频在线观看-爱奇艺</title>
				<click num="1" starttime="1427889474.44">

A session is consist of several search interactions together with a clicked-document list. Each interaction represents a search iteration where a user submits an independent query and receives top 10 documents from the search engine. For each round of interactions, the query text and query identifier are provided. For each document in the result list, the URL, document identifier, title, and six click-based relevance labels are given. In addition, the start timestamps for all sessions, interactions, and clicked documents are also presented to support dwell-time based models. Titles of Web pages that we fail to crawl are replaced by UNK.


Each file in the Web page set which contains a word sequence of the corresponding Web page contents, is named after the document identifier. Here we apply the open-sourced jiaba_fast tool for Chinese word segmentation.


As for the human label file, it contains the sample id, session id, query id, document id, the relevance label and the Web page validality variable (whether the Web page is valid), as follows:


1	1844	q2124	d19378	2	1
1	1844	q2124	d19375	0	0
1	1844	q2124	d19374	1	1
1	1844	q2124	d19377	2	1
1	1844	q2124	d19376	2	1
1	1844	q2124	d19371	2	1
1	1844	q2124	d19370	2	1
1	1844	q2124	d19373	1	1
1	1844	q2124	d19372	0	1
1	1844	q2124	d19369	2	1

How to get Sogou-ST

We provide a demo of Sogou-ST which contains 2 sessions to help researchers have a quick start. For the whole copy of the Sogou-ST dataset, you need to contact with us ( After signing an application forum online, we can send you the data.



If you use Sogou-ST in your research, please add the following bibtex citation in your references. A preprint of this paper can be found here Sogou-ST.


  title={Sogou-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions},
  author={Chen, Jia and Mao, Jiaxin and Liu, Yiqun and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 28th ACM International on Conference on Information and Knowledge Management},