The Search Evaluation Dataset

This dataset was created to support research on session search evaluation. We conducted two user study which contained 675 search sessions for 9 search tasks. Users’ interactions and explicit feedback were collected during searching process.


User satisfaction has been paid much attention to in recent Web search evaluation studies and regarded as the ground truth for designing better evaluation metrics. However, most existing studies are focused on the relationship between satisfaction and evaluation metrics at query-level.

While search request becomes more and more complex, there are many scenarios in which multiple queries and multi-round search interactions are needed (e.g. exploratory search). In those cases, the relationship between session-level search satisfaction and session search evaluation metrics remain uninvestigated.

In this study, we conduct a laboratory study in which users are required to finish some complex search tasks and provide usefulness judgments of documents as well as session-level and query-level satisfaction feedbacks. So that we analyze how users’ perceptions of satisfaction accord with a series of session-level evaluation metrics.

Data description

This dataset contains two parts: main user study and comparison user study. The main user study consists of 450 search sessions of 9 tasks and the comparison user study consists of 225 search sessions of the same 9 tasks.

For each search task, the participant needs to read and memorize the task description and repeat the task description without viewing it. The participant can submit queries and click on the results to collect information as they usually do with commercial search engines. He/She is asked to mark whether the clicked documents were useful (4-level) and give a 5-level graded satisfaction feedback on each query. Finally, he/she is required to give an answer to the search task and an overall 5-level graded satisfaction feedback of search experience in the task. The detailed information are shown in the following table.

Measure Type Description
task #9 indexs Index used to distinguish tasks
query text user submitted query
clicked_url url user clicked url
start/end time numerical the time user behavior occurs
usefulness 1(low)~4(high) user’s usefulness feedback on a document
query satisfaction 1(low)~5(high) user’s satisfaction feedback on a search query
session satisfaction 1(low)~5(high) user’s satisfaction feedback on a search session
answer text user’s answer on a task

How to get the detailed dataset

We provide the data used in the paper we published at the WWW18 conference. For the whole dataset that contains the detailed user behavior, you need to contact with us ( After signing an application forum online, we can send you the data.


If you use this dataset in your research, please add the following bibtex citation in your references. A preprint of this paper can be found here.

  author    = {Mengyang Liu and
               Yiqun Liu and
               Jiaxin Mao and
               Cheng Luo and
               Shaoping Ma},
  title     = {Towards Designing Better Session Search Evaluation Metrics},
  booktitle = {The 41st International {ACM} {SIGIR} Conference on Research {\&}
               Development in Information Retrieval, {SIGIR} 2018, Ann Arbor, MI,
               USA, July 08-12, 2018},
  pages     = {1121--1124},
  year      = {2018},
  crossref  = {DBLP:conf/sigir/2018},
  url       = {},
  doi       = {10.1145/3209978.3210097},
  timestamp = {Mon, 02 Jul 2018 08:24:13 +0200},
  biburl    = {},
  bibsource = {dblp computer science bibliography,}