Article Image
Article Image


The Session Search (SS) task is a core task in NTCIR-17 to support intensive investigations of session search or task-oriented search.

Nowadays, users depend increasingly on search engines to either gain useful information or to complete tasks. In complex search scenarios, a single query may not fully cover user information need. Therefore, users will submit more queries to search systems within a short time interval until they are satisfied or give up. Such a search process is called a search session or task. As users’ search intents may evolve within search sessions, their actions and decisions will also be greatly impacted. Going beyond ad-hoc search and considering the contextual information within sessions has been proved efficient for user intent modeling in IR communities. To this end, we proposed Session Search (SS) task as a pilot task in NTCIR-16.

As the second year of organizing SS, we still employ settings that support not only (1) large-scale practical session datasets for model training but also (2) both ad-hoc and session-level evaluation this year. We would update the testing set by collecting data via an upcoming field study. Besides the aforementioned settings, we would also involve a new subtask for participants to design better session-level search effectiveness evaluation metrics. We believe that this will facilitate the development of the IR community in the related domain.


Following NTCIR-16 SS, we remain its two subtasks as follows to assess both query-level and session-level search effectiveness.

  • Fully Observed Session Search (FOSS): For a k-length session, we provide full session contexts in the first (k-1) queries. Participants need to re-rank the candidate documents for the last query of a session. This setting follows TREC Session Tracks to enable ad-hoc evaluation by using metrics such as NDCG, AP, and RBP, etc.
  • Partially Observed Session Search (POSS): In this subtask, we truncate all sessions before the last query. For a session with k queries (k ≥ 2), we only reserve the session contexts in the first n queries, where 1nk-1. The value of n varies in different sessions. Participants will need to re-rank documents for the last k-n queries(query) according to the partially observed contextual information in previous search rounds. Session-level metrics such as RS-DCG and RS-RBP will be adopted for the evaluation of system effectiveness.

Besides, this year we will include a third subtask to facilitate the development of session-level search effectiveness metrics:

  • Session-level Search Effectiveness Estimation (SSEE): We will provide a set of web sessions with full user interactive behaviors. Participants can utilize user feedback to construct new session-level search effectiveness evaluation measures. To meta-evaluate the reasonability of all proposed measures, we will compare the consistency of each measure and golden user satisfaction labels by calculating coefficients such as Pearson’s r and Spearman’s ρ.
  NTCIR-17 Session Search (SS) TREC Session Tracks
Number of Sessions - Training: 147, 154, with human relevance labels for the last query of 2, 000 sessions.
- FOSS Testing Subtask: 1, 184.
- POSS Testing Subtask: 976.
- SSEE Testing Subtask: 1,174.
English: 76~1, 257
Session Datasets - TianGong-ST
- TianGong-SS-FSD/TianGong-Qref
- An un-released field study dataset
Session Track 2011-2014
Document Collection - Training corpus: With about 1, 000, 000 documents.
- Test corpus: A collection provided by T2Ranking, with about 2.3M web pages.
Source/Generation of session data Refined from a search log from or extracted from two large-scale field studies. Generated by real search users based on manually designed topics.
Support from log analysis for annotation? ×
Support session-level evaluation? ×

Expected Results

This year we plan to attract at least 10 active participants. Similar to the first year, we only construct Chinese-centric session collections. Participants could leverage the training set for either single-task or multi-task learning. Besides, the number of testing sessions in each subtask is expected to be more than 400. Having collected run files from all teams, we will recruit accessors to annotate four-scale relevance labels for all pooled results. Then, both query-level and session-level metrics will be adopted for system effectiveness evaluation. Besides, we will also meta-evaluate the estimated query-level and session-level search effectiveness from participants by calculating the consistency of these measure values and the ground truth user satisfaction labels. Through these efforts, we hope that there will be a technological breakthrough in understanding and optimizing multi-turn search systems.

Important Dates

All deadlines are at 11:59pm in the Anywhere on Earth (AOE) timezone.
Session Search registration Due: June 30, 2023
Dataset Release: June 30, 2023
Formal Run: 👉July 2023 - August 2023
Evaluation Result Release: September 1, 2023
Draft Task Overview Paper Release: September 15, 2023
Draft Participant Paper Submission Due: October 1, 2023
All Camera-ready Paper Submission Due: November 1, 2023
NTCIR-17 Conference in NII, Tokyo, Japan: December 2023

Evaluation Measures

For the FOSS subtask, we adopt nDCG, AP, and RBP, etc.
For the POSS subtask, we use session-level metrics such as RS-DCG and RS-RBP.
For the SSEE subtask, we compare the consistency of each measure and golden user satisfaction labels by calculating coefficients.

The official evaluation tool is coming soon!

Data and File format

We provide three directories:

|----- ./document_collection 
|              	|----- [A collection with about 1, 000, 000 pages. Each directory contains about 10, 000 files. ]
|              	|----- test_doc [T2ranking collection.tsv]
|----- ./sessions
|		|----- ./training |----- training_sessions.txt		
|		|
|		|----- ./testing
|		|----- ./FOSS |------ testing_sessions_foss.txt		
|		|
|		|----- ./POSS |------ testing_sessions_poss.txt	
|		|
|		|----- ./SSEE |------ testing_sessions_ssee.txt					
|----- ./training_human_labels |----- human_labels.txt
|----- README.txt

1) Firstly, for all session files, each session is split by two line breaks (\n\n).

2) Each training session is formatted as follows:

SessionID    87

画杨桃    q198    1427848224.93
1    d1882    404    0    -1
2    d1883    <unk>    0    -1
3    d1884    画杨桃-搜索页    0    -1
4    d1885    画杨桃    0    -1
5    d1886    人教版小学三年级下册语文《画杨桃》教学设计优质课教案    0    -1
6    d5    微信,是一个生活方式    0    -1
7    d1887    【图文】画杨桃_百度文库    0    -1
8    d1888    画杨桃课件_    0    -1
9    d1889    搜狗搜索    0    -1
10    d1890    画杨桃_三年级语文下册课件_奥数网    0    -1


画杨桃ppt课件    q199    1427848230.2
1    d1894    【图文】画杨桃ppt课件精品_百度文库    1    1427848232.105
2    d1895    《画杨桃》PPT课件    0    -1
3    d1896    《画杨桃》PPT课件6    0    -1
4    d1897    《画杨桃》公开课ppt课件(24页)-免费高速下载    0    -1
5    d1898    <unk>    0    -1
6    d1899    画杨桃ppt课件下载    0    -1
7    d1900    <unk>    0    -1
8    d1901    11画杨桃PPT课件_管理资源吧    0    -1
9    d1902    《画杨桃》ppt课件-免费高速下载    0    -1
10    d1903    《画杨桃》ppt课件【13页】-免费高速下载    0    -1


画杨桃ppt    q200    1427848257.0
1    d1904    【图文】画杨桃PPT_百度文库    1    1427848258.188
2    d1895    《画杨桃》PPT课件    0    -1
3    d1905    【图文】画杨桃ppt课件_百度文库    0    -1
4    d1897    《画杨桃》公开课ppt课件(24页)-免费高速下载    0    -1
5    d1900    <unk>    0    -1
6    d1906    《画杨桃》ppt课件(19页)-免费高速下载    0    -1
7    d1907    <unk>    0    -1
8    d1903    《画杨桃》ppt课件【13页】-免费高速下载    0    -1
9    d1896    《画杨桃》PPT课件6    0    -1
10    d1908    【精品】:画杨桃PPT    0    -1
  • The first line: <SessionID><tab><session ID>, such as SessionID 87.

  • The first line in a query: <query string><tab><query ID><tab><query start time>, such as 画杨桃 q198 1427848224.93.

  • Each rest line in a query: <rank><tab><url><tab><document ID><tab><document title><tab><clicked><tab><click timestamp>, such as 2 d1895 《画杨桃》PPT课件 0 -1.

  • If the title of a document is unknown, then the document title will be represented by <unk>.

  • If a document is not clicked, then the click timestamp is -1.

  • The corpus corresponding to the training session is, whic is a collection with about 1, 000, 000 pages. Each directory contains about 10, 000 files. The human label is in human_label.txt.

  • Participants can sample the validation set to verify the effectiveness of the methods.

3) Each testing session is formatted as follows:

SessionID    8

Tensorflow    q64324    1596009033.208
1    TensorFlow    0    -1    2
2    TensorFlow教程:TensorFlow快速入门教程(非常详细)    0    -1    0
3    TensorFlow_百度百科    0    -1    0
4    TensorFlow - 机器学习系统    0    -1    0
5    TensorFlow - 机器学习系统    0    -1    0
6    tensorflow neural network playground - A Neural Network...    0    -1    2
7    GitHub - tensorflow/tensorflow: An Open Source Machine...    0    -1    0
8    TensorFlow入门极简教程(一) - 简书    0    -1    0
9    TensorFlow 如何入门,如何快速学习? - 知乎    0    -1    0
10    终于来了!TensorFlow 2.0入门指南(上篇)_机器学习算法..._CSDN博客    0    -1    0


Pytorch    q64325    1596009037.5
  • The first line: <SessionID><tab><session ID>, such as SessionID 8.

  • The first line in an observed/unobserved query: <query string><tab><query ID><tab><query start time>, such as Tensorflow q64324 1596009033.208.For SSEE task, we additionally provide query-level satisfaction:<query string><tab><query ID><tab><query start time><tab><query satisfaction score>

  • Each rest line in an observed query: <rank><tab><url><tab><document title><tab><clicked><tab><click timestamp><tab><usefulness>, such as 1 TensorFlow 0 -1 2.

  • If the title of a document is unknown, then the document title will be represented by <unk>.

  • If a document is not clicked, then the click timestamp is -1.

  • The usefulness ratings are 4-scale (0-3), annotated by the first-tier search users. We provide this information to explore to what extent search systems will be improved if true and instant user feedback is available.

  • The corpus corresponding to the test session is T2Ranking. T2Ranking contains 2M unique passages from real-world search engines. Participants need to retrieve and rerank the relevant documents in corpus for each query.

4) Each line in the human_labels.txt is formatted as:<ID><tab><training session ID><tab><query ID><tab><document ID><tab><relevance><tab><valid>.

  • Relevance is labeled at five levels: 0 for not relevant or spam, 1 for relevant, 2 for highly relevant, 3 for key, and 4 for navigational. More information can be found in the paper TianGong-ST.

Submission format

1) Each team can submit up to six NEW or REP runs for each subtask.
2) The submission file should be named as [TEAMNAME]-{FOSS, POSS, SSEE}-{NEW, REP}-[1-5].txt, such as THUIR1-FOSS-NEW-1.txt. Note that for the organizers’ convenience, there should not be any hyphen-minus (-) in the TEAMNAME. A NEW run means you use a novel approach, while a REP run means you reproduce some model.
3) As for the content, the first line should be a short English description of this particular run, such as BM25F with Pseudo-Relevance Feedback. The rest lines should be formatted as [SessionID]<tab>[QueryID]<tab>[QueryPosInSession]<tab>[DocumentID]<tab>[Rank]<tab>[Score]<tab>[RunName]. Such as


Please do not include more than 20 candidate documents per case!

Specially, for SSEE tasks, the first line should be a short English description of the session-level evaluation measures. The rest lines should be formatted as [SessionID]<tab>[Session Score]<tab>[RunName]

4) After getting each team's run, we will annotate the relevance of the queries and documents. Afterwards, we use the labels to calculate the evaluation metrics and announce the final ranks.

Submitting runs



Haitao Li [] (Tsinghua University)

Jiannan Wang [] (Tsinghua University)

Jia Chen [] (Tsinghua University)

Weihang Su [] (Tsinghua University)

Beining Wang [] (Tsinghua University)

Fan Zhang [] (Wuhan University)

Qingyao Ai [] (Tsinghua University)

Jiaxin Mao [] (Renmin University of China)

Yiqun Liu [] (Tsinghua University)

Contact Email:
Please feel free to contact us! 😉


[1] Carterette, B., Kanoulas, E., Hall, M., & Clough, P. (2014). Overview of the TREC 2014 session track. pdf
[2] Yang, G. H., & Soboroff, I. (2016). TREC 2016 Dynamic Domain Track Overview. In TREC. pdf
[3] Zhang, F., Mao, J., Liu, Y., Ma, W., Zhang, M., & Ma, S. (2020, July). Cascade or Recency: Constructing Better Evaluation Metrics for Session Search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 389-398). pdf
[4] Chen, J., Mao, J., Liu, Y., Zhang, M., & Ma, S. (2019, November). TianGong-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 2485-2488). pdf
[5] Liu, M., Liu, Y., Mao, J., Luo, C., & Ma, S. (2018, June). Towards designing better session search evaluation metrics. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (pp. 1121-1124). pdf
[6] Chen, Jia, Weihao Wu, Jiaxin Mao, Beining Wang, Fan Zhang, and Yiqun Liu. “Overview of the NTCIR-16 Session Search (SS) Task.” Proceedings of NTCIR-16. (2022).

Supported by