The TianGong-PDR Dataset

The TianGong-PDR dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 70 queries, 1,050 documents (15 documents for each query), 11,512 passages (paragraphs within the documents), four-grade human assessed relevance labels for both documents and passages, and four-grade passage-level cumulative gain labels.

Dataset Description

The query-document pairs are constructed as follows. We use THUCNews, a Chinese news dataset based on the Web pages data of Sina News RSS subscription channel, as our corpus and select queries from a 10-day query log of a popular commercial search engine Sogou in China. We sample the search sessions in the query logs where users have clicked on at least one result from the Sina news website and reserve the corresponding queries. Then 70 intermediate-frequency queries from these queries are manually chosen as the query set.

For each query in the query set, the initial candidate document set consists of all the documents in THUCNews of the same domain with the query. We filter out the documents where the number of paragraphs is less than 4 or more than 20, which are too short or too long in the corpus. We calculate the BM25 score for each query-document pair and reserve 15 documents for each query according to their BM25 scores.

Finally, we obtain a dataset consists of 70 queries and 1,050 documents. We directly use one paragraph within the document as a passage. There are 11,512 passages in the dataset. For each query-document and query-passage pair, we collect a four-grade human assessed relevance label through crowdsourcing. We also collect four-grade passage-level cumulative gain labels for each document.

For more details about this dataset, you can read the paper

Dataset Organization

The TianGong-PDR dataset is organized as follows:

for one query in the dataset:

{
    'qid':                 // inner-id of the search query
    'query':               // the query text
    'query_description':   // the description of search intent
    'IF':                  // the category of the query: Factual or Intellectual
    'topic':               // the topic of the query
    'docs':                // a list of documents for the query   
}

for one document in 'docs':

{
    'docid':       // inner-id of the document
    'doc_rel':     // relevance score of the document (irrelevant(0) to highly relevant(3)).
    'passages':    // a list of passage text within the document.
    'pass_rel':    // a list of relevance scores of the passages within the document (irrelevant(0) to highly relevant(3)).
    'gain':        // three lists of passage-level cumulative gain (PCG) labels of the document (No-gain(0) to High-gain(3)).
}

How to get TianGong-PDR

We provide a demo of TianGong-PDR that contains 15 documents for one queries to help researchers have a quick start. For the whole copy of the TianGong-PDR dataset, you need to contact with us (thuir_datamanage@126.com). After signing an application forum online, we can send you the data.

Citation

If you use TianGong-PDR in your research, please add the following bibtex citations in your references. The papers can be found here:

@inproceedings{wu2019investigating,
  title={Investigating passage-level relevance and its role in document-level relevance judgment},
  author={Wu, Zhijing and Mao, Jiaxin and Liu, Yiqun and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  pages={605--614},
  year={2019}
}

@inproceedings{wu2020leveraging,
    title={Leveraging passage-level cumulative gain for document ranking},
    author={Wu, Zhijing and Mao, Jiaxin and Liu, Yiqun and Zhan, Jingtao and Zheng, Yukun and Zhang, Min and Ma, Shaoping},
    booktitle={Proceedings of The Web Conference 2020},
    pages={2421--2431},
    year={2020}
}