The TianGong-PDR Dataset

The TianGong-PDR dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 70 queries, 1,050 documents (15 documents for each query), 11,512 passages (paragraphs within the documents), and four-grade human assessed relevance labels for both documents and passages.

Dataset Description

The query-document pairs are constructed as follows. We use THUCNews, a Chinese news dataset based on the Web pages data of Sina News RSS subscription channel, as our corpus and select queries from a 10-day query log of a popular commercial search engine Sogou in China. We sample the search sessions in the query logs where users have clicked on at least one result from the Sina news website and reserve the corresponding queries. Then 70 intermediate-frequency queries from these queries are manually chosen as the query set.

For each query in the query set, the initial candidate document set consists of all the documents in THUCNews of the same domain with the query. We filter out the documents where the number of paragraphs is less than 4 or more than 20, which are too short or too long in the corpus. We calculate the BM25 score for each query-document pair and reserve 15 documents for each query according to their BM25 scores.

Finally, we obtain a dataset consists of 70 queries and 1,050 documents. We directly use one paragraph within the document as a passage. There are 11,512 passages in the dataset. For each query-document and query-passage pair, we collect a four-grade human assessed relevance label through crowdsourcing.

For more details about this dataset, you can read the paper “Investigating Passage-level Relevance and Its Role in Document-level Relevance Judgment”.

Dataset Organization

The TianGong-PDR dataset is organized as follows:

for one query in the dataset:

{
    'qid':                 // inner-id of the search query
    'query':               // the query text
    'query_description':   // the description of search intent
    'IF':                  // the category of the query: Factual or Intellectual
    'topic':               // the topic of the query
    'docs':                // a list of documents for the query   
}

for one document in 'docs':

{
    'docid':       // inner-id of the document
    'doc_rel':     // relevance score of the document (irrelevant(0) to highly relevant(3)).
    'passages':    // a list of passage text within the document.
    'pass_rel':    // a list of relevance scores of the passages within the document (irrelevant(0) to highly relevant(3)).
}

How to get TianGong-PDR

We provide a demo of TianGong-PDR that contains 15 documents for one queries to help researchers have a quick start. For the whole copy of the TianGong-PDR dataset, you need to contact with us (frankyzf94@gmail.com or fan-zhan16@mails.tsinghua.edu.cn). After signing an application forum online, we can send you the data.

Citation

If you use TianGong-PDR in your research, please add the following bibtex citation in your references. A preprint of this paper can be found here TianGong-PDR.

@inproceedings{Investigating2019WU,
 author = {Zhijing Wu, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma},
 title = {Investigating Passage-level Relevance and Its Role in Document-level Relevance Judgment},
 booktitle = {Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval},
 year = {2019},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {relevance judgment, passage-level relevance aggregation, relevance model},
}