Welcome to the NTCIR-15 We Want Web-3 Chinese Subtask!
This page contains information specific to the Chinese subtask of the NTCIR-15 WWW-3 task. For details of the task, please visit the WWW-3 task page.
From We Want Web 2, we released a brand-new training dataset, Sogou-QCL, for Chinese subtask. In We Want Web 3, we also provide this dataset for all participants. Sogou-QCL contains 0.54 million queries and more than 9 million corresponding documents. For each query-doc pair, we provide 5 kinds of click labels generated by different models. For all the documents, the title and the content are already well-extracted with the help of our friend, Sogou.com. All you need to do is to design your own ranking model!
For more details of Sogou-QCL, you may refer to our resource paper at SIGIR 2018.
- Leverage the Sogou-QCL data and Sogou-T corpus for Chinese web search
- Quantify within-site and cross-site improvements across multiple NTCIR rounds
- Through a collaboration across organizers and participants, discover what cannot be discovered in a single-site failure analysis
- Conduct cross-language web search experiments by leveraging the intersection between our Chinese and English topic sets
- Chinese ad hoc web search that spans at least three rounds of NTCIR (NTCIR-13 through NTCIR-15)
- For the Chinese subtask, some user behavior data will be provided to improve search quality.
- A one-sentence DESCRIPTION field will be provided for each query, based on which relevance assessments will be conducted.
- We have 80 queries for Chinese subtasks. About 25 topics will be shared among different languages for possible future cross-language research purpose.
- The Chinese queries are sampled from the median-frequency queries collected from Sogou search logs.
- The queries are organized in XML can be download here (Chinese)
Chinese Training Set
For Chinese subtask, in this round of We Want Web 3, we provide Sogou-QCL as the training set. Sogou-QCL contains two kinds of training set:
(1) The first set is traditional relevance assessment. It is made of 1000 Chinese queries and for each query, Sogou-QCL contains about 20 query-doc relevance judgments. Each pair is annotated by three trained assessors. Sogou-QCL also provides title and content extracted from raw htmls.
(2) The second set is query click labels. Original clicks often contain much users’ privacy. Therefore we provide the relevance score estimated based on group of users’ behaviors. More specifically, for each query-doc pair, we provide five kinds of weak relevance label estimated by five popular click models: UBM, DBN, TCM, PSCM, and TACM. These click models utilize rich users’ behavior such as click, skip, and dwell time. Sogou-QCL contains more than half a million queries and more than 9 millions of documents. To the best of knowledge, this is so far the largest free training collection for Chinese ranking problem.
It is always a difficult job to handle the raw html content. Therefore we provide fine-grained extracted content with professional tools of Sogou.com. We hope it will reduce some effort for our participants and help them focus on the ranking model design.
- For the Chinese Subtask, we adopt the new SogouT-16 as the document collection. SogouT-16 contains about 1.17B webpages, which are sampled from the index of Sogou. Considering that the original SogouT might be a little bit difficult to handle for some research groups (almost 80TB after decompression), we prepare a “Category B” version of SogouT-16, which is denoted as “SogouT-16 B”. This subset contains about 15% webpages of SogouT-16 and it will be applied as the Web Collection. We also provide free online retrieval services for free. This Web Collection is absolutely free for research purpose. You can apply online and then drop an email to Dr. Jiaxin Mao (maojiaxin AT gmail.com) to get it. SogouT-16 has a free online retrieval/page rendering service. You will get an account after application for SogouT-16.
- You may feel it spends too much to apply for the document collections. Don’t worry! We have a much easier plan for you. For SogouT-16, you only need to sign an application forum online, we can send you the original docs.
- We have top 1000 results for each topic. These results were obtained by our baseline system.
- The baselines are organized in standard TREC format.
- The baseline runs can be downloaded here.
Topics and Qrels from Previous WWW Tasks
We also provide the topics and qrels (i.e. relevance annotations) collected in the Chinese subtask of WWW-1 and WWW-2. You can use them as training or validation set when building retrieval models.
- For the Chinese corpus (Sogou-T and Sogou-QCL) and user behavior dataset, we will first sign an agreement. After the agreement (soft copy) is received, we will send you the data. The delivery of disks takes about one week. The user behavior data can be send to you online separately .
- We suggest you to start the data application procedure as early as possible.
You may need these links to find more inforamtion about NTCIR15 and WWW-3 English subtask: