Figure 0: Logo.

We provide this Chinese-centric TianGong-QRef dataset to support more in-depth investigations of user reformulating behaviors in Web search.

Motivation

As queries submitted by users directly affect search experiences, how to organize queries has always been a research focus in Web search studies. While search request becomes complex and exploratory, many search sessions contain more than a single query thus reformulation becomes a necessity. To help users better formulate their queries in these complex search tasks, modern search engines usually provide a series of reformulation entries on search engine result pages (SERPs), i.e., query suggestions and related entities. However, few existing work have thoroughly studied why and how users perform query reformulations in these heterogeneous interfaces. Therefore, whether search engines provide sufficient assistance for users in reformulating queries remains underinvestigated. To shed light on this research question, we conducted a field study to analyze fine-grained user reformulation behaviors including reformulation type, entry, reason, and the inspiration source with various search intents. Different from existing efforts that rely on external assessors to make judgments, in the field study we collect both implicit behavior signals and explicit user feedback information. We release this dataset for more in-depth investigations of user reformulation behaviors in Web search.

Dataset Description

The dataset contains two main parts:

  • Search Behavior Log. We developed a Chrome extension which could be installed on various chrome-based browsers and record search-related activities when specific events such as clicking or mouse movements were triggered. To better understand how users reformulate a query, the extension recorded the sources of reformulations by locating the action within the current SERP. Other information we recorded are listed as follows: 1) HTML: including the URLs and HTML contents of SERPs and landing pages; 2) Mouse events: including details about mouse movements, clicks, and scrolling; 3) Queries: the content of queries that the participants issued; 4) Timestamps: including the starting and ending timestamps for all pages and user activities.
  • Search Feedback. As the browser extension recorded user activities implicitly, we also developed an annotation platform to collect more explicit feedback from users. Our annotation platform mainly consists of five functional screens (each screen collected some information, as shown in Table 1). While reviewing the search task, the participants needed to go through these screens sequentially, yet they could leave the pages at any time and then continue annotating by re-entering from the home page. More details please refer to our paper.

Figure 1: Data Description.

Detailed information of user reformulation behaviors is listed as follows:

Figure 2: Reformulation Description.

Dataset Organization

The dataset is organized in a prettified JSON format, as shown in the following.

[		-- All sessions
    {		-- One session
        "session_id": 1,
        "user_id": 7,
        "satisfaction": 4,
        "success": 4,
        "difficulty": 0,
        "urgency": 4,	-- Whether the user is very urgent in completing this search task
        "atmosphere": 0,	-- Is the environment very noisy
        "trigger": 4,	-- How was this search task motivated
        "expertise": 4,	-- Were the user familiar with the search tasks before searching
        "specificity": 4,
        "queries": [	-- One query
            {
                "query_id": 5,
                "satisfaction": 4,
                "start_timestamp": 1595991152761,
                "query_string": "百世汇通",
                "reform_type": 1,	-- Intent-level reformulation type
                "reform_reason": 4,	-- The reason why to leave this query
                "reform_entry": 1,	-- The interface of this reformulation
                "reform_inspiration": 1,	-- The inspiration source of this reformulation
                "other_reform_type": "",	-- The content for the "other" option of reformulation type
                "SERPs": [	-- Examined SERPs under the query
                    {
                        "page_id": 1,	-- The Page number
                        "start_timestamp": 1595991152761,
                        "end_timestamp": 1595991172921,
                        "page_timestamps": [	-- All timestamps of this page
                            {
                                "inT": 1595991152761,	-- The timestamp when the user jump in
                                "outT": 1595991154329	-- The timestamp when the user jump out
                            },
                            ...
                        ],
                        "mouse_moves": [	-- Mouse movements
                            {
                                "Sx": 545,	-- X coordinate of the starting point
                                "Sy": 231,	-- Y coordinate of the starting point
                                "St": 1595991152889,	-- Start timestamp
                                "Ex": 537,	-- X coordinate of the ending point
                                "Ey": 257,	-- Y coordinate of the ending point
                                "Et": 1595991153001,	-- End timestamp
                                "Ty": "move"	-- Type of mouse movement
                            },
                            ...
                        ],
                        "clicked_results": [	-- All clicked results
                            {
                                "type": "content",
                                "id": 1,	-- The rank of this result
                                "timestamp": 1595991154277,	-- Click timestamp
                                "pos_x": 303,	-- X coordinate of the result
                                "pos_y": 254,	-- Y coordinate of the result
                                "content": "百世快递-快递单号查询"	-- Content of this result
                            }
                            ...
                        ],
                        "top10_usefulness": [	-- The usefulness annotation for the top 10 results in this page (human label)
                            2,
                            0,
                            0,
                            2,
                            0,
                            0,
                            0,
                            0,
                            0,
                            0
                        ],
                        "clicked_others": [],	-- The other clickthrough activities by the user
                        "results": [    -- The results, including URL and title
                            [
                                "https://www.baidu.com/link?url=1QKpR-9HcZX27WvPjFWXhHgMsUgfa23QZquACETfd17&wd=&eqid=e1b9e80b00001b15000000035f20e470",
                                "百世快递-快递单号查询"
                            ],
                            ...
                        ]
                    }
                ]
            },

How to get TianGong-QRef

We provide a demo of TianGong-QRef which contains 2 sessions to help researchers have a quick start. For the A) whole copy of the TianGong-QRef dataset or B) the implementation of our field study platform, you need to contact with us (thuir_datamanage@126.com, chenjia0831@gmail.com). After signing an application forum online, we can send you the data.

Field Study Toolkit

We have also released our web search field study toolkit at github. Welcome to use it for conducting more interesting field study!

Citation

If you use TianGong-QRef in your research, please add the following bibtex citation in your references. A preprint of this paper can be found here.

@inproceedings{chen2021towards,
  title={Towards a Better Understanding of Query Reformulation Behavior in Web Search},
  author={Chen, Jia and Mao, Jiaxin and Liu, Yiqun and Zhang, Fan and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of The Web Conference 2021},
  year={2021},
  organization={ACM}
}