THUIR

Opensource 开源项目与数据

Follow us on GitHub Follow us on HuggingFace

Toolkits

ReChorus2.0Top-K Recommendation with Implicit Feedback

ReChorus2.0 is a modular and task-flexible PyTorch library for recommendation, especially for research purpose. It aims to provide researchers a flexible framework to implement various recommendation tasks, compare different algorithms, and adapt to diverse and highly-customized data inputs.

ReChorusTop-K Recommendation with Implicit Feedback

ReChorus is a general PyTorch framework for Top-K recommendation with implicit feedback, especially for research purpose. It aims to provide a fair benchmark to compare different state-of-the-art algorithms. We hope this can partially alleviate the problem that different papers adopt non-comparable experimental settings, so as to form a “Chorus” of recommendation algorithms.

ULTRAUnbiased Learning to Rank Algorithm

ULTRA is an Unbiased Learning To Rank Algorithms toolbox that provides a codebase for experiments and research on learning to rank with human annotated or noisy labels. With the unified data processing pipeline, ULTRA supports multiple unbiased learning-to-rank algorithms, online learning-to-rank algorithms, neural learning-to-rank models, as well as different methods to use and simulate noisy labels (e.g., clicks) to train and test different algorithms/ranking models

EasyRL4RecReinforcement Learning for Recommender Systems

EasyRL4Rec is an easy-to-use library for Reinforcement Learning (RL) based Recommender Systems, covering five public datasets, three accessible simulator-based environments, comprehensive RL-based baselines, and unified evaluation protocols to make RL-based recommendation research easier to reproduce and compare.

RUS-toolkitPrivacy-Aware Remote IR User-Study Logging Tool

A privacy-aware logging toolkit for running remote information retrieval user-study experiments, helping researchers collect interaction data from participants without compromising user privacy.

MemCraftA Memory Plugin for OpenClaw

MemCraft is an OpenClaw memory integration plugin that connects major LLM memory baselines to OpenClaw, providing a local-first, reproducible, and extensible unified memory framework for both general users and researchers.

LegalKitEvaluation Toolkit for Legal-Domain LLMs

LegalKit is a practical and extensible evaluation toolkit for legal-domain Large Language Models. It unifies dataset adapters, model generation, offline JSON evaluation, and LLM-as-Judge scoring into a single workflow, with an optional lightweight Web UI for non-terminal users.

Dynamic RAG ToolboxSIGIR 2025 Tutorial on Dynamic and Parametric RAG

A clean, modular, and easy-to-use codebase developed for the SIGIR 2025 Tutorial on Dynamic and Parametric Retrieval-Augmented Generation, to reproduce, compare, and extend dynamic RAG methods such as DRAGIN (ACL 2024) and FLARE.

Parametric RAG ToolkitSIGIR 2025 Tutorial on Dynamic and Parametric RAG

A toolkit developed for the SIGIR 2025 Tutorial on Dynamic and Parametric Retrieval-Augmented Generation, designed to help researchers and practitioners reproduce, compare, and extend Parametric RAG methods, specifically PRAG and DyPRAG.

Datasets

Description: We provide this Chinene-centric TianGong-CRL dataset to support researches in epidemic related Information Retrieval (IR) tasks and information needs of Chinese people in the context of COVID-19. Refined from an 82-day search log by Sogou, the second largest search engine in China, the dataset consists of two parts. The first part provides a collection of 1492 COVID-19 related queries and the submission frequency of these queries in each province of China over an 82-day period, the second part provides a sample of COVID-19-related search logs during the period, we only provide session-level data for user privacy concerns. We also sample a subset of 1,700 sessions from TianGong-CRL and manually label each session with five intent labels.
Description: On Annotation Methodologies for Image Search Evaluation
Description: The influence of image search intents on user behavior and satisfaction
Description: Understanding Reading Attention Distribution during Relevance Judgement.
Description: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks. The dataset consists of 6,338 queries and corresponding top 10 search results. For each search result, the screenshot, title, snippet, HTML source code, parse tree, url as well as a 4-grade relevance score (1-4) and the result type are provided. The queries are sampled from search logs of Sogou.com. The sampled queries with frequency between 100 and 10,000 are usually regarded as torso queries , and usually the most important concerns for ranking algorithm design.
Description: The Sogou-QCL dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 537,366 queries, more than 9 million Chinese web pages, and five kinds of relevance labels assessed by click models. Moreover, a 2,000-queries' dataset with 4-level human assessed relevance labels is also offered to the public for research.
Description: The Tiangong-ULTR (Unbiased Learning To Rank) dataset is constructed to support the studies on unbiased learning to rank. This dataset provides real click data sampled from the search logs of Sogou.com for the training of unbiased learning to rank algorithm as well as a seperate set of human-annotated data for the evaluation of their performance.
Description: This dataset was created to support research on search evaluation in exploratory search. We conducted a user study which contained 166 search sessions in three domains. Users' interactions and explicit feedback were collected during searching process. The clicked documents collected in the user study were annotated by external assessors.
Description: ZhihuRec dataset is collected from a knowledge-sharing platform (Zhihu), which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query logs. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.
Description: T2Ranking is a large-scale Chinese benchmark for passage ranking, including passage retrieval and re-ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Specifically, we sample question-based search queries from user logs of the Sogou search engine, a popular search system in China. For each query, we extract the content of corresponding documents from different search engines. After model-based passage segmentation and clustering-based passage de-duplication, a large-scale passage corpus is obtained. For a given query and its corresponding passages, we hire expert annotators to provide 4-level relevance judgments of each query-passage pair.
Description: EEG-SVRec is the first EEG dataset with user multidimensional affective engagement labels in short video recommendation. It can be used for a deeper exploration of affective experience and cognitive activity behind user behaviors in recommender systems.
Description: STARD is a Chinese dataset that compiles 1,543 query cases from real legal consultations and 55,348 candidate statutory articles, aimed at addressing the neglect of non-professional public queries in existing statute retrieval benchmarks, thereby more comprehensively capturing the complexity and diversity of real queries from the public.
Description: The URS (User Reported Scenario) dataset comprises 1,846 real-world conversations with 15 LLM services, contributed by 712 users from 23 countries across 6 continents. Each scenario is classified into six user intent categories. This dataset, characterized by its user-centric, multi-intent, and multi-cultural nature, provides a valuable resource for advancing user-centric evaluations of LLMs.
Description: GNN4EEG is a benchmark and toolkit focusing on Electroencephalography (EEG) classification tasks via Graph Neural Network (GNN), aiming to facilitate research in this direction. Researchers can arbitrarily choose their preferred GNN models, hyper-parameters and experimental protocols. Training and evaluating dataset can be flexibly chosen as any self-built datasets.
Description: LeKUBE serves as a benchmark for knowledge updating methods in the legal domain, which is distinct from general domain knowledge updating. The legal domain presents unique challenges such as legal reasoning, application of law, and the length of legal regulations. LeKUBE concentrates on these challenges, providing a comprehensive evaluation of knowledge updating methods in the legal domain across five dimensions (accuracy, generalizability, locality, retainability, and scalability).
Description: LeCaRDv2 is a large-scale Chinese legal case retrieval dataset, providing extensive query-candidate pairs with fine-grained relevance annotations to support research on legal case retrieval.
Description: The official code and data accompanying "An Intent Taxonomy of Legal Case Retrieval," providing a taxonomy and annotated search behavior data for understanding user intents in legal case retrieval.
Description: A dataset for EEG-based emotion analysis under different information environments, focused on the short video recommendation scenario.
Description: User study data on news reading behavior, collected and used in our SIGIR 2018 and WWW 2019 papers.
Description: Search logs together with query/task usefulness, satisfaction, and relevance annotations collected from a user study on search usefulness.
Description: Datasets and implementation for relevance feedback with brain signals, exploring how brain-computer signals can be used to infer document relevance.
Description: Datasets and implementation for "Jointly Learning Explainable Rules for Recommendation with Knowledge Graph" (TheWebConf 2019), combining knowledge-graph rules with recommendation models.
Description: SurGE is a benchmark and dataset for end-to-end scientific survey generation in the computer science domain, developed for the SIGIR Resource Track, requiring systems to retrieve from over 1 million papers, organize a hierarchical outline, and generate a coherent survey.
Description: LexEval is a comprehensive benchmark for evaluating large language models in the legal domain.
Description: CoLLaM is a comprehensive benchmark for large legal language models.
Description: JuDGE is a benchmark for judgment document generation, evaluating systems that automatically draft legal judgment documents.

Code / Paper Implementations

Description: Code for MemoryBench, a benchmark for memory and continual learning in LLM systems. See the corresponding dataset on HuggingFace below.
Description: Source code of DRAGIN (ACL 2024 main conference Long Paper), a method for dynamically deciding when and what to retrieve during text generation based on the real-time information needs of large language models.
Description: LegalOne-R1, a family of legal foundation models for reliable legal reasoning.
Description: Code for Generalized Pseudo-Relevance Feedback, extending pseudo-relevance feedback techniques for modern retrieval pipelines.
Description: Code for Parametric Retrieval-Augmented Generation, injecting retrieved knowledge directly into LLM parameters instead of the input context.
Description: Code for Robust Fine-tuning (RbFT), improving the robustness of retrieval-augmented LLMs to noisy or irrelevant retrieved content.
Description: DecoupledRAG decouples external knowledge from the input context in Retrieval-Augmented Generation, injecting knowledge via cross-attention during inference for more efficient and robust knowledge integration.
Description: LexiLaw is a Chinese legal large language model, fine-tuned to provide legal consultation and assistance.
Description: A general framework for evaluating the performance of large language models based on the peer review mechanism among LLMs.
Description: Implementation of "Beyond User Embedding Matrix: Learning to Hash for Modeling Large-Scale Users in Recommendation" (SIGIR 2020).
Description: Source code of our MIND paper (ACL 2024 Long Paper).
Description: Code for I3 Retriever, accepted by CIKM 2023.
Description: CaseEncoder, a knowledge-enhanced pre-trained model for legal case encoding (EMNLP 2023).
Description: Official repo for "Constructing Tree-based Index for Efficient and Effective Dense Retrieval" (SIGIR 2023 Full paper).
Description: Official repo for "Structure-aware Pre-trained Language Model for Legal Case Retrieval" (SIGIR 2023 Full paper).
Description: Implementation of "Adaptive Feature Sampling for Recommendation with Missing Content Feature Values" (CIKM 2019).
Description: ACCM, an Attentional Content & Collaborate Model for recommendation.
Description: Implementation of "WG4Rec: Modeling Textual Content with Word Graph for News Recommendation" (CIKM 2021).
Description: Code for "Axiomatically Regularized Pre-training for Ad hoc Search" (SIGIR 2022).
Description: Code for "Incorporating Query Reformulating Behavior into Web Search Evaluation" (CIKM 2021).
Description: Implementation of DAF, a Difficulty-Aware Framework for churn prediction (KDD 2021).
Description: Implementation of "Efficient Heterogeneous Collaborative Filtering without Negative Sampling for Recommendation" (AAAI 2020).
Description: Implementation of "Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation" (TheWebConf 2020).
Description: Implementation of Efficient Neural Matrix Factorization, also used in "Efficient Neural Matrix Factorization without Sampling for Recommendation" (TOIS).
Description: Implementation of SAMN, a Social Attentional Memory Network for recommendation.
Description: Short-term and Life-time Repeat Consumption (SLRC) model for recommendation.
Description: Implementation of "Boosting Moving Average Reversion Strategy for Online Portfolio Selection: A Meta-Learning Approach."
Description: Implementation of NARRE, a neural attentional regression model with review-level explanations for recommendation.
Description: Code for TranSIV, "Learning and Transferring Social and Item Visibilities for Personalized Recommendation" (CIKM 2017).
Description: Source code of "Incorporating Position Bias into Click-through Bipartite Graph" (CCIR 2017).
Description: A small set of Python scripts implementing different deep click models, including RNN-based and deep UBM/PSCM variants.
Description: Implementation of user click models based on the Yandex click model framework, including the Partially Sequential Click Model (PSCM, SIGIR 2015).
Description: Implementation of the motif extraction algorithm from "Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information" (SIGIR 2015).
Description: A project for building click models for mobile search, based on Aleksandr Chuklin's click models project.
Description: Implementation of "Equity vs. Equality: Optimizing Ranking Fairness for Tailored Provider Needs" (SIGIR 2026).
Description: Code for Caseformer
Description: The LeaderBoard of MemoryBench

HuggingFace Projects

T2Rankinghuggingface.co/datasets/THUIR/T2Ranking

A large-scale Chinese benchmark for passage ranking with more than 300K queries and over 2M unique passages sampled from real-world search engines, distributed via the HuggingFace Datasets Hub.

Qilinhuggingface.co/datasets/THUIR/Qilin

A large-scale multimodal dataset for search, recommendation, and Retrieval-Augmented Generation (RAG) research, built from app-level user sessions with rich query, interaction, and multimodal content data.

MemoryBenchhuggingface.co/datasets/THUIR/MemoryBench

A benchmark for evaluating memory and continual learning in LLM systems, testing whether they can learn from accumulated user feedback during service time. Accepted at ICML 2026 as a Spotlight paper.

MemoryBench-Fullhuggingface.co/datasets/THUIR/MemoryBench-Full

An extended version of the MemoryBench dataset with additional user feedback data and simulator settings.

MemoryBench-Resultshuggingface.co/datasets/THUIR/MemoryBench-Results

An artifact archive of published MemoryBench experiment runs, including model predictions, per-sample evaluation details, and aggregate summaries across multiple backbone models.

AEOLLMhuggingface.co/datasets/THUIR/AEOLLM

Datasets for the NTCIR-18 and NTCIR-19 Automatic Evaluation of LLMs (AEOLLM) tasks, supporting research on automatic evaluation methods for large language model outputs.

AEOLLM Leaderboardhuggingface.co/spaces/THUIR/AEOLLM

A live leaderboard Space tracking submissions to the NTCIR AEOLLM tasks for automatic evaluation of large language models.

MetaSynhuggingface.co/datasets/THUIR/MetaSyn

A dataset of 442 meta-analyses from the Nature Portfolio (2015-2024) paired with a retrieval corpus of 140K+ PubMed-indexed articles, built to benchmark LLM agent pipelines on the full meta-analysis workflow of retrieval, screening, extraction, and synthesis.

Special thanks to Shuqi Zhu for the initial construction of this page.