Opensource 开源项目与数据

Toolkits

ReChorus2.0Top-K Recommendation with Implicit Feedback

ReChorus2.0 is a modular and task-flexible PyTorch library for recommendation, especially for research purpose. It aims to provide researchers a flexible framework to implement various recommendation tasks, compare different algorithms, and adapt to diverse and highly-customized data inputs.

PyTorch Top-K Recommendation

ReChorusTop-K Recommendation with Implicit Feedback

ReChorus is a general PyTorch framework for Top-K recommendation with implicit feedback, especially for research purpose. It aims to provide a fair benchmark to compare different state-of-the-art algorithms. We hope this can partially alleviate the problem that different papers adopt non-comparable experimental settings, so as to form a “Chorus” of recommendation algorithms.

PyTorch Top-K Recommendation

ULTRAUnbiased Learning to Rank Algorithm

ULTRA is an Unbiased Learning To Rank Algorithms toolbox that provides a codebase for experiments and research on learning to rank with human annotated or noisy labels. With the unified data processing pipeline, ULTRA supports multiple unbiased learning-to-rank algorithms, online learning-to-rank algorithms, neural learning-to-rank models, as well as different methods to use and simulate noisy labels (e.g., clicks) to train and test different algorithms/ranking models

PyTorch TensorFlow Learning to Rank

EasyRL4RecReinforcement Learning for Recommender Systems

EasyRL4Rec is an easy-to-use library for Reinforcement Learning (RL) based Recommender Systems, covering five public datasets, three accessible simulator-based environments, comprehensive RL-based baselines, and unified evaluation protocols to make RL-based recommendation research easier to reproduce and compare.

Reinforcement Learning Recommendation

RUS-toolkitPrivacy-Aware Remote IR User-Study Logging Tool

A privacy-aware logging toolkit for running remote information retrieval user-study experiments, helping researchers collect interaction data from participants without compromising user privacy.

User Study Privacy

MemCraftA Memory Plugin for OpenClaw

MemCraft is an OpenClaw memory integration plugin that connects major LLM memory baselines to OpenClaw, providing a local-first, reproducible, and extensible unified memory framework for both general users and researchers.

LLM Memory

LegalKitEvaluation Toolkit for Legal-Domain LLMs

LegalKit is a practical and extensible evaluation toolkit for legal-domain Large Language Models. It unifies dataset adapters, model generation, offline JSON evaluation, and LLM-as-Judge scoring into a single workflow, with an optional lightweight Web UI for non-terminal users.

LLM Evaluation Legal

Dynamic RAG ToolboxSIGIR 2025 Tutorial on Dynamic and Parametric RAG

A clean, modular, and easy-to-use codebase developed for the SIGIR 2025 Tutorial on Dynamic and Parametric Retrieval-Augmented Generation, to reproduce, compare, and extend dynamic RAG methods such as DRAGIN (ACL 2024) and FLARE.

RAG Tutorial

Parametric RAG ToolkitSIGIR 2025 Tutorial on Dynamic and Parametric RAG

A toolkit developed for the SIGIR 2025 Tutorial on Dynamic and Parametric Retrieval-Augmented Generation, designed to help researchers and practitioners reproduce, compare, and extend Parametric RAG methods, specifically PRAG and DyPRAG.

RAG Tutorial

Datasets

TianGong-CRL Dataset

Description: We provide this Chinene-centric TianGong-CRL dataset to support researches in epidemic related Information Retrieval (IR) tasks and information needs of Chinese people in the context of COVID-19. Refined from an 82-day search log by Sogou, the second largest search engine in China, the dataset consists of two parts. The first part provides a collection of 1492 COVID-19 related queries and the submission frequency of these queries in each province of China over an 82-day period, the second part provides a sample of COVID-19-related search logs during the period, we only provide session-level data for user privacy concerns. We also sample a subset of 1,700 sessions from TianGong-CRL and manually label each session with five intent labels.

Logs

Image Annotation Dataset

Description: On Annotation Methodologies for Image Search Evaluation

Image Annotation

User Behavior Dataset

Description: The influence of image search intents on user behavior and satisfaction

Logs Annotation

Reading Attention Dataset

Description: Understanding Reading Attention Distribution during Relevance Judgement.

Annotation

Sogou-SRR Dataset

Description: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks. The dataset consists of 6,338 queries and corresponding top 10 search results. For each search result, the screenshot, title, snippet, HTML source code, parse tree, url as well as a 4-grade relevance score (1-4) and the result type are provided. The queries are sampled from search logs of Sogou.com. The sampled queries with frequency between 100 and 10,000 are usually regarded as torso queries , and usually the most important concerns for ranking algorithm design.

Logs

Sogou-QCL Dataset

Description: The Sogou-QCL dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 537,366 queries, more than 9 million Chinese web pages, and five kinds of relevance labels assessed by click models. Moreover, a 2,000-queries' dataset with 4-level human assessed relevance labels is also offered to the public for research.

Logs

TianGong-ULTR Dataset

Description: The Tiangong-ULTR (Unbiased Learning To Rank) dataset is constructed to support the studies on unbiased learning to rank. This dataset provides real click data sampled from the search logs of Sogou.com for the training of unbiased learning to rank algorithm as well as a seperate set of human-annotated data for the evaluation of their performance.

Logs Annotation

SearchSuccess Dataset

Description: This dataset was created to support research on search evaluation in exploratory search. We conducted a user study which contained 166 search sessions in three domains. Users' interactions and explicit feedback were collected during searching process. The clicked documents collected in the user study were annotated by external assessors.

Logs

ZhihuRec Dataset

Description: ZhihuRec dataset is collected from a knowledge-sharing platform (Zhihu), which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query logs. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.

Logs

T²Ranking

Description: T²Ranking is a large-scale Chinese benchmark for passage ranking, including passage retrieval and re-ranking. T²Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Specifically, we sample question-based search queries from user logs of the Sogou search engine, a popular search system in China. For each query, we extract the content of corresponding documents from different search engines. After model-based passage segmentation and clustering-based passage de-duplication, a large-scale passage corpus is obtained. For a given query and its corresponding passages, we hire expert annotators to provide 4-level relevance judgments of each query-passage pair.

Annotation Passage retrieval Passage re-ranking

EEG-SVRec

Description: EEG-SVRec is the first EEG dataset with user multidimensional affective engagement labels in short video recommendation. It can be used for a deeper exploration of affective experience and cognitive activity behind user behaviors in recommender systems.

EEG

STARD

Description: STARD is a Chinese dataset that compiles 1,543 query cases from real legal consultations and 55,348 candidate statutory articles, aimed at addressing the neglect of non-professional public queries in existing statute retrieval benchmarks, thereby more comprehensively capturing the complexity and diversity of real queries from the public.

Legal

URS

Description: The URS (User Reported Scenario) dataset comprises 1,846 real-world conversations with 15 LLM services, contributed by 712 users from 23 countries across 6 continents. Each scenario is classified into six user intent categories. This dataset, characterized by its user-centric, multi-intent, and multi-cultural nature, provides a valuable resource for advancing user-centric evaluations of LLMs.

User Intent

GNN4EEG

Description: GNN4EEG is a benchmark and toolkit focusing on Electroencephalography (EEG) classification tasks via Graph Neural Network (GNN), aiming to facilitate research in this direction. Researchers can arbitrarily choose their preferred GNN models, hyper-parameters and experimental protocols. Training and evaluating dataset can be flexibly chosen as any self-built datasets.

EEG

LeKUBE

Description: LeKUBE serves as a benchmark for knowledge updating methods in the legal domain, which is distinct from general domain knowledge updating. The legal domain presents unique challenges such as legal reasoning, application of law, and the length of legal regulations. LeKUBE concentrates on these challenges, providing a comprehensive evaluation of knowledge updating methods in the legal domain across five dimensions (accuracy, generalizability, locality, retainability, and scalability).

Legal

LeCaRDv2

Description: LeCaRDv2 is a large-scale Chinese legal case retrieval dataset, providing extensive query-candidate pairs with fine-grained relevance annotations to support research on legal case retrieval.

Legal

Legal Case Retrieval Intent Dataset

Description: The official code and data accompanying "An Intent Taxonomy of Legal Case Retrieval," providing a taxonomy and annotated search behavior data for understanding user intents in legal case retrieval.

Legal User Intent

EEG-based Short Video Emotion Dataset

Description: A dataset for EEG-based emotion analysis under different information environments, focused on the short video recommendation scenario.

EEG

News Reading User Study Data

Description: User study data on news reading behavior, collected and used in our SIGIR 2018 and WWW 2019 papers.

User Study Logs

Usefulness User Study Data

Description: Search logs together with query/task usefulness, satisfaction, and relevance annotations collected from a user study on search usefulness.

User Study Annotation Logs

Brain-Relevance-Feedback

Description: Datasets and implementation for relevance feedback with brain signals, exploring how brain-computer signals can be used to infer document relevance.

EEG Relevance Feedback

RuleRec

Description: Datasets and implementation for "Jointly Learning Explainable Rules for Recommendation with Knowledge Graph" (TheWebConf 2019), combining knowledge-graph rules with recommendation models.

Recommendation Knowledge Graph

SurGE

Description: SurGE is a benchmark and dataset for end-to-end scientific survey generation in the computer science domain, developed for the SIGIR Resource Track, requiring systems to retrieve from over 1 million papers, organize a hierarchical outline, and generate a coherent survey.

Survey Generation LLM

LexEval

Description: LexEval is a comprehensive benchmark for evaluating large language models in the legal domain.

Legal LLM Evaluation

CoLLaM

Description: CoLLaM is a comprehensive benchmark for large legal language models.

Legal LLM Evaluation

JuDGE

Description: JuDGE is a benchmark for judgment document generation, evaluating systems that automatically draft legal judgment documents.

Legal LLM Evaluation

Code / Paper Implementations

MemoryBench (Code)

Description: Code for MemoryBench, a benchmark for memory and continual learning in LLM systems. See the corresponding dataset on HuggingFace below.

LLM Memory

DRAGIN

Description: Source code of DRAGIN (ACL 2024 main conference Long Paper), a method for dynamically deciding when and what to retrieve during text generation based on the real-time information needs of large language models.

RAG LLM

LegalOne-R1

Description: LegalOne-R1, a family of legal foundation models for reliable legal reasoning.

Legal LLM

GPRF

Description: Code for Generalized Pseudo-Relevance Feedback, extending pseudo-relevance feedback techniques for modern retrieval pipelines.

Retrieval

PRAG

Description: Code for Parametric Retrieval-Augmented Generation, injecting retrieved knowledge directly into LLM parameters instead of the input context.

RAG LLM

Robust Fine-tuning (RbFT)

Description: Code for Robust Fine-tuning (RbFT), improving the robustness of retrieval-augmented LLMs to noisy or irrelevant retrieved content.

RAG LLM

DecoupledRAG

Description: DecoupledRAG decouples external knowledge from the input context in Retrieval-Augmented Generation, injecting knowledge via cross-attention during inference for more efficient and robust knowledge integration.

RAG LLM

LexiLaw

Description: LexiLaw is a Chinese legal large language model, fine-tuned to provide legal consultation and assistance.

Legal LLM

LLM_Eval

Description: A general framework for evaluating the performance of large language models based on the peer review mechanism among LLMs.

LLM Evaluation

PreHash

Description: Implementation of "Beyond User Embedding Matrix: Learning to Hash for Modeling Large-Scale Users in Recommendation" (SIGIR 2020).

Recommendation

MIND

Description: Source code of our MIND paper (ACL 2024 Long Paper).

LLM

III-Retriever

Description: Code for I3 Retriever, accepted by CIKM 2023.

Retrieval

CaseEncoder

Description: CaseEncoder, a knowledge-enhanced pre-trained model for legal case encoding (EMNLP 2023).

Legal Retrieval

JTR

Description: Official repo for "Constructing Tree-based Index for Efficient and Effective Dense Retrieval" (SIGIR 2023 Full paper).

Retrieval

SAILER

Description: Official repo for "Structure-aware Pre-trained Language Model for Legal Case Retrieval" (SIGIR 2023 Full paper).

Legal Retrieval

CC-CC

Description: Implementation of "Adaptive Feature Sampling for Recommendation with Missing Content Feature Values" (CIKM 2019).

Recommendation

ACCM

Description: ACCM, an Attentional Content & Collaborate Model for recommendation.

Recommendation

WG4Rec

Description: Implementation of "WG4Rec: Modeling Textual Content with Word Graph for News Recommendation" (CIKM 2021).

Recommendation

ARES

Description: Code for "Axiomatically Regularized Pre-training for Ad hoc Search" (SIGIR 2022).

Retrieval

Reformulation-Aware Metrics

Description: Code for "Incorporating Query Reformulating Behavior into Web Search Evaluation" (CIKM 2021).

Evaluation

DAF (Difficulty-Aware Framework)

Description: Implementation of DAF, a Difficulty-Aware Framework for churn prediction (KDD 2021).

Recommendation

EHCF

Description: Implementation of "Efficient Heterogeneous Collaborative Filtering without Negative Sampling for Recommendation" (AAAI 2020).

Recommendation

ENSFM

Description: Implementation of "Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation" (TheWebConf 2020).

Recommendation

ENMF

Description: Implementation of Efficient Neural Matrix Factorization, also used in "Efficient Neural Matrix Factorization without Sampling for Recommendation" (TOIS).

Recommendation

SAMN

Description: Implementation of SAMN, a Social Attentional Memory Network for recommendation.

Recommendation

SLRC

Description: Short-term and Life-time Repeat Consumption (SLRC) model for recommendation.

Recommendation

BMAR

Description: Implementation of "Boosting Moving Average Reversion Strategy for Online Portfolio Selection: A Meta-Learning Approach."

Portfolio Selection

NARRE

Description: Implementation of NARRE, a neural attentional regression model with review-level explanations for recommendation.

Recommendation

TranSIV

Description: Code for TranSIV, "Learning and Transferring Social and Item Visibilities for Personalized Recommendation" (CIKM 2017).

Recommendation

PositionBiasInLP

Description: Source code of "Incorporating Position Bias into Click-through Bipartite Graph" (CCIR 2017).

Click Models

DEEP-CLICK-MODEL

Description: A small set of Python scripts implementing different deep click models, including RNN-based and deep UBM/PSCM variants.

Click Models

PSCMModel

Description: Implementation of user click models based on the Yandex click model framework, including the Partially Sequential Click Model (PSCM, SIGIR 2015).

Click Models

MotifExtraction

Description: Implementation of the motif extraction algorithm from "Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information" (SIGIR 2015).

User Behavior

Click Models for Mobile Search

Description: A project for building click models for mobile search, based on Aleksandr Chuklin's click models project.

Click Models

EquityRank

Description: Implementation of "Equity vs. Equality: Optimizing Ranking Fairness for Tailored Provider Needs" (SIGIR 2026).

Fairness Ranking

Caseformer

Description: Code for Caseformer

MemoryBench-LeaderBoard

Description: The LeaderBoard of MemoryBench

HuggingFace Projects

T²Rankinghuggingface.co/datasets/THUIR/T2Ranking

A large-scale Chinese benchmark for passage ranking with more than 300K queries and over 2M unique passages sampled from real-world search engines, distributed via the HuggingFace Datasets Hub.

Dataset Passage Ranking

Qilinhuggingface.co/datasets/THUIR/Qilin

A large-scale multimodal dataset for search, recommendation, and Retrieval-Augmented Generation (RAG) research, built from app-level user sessions with rich query, interaction, and multimodal content data.

Dataset Search Recommendation RAG

MemoryBenchhuggingface.co/datasets/THUIR/MemoryBench

A benchmark for evaluating memory and continual learning in LLM systems, testing whether they can learn from accumulated user feedback during service time. Accepted at ICML 2026 as a Spotlight paper.

Dataset LLM Memory Continual Learning

MemoryBench-Fullhuggingface.co/datasets/THUIR/MemoryBench-Full

An extended version of the MemoryBench dataset with additional user feedback data and simulator settings.

Dataset LLM Memory

MemoryBench-Resultshuggingface.co/datasets/THUIR/MemoryBench-Results

An artifact archive of published MemoryBench experiment runs, including model predictions, per-sample evaluation details, and aggregate summaries across multiple backbone models.

Dataset LLM Memory

AEOLLMhuggingface.co/datasets/THUIR/AEOLLM

Datasets for the NTCIR-18 and NTCIR-19 Automatic Evaluation of LLMs (AEOLLM) tasks, supporting research on automatic evaluation methods for large language model outputs.

Dataset LLM Evaluation

AEOLLM Leaderboardhuggingface.co/spaces/THUIR/AEOLLM

A live leaderboard Space tracking submissions to the NTCIR AEOLLM tasks for automatic evaluation of large language models.

Space LLM Evaluation

MetaSynhuggingface.co/datasets/THUIR/MetaSyn

A dataset of 442 meta-analyses from the Nature Portfolio (2015-2024) paired with a retrieval corpus of 140K+ PubMed-indexed articles, built to benchmark LLM agent pipelines on the full meta-analysis workflow of retrieval, screening, extraction, and synthesis.

Dataset LLM Agents Meta-Analysis

Special thanks to Shuqi Zhu for the initial construction of this page.