USPTO-LLM: A Large Language Model-Assisted Information-enriched Chemical Reaction Dataset

Published in The Resource Track of the 2025 ACM Web Conference (WWW 2025), 2025

Over the past few years, the machine learning community has given increasing attention to chemical reaction prediction and retrosynthesis. Despite impressive achievements, the existing datasets in this field have gradually become the bottleneck of current research — the limitation of dataset size and the lack of reaction condition information hinder the practicability of the current methods. In this study, we construct an information-enriched chemical reaction dataset called USPTO-LLM, with the help of large language models (LLMs). This dataset comprises over 247K chemical reactions extracted from the patent documents of USPTO (United States Patent and Trademark Office), encompassing abundant information on reaction conditions. We employ large language models to expedite the data collection procedures automatically with a reliable quality control process. Experiments show that USPTO-LLM helps pre-train the existing retrosynthesis methods and the condition information in the dataset helps improve the model performance. The dataset is open-sourced at https://zenodo.org/records/14396156 and the annotation code is open-sourced at https://github.com/GONGSHUKAI/USPTO_LLM.