Science of Synthesis Datasets
Science of Synthesis Reaction Datasets: High-Quality Reaction Data in Organic Chemistry, Converted into Machine-Readable Format

Science of Synthesis (SOS), a prestigious reference work in organic chemistry, contains critical reviews of the entire field of organic and organometallic chemistry. The reactions included are those judged to be most synthetically relevant and most reliable, as selected by experts in each field.
Thieme has converted this wealth of organic synthesis knowledge into machine-readable format. These highly structured datasets can provide a crucial basis for training AI-based models or rules-based algorithms for retrosynthesis and forward-reaction prediction, as well as helping in the analysis of known chemical reactivity and chemical space.
The consistently formatted experimental procedures potentially allow the automation of synthetic chemistry. Such use cases have application in in academia as well as in industry, for example, in chemical and pharmaceutical research as well as in drug development. A retrosynthesis tool pretrained using the SOS datasets is now available on the IBM RXN for Chemistry platform.
Do You Get Your Hands Dirty in a Real Lab or Are You into Data Science?
High-Quality Datasets for Virtual and Real-Life Chemistry
The high-quality Science of Synthesis datasets are an essential knowledge base for chemists working in “real life” as well as for “virtual chemists”, for researchers working in designing syntheses or in computational investigations of chemical reactivity as well as for data scientists.
Real Laboratory
Synthetic Organic Chemists, Medicinal Chemists, Process Chemists
Virtual Laboratory
Cheminformaticians, Theoretical Chemists, Data Scientists
What We Offer: Pre-Trained Retrosynthesis Models and Reaction Datasets
A high-performing model for retrosynthesis and forward-reaction prediction
In collaboration with IBM, we offer a pre-trained model for retrosynthesis and forward-reaction prediction on IBM’s RXN for Chemistry platform.
High-quality, diverse reaction datasets covering the broad scope of organic chemistry
We can provide you with the whole dataset, or just the part that focuses on your area of research. We offer flexible pricing models to cover academic or various commercial use cases.
Improved Results when Training Models for Reaction Prediction and Retrosynthesis with SOS Datasets
The better the quality of the data used for training your models, whether rules-based or AI, the better the results you obtain. Science of Synthesis provides chemical reaction and structure data in synthetic organic chemistry to an unprecedented level of accuracy and reliability. Not only have the reactions included been selected by experts as being reliable and synthetically applicable, but also the abstraction of the data has been carried out with great care, predominantly manually, to ensure the dataset is of the highest quality. Experimental procedures have been edited for clarity and checked for scientific accuracy.
In addition to this, the Science of Synthesis datasets are very diverse, covering a much wider range of chemistry than that included in, for example, publicly available datasets automatically abstracted from patents.
High-Quality, Highly Structured Reaction Data: Ready for Use
There is no need for time-consuming, expensive text and data mining, the SOS dataset is already machine-readable and ready for use. It can be employed alone or readily used to supplement other data, such as your own in-house data or commercially or publicly available datasets.
The SOS Dataset Includes:
- NOver 470,000 reactions (available in V2000 BIOVIA CT RD file format and SMILES format)
- NOver 2.3 million molecules (available in V2000 BIOVIA CT SD file format)
- NOver 2,400 full-text files in PDF format
- NOver 76,000 full-text files in XML format
- NOver 60,000 experimental procedures in XML format, edited for clarity and checked for scientific accuracy
The SOS Dataset Offers the Following Advantages:
Consistent and Highly Structured:
The consistent and accurate format allows rapid integration into your system without significant cleanup needed
Very diverse – covers a very broad range of organic reactions:
Allows AI models to learn from the full breadth of knowledge
Has a high proportion of unique reactions:
More balanced and less skewing during training
Hand picked:
Extra quality – only that chemistry recommended by experts is included
Predominantly manually created:
Fewer errors than machine-abstracted data
Regularly updated and current:
New chemistry is always being added (approx. 10,000 reactions p.a.)
Application of SOS Datasets in Your Institution or Company to Improve your Success

Examples of organizations who would benefit from the SOS datasets include:
- NChemical institutes and companies using SOS datasets to train their inhouse models or supplement existing datasets
- NSoftware companies providing synthesis solutions to their customers
Interested? Step into Our SOS Datset Matrix and Reveal Your Tailored Offer
We can offer the package that meets your needs, whether that be the whole dataset, or just a part of it that focuses on your area of interest. We offer flexible pricing models to cover academic or various commercial use cases.
Contact Our SOS Dataset Experts
If you are interested in using the Science of Synthesis (SOS) data, please get in touch. We will be pleased to provide you with our content! Send an email to