
Cheminformatics with Python
- 1st Edition - May 1, 2026
- Latest edition
- Authors: Zhimin Zhang, Hongmei Lu, Ming Wen
- Language: English
- Paperback ISBN:9 7 8 - 0 - 4 4 3 - 2 9 1 8 6 - 9
- eBook ISBN:9 7 8 - 0 - 4 4 3 - 2 9 1 8 7 - 6
Machine learning and deep learning have now been widely used in cheminformatics, and programming skills are becoming a must for most chemists. Python has become an invaluable and… Read more
Purchase options

- Familiarizes reader with well-established techniques in cheminformatics, providing an in-depth understanding of the application of deep learning in cheminformatics using Python software
- Simultaneously introduces the basic principles and implementations of deep learning algorithms, demonstrating how to apply deep learning models to chemical data for prediction and classification using Python
- Provides rich case studies and practical project examples to help readers apply what they have learned to real chemical problems and data
- Accompanied by an online GitHub repository with relevant Python code for each chapter
- Chapters also have an accompanying Jupyter Notebook containing relevant data, methods, and application examples, which can be run directly to get the results
1. Introduction
Focuses on the history, objectives, and scope of cheminformatics. Also, the structure of the book will be introduced.
1.1 History
1.2 Aims and Scope
1.3 Structure of the Book
Part I: Python for Cheminformatics
2. Python Basics
Introduces the basic concepts and features of the Python language and system to readers.
2.1 Installation
2.2 Python Interpreter
2.3 Variables and Data Types
2.4 Data Structures
2.5 Control Flow
2.6 Functions
2.7 Classes
2.8 Modules
2.9 Standard Library
2.10 Packages
2.11 Environments
2.12 Input and Output
2.13 Exceptions
2.14 Testing
2.15 Debugging
2.16 Profiling
2.17 Source Control
2.18 Collaboration
3. Python Packages
Provides an overview of the packages that need to be installed to conduct cheminformatics research based on Python, including scientific computing, cheminformatics, machine learning, deep learning, scientific visualization, databases, and web services.
3.1 Scientific Computing
3.1.1 NumPy
3.1.2 SciPy
3.1.3 Pandas
3.1.4 Jupyter
3.2 Cheminformatics
3.2.1 OpenBabel
3.2.2 RDKit
3.2.3 Chemfp
3.2.4 DeepChem
3.3 Machine Learning and Deep Learning
3.3.1 Scikit-learn
3.3.2 PyTorch
3.3.3 PyTorch Geometric
3.3.4 Hugging Face Transformers
3.4 Scientific Visualization
3.4.1 Matplotlib
3.4.2 Seaborn
3.4.3 PyMol
3.5 Databases
3.5.1 SQLite
3.5.2 MySQL
3.5.3 PostgreSQL
3.5.4 Redis
3.5.5 Milvus
3.6 Web Services
3.6.1 Django
3.6.2 Flask
3.6.3 FastAPI
3.6.4 OpenAPI
Part II: Data and Databases
4. Representation of Instrumental Signals
Focuses on how to represent the signals generated by the analysis instruments as vectors, matrices, tensors, etc., which can be easily imported into Python for further processing.
4.1 Vector Representation
4.1.1 Spectra
4.1.2 Chromatograms
4.1.3 Electrochemical Signals
4.2 Matrix Representation
4.2.1 Data of Hyphenated Instruments
4.2.2 Images from Single-Channel Imaging Instruments
4.2.3 Fluorescence (Phosphorescence) Spectra
4.2.4 Correlation Spectra
4.3 Tensor Representation
4.3.1 Images from Multi-Channel Imaging Instruments
4.3.2 Data of Multidimensional Instrumentation
4.4 Other Representations
4.4.1 Sparse Data
5. Representation of Molecules
Introduces common molecular formats and chemical reaction formats. Focuses on how to represent molecular structures as molecular descriptors, molecular fingerprints, molecular encodings, molecular graphs, molecular tokens, learned representations etc., which can be easily imported into Python for further processing.
5.1 Common Formats for Molecules
5.1.1 SMILES Format
5.1.2 InChI Format
5.1.3 Chemical Markup Language
5.1.4 MDL MOL Format
5.1.5 Sybyl Mol2 Format
5.1.6 XYZ Format
5.1.7 SELFIES Format
5.1.8 Crystallographic Information File
5.1.9 Protein Data Bank Format
5.1.10 FASTA Format
5.2 Common Formats for Chemical Reactions
5.2.1 CML Reaction Format
5.2.2 MDL RXN Format
5.2.3 Reaction SMILES Format
5.3 2D Representation
5.3.1 Molecular Descriptors
5.3.2 Molecular Fingerprints
5.3.3 Molecular One-Hot Encoder
5.3.4 Molecular Graphs
5.3.5 Molecular Tokenizers
5.4 3D Representation
5.4.1 Cartesian Coordinates
5.4.2 Internal Coordinates
5.4.3 3D Molecular Descriptors
5.5 Learned Representation
5.5.1 Mol2Vec
5.5.2 SMILES-BERT
5.5.3 ChemBERTa
5.5.4 Chemformer
5.5.5 MolCLR
6. Databases in Chemistry
This chapter first introduces basic database theory, history, classification, database management systems, structured query language, object-relational mapping, and database design and implementation. Then, chemical databases are classified into literature databases, spectral databases, property databases, molecular structure databases, molecular biology databases, patent databases, and standard databases. Finally, it summarizes the common chemical databases that lay the data foundation for building machine learning and deep learning models.
6.1 Basic Database Theory
6.1.1 Brief History
6.1.2 Classification
6.1.3 Database Management System
6.1.4 Structured Query Language
6.1.5 Object-Relational Mapping
6.1.6 Database Design and Implementation
6.1.7 Handling of Ultra-Large Databases
6.2 Classification of Chemical Databases
6.2.1 Literature Databases
6.2.2 Spectral Databases
6.2.3 Property Databases
6.2.4 Molecular Structure Databases
6.2.5 Molecular Biology Databases
6.2.6 Patent Databases
6.2.7 Standard Databases
6.3 Common Databases in Chemistry
6.3.1 Literature Databases of Publishers
6.3.2 Reference and Citation Databases
6.3.3 Mass Spectral Databases
6.3.4 NMR Spectroscopy Databases
6.3.5 Molecular Spectroscopy Databases
6.3.6 Atomic Spectroscopy Databases
6.3.7 Common Databases of Molecular Properties
6.3.8 Structural Databases of Small Molecules
6.3.9 Structural Databases of Macromolecules
6.3.10 Databases of Chemical Reactions
6.3.11 Common Databases in Molecular Biology
6.3.12 Google Patents
6.3.13 Standards in International Organization for Standardization
Part III: Methods
7. Instrumental Signal Processing
Presents signal processing methods for instrumental analysis by removing noises, baselines and other factors from signals through signal smoothing, baseline correction, peak detection, peak calibration, signal correction, signal derivation, signal transformation and other methods.
7.1 Smoothing Methods
7.1.1 Window Moving Average Method
7.1.2 Window Moving Polynomial Least Squares Method
7.1.3 Penalized Least Squares Method
7.2 Baseline Correction Methods
7.2.1 Improved Modified Polynomial Method
7.2.2 Adaptive Iteratively Reweighted Penalized Least Squares Method
7.2.3 Morphological Penalized Least Squares Method
7.2.4 Locally Estimated Scatterplot Smoothing Method
7.3 Peak Detection Methods
7.3.1 Common Criteria for Peak Detection
7.3.2 Peak Detection Based on Peak Properties
7.3.3 Peak Detection Incorporating Continuous Wavelet Transform
7.3.4 Multiscale Peak Detection in Wavelet Space
7.4 Peak Alignment Methods
7.4.1 Dynamic Time Warping
7.4.2 Correlation Optimized Warping
7.4.3 Peak Alignment Using Heuristic Optimization Algorithms
7.4.4 FFT Cross-Correlation for Alignment
7.4.5 Retention Time Alignment in XCMS
7.5 Signal Correction Methods
7.5.1 Multiplicative Scatter Correction
7.5.2 Orthogonal Signal Correction
7.5.3 Optical Length Estimation and Correction
7.6 Derivative Methods
7.6.1 Simple Discrete Difference
7.6.2 Window Moving Polynomial Least Squares Method
7.6.3 Fractional Order Derivation
7.7 Transformation Methods
7.7.1 Convolution and Cross-Correlation
7.7.2 Hadamard Transformation
7.7.3 Fast Fourier Transform
7.7.4 Wavelet Transform
8. Multivariate Calibration and Resolution
Multivariate calibration and resolution According to the component complexity of the systems, the multi-component systems are divided into white, grey and black systems, and the quantitative analysis methods applicable to these systems are introduced, respectively. For the white system, we introduce direct calibration and indirect calibration methods. For the grey system, the vector and matrix calibration methods are introduced. For the generalized grey system, we introduce multiple linear regression, principal component regression, partial least squares regression, model maintenance, model transformation, and variable selection methods. For the black system, iterative and non-iterative multivariate resolution methods are introduced. These methods can be used to build better multivariate calibration and multivariate discrimination models to extract accurate qualitative and quantitative information from complex systems.
8.1 Multivariate Calibration for White Analytical Systems
8.1.1 Direct Calibration Methods
8.1.2 Indirect Calibration Methods
8.1.3 Generalized Standard Addition Method
8.2 Multivariate Calibration for Grey Analytical Systems
8.2.1 Vector Calibration Methods
8.2.2 Matrix Calibration Methods
8.3 Multivariate Calibration for Generalized Grey Analytical Systems
8.3.1 Multiple Linear Regression
8.3.2 Principal Component Regression
8.3.3 Partial Least Squares Regression
8.3.4 Model Maintenance and Calibration Transfer
8.3.5 Variable Selection
8.4 Multivariate Resolution for Black Analytical Systems
8.4.1 Self-Modeling Curve Resolution
8.4.2 Iterative Target Transformation Factor Analysis
8.4.3 Evolving Factor Analysis and Related Methods
8.4.4 Heuristic Evolving Latent Projections
8.4.5 Subwindow Factor Analysis
8.4.6 Multivariate Curve Resolution Alternating Least Squares
8.4.7 Resolution by Immune Algorithms
8.4.8 Multi-Way Calibration
9. Manipulation of Molecular Structures
Focuses on how to process molecular structure information, generate molecular 3D conformations, calculate molecular descriptors, calculate molecular fingerprints, calculate similarities between molecules, perform chemical structure searches, and perform transformations or chemical reactions on molecular structures.
9.1 Working with Molecular Structures
9.1.1 Reading Molecules
9.1.2 Drawing Molecules
9.1.3 Editing Molecules
9.1.4 Writing Molecules
9.2 Conformer Generation
9.2.1 Distance Geometry
9.2.2 Experimental-Torsion Basic Knowledge Distance Geometry
9.2.3 Universal Force Field
9.2.4 Merck Molecular Force Field
9.3 Molecular Descriptor Calculation
9.3.1 Constitutional Descriptors
9.3.2 Topological Descriptors
9.3.3 Connectivity Descriptors
9.3.4 Geometric Descriptors
9.3.5 E-State Descriptors
9.3.6 Charge Descriptors
9.3.7 Kappa Shape Descriptors
9.3.8 Other Descriptors
9.4 Molecular Fingerprints
9.4.1 Morgan Fingerprint
9.4.2 MACCS Keys
9.4.3 PubChem Fingerprint
9.4.4 Daylight-like Fingerprint in RDKit
9.4.5 Atom-Pair Fingerprint
9.4.6 Topological-Torsion Fingerprint
9.4.7 OpenBabel FP Series Fingerprints
9.4.8 Mol2Vec
9.5 Molecular Similarity
9.5.1 Fingerprint Generation
9.5.2 Similarity Metrics
9.5.3 Procedure of Molecular Similarity Searching
9.5.4 Similarity Scores
9.5.5 Visualization of Similarity Contributions
9.6 Chemical Structure Searching
9.6.1 SMILES Arbitrary Target Specification
9.6.2 Full Structure Search
9.6.3 Substructure Search
9.6.4 3D Structure Search
9.7 Chemical Transformations and Reactions
9.7.1 Substructure-Based Transformations
9.7.2 Murcko Decomposition
9.7.3 RECAP and BRICS
9.7.4 Chemical Reactions Based on Reaction SMILES
9.7.5 Chemical Reactions Based on MDL RXN Files
10. Classic Machine Learning Methods
This chapter will explore classic machine learning methods for statistical inference: drawing conclusions on the chemical data at hand. These machine learning methods include pattern distances, similarity metrics, normalization, feature extraction, dimensionality reduction and visualization, clustering, classification, regression, model selection and evaluation, and other related methods.
10.1 Distance and Similarity
10.1.1 Distances in Pattern Space
10.1.2 Similarities in Pattern Space
10.2 Normalization and Feature Extraction
10.2.1 Range Scaling
10.2.2 Autoscaling
10.2.3 Normalization
10.2.4 Encoding Categories
10.2.5 Missing Value Imputation
10.3 Dimensionality Reduction and Visualization
10.3.1 Principal Component Analysis
10.3.2 Independent Component Analysis
10.3.3 Non-Negative Matrix Factorization
10.3.4 Multidimensional Scaling
10.3.5 Isometric Feature Mapping
10.3.6 Locally Linear Embedding
10.3.7 t-Distributed Stochastic Neighbor Embedding
10.3.8 Uniform Manifold Approximation and Projection
10.4 Clustering
10.4.1 K-Means
10.4.2 Hierarchical Clustering
10.4.3 Density-Based Spatial Clustering of Applications with Noise
10.4.4 Hierarchical Density-Based Spatial Clustering of Applications with Noise
10.5 Classification
10.5.1 Nearest Neighbors Classification
10.5.2 Logistic Regression
10.5.3 Linear Discriminant Analysis
10.5.4 Support Vector Classification
10.5.5 Gaussian Process Classification
10.5.6 Partial Least Squares Discriminant Analysis
10.5.7 Random Forest Classification
10.5.8 Boosting Classification
10.6 Regression
10.6.1 Nearest Neighbors Regression
10.6.2 Linear Regression
10.6.3 Ridge Regression
10.6.4 Least Absolute Shrinkage and Selection Operator
10.6.5 Elastic Net
10.6.6 Support Vector Regression
10.6.7 Gaussian Process Regression
10.6.8 Partial Least Squares Regression
10.6.9 Random Forest Regression
10.6.10 Boosting Regression
10.7 Model Selection and Evaluation
10.7.1 Subsets Split
10.7.2 Cross-Validation
10.7.3 Hyperparameter Optimization
10.7.4 Evaluation Metrics
10.7.5 Visualization
11. Deep Learning Methods
This chapter introduces deep learning methods for cheminformatics, including multilayer perceptrons, convolutional neural networks, recurrent neural networks, attention mechanisms and Transformers, graph neural networks, and generative networks.
11.1 Multilayer Perceptrons
11.1.1 Perceptron
11.1.2 Hidden Layers
11.1.3 Activation Functions
11.1.4 Optimizers
11.1.5 Forward Propagation
11.1.6 Loss Functions
11.1.7 Backpropagation
11.1.8 Training
11.1.9 Overfitting and Regularization
11.2 Convolutional Neural Networks
11.2.1 Translation Invariance
11.2.2 Cross-Correlation and Convolution
11.2.3 Convolutional Layers
11.2.4 Transposed Convolutional Layers
11.2.5 Padding and Stride
11.2.6 Pooling
11.2.7 Batch Normalization
11.2.8 Common CNN Models
11.2.9 Fully Convolutional Network and U-Net
11.2.10 Object Detection and Segmentation
11.3 Recurrent Neural Networks
11.3.1 Sequence Data
11.3.2 Vocabulary and Tokenization
11.3.3 Basic Recurrent Neural Network
11.3.4 Long Short-Term Memory
11.3.5 Gated Recurrent Units
11.3.6 Bidirectional RNN
11.3.7 Seq2Seq
11.3.8 Beam Search
11.4 Attention Mechanisms and Transformers
11.4.1 Queries, Keys, and Values
11.4.2 Attention Mechanism
11.4.3 Multi-Head Attention
11.4.4 Self-Attention
11.4.5 Positional Encoding
11.4.6 Transformer
11.4.7 Vision Transformer
11.4.8 Swin Transformer
11.5 Graph Neural Networks
11.5.1 Graph Structured Data
11.5.2 Tasks of GNNs
11.5.3 Message Passing
11.5.4 Pooling
11.5.5 Normalization
11.5.6 Readout
11.5.7 Common GNN Models
11.6 Generative Networks
11.6.1 Autoregressive Network
11.6.2 Variational Autoencoder
11.6.3 Generative Adversarial Network
11.6.4 Flow Network
11.6.5 Diffusion Model
11.7 Recent Advances in Deep Learning
11.7.1 Distributed Training
11.7.2 Hyperparameter Tuning
11.7.3 Large-Scale Pretraining
11.7.4 Contrastive Learning
11.7.5 Multimodal Machine Learning
11.7.6 Foundation Models
11.7.7 Explainable Artificial Intelligence
11.7.8 Reinforcement Learning
Part IV: Applications
12. Cheminformatics in Analytical Chemistry
In this chapter, we give a brief introduction to the application of cheminformatics in Analytical Chemistry, including data preprocessing, qualitative analysis, quantitative analysis, and structure elucidation.
12.1 Introduction
12.2 Data Preprocessing
12.3 Qualitative Analysis
12.4 Quantitative Analysis
12.5 Structure Elucidation
12.6 Summary and Outlook
13. Cheminformatics in Metabonomics
In this chapter, we give a brief introduction into the application of cheminformatics in Metabonomics, including raw spectral processing, statistical analysis, functional analysis, and integrative analysis.
13.1 Introduction
13.2 Raw Spectral Processing
13.3 Statistical Analysis
13.4 Functional Analysis
13.5 Integrative Analysis
13.6 Summary and Outlook
14. Cheminformatics in Drug Discovery
In this chapter, we give a brief introduction into the application of cheminformatics in drug discovery, including data sources for drugs, targets, and diseases, druglikeness, synthetic accessibility, pharmacophore, quantitative structure–activity relationship, drug–target interaction, ADMET properties, virtual screening, molecular dynamics simulation.
14.1 Introduction
14.2 Drug Discovery Process
14.3 Data Sources for Drugs, Targets, and Diseases
14.4 Druglikeness
14.5 Synthetic Accessibility
14.6 Pharmacophore
14.7 Quantitative Structure–Activity Relationship
14.8 Drug–Target Interaction
14.9 ADMET Properties
14.10 Virtual Screening (including new paradigms such as V-SYNTHES, Deep Docking)
14.11 Molecular Dynamics Simulation
14.12 Summary and Outlook
15. Cheminformatics in Materials Science
In this chapter, we give a brief introduction into the application of cheminformatics in Materials Science, including representation of materials, generative models and methods for materials, prediction of materials properties, high throughput screening, inverse design of materials.
15.1 Introduction
15.2 Representation of Materials
15.3 Generative Models and Methods for Materials
15.4 Prediction of Materials Properties
15.5 High-Throughput Screening
15.6 Inverse Design of Materials
15.7 Summary and Outlook
Appendices
A: Necessary Knowledge of Mathematics
This appendix provides a rapid introduction to the basic knowledge of Mathematics that you will need to follow most of the technical content in this book. They include the basic linear algebraic operations for high-dimensional data elements, calculus to determine which direction to adjust each parameter to decrease the loss function, some basic probability for reasoning under uncertainty, and basic information theory to measure and compare how much information is present in different signals.
A.1 Linear Algebra
A.1.1 Addition and Subtraction of Vectors
A.1.2 Direction and Length of Vector
A.1.3 Scalar Multiplication of Vectors
A.1.4 Inner and Outer Products Between Vectors
A.1.5 Matrix Addition and Subtraction
A.1.6 Matrix Multiplication
A.1.7 Zero Matrix and Identity Matrix
A.1.8 Transpose of a Matrix
A.1.9 Determinant of a Matrix
A.1.10 Inverse of a Matrix
A.1.11 Orthogonal Matrix
A.1.12 Trace of a Square Matrix
A.1.13 Rank of a Matrix
A.1.14 Eigenvalues and Eigenvectors of a Matrix
A.1.15 Orthogonal Similar Transformation of a Matrix
A.1.16 Singular Value Decomposition
A.1.17 Generalized Inverse
A.1.18 Derivative of a Matrix
A.1.19 Norms of Vector and Matrix
A.1.20 Concept of Tensor
A.2 Calculus
A.2.1 Differential Calculus
A.2.2 High-Dimensional Differentiation
A.2.3 Gradients and Gradient Descent
A.2.4 Multivariate Chain Rule
A.2.5 Backpropagation
A.3 Distributions
A.3.1 Bernoulli
A.3.2 Discrete Uniform
A.3.3 Continuous Uniform
A.3.4 Binomial
A.3.5 Poisson
A.3.6 Gaussian
A.3.7 Exponential Family
A.4 Information Theory
A.4.1 Entropy
A.4.2 Mutual Information
A.4.3 Kullback–Leibler Divergence
A.4.4 Cross-Entropy
B: Editors and IDEs
This appendix introduces common Python code editors and integrated development environments.
B.1 Jupyter
B.2 Spyder
B.3 VSCode
B.4 PyCharm
- Edition: 1
- Latest edition
- Published: May 1, 2026
- Language: English
ZZ
Zhimin Zhang
Zhimin Zhang is an Associate Professor of Analytical Chemistry at Central South University, PR China. He received his Bachelor and Doctoral degrees from Central South University. His main research interests are chemometrics and cheminformatics, machine learning and deep learning, high-resolution mass spectrometry and its resolution methods, Raman spectroscopy and its resolution methods, and chemometric software development. In recent years, he has hosted 4 national and provincial research projects, including the National Natural Science Foundation of China (NSFC) Youth Fund, National Major Scientific Instrument and Equipment Development Special Task, Hunan Provincial Natural Science Foundation Youth Fund, and National Postdoctoral Fund. He has also cooperated with B&W Tek, Shimadzu, ExxonMobil, National University of Defense Technology, Yunnan Institute of Tobacco Agricultural Science, and other enterprises and research institutions in the fields of data analysis and software development. He has published more than 100 SCI papers in Analytical Chemistry, Bioinformatics, Analytica Chimica Acta, Analyst, Chemometrics and Intelligent Laboratory Systems, Journal of Chemometrics, and other journals. He has been engaged in the development of chemometric software for analytical instrument data processing for a long time and has developed several sets of chemometric software and obtained 10 computer software copyrights. The developed chemometric software BWIQ (http://bwtek.com/products/bwiq/) is sold worldwide together with B&W Tek Raman and NIR spectrometers. He is currently an invited reviewer for Analytical Chemistry, Chemometrics and Intelligent Laboratory Systems, Analytica Chimica Acta, Journal of Chromatography A, and Analyst.
HL
Hongmei Lu
Hongmei Lu is a Professor of Analytical Chemistry at Central South University, PR China. She received her Bachelor and Doctoral degrees from Central South University. She is Vice Dean of the College of Chemistry and Chemical Engineering, Specially Appointed Professor of Furong Scholar, Editor of Chemometrics and Intelligent Laboratory System, Member of the Committee of Computational Chemistry of the Chinese Chemical Society, Executive Director of Hunan Chemical and Chemical Society, Executive Director of Hunan Provincial Inspection and Testing. She is also a member of the Executive Director of Hunan Chemical and Chemical Society, Executive Director of Hunan Provincial Inspection and Testing Society, Director of China Biological Testing and Monitoring Industry Technology Innovation Strategic Alliance, Director of National Chemistry Experimental Demonstration Center, Head of National Virtual Simulation Project, Member of the Tenth Hunan Youth Federation, Baosteel Excellent Teacher Award, Yuying Talent Program of Central South University. She has been awarded the second prize in Natural Science Award of Hunan Province, the third prize in Science and Technology Progress Award of Hunan Province, the third prize in Science and Technology Award of China Petroleum and Chemical Automation Industry, the first prize in Science and Technology Progress Award of Huaihua City, and the first prize of Teaching Achievement of Hunan Province. She has published more than 160 papers in international academic journals such as Anal Chem, Trend Anal Chem, Metabolomics, Bioinformatics, J Chromatogr A, etc. She has co-authored 3 monographs in English. She has led more than 20 research projects, including 7 National Natural Science Foundation of China projects. She has received funding from the Biotechnology and Life Sciences Research Council (BBSRC) and the Erasmus Mundus Program of the European Union to visit and lecture at the University of Manchester (UK), the Universities of Cadiz and Barcelona (Spain), the University of Algarve (Portugal), and the University of Bergen (Norway). In recent years, she has hosted the international conferences "6th International Conference On Separation Science and Technology" and "Chemometrics in Analytical Chemistry, 2015". She has participated in various international and domestic academic conferences and made invited presentations.
MW