The idea of giving chemical compounds a systematic name was invented in the late eighteenth century. Back then, the number of known substances was limited, but growing quickly. In order to find a unified method of naming the substances, the element names were based on their properties. About one century later, a new system of chemical nomenclature was proposed, since molecules could have multiple names when using the original method. Today, over 150 million unique organic and inorganic substances are known. The Chemical Abstracts Service (CAS) maintains a register into which around 15,000 substances are added daily, making it clear that a systematic naming method is vital in chemistry.
Even with the existence of a systematic name, trivial names might remain common. Para-acetylaminophenol is one example. European readers will know this molecule as paracetamol, while American readers might be more familiar with the name acetaminophen. Both names are derived in a different way from the (old) systematic name. These examples make it clear that the trivial naming method does not lead to unique names.
A solution is using a systematic name, following the IUPAC rules. The name for paracetamol then becomes N-(4-hydroxyphenyl)acetamide. The IUPAC nomenclature is also unambiguous, which means that the name corresponds to one compound only. For relatively small molecules like this one, the complexity of the IUPAC name is still reasonable. Nevertheless, the systematic name for large molecules quickly becomes too difficult. Alternatively, paracetamol can be described by its chemical formula C8H9NO2. However, a wide range of other molecules are represented by this same formula.
Another unique and unambiguous way to specify a chemical compound is by using a graphical representation. In the accompanying structures, paracetamol is depicted as a two-dimensional (left) and a three-dimensional (right) structure. When a Natta projection is used as a two-dimensional representation, then the third dimension is not needed to visualise the stereochemistry (Kim et al., 2018). Note that N-(4-hydroxyphenyl)acetamide (shown below) is achiral.
Chemical graphs are mathematical structures consisting of an ordered pair, in which is a set of vertices and a set of edges. This representation can be used for molecules, reactions, and crystal structures because each consists of sites (atoms, molecules, intermediates, …) and connections (bonds, reaction steps, …). The sites are the vertices in the graph, while the edges correspond to the connections. Molecular or constitutional graphs denote graphs in which individual atoms are the sites (and thus the vertices), while the edges depict the bonds. However, hydrogen atoms and their connecting bonds are often omitted. These graphs are called hydrogen-suppressed graphs or skeleton graphs. See David et al. (2020) for examples and further information.
Computers can also read molecular information in the form of a linear string of letters and numbers. This gives the opportunity to make very compact notations, minimising the required data. Furthermore, many computational processes can interpret text strings more effectively than data in tables. In addition, the string input is often human-readable. The main advantage of line-based identifiers is their processing speed. Several line notations exist, of which the Simplified Molecular Input Line Entry System (SMILES) and the IUPAC International Chemical Identifier (InChI) are currently the most used methods.
SMILES grew into the de facto standard notation method in cheminformatics. It uses the molecular graph structure to provide an unambiguous chemical nomenclature, based on the Morgan algorithm (vide infra) (Weininger, 1988). The example molecule paracetamol can easily be converted into a SMILES string, such as CC(=O)Nc1ccc(O)cc1. Although PubChem does recognise this structure, the database reports CC(=O)NC1=CC=C(C=C1)O as the SMILES identifier. Hence, the non-uniqueness of this notation method is clear. Several algorithms for canonicalising the SMILES structure have been created in order to obtain uniqueness (Weininger et al., 1989). The unique SMILES identifier gives the advantage that the string is not only synonymous to the molecule, but it is also the only string that will be used for this molecule. Due to the existence of multiple canonicalisation algorithms, the uniqueness is only guaranteed when only one algorithm is used. The PubChem SMILES structure of paracetamol is a canonical version, while the first string has been made manually. A InChI string is unique to each molecule.
While string-based representations such as SMILES, or graph-based representations can be easily used by chemists, a computer needs numerical input for correlating structures and properties. This numerical input is the molecular descriptor, which is formally defined by Todeschini and Consonni (2009) as follows: 'A molecular descriptor is the final result of a logical and mathematical procedure, which transforms chemical information encoded within a symbolic representation of a molecule into a useful number, which allows mathematical treatment of chemical information'. The resulting mathematical representation (or 'number') of a molecule can be a scalar value, vector, matrix, tensor or scalar field. In cheminformatics, feature vectors are the most common numerical translation of a molecule. Depending on the information included in the molecular representation, the descriptor is called either 1D, 2D, 3D or 4D.
The chemical graph theory and line-based identifiers, which are the most frequently applied representations, produce 2D descriptors, or topological descriptors, since only topological information is included. 3D descriptors are also called geometrical descriptors, since information about bonds and angles is enclosed. When the 3D molecular descriptor can handle multiple conformers simultaneously, a fourth dimension is added.
Every component of an n-component descriptor is a bin, representing a molecular feature. This can either be a local (substructures, topological indices, …) or global feature (molecular weight, volume, …). Typically, the number of features exceeds the order of several hundreds or thousands, depending on the information used for describing the molecule. Each feature xk in the vector is related to a real number, an integer number or a bit (0/1). The latter type, the binary-valued feature vector, is also called the molecular fingerprint, e.g. vA of molecule A. Below we define a specific type of fingerprint, the extended-connectivity fingerprints.
Extended-connectivity fingerprints (ECFP) are circular, topological fingerprints (Rodgers and Hahn, 2010). This implies that only 2D information about the atom connectivity is included. The fingerprinting method is inspired by the famous Morgan algorithm, which has been important for several decades in computational operations with chemical structures.
The above image illustrates the application of the Morgan algorithm on paracetamol. In the first step, an initial graph invariant is set to 1 for every vertex. The number of different invariants at this stage is equal to one. Then, the invariants are replaced by the sum of the neighboring invariants from the previous iteration step. This step is repeated until it does not change anymore.
The invariants at the first iteration where the maximal value is found, become the final invariant values. Based on these values, priorities can be given to the different atoms, as shown below. The atoms that have the lowest final invariant value are prioritised.
Respecting the priority rules, an unambiguous molecular representation can be created for the structure. It is straightforward to see how the SMILES identifier CC(=O)NC1=CC=C(C=C1)O is created (Weininger, 1988).
Fingerprints based on the Morgan algorithm are also called circular fingerprints (Rodgers and Hahn, 2010; Faulon et al., 2003; von Lilienfeld et al. 2015). This name finds its origin in the fact that at every iteration step, for every atom, a circle is created with a radius of a certain number of bonds. In the Morgan algorithm, the final radius is found when it does not change anymore. The extended-connectivity fingerprint, on contrary, predefines the radius, so that the identifiers are not (by definition) unique. However, the uniqueness of the Morgan algorithm leads to the disadvantage that atoms in the same environment in different molecules can have different identifiers, which is very inconvenient for comparing molecules (Rodgers and Hahn, 2010).
The algorithm for creating extended-connectivity fingerprints consists of three sequential stages: assignment of an initial identifier, iterative updating of this identifier and removal of duplicate identifiers. Depending on the type of identifier and the predefined radius (or diameter), the fingerprint receives a different name. In property prediction, the 'ECFP_4' fingerprint is often used (Faber et al. 2017), which captures the atom environment with a diameter of four bonds (i.e. two iterations). A fingerprint that captures the molecular functional class, with the width of the largest fragment being six bonds, is therefore named 'ECFP_6'.
PubChem 2019 update: improved access to chemical data: Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B., Zaslavsky, L., Zhang, J. and Bolton, E.E., Nucleic Acids Res. 2018, 47, D1102-D1109.
Molecular representations in AI-driven drug discovery: a review and practical guide: David, L., Thakkar, A., Mercado, R. and Engkvist, O., J. Cheminform. 2020, 12, 56.
SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules: Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31-36.
Molecular descriptors for chemoinformatics: Todeschini, R. and Consonni, V., Wiley‐VCH, Weinheim, 2009.
SMILES. 2. algorithm for generation of unique SMILES notation: Weininger, D., Weininger, A. and Weininger, J.L. J. Chem. Inf. Comput. Sci. 1989, 29, 97-101.
Extended-connectivity fingerprints: Rogers, D. and Hahn, M. J. Chem. Inf. Model. 2010, 50, 742-754.
The signature molecular descriptor. 1. using extended valence sequences in QSAR and QSPR studies: Faulon, J.-L., Visco, D.P. and Pophale, R.S. J. Chem. Inf. Comput. Sci. 2003, 43, 707-720.
Fourier series of atomic radial distribution functions: a molecular fingerprint for machine learning models of quantum chemical properties: von Lilienfeld, O.A., Ramakrishnan, R., Rupp, M. and Knoll, A. Int. J. Quantum Chem. 2015, 115, 1084-1093.
Prediction errors of molecular machine learning models lower than hybrid DFT error: Faber, F.A., Hutchison, L., Huang, B., Gilmer, J., Schoenholz, S.S., Dahl, G.E., Vinyals, O., Kearnes, S., Riley, P.F. and von Lilienfeld, O.A. J. Chem. Theory Comput. 2017, 13, 5255-5264.