Active machine learning represents a sophisticated subset of supervised machine learning techniques in which the model is involved in selecting the data it is trained on. We can distinguish two main categories: Bayesian optimisation and active learning.
The difference between the two techniques lies in the balance between exploration and exploitation. In Bayesian optimisation, it is the aim to find a global extremum in a minimal number of iterations. This goal indicates that the model will have to explore (i.e., it must know how the data is distributed over a design space) but it must also exploit. Exploiting means that in a zone with a high likelihood of finding an extremum point, the model will ask for more data to validate whether a local of global extremum point is present. In the case of active learning, the aim is to reduce the uncertainty over the complete design space with a minimal amount of data. That is why exploitation is not important in active learning.
Active machine learning tools use supervised machine learning to model the relationship between the input and output. This model is called the surrogate model. Gaussian process regression is a suitable surrogate model because it is able to immediately calculate the uncertainty on the model predictions. As such, it allows for exploration over the design space because the uncertainty indicates in which zone the model lacks information.
However, this uncertainty is not the only criterion to select a next data point. A mathematical function named the acquisition function is used to perform this selection. In Bayesian optimisation, it is common practice to employ the Expected Improvement acquisition function. Then, the next data point will be the point where the model expects a maximal difference between the current extremum and the predicted extremum value. In active learning, a new data point should be the most informative one and, thus, reduce a maximal amount of uncertainty over the design space.
In chemical sciences and engineering, active machine learning is mainly used for finding chemical reaction conditions. Shields et al. (2021) demonstrated in a landmark paper how human decision-making can be outperformed by reaction optimisation that is supported by Bayesian optimisation. These results indicated that ideal conditions to perform a chemical reaction can be found with a much lower number of experiments than is common practice. Active learning is especially important for creating representative databases. In continuous chemical process design, for example, conditions can deviate from the optimum but must remain controlled. For this purpose, machine learning models are useful that can predict the outcome (e.g., yield or selectivity) at a wide range of input conditions. Since it is often time- and resource-consuming to make measurements at a wide range of experimental conditions, active machine learning can assist by covering the desired design space with a minimal number of experiments.
When people talk about artificial intelligence (AI), they usually mean specific subsets of AI, for example machine learning or deep learning. Machine learning can be defined as the collection of mathematical algorithms that learn from experience. Experience is in this case data. Deep learning is a subset of machine learning which extracts the important features in the data by itself by making use of algorithms such as deep neural networks. It is important to notice that a model can only be as good as the data it is trained on. This means that if a model is trained on bad data, the predictions will be bad (the 'garbage in-garbage out' principle).
In chemical sciences and engineering, various sources of data can be used, depending on the purpose of the model. Typical applications of chemical AI tools include structure-property relationships and chemical reaction planning. If a model is to be created to predict molecular properties from the structure, then a dataset is required that contains molecules and their properties. High-quality molecular property data is relatively scarce. Large datasets (hundreds of thousands of molecules) are available for properties calculated via quantum chemistry. However, these data points might lack accuracy because it is too time-consuming to calculate large databases with highly accurate methods. Experimental data is even scarcer with properties of about 10,000 molecules being available. The amount deviates per property and in many cases these data points are still scattered around the scientific literature.
When it comes to chemical reaction data, the situation is different. Here, we will distinguish the cases of optimising chemical reactions and the suggestion of reaction conditions. In the first case, data is needed about a specific reaction type. Typically, these reaction datasets are constructed experimentally. Nowadays, it is common practice to use high-throughput reaction set-ups to obtain this data. In the case of reaction condition suggestion, we typically deal with new reactions that have never been performed before. That is why we need a very large dataset with an enormous variety of chemical reactions. Such datasets are available (Reaxys, SciFinder) but not accessible for AI purposes due to legal restrictions imposed by the publishers. Alternative datasets from patents are available (Jessop et al., 2011).
It is a crucial task to convert the input data into a format that is handleable by an algorithm. The task of converting your data in an effective way is called representation learning. Sometimes, the representation is trivial. For example, if we predict the output of a process as a function of continuous process parameters, then it is a logical choice to use these process parameters as input. Unfortunately, many chemical data types are lacking a natural numerical representation. Small organic molecules are the most important example of this and pose many challenges. The classical approach to create a molecular feature vector is by calculating some mathematical descriptors. These descriptors can be the molecular weight, the number of certain functional groups, or more complex ones such as Balaban’s J index.
A molecule can mathematically be described as a graph in which the nodes are atoms and the edges are bonds. This graph can be constructed without any effort, because it is inherently present in the Simplified Molecular-Input Line-Entry System (SMILES). The SMILES of a molecule is the standard approach to storing molecular data by cheminformatics researchers. Because of this accessible graph-based representation, graph neural networks are among the most commonly applied deep learning architectures for chemical research.
Bayesian reaction optimization as a tool for chemical synthesis: Shields, B.J., Stevens, J., Li, J., Parasram, M., Damani, F., Alvarado, J.I.M., Janey, J.M., Adams, R.P. and Doyle, A.G., Nature 2021, 590, 89-96.
Machine learning in chemical engineering: strengths, weaknesses, opportunities, and threats: Dobbelaere, M.R., Plehiers, P.P., Van de Vijver, R., Stevens, C.V. and Van Geem, K.M., Engineering 2021, 7, 1201-1211.
Highly discriminating distance-based topological index: Balaban, A.T., Chem. Phys. Lett. 1982, 89, 399-404.
Mining chemical information from open patents: Jessop, D.M., Adams, S.E. and Murray-Rust, P.J., Cheminform. 2011, 3, 40.