Data is one of the key pillars that supports use of machine intelligence in the chemical sciences and engineering (Dobbelaere et al. 2021). Therefore, it is crucial to have appropriate methods for data acquisition, storage, and handling. It is good practice to follow the FAIR data principles when working with scientific data.
FAIR is the abbreviation of Findability, Accessibility, Interoperability, and Reusability. These four foundational principles should become standard practice for all scientists and engineers that create or use data, irrespective of their research being of a computational or experimental nature. The intent of these principles is to enhance the reusability of data, by both machines and human researchers.
It is a routine job of experimental researchers to note down every step of their research in a laboratory notebook. Even in the 2020s, most of these notebooks are made of paper. Hence, they are not directly machine interpretable. Apart from paper notebooks, spreadsheets (e.g., Microsoft Excel) are another common way to store information about performed data. If these spreadsheets are perfectly tabulated, the data can be interpreted by a machine. Nevertheless, neither paper books nor spreadsheets are ideal in the sense that a lot of data and metadata are lost.
Electronic lab notebooks (ELN) offer a solution that is in agreement with the FAIR data principles (Jablonka et al. 2022). Technically, an ELN is a piece of software that supports planning, description, storage, and management of scientific data. A lot of electronic scientific equipment (e.g. analytical tools) are delivered with commercial software by the producer of the machine. This software then allows the storage of data, data visualisation, and performs tasks such as peak fitting. However, once the license has expired, the data might get lost. Furthermore, collaborators from other institutes not having access to the same software might not be able to see or reuse the data.
The solution is to make use of open-source ELN software. A risk here is that the maintenance of open-source software repositories might be stopped once the project is over and when a researcher changes institutions. However, as of 2024 multiple community-supported open projects exist. In such tools, all data and metadata is stored (Tremouilhac et al. 2017). This means that instead of a user typing variables into a spreadsheet, now all raw data is collected. These raw data types allow for distribution among various users and systems. Visualisation and analysis tools in the ELN software transform the machine data into something that is interpretable by a human.
The use of an ELN allows to store all data without human bias. These data points also include “failed” experiments or “negative” data points, which are undesired or unwanted results. It is, namely, shown earlier that these negative data points offer additional value for AI-based research (Strieth-Kalthoff et al. 2022). Yet, these experimental results are typically discarded and even not recorded since they do not provide added value in a research paper. This practice change for researchers to take full advantage of the capabilities of machine learning.
Machine learning in chemical engineering: strengths, weaknesses, opportunities, and threats: Dobbelaere, M.R., Plehiers, P.P., Van de Vijver, R., Stevens, C.V. and Van Geem, K.M., Engineering 2021, 7, 1201-1211.
Making the collective knowledge of chemistry open and machine actionable: Jablonka, K.M., Patiny, L. and Smit, B., Nat. Chem. 2022, 14, 365-376.
Chemotion ELN: an open source electronic lab notebook for chemists in academia: Tremouilhac, P., Nguyen, A., Huang, Y.C., Kotov, S., Lütjohann, D.S., Hübsch, F., Jung, N. and Bräse, S., J. Cheminform. 2017, 9, 54.
Machine learning for chemical reactivity: the importance of failed experiments: Strieth-Kalthoff, F., Sandfort, F., Kühnemund, M., Schäfer, F.R., Kuchen, H. and Glorius, F., Angew. Chemie Int. Ed. 2022, 61, e202204647.