Transforming chemistry with machine learning

This post was originally published on this site

Transforming chemistry with machine learning

About the author: Theo Martinot is a skilled synthetic organic chemist with over 10 years of industry experience including methodology development, total synthesis of natural products and pharmaceutical targets, and route evaluation and development.

A strong proponent of scientific innovation across all channels including laboratory automation and data rich experimentation, he drives efficiency improvements through knowledge infrastructure (Lab Equipment Integration, etc.) and supports implementation of a Design of Experiments (DoE) approach to projects in all stages of development. Theo is currently an Associate Principal Scientist of Discovery Process Chemistry at Merck & Co., Inc., Kenilworth, NJ, USA.

Internet of Things in chemistry

My first blog post on the potential of the Internet of Things (IoT) in chemistry yielded both fruitful and surprising discussions with colleagues throughout the industry. These conversations catalyzed a flurry of new questions in my mind:

  • Where else can we improve how we perform our jobs as scientists?
  • What does technology really mean in the lab?
  • What do richer data sets really enable practically (and could they be a hindrance)?
  • How else could we leverage modern technology found elsewhere in our lives to improve the quality of science we do in the laboratory?

An outcome of using the Internet of Things in the laboratory is a set of higher-quality data rich experiments that can enable not only data-driven decisions, but a higher-order analysis that has not been possible to date (i.e., principal component analysis or other multivariate techniques and machine learning).

Richer data sets mean a richer analysis and perhaps even unexpected discoveries. To me, and many others who have already invested a lot of effort into this, machine learning seems like the next logical step here. In other industries, like manufacturing, finance, and health care, innovators have used machine learning to improve their operations and outcomes (just have a look at the Jan. 23, 2017 issue of C&E News for an example of the many applications). I have long believed that such heuristics deserve a prominent place in advancing science.

Mining vast amounts of data

Machine learning uses vast historical data sets to build predictive models that can help drive decisions. This capability has proven to be revolutionary in circumstances where simple analytics may not have been sufficient. In my mind, such tools transcend boundaries and provide insights on correlations that may otherwise have gone unnoticed. The human mind can easily process data sets in 1 or 2 – maybe 3 – dimensions, but we struggle beyond that.

In the same way that other industries have built upon their troves of data, how can we in science do the same? For years, a large portion (if not all) of the pharmaceutical industry has implemented Electronic Lab Notebooks (ELN). This software, despite its promise, disappoints because of its inability to build knowledge from these vast databases (unfortunately, most ELNs have also failed to deliver enhanced collaboration – a topic for another time).

This realization leads me to consider how techniques, like machine learning and artificial intelligence, can improve how we execute our science and how the data within the ELN (specifically) and literature (broadly) can be mined to reveal the next scientific breakthrough.

The challenge in chemistry

Before I delve into the benefits of machine learning and the dream of what it could bring, let’s dispel the fears. First off, I am a firm believer that the scientist will always be needed – not just as a pair of hands (which, actually, can be replaced by an automated system), but as an active judge and jury for the best course of action.

From personal experience, the planning, prosecution, and analysis of organic chemistry are innately instinctive, and I question whether a machine could ever emulate the entire process. In general, the nature of science and discovery is unstructured and unpredictable – a pursuit most appropriate for creative people that can respond accordingly when the unexpected happens. This may be hubris, but I don’t think that a machine will ever replace the intuition and expertise that a practicing scientist provides beyond the bench. However, I do believe that technology can play an important part in supporting and/or augmenting scientists, particularly in highlighting the unexpected or otherwise invisible correlations.

…technology can play an important part in supporting and/or augmenting scientists, particularly in highlighting the unexpected or otherwise invisible correlations.

And that’s exactly the type of support I’m looking for: an unbiased advisor. What we need is to artificially enhance chemists with all of the governing dynamics of physical and synthetic organic chemistry (the entire knowledge base) – a system that can be taught the principles of thermodynamics and kinetics associated with synthetic transformations and all of the associated primary literature.

When designing a synthesis, a chemist will look back from the target molecule and analyze which bonds can be formed, in what order, all the way to the starting material(s) – a practice known as retrosynthetic analysis. Today, there are many tools at hand that can help the chemist with deciding (here’s the intuition) what approach has the highest probability of success.

For example, Reaxys and Scifinder are great resources that can provide literature references and procedures to support an idea. And other tools (e.g., Chematica, ICSYNTH, etc.) provide algorithms that aid in the deisgn of syntheses specifically. But what limits the capability of these tools is that every molecule is unique, and so each will react differently under various reaction conditions. The conditions provided by the literature on a related substrate may not work as advertised, and is likely to work very differently on your molecule.

Furthermore, the state of the art in chemistry is ever-changing; while we all attempt to stay abreast of the latest literature, it is impossible to have all optimal starting reaction conditions in mind at all times. Note the emphasis on “starting”: once a route has been validated, the final, optimal reaction conditions will often be very different from where one started; but the idea is to always start from a position that will ensure the highest probability of success that a molecule can be made.

The opportunity for machine learning

Using tools like machine learning, could we harness the wealth of reaction data locked in our ELNs from all of our peers to inform how we synthesize new compounds? With these data, could you also build an algorithm analyzing for probability of success?

The reality of science is that experiments often fail. And, unfortunately, these negative findings frequently disappear from the published success story (otherwise, the story would be too grotesque). Yet, it is these negative findings that many of us find the most useful.

By analyzing the years of experiments captured in an ELN (the good and the bad), the algorithm could not only suggest to a scientist optimal parameters to perform a specific experiment based on the substrate or product drawn (the state of the art), but also provide a probability of the reaction success and even perhaps an alternate route. Such an algorithm would arm the scientist with the best tools for the job.

By analyzing the years of experiments captured in an ELN (the good and the bad), the algorithm could not only suggest to a scientist optimal parameters to perform a specific experiment based on the substrate or product drawn (the state of the art), but also provide a probability of the reaction success and even perhaps an alternate route.

In practical terms, one could imagine an ‘autocomplete’-like interface. As we define the product or reactants in a scheme, the software could recognize the experiment and ask:

“Are you running a Suzuki coupling?”
[YES] → “Would you like to load the best conditions for your reaction?”
“Are you trying to synthesize [x]?”
[YES] → “Compounds with 80% homology have been made; would you like to review the approaches taken?”

Based on the software’s assessment, it could suggest optimal reaction conditions, specific synthetic routes, or even compounds to evaluate (either commercially-available ones or proprietary compounds). It could do all that and more, and even provide a probability analysis that aggregates public data (i.e., information available on Scifinder, Reaxys, PubMed, etc.) and internal data (i.e., the good and the bad).

With companies investing significantly into the development of new technology (including discovery of new reactions), by having this information propagated through the very tools that the chemists use, you can ensure that everyone is always doing the best science possible. Catalysis experts must be so frustrated, for example, to learn that colleagues are struggling with a reaction, only to find out that their methods are decades-old.

Driving science with advanced analytics

What I advocate isn’t science fiction. In fact, many people have already been working on this for years. Data is – and has always been – king. With tools like IoT and High Throughput Experimentation (HTE), data will continue its reign, but we need to think creatively about how to make today’s results and analysis still relevant and accessible tomorrow and 10 years from now.

This means providing more structure to that construct but also inventing tools that can help automate these tasks with high accuracy and precision. At a time when our industry is seeking to improve the time-to-completion of projects, I’m inspired by the possibility of advanced analytics making my colleagues and me more productive, while also driving science forward.

I welcome discussion in the comment section below.

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑