Skip to main content

Directing the power of AI towards finding compounds to combat antimicrobial resistance

The COMBINE project trained an AI model to improve the selection of compounds that can fight multi-drug resistant bacteria – a newly published peer-reviewed paper outlines how they did it.

24 February 2025
COMBINE researcher Yojana Gadiya led efforts to train a new AI model that can help to select the most effective antimicrobial activity of compounds.
COMBINE researcher Yojana Gadiya led efforts to train a new AI model that can help to select the most effective antimicrobial activity of compounds.

Although many companies individually store databases detailing whether various compounds are effective against drug-resistant bacteria, to date, there has been no synchronised effort to collect this data and use it to better direct antimicrobial drug discovery.

The COMBINE project decided to leverage machine learning and big data to train a new AI model that can help to select the most effective antimicrobial activity of compounds, and a peer-reviewed paper has just been published outlining their work.

We spoke to Yojana Gadiya from the Fraunhofer ITMP Screening Port Hamburg and the University of Bonn, who led the efforts to develop this model within the project.

Hi Yojana, thanks for talking to us about this exciting result! Where did the idea for this machine learning model come from?

One of the first things that we did within COMBINE was to carry out a survey on what information is publicly available with respect to AMR. If you Google it, you will find an overview of this information compiled by the two biggest AMR communities: JPIAMR and GARDP. We took a closer look at the data found in these resources, and we noticed that they were mostly talking either about bacterial strain related models and datasets (i.e. genotypes and phenotypes) or where to find patients suffering from AMR (i.e. clinical data). One thing that was completely missing was the compound-based information which is crucial for pre-clinical drug discovery.

Why is the compound-based information important?

By compound-based information, I mean which compounds can show activity against specific bacterial strains. To some extent, each organisation has their own in-house activity database, which collects information on antimicrobial-tested compounds. But none of this data is aggregated in one single space.

That’s what guided us to make this whole repository, what we call the antimicrobial knowledge graph (KG) – to collect all of the publicly-available information on experimentally-tested compounds with regards to existing bacterial strains or pathogens. And using this data, we trained machine learning models to identify antimicrobial compounds.

How does the machine learning model compare the different compounds?

We focus on the chemoinformatics – the chemistry side, so how the drug is actually formulated and what it looks like. This helps for early drug discovery research because when a researcher has a promising candidate for a drug, they can see what could be done to optimise it so that it becomes a much better drug. Our model focuses on how the compound looks in 2D structure and in 3D structure as well as in various dimensional spaces. We tried to capture all of these different aspects of the compound as well as integrating the basic classifications – like how complex the molecule is or whether it can go through a membrane or not.

How did you train the model?

The model was trained using bioactivity data present in our antimicrobial KG. This data set is special as it maps more than 80,000 compounds with activity data across a wide spectrum of microbial (bacterial and fungal) strains. In an ideal world, each of these compounds would have been tested with each microbial strain, but given the time and cost associated with this, it is rather difficult. For this reason, we created a minimalist dataset which maps each compound to a microbial class: Gram-positive, Gram-negative, fungi, or another class of pathogens known as acid-fast. This way, we now have a much more complete picture of the compound. Using this as training data, we make the model learn correlations between the chemical properties and its microbial activity. 

How did you check that the model was correct?

Many studies have already shown that how the drug interacts with the membrane for several bacterial strains is essential. For instance, some chemical properties are necessary to improve the permeability of the compound across a membrane, making it active against Gram-negative pathogens. The model was also able to predict this, which gives us a validation that indicates that the model is performing on par. We also compared the model prediction with experimentally tested results of the EU Openscreen Library. Here, the model could predict at least 30% of the active hits.

Does this increase the efficiency of the drug development process then, because you don't have to do so many tests initially to find drug compounds that work?

Exactly – it works in a few ways. The first way that this model improves efficiency is by predicting how you can optimise your molecule. It also tells you what type of pathogen class your candidate is more likely to be active for – so whether that’s Gram positive,  Gram negative, tuberculosis strains, etc. This will reduce your costs because you won’t have to buy a complete compound library and screen to find that out (saving us about € 100-120K).

It also increases efficiency by indicating what actually works. You can use the model to say that of these 200 compounds, 20 are going to work but 20 are definitely not going to work. Then you select the 20 that are going to work, plus ten that are definitely not going to work (so that you have a negative control) and then you only have to test 30 compounds in total instead of 200.

Will this model be available past the project’s end?

This whole pipeline has already been deposited on GitHub, which makes the code open and accessible to others. We will be depositing this information to the EMBL platform as well, which will open this data up not only to the AMR community but also to wider research communities around the globe. What’s more, is that we have just published a peer-reviewed manuscript encoding all of the work that has been done along with the models. A copy of the model as well as the data repositories of the knowledge graph and the models can be found on the website for people to use.

What are the next steps?

Currently, we have built and tested the model in collaboration with EU-Openscreen. After this validation run, we are confident in its prediction and are reaching out to projects such as ENABLE and others to see if the models are prospectively applicable.

COMBINE is supported by the Innovative Medicines Initiative, a partnership between the European Union and the European pharmaceutical industry.