June 7, 2024

Approaching AlphaFold 3 docking accuracy in 100 lines of code

By:  
Alex Rich, Ben Birnbaum, and Josh Haimson
AlphaFold 3 (AF3) is an exciting leap forward for protein-ligand docking, but the baseline it was compared to (Vina) is not truly state-of-the-art. We found that stronger baselines (shown in blue) can outperform the blind version of AF3, especially on more drug-like molecules, and can come much closer to the accuracy of the pocket-specified version of AF3.

The publication of AlphaFold 3 (AF3), and its ability to predict the structure of proteins, DNA, RNA, and small molecules interacting from nothing but sequences and SMILES strings, is an exciting leap forward in our ability to computationally predict the structure and properties of biomolecular systems. The work to understand the practical impact of this advance on day-to-day drug discovery is just beginning. To contribute to this effort, we set out to explore how the small-molecule docking results from the paper compared to existing techniques and where it might make sense to use AlphaFold 3 or similar models in small molecule drug discovery.

The AlphaFold 3 paper states that the algorithm achieves “far greater accuracy on protein-ligand interactions than state of the art docking tools.” While this sounds compelling, the most competitive baseline used to back this statement is the open-source Vina docking algorithm, about which the paper claims “AlphaFold 3 greatly outperforms classical docking tools like Vina even while not using any structural inputs." However, out-of-the-box Vina does not accurately represent the highest accuracy that can be achieved using well-known existing techniques.

To better measure how AlphaFold 3 compares to existing techniques, we built a stronger baseline using open source tools and about 100 lines of code (available here). This approach, like Vina, uses experimentally-determined structural information about the binding pocket. On the reported PoseBusters dataset it outperforms the blind docking version of AF3 (in which no information about the binding pocket is used) and comes much closer to the accuracy of the “pocket specified” version of AF3 (which is told which protein residues are near the ligand). We’ve made our baseline code available and hope it will be useful to the community as a comparison point.

This result does not diminish the usefulness of AlphaFold 3, particularly in its ability to operate with vastly less input information than traditional docking approaches that make use of experimental receptor structures. However, it suggests that in the short term, models like AlphaFold 3 may be used as a tool that complements other docking approaches, rather than wholesale replacing them. We’ll end with some more thoughts on the implications of AF3 for drug discovery — but first, let’s get into the weeds.

The Posebusters benchmark and reported results

The docking dataset and baselines used in the AlphaFold 3 analyses come from the PoseBusters work of Butteschoen et al., which was curated from PDB submissions dated 2021 or later to assess the performance of ML-based docking methods. The dataset was released along with a PoseBusters Python package, which checks docked poses to ensure they are <2Å from the experimentally determined pose (the classic metric of docking success) as well as free from other issues that would make the docked pose unusable, like stereochemistry violations or severe protein-ligand clashes (“PB-valid”).

The results in the AF3 paper, reproduced below, compare AF3 to the Vina docking results that were reported as the top approach in the PoseBusters paper. They find that AF3 has a 15% absolute improvement in PB-valid poses, even without receiving any structural inputs, and this jumps to 26.3% improvement when provided specification of which protein residues should be in proximity to the ligand (but still not their experimentally-determined 3D structure).

Reproduction of PoseBusters docking results from Extended Data Figure 4 of the AlphaFold 3 paper.

Building a better baseline

Is the Vina method described in the PoseBusters paper the best baseline out there? Not really. And it wasn’t meant to be: the point of the PoseBusters paper was that many deep learning docking programs fail badly, and are worse than even the most run-of-the-mill physics-inspired methods (like Vina), so there was no reason to create the strongest baseline possible.

But for comparison to AlphaFold 3, we can create a much more accurate baseline, while still using fairly standard, off-the shelf tools. We made two major changes to the Vina-based docking pipeline described in the PoseBusters paper.

  1. We ran docking from an ensemble of starting conformations of the ligand, rather than from a single starting conformation. The docking software itself will sample different conformations while docking, but generally it only modifies non-ring rotatable bonds. If the algorithm is given a poor starting conformation of a ring or other region of the molecule that it cannot modify, then it has no way to sample a correct pose. Running the algorithm from multiple starting conformations (created using the cheminformatics toolkit RDKit) and pooling the results at the end increases the chance of sampling a high quality pose (as seen, e.g., here and here). When running from multiple starting conformations, we performed less sampling for each run (via Vina’s exhaustiveness parameter), so that the total amount of computation was roughly unchanged.
  2. We used Gnina to rescore the docked poses output by Vina and choose the best. Gnina is a convolutional neural network trained to distinguish near-native from non-near-native docking poses, as well as to predict binding affinity. Extensive cross-validation studies have shown that it selects poses more accurately than Vina’s scoring function. It was published in early 2020, and has an earlier PDB training data cutoff date (2017-04-26, based on the IDs listed here) than the AlphaFold 3 model used on PoseBusters (2019-09-30), making it a fair comparison in terms of training data access.

The code to run this baseline can be found here and uses the Gnina software package (which is built on top of Smina, a fork of Vina).

Comparing our baseline to AlphaFold 3

The results of running docking with these changes on the 308 PoseBusters complexes can be seen in blue between the originally reported Vina and AlphaFold 3 results below. From left to right, the new bars show the Vina scoring function combined with an input conformational ensemble, Gnina scoring with a single input conformation, and finally Gnina scoring combined with an input conformational ensemble.

Posebusters docking results with stronger baselines added in blue.

With these modifications, the “strong baseline” is now 19.2% above the original Vina baseline in generating RMSD<2 and PB-valid poses. It’s 4.2% above the standard AlphaFold 3 result, but still 7.1% below the AlphaFold 3 result with additional pocket information.

Beyond these top-line performance numbers, the AlphaFold 3 paper provides only limited information about the new method’s strengths and weaknesses. One interesting figure, reproduced with our baseline results added below, shows that AF3 does particularly well on 50 “common natural ligands,” which are listed in the supplemental materials and include several nucleosides and nucleotides along with other molecules like pheophytin. It makes sense that AlphaFold does well here — these common natural ligands are defined as those occurring more than 100 times in the PDB, so they are likely highly represented in the AF3 training dataset. Our baseline, in contrast, struggles on these common natural ligands but performs 8.5% above AF3 on the remaining molecules, compared to a difference of 4.2% on the full dataset.

Comparison between AF3 and a strong baseline on common natural ligands (n=50) and other ligands (n=258).

Because the set of molecules that exclude common natural ligands may be more representative of typical small-molecule therapeutics, this finding raises the possibility that our strong baseline may perform even better relative to AF3 in a real-world drug discovery setting than the overall comparison would suggest. But this is where the lack of open code or results in the AlphaFold 3 paper becomes an issue. How well does the AF3 model with additional pocket information perform on these two groups of ligands? And how does performance look if one slices the remaining ligands further? For example, with our baseline model we can evaluate performance on the 69 PoseBusters ligands that contain fluorine or chlorine, a subset that may more closely resemble typical drug molecules. The baseline does quite well, with 84.1% being PB-valid and having RMSD<2. Does AF3 perform well too, or does it struggle, given that halogens are probably relatively rare in its training set? Without being able to see the raw data behind the paper’s figures, there’s no way to know.

What does AlphaFold 3 mean for the future of docking in drug discovery?

The AlphaFold 3 paper implies a simple answer to the above question: AlphaFold 3 blows alternative methods away, and once released, AlphaFold 3 or models like it should replace existing tools for most molecular docking. Our findings here are more nuanced. AlphaFold 3 is a true breakthrough for blind docking, but it doesn’t outperform what a strong baseline can achieve with receptor information. And the version of AlphaFold 3 with limited pocket information outperforms the baseline on the full PoseBusters benchmark, but it’s unclear which performs better for more druglike molecules.

In many ways, this is a more interesting situation. As the AlphaFold 3 authors rightly state, traditional docking methods use experimentally-determined receptor structures that one never truly has in practice. But it’s not really the case that computational chemists have nothing to start from but a protein sequence, either. Many real-world drug programs have a specific binding pocket being targeted and at least one (and maybe many) experimentally-determined structures of similar ligands bound to the protein, giving important structural information that can be used to guide docking algorithms. 

Where and when existing methods are still the “state-of-the-art,” and where models like AlphaFold 3 perform best (or will soon) remains an open question. For cases where an appropriate crystal structure is unavailable, or where a pocket is known to exhibit considerable flexibility, it seems likely that AF3 will have a lot of impact. On the other hand, for targets with available experimental structures and relatively rigid binding sites, and especially for cases where one has experimental data from a related ligand, other docking methods may still do best for the near future. The one thing that is certain is that this field will evolve rapidly in the months and years to come. We’re excited to see where things go once the community gets access to AF3 or AF3-like models, as well as what new capabilities emerge with AlphaFold 4 and beyond.

Acknowledgements

We thank Andy Good and Abba Leffler for feedback on this blog post.