January 10, 2025

Can ChatGPT Speak Chemistry?

Why the choice of molecular notation affects what an LLM understands

By:  
Patrick Maher and Ben Birnbaum

The tech world can be prone to hyperbole, but it’s no exaggeration to say that the progress of Large Language Models (LLMs) like ChatGPT over the past few years has been sensational. And while novel abilities like writing poetry and telling jokes might have helped capture public imagination, these models have also shown impressive progress on hard scientific tasks that require synthesizing many pieces of information.

In the domain of drug discovery, generative chemistry certainly seems like a problem where the flexibility of LLMs might be a good fit. The general conclusion from people who have looked at this carefully, though, is that LLMs are less reliable than existing methods, often failing even to produce valid molecules. At Inductive Bio, we broadly agree with that assessment, but we’ve also found in our own experiments that a seemingly minor trick can make a major difference in the quality of the output. Simply changing the way molecules are described from SMILES to IUPAC can dramatically improve the results and reduce the number of invalid molecules generated by over 90%.

While this may seem like a surprising result, understanding why such a trivial change can have such a big effect comes down to a fundamental aspect of how LLMs are trained. And this phenomenon, in turn, has real implications if we’re trying to assess how much chemistry LLMs actually “understand."

We’ll get into the LLM training process and how it can cause unintuitive results below, but first let’s flesh out the generative chemistry problem a bit. To make things concrete, imagine you’re trying to come up with new analogs of this molecule: 

If you’re using an LLM for this task, you probably want to represent the molecule as text in some way. Of course, you could simply ask the model, “Please give me some analogs to amoxicillin”, but we’d really like a strategy that will work for novel, unnamed compounds. So one option is use the SMILES string representation and ask, “Please give me some analogs to CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccc(O)cc3)C(=O)N2[C@H]1C(=O)O”. But we could also use the IUPAC nomenclature and ask, “Please give me some analogs to (2S,5R,6R)-6-[[(2R)-2-amino-2-(4-hydroxyphenyl)acetyl]amino]-3,3-dimethyl-7-oxo-4-thia-1-azabicyclo[3.2.0]heptane-2-carboxylic acid.”

Which approach should provide better results? You might hope that if the model is really understanding chemistry, the results would be pretty similar.

We were curious about this, so we ran an experiment using a few cutting edge LLMs (ChatGPT-4o, Anthropic Claude, and ChatGPT’s o1 preview). We gave them two different prompts:

  1. You are a skilled medicinal chemist. Generate SMILES strings for 20 analogs of the molecule represented by the SMILES string {compound_structure}. You can modify both the core and the substituents. Return the answer as a strict json array in a markdown block with each entry containing one SMILES string.
  2. You are a skilled medicinal chemist. Generate IUPAC names for 20 analogs of the molecule represented by the IUPAC name {compound structure}. You can modify both the core and the substituents. Return the answer as a strict json array in a markdown block with each entry containing one IUPAC name.

For the prompt compounds, we used the 18 compounds in this Fragment-to-Lead review. We then collected the outputs and attempted to parse them into chemical structures.1

The main result is presented in the graph below. For each of the models, we see that asking for SMILES strings often results in invalid molecules, running into problems like unclosed rings, aromaticity violations, or incorrect valence. In contrast, despite the longer character lengths of IUPAC names, all the LLMs we tried showed much lower rates of invalid molecules–almost every suggestion was something that corresponded to a valid molecule. In the case of Anthropic’s Claude model, it generated 46 invalid smiles strings but only 4 invalid IUPAC names, a reduction in invalid suggestions of 91%!

Comparison of chemical validity of analogs suggested by LLMs when prompted with SMILES notation and IUPAC notation. Fraction of invalid suggestions is calculated for the 20 analogs suggested for each molecule, and distribution is plotted over 18 prompt molecules.

Of course, we don’t just want valid suggestions, we want useful ones as well. This is more subjective, but the IUPAC analogs often seem to be more chemically diverse and interesting than the SMILES suggestions. You can see one example of this below.

Analogs generated when the input molecule is represented by the SMILES representation (left) vs. analogs generated when the input molecule is represented by the IUPAC representation (right). Beyond a higher validity rate (five of the SMILES analogs were not valid and are not shown), the IUPAC analogs are more chemically diverse and interesting.

So what’s going on here? Why do these different prompts generate such different responses? Well, an important thing to keep in mind when analyzing the results of LLMs is how they’re trained. To vastly oversimplify a complex and rapidly evolving field, the core technique that’s used to train an LLM is a game of fill-in-the-blank.2 Models are shown sections of text (ideally high quality text produced by well-informed humans) with a word blanked out, made to guess the next word, and are adjusted based on whether they were right or wrong. This process is repeated trillions of times, and in doing so the models learn to capture all sorts of nuances needed to distinguish one sentence from another. (Consider completing the sentence: “Carboxyl groups have a very _ lipophilicity”. What context words are important for determining the correct answer? What kind of text would you need to see to learn patterns of how those words are used?) One of the key factors behind the recent success of LLMs is that this seemingly simple process turns out to be tremendously powerful when you have a huge number of parameters in your model and mountains of training data.

A cartoon of the core training task for LLMs. An uncompleted sentence is fed to the LLM, and it generates probabilistic predictions for the next word. Based on whether those predictions are correct or incorrect, parameters inside the LLM are adjusted to encourage it to choose the correct word.

While we don’t have details on what data OpenAI or Anthropic uses to train their models, it’s a pretty reasonable hypothesis that much of the training data that’s relevant to medicinal chemistry looks similar to how humans write about medicinal chemistry in papers and textbooks, largely using the language of functional groups. And while determining the functional groups from a SMILES string is possible, it takes some work. In contrast, look back at our IUPAC name for amoxicillin: pieces like “hydroxyphenyl” and “azabicyclo” are very specific and tell us clearly what functional groups we’re working with. Given all this, it’s not so surprising that LLMs would have an easier time making modifications to IUPAC names.

It’s worth noting, too, that this is by no means a phenomenon unique to chemistry. Many others have observed that the performance of LLMs on a downstream task can vary widely depending on how the model is prompted. One study found a >40% difference in computer code translation accuracy when comparing two popular formatting choices; another found large differences in question answering performance with only cosmetic changes to the prompt; still others have noticed huge variances in the effective ELO ranking of LLMs when used to play chess based on both model version and how they’re prompted.

So can ChatGPT speak chemistry? The answer, as we’ve seen, is that it depends on the dialect. Give it a SMILES string and it might see gibberish. Give it an IUPAC name, on the other hand, and it could give some meaningful suggestions.

Going further, can we conclude anything about how much chemistry LLMs actually understand? On the one hand, the sensitivity we saw to the input format of the molecules suggests that LLMs’ understanding may be quite shallow. On the other, research has shown that LLMs are capable of producing results that extend beyond simple pattern matching or looking up training examples. And although we believe that other approaches for generative chemistry are still more effective for now, many of the suggestions that these models come up with, when properly prompted, are actually interesting. It’s very possible, too, that the particular input-format sensitivity we observed may evolve as computer code (that would tend to have more SMILES strings) is increasingly used alongside human language as training data.

Ultimately, the question of how much a machine learning model understands can be somewhat slippery (and philosophical), so it’s often more useful to focus our questions on what a model can do. Because the pace of progress in LLMs doesn’t seem to be slowing down, the only way to tell when these models have improved enough to be useful for problems like generative chemistry is to keep testing them–and make sure we’re speaking their language.

Acknowledgements

We thank Josh Haimson and Paul Ornstein for feedback on this blog post.


1 While parsing SMILES strings on a computer is straightforward, working with IUPAC names programmatically is a little harder than you might hope. A number of proprietary tools (ChemAxon, OpenEye, ChemDraw) have functionality for this, but RDKIt (a workhorse of the open source organic chemistry world) can’t do it out of the box. For converting an IUPAC name into SMILES, there’s a Java package called OPSIN (with a python wrapper) that uses a formal grammar to parse names. Going the other direction, the best option is actually another machine learning model—a project called STOUT, which is trained on almost a billion pairs of smiles and IUPAC names. Conveniently, STOUT offers both a python package and a web app/API.

2 For those interested in a not-quite-as-vastly oversimplified picture, a few of the more important details: First, consumer models like ChatGPT also make heavy use of a technique called "supervised fine-tuning" (or a related technique called "reinforcement learning"), where the model is taught explicitly what kinds of chat responses are good and bad. Second, most of those models don't actually work directly with words, but with "tokens", which are sequences of a few characters at a time. This turns out to be more efficient, and you can imagine how it’s helpful when dealing with e.g. misspelled words. And third, increasingly these large models leverage a technique called "retrieval augmented generation" so that they can make use of information without having to learn it during training. You can think about this like letting the LLM search Wikipedia while it's composing its answer.