We’ve all heard a lot about using large language models (LLMs) in medicine, but it’s generally in the abstract. I’d like to explore concretely how to apply these models to answer a specific medical question. I make no claim to be an expert in this topic, but rather I would like to propose an approach and see what folks think about it (please comment below!).
#1/3: Formulate your own answer
- Don’t simply outsource all the work to the LLMs.
- The goal is to use LLMs in addition to what you would normally do to answer the question.
- Formulating your own answer involves consulting conventional resources (e.g., pharmacopoeias, textbooks, UpToDate).
- If you’re using LLMs, then you probably won’t be completely sure of the answer. However, it’s still important to at least formulate an opinion.
- Formulating your own opinion helps avoid blindly following advice from the LLM.
#2/3: Narrow-spectrum medical literature LLM with dense references (OpenEvidence)
- A narrow-spectrum LLM is based entirely on high-quality medical literature (e.g., PubMed, JAMA, NEJM).
- OpenEvidence currently seems to be the best narrow-spectrum medical LLM.
- The advantage of a narrow-spectrum LLM is that it focuses the search on the highest-quality sources. The disadvantage is that it has blinders, which will ignore other high-quality information sources.
- OpenEvidence seems to have an extremely low rate of hallucinations.
#3/3: Broad-spectrum LLM with dense references (e.g., Perplexity)
- A broad-spectrum LLM examines a wider range of sources. Many of these sources are excellent and highly reliable (e.g., hospital protocols, health department websites, FDA drug package inserts). Other sources are of lower quality.
- For more cutting-edge or esoteric questions, the broad-spectrum LLM may be more effective in finding relevant information sources.
- Dense referencing is essential for any LLM, but particularly critical for a broad-spectrum LLM. To be useful, the LLM must provide frequent references to the sources it uses (ideally, line-by-line citations). This allows you to evaluate the information source and go deeper as needed.
- I’ve found Perplexity to be the most useful broad-spectrum LLM, but I’d be interested in others’ experiences (many LLMs fail to provide dense enough referencing to be useful).
- The LLM may allow you to choose a specific AI model that it uses. Different AI models have different hallucination rates. Currently, Claude Sonnet 4.0 appears to have a relatively low hallucination rate; however, this is expected to change over time as newer models evolve.
- Ensure that you formulate the question in a neutral manner. For example, if you’re determining whether a drug is useful for a certain condition, don’t ask “what are the benefits of drug X?” You should ask, “Is drug X beneficial?” LLMs are designed to provide users with answers they prefer, so if you ask the question in a leading manner, it will bias the output.
integration
- Compare your answer with two LLMs.
- If all three answers are consistent, this will increase your confidence in the accuracy of the answer.
- If OpenEvidence and Perplexity disagree, don’t assume that OpenEvidence is necessarily correct. Sometimes, OpenEvidence is “too” evidence-based (i.e., it will equate an absence of evidence with evidence of absence).
- Any inconsistencies can be addressed by following the references provided in the LLMs to primary information sources.
Latest posts by Josh Farkas (see all)