ChatGPT demonstrates efficiency in clinical trial recruitment

ChatGPT demonstrates efficiency in clinical trial recruitment


September 29, 2025

3 min read

Key takeaways:

  • GPT-4 was more accurate than GPT-3.5 in identifying patients but had longer load times and higher costs.
  • The researchers suggest AI could help narrow down candidates, but selection should be done manually.

Multiple versions of ChatGPT demonstrated efficacy in identifying eligible patients for clinical trial recruitment, showing potential to complement manual chart reviews, according to a study published in Machine Learning: Health.

Improving clinical trial recruitment has been an ongoing challenge in the medical field, with nearly 20% of National Cancer Institute-affiliated trials failing to recruit enough participants, Michael Dohopolski, MD, assistant professor in the department of radiation oncology at UT Southwestern Medical Center, and colleagues wrote.



PC0825Dohopolski_Graphic_01_WEB



As AI tools continue to evolve, Dohopolski and colleagues considered if ChatGPT-3.5 and ChatGPT-4 could scour through electronic health records to find eligible candidates for a phase 2 head and neck cancer trial evaluating hypofractionated radiation therapy.

“One of the things that prompted this research was the difficulties with trial recruitment we had within our university,” Dohopolski told Healio. “It’s not that we lack trials that patients can enroll in, but sometimes we lack the personnel or resources for our research team or clinical staff to review.”

For this recruitment effort, the researchers prompted the large language models (LLMs) to review EHR data from 74 patients (eligible, n = 35; ineligible, n = 39) on 14 clinical trial eligibility criteria using one of three prompting methods:

  • structured output: a strict answer without additional explanation;
  • chain of thought: the LLM must also explain its reasoning for selecting a candidate; and
  • self-discover: allow the LLM to generate its own structures and reasoning for selecting candidates.

The LLM responses were further optimized with expert guidance input by a physician or AI-generated LLM guidance.

Researchers evaluated the performance of ChatGPT-3.5 and GPT-4 using receiver operating characteristic area under the curve (AUC) and whether the patient met strict eligibility (meeting all required criteria) or proportional eligibility criteria, which involved generating a prediction score based on how many eligibility requirements were correctly met.

Results showed ChatGPT-4 had higher median accuracy than GPT-3.5 (0.838 vs. 0.761).

GPT-4 was also more accurate than GPT-3.5 when testing for strict eligibility (median, 0.61 vs. 0.54) and proportional eligibility (median AUC, 0.74 vs. 0.64), with both versions performing better on the latter.

Although the most effective prompting style differed across performance metrics, the researchers found that structured output with expert guidance from a physician was the most accurate for GPT-3.5 (median, 0.91). Self-discover prompts were the most accurate for GPT-4 (median, 0.89), with chain of thought plus expert guidance performing similarly for this version.

Additionally, researchers measured the amount of time a screening took and cost per token of using the LLM.

“LLMs incur costs based on ‘tokens in, tokens out,’ which is basically the words you type in and the words the AI responds with,” Dohopolski said. “It’s not only about your prompt, which can be large at times — it’s also the chunks of a patient’s health record that gets implemented into the total prompt.”

Despite better performance, compared with GPT-3.5, GPT-4 had longer screening times (1.4-3 minutes vs. 7.9-12.4 minutes) and higher costs ($0.15-$0.27 vs. $0.02-$0.03).

“Although there was about at 10-fold difference in cost, we’re still talking about cents,” Dohopolski noted. “That said, the proper implementation now isn’t to throw in every single patient note you have, because that would become very expensive.”

Overall, ChatGPT could effectively screen for eligible trial candidates, but it does struggle to identify individuals who fulfill all criteria requirements. Thus, the researchers suggested using LLMs to narrow down potential candidates, and then select candidates manually.

“AI is at a point where it’s very intelligent in matters, but it doesn’t always understand the medical nuances that the clinical provider had intended when they wrote the protocol,” Dohopolski told Healio.

Dohopolski also mentioned that different LLMs require different methods of prompting when considering costs and screening time.

“There is some cost and accuracy analysis that has to be done, even for current implementations like GPT-5,” he said. “It is also important to make sure any model used is HIPAA compliant.”

Future research should examine applications of LLMs in other specialties, Dohopolski noted.

“One of the critiques about this study is that it’s focused on one trial in oncology,” he added. “But since we’ve done the initial work correctly, we can explore their usage in other trials and other scopes across medicine.”

For more information:

Michael Dohopolski, MD, can be reached at primarycare@healio.com.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *