GenAI chatbots can treat clinical level mental health symptoms

GenAI chatbots can treat clinical level mental health symptoms


AI

For some, the title of this blog might look like ‘click-bait’ – and dismissed as a further example of the exaggeration that can surround discussions of Generative Artificial Intelligence (GenAI). For others, the statement may seem axiomatic and obvious given that research has already suggested that chatbots are a feasible, engaging, and effective way to deliver Cognitive Behavioural Therapy (CBT; e.g., Fitzpatrick et al., 2017).

Yet the title to this blog is neither hyperbole nor self-evident. Although chatbots have previously been shown to have benefits, these tended to be rule-based agents, “limited by their reliance on an explicitly programmed decision trees and restricted inputs” (Heinz et al., 2025, p.2). It therefore is of interest that a recent paper by Heinz and colleagues (2025) reported on a randomised controlled trial (RCT) to demonstrate the effectiveness of a fully GenAI chatbot for treating clinical level mental health symptoms.

Within this blog, we look at the details of this study and ask where it leaves us going forward.

Is GenAI finally on the verge of transforming the way we deliver mental health care?

Is GenAI finally on the verge of transforming the way we deliver mental health care?

Methods

The authors conducted a national RCT of adults with clinically significant symptoms of major depressive disorder (MDD), generalised anxiety disorder (GAD) or at high risk for feeding and eating disorders (FED). The 210 eligible participants were stratified into one of these three groups and randomly assigned to a 4-week chatbot intervention (n = 106) or waitlist control (n = 104).

Participants in the intervention group were prompted daily to interact with a chatbot (‘Therabot’) during treatment phase (4 weeks). During post-intervention (weeks 4-8) and follow-up, participants were not prompted, but were still permitted to use Therabot.

The chatbot was developed with over 100,000 human hours and utilises a generative large language model (LLM) “fine-tuned on expert-curated mental health dialogues” (p.3). Based on third-wave CBT, Therabot allowed users to either initiate a session directly in the chat interface or reply to notifications. A user prompt, conversation history and most recent user message were then combined and sent to the LLM. All responses from Therabot were supervised by trained personnel post-transmission. In the event of an inappropriate response from Therabot, the participant was contacted to provide correction.

Primary outcomes were symptom changes from baseline to postintervention (4 weeks) and follow up (8 weeks). Measures included the Patient Health Questionnaire (PHQ-9), Generalised Anxiety Disordered Questionnaire (GAD-Q-IV), and the Weight Concerns Scale (WCS) within the Stanford-Washington University Eating Disorder (SWED). Secondary outcomes included measures of therapeutic alliance, and satisfaction and engagement with Therabot.

Results

Participant characteristics

Of the 210 participants recruited to the study, 125 (59.5%) identified as female and 166 identified as heterosexual (79.05%). Around half of the sample (53.3%) were Non-Hispanic White and approximately 60% had a Bachelor degree or above. The paper reports that 68% (n = 142) with MDD, 55% (n = 116) with GAD and 42% (n = 89) with CHR-FED at baseline. Minimal withdrawal or attrition was seen across the 8-week period (n = 7).

Main findings

Therabot users showed significantly greater reductions in depression symptoms. The mean change on PHQ-9 score from baseline to postintervention was -6.13 (SD = 6.12) in the intervention group and -2.63 (SD = 6.03) in the control group. Change from baseline to follow-up was -7.93 (SD = 5.97) in the intervention group and -4.22 (SD = 5.94) in the control group. As the authors note, a decrease of 5 or more has been shown to constitute clinically meaningful change.

Similar patterns were observed for anxiety symptoms. The GAD-Q-IV does not have established clinically meaningful change thresholds so the Cohen’s d values for effect sizes are most instructive here. Both groups see an improvement from baseline to follow up but this is significantly larger in the intervention group ( d = 0.84, 95% CI [0.38 to 1.298], p = .001 at 4 weeks; and d = 0.79, 95% CI [0.32 to 1.26], p = .003 at 8 weeks). If we take the ‘rule-of-thumb’ that a Cohen’s d of 0.8 or greater signifies a substantial difference then these would be considered ‘large’ effects.

The WCS score ranges from 0 to 100 and also does not have established meaningful change thresholds. The effect sizes do suggest that the intervention group showed greater improvement in weight concerns than the control group (d = 0.82, 95% CI [0.26 to 1.37], p = .008 at 4 weeks; and d = 0.63, 95% CI [0.07 to 1.18], p = .027 at 8 weeks).

With respect to secondary outcomes, the mean number of messages sent by participants was 260 (min = 1, max = 1,557) and the mean number of days interacting was 24 (min = 1, max = 60). For the authors, these figures suggest over the space of 4 weeks, participants were able to develop a working alliance comparable to that shown in an outpatient psychotherapy sample.

Therabot users showed greater reductions in depression, generalised anxiety and feeding and eating disorder symptoms at both post-intervention and follow-up in comparison to the waitlist control.

Therabot users showed greater reductions in depression, generalised anxiety and feeding and eating disorder symptoms at both post-intervention and follow-up in comparison to the waitlist control.

Conclusions

The key take-home message from this paper is that a GenAI chatbot can reduce clinical symptoms across several different mental health conditions. The authors suggest that Therabot’s success may be driven by three main factors:

  1. Therabot is evidence-informed, rooted in evidenced-based psychotherapies and built on what we know already works.
  2. Users had unrestricted access, meaning that they could engage at any time and place. The ability to access therapeutic support wherever and whenever most needed may be a key advantage of digital therapeutics.
  3. Unlike existing chatbots for mental health treatment, Therabot was powered by GenAI, “allowing for natural, highly personalised, open-ended dialogue” (Heinz et al. 2025, p.10).
Therabot’s success may be driven by a range of different factors, including the fact that it is based on a range of evidence-based psychotherapies.

Therabot’s success may be driven by a range of different factors, including the fact that it is based on a range of evidence-based psychotherapies.

Strengths and limitations

A key strength of this study is the robustness of the design. The authors conducted a national RCT, and statistical considerations look appropriate (e.g., a Monte-Carol simulation study was used to estimate the statistical power). Although only ever as good as the assumptions underpinning it, these methods do work well with complex designs. Missing data was also minimal throughout, including with the user satisfaction survey. The authors also recognised that there is potential in waitlist control trials for differential contact between the intervention and control group and attempted to mitigate this with by planning equivalent contact where possible.

The authors also seem to have paid attention to some of the more general methodological challenges involved in running a study on mobile/digital therapeutics. For example, Therabot ran on both Android and iOS devices. Although the research remains a little unequivocal, studies have suggested that, in comparison to Android users, iPhone users are more likely to be younger, female, and have higher levels of emotionality (Shaw et al., 2016). Restricting the sample to either Android or iOS could therefore have skewed the sample. The authors also “assumed participant identity to be truthful unless we detected irregularities in the data”, seemingly recognising some of the challenges of online recruitment as well as the increasing challenge of ‘imposter participants’(Sharma et al., 2024), such as preventing duplicate sign-ups and two-factor authentication.

There are, however, limitations. The authors do note the short follow-up period and that longer studies are needed to assess the durability of Therabot’s effectiveness. They also recognise the potential self-selection and possible bias toward younger, technologically-minded participants who were open to AI.

Less is said by the authors about the fact that the study was not blinded and the fact that other interventions were being delivered at the same time.  Of those currently receiving treatment (around 27%), 17 people were receiving both medication and psychotherapy. Further to this, when considering the possible self-selection and bias noted above the authors move over this quite rapidly. There is little overt recognition of the role the socio-economic status (SES) might be playing here. The baseline characteristics show 42% of the overall sample had a Bachelor’s degree and around 17% had a Master’s degree or higher. Research continues to link academic achievement and SES and – as such – it is possible that the education profile of the sample means that it was also skewed towards those with higher SES. Further reflection by the authors on the possible implications of this would have been welcome.

Heinz et al. (2025) note the potential self-selection and possible bias toward younger, technologically-minded participants who were open to AI in this study, which could impact the generalisability of the results.

Heinz et al. (2025) note the potential self-selection and possible bias toward younger, technologically-minded participants who were open to AI in this study, which could impact the generalisability of the results.

Implications for practice

So where does this leave us going forward? As I write this, the BBC news is running a story with the title “NHS plans ‘unthinkable’ cuts to balance books” – with one “boss of a mental health trust” telling the BBC that waits for psychological therapies now exceed a year. It is here that we often situate our discussions of what GenAI may, or may not, be able to do. On the one hand, GenAI may provide solutions to a mental health infrastructure which is “inade­quately resourced to meet the current and growing demand for care” (Heinz et al., 2025, p.2). On the other, there are concerns around privacy, data protection, biased datasets, widening inequalities and generic models being inappropriately deployed. Professor Miranda Wolpert neatly summarises these debates in a recent Wellcome blog.

We see this now familiar tension play out within this paper. The authors suggest that the paper does show that fine-tuned GenAI chatbots offer a feasible approach to delivering personalised mental health at scale. They then add the caveat that further research with larger samples is needed to confirm their effectiveness and generalisability. Elsewhere, the authors emphasise the need to understand GenAI’s potential role and risks in mental health treatment and the need for guardrails and close human supervision whilst testing. Indeed, within their own study, post-transmission staff intervention was required 15 times for safety concerns and 13 times to correct inappropriate responses provided by Therabot.

At one level, then, the implications remain within this familiar ground of ‘potential for change’ versus safeguards being necessary when testing similar future models to ensure safety. The need for larger samples means that chatbots like Therabot are still a long way from implementation.

The authors also note that the inner processes of Gen-AI models are difficult or impossible to understand analytically. This introduces a further implication for practice in that it invites us to think about if and how we can ever move to implementation. Can the current methods we use to conduct and evaluate research ever be made compatible with something considered “difficult or impossible to understand analytically”? Or what might need to change here?

In light of concerns related to privacy, biased datasets, and widening inequalities, should we be using GenAI in mental health treatments?

In light of concerns related to privacy, biased datasets, and widening inequalities, should we be using GenAI in mental health treatments?

Statement of interests

Robert Meadows has recently completed a British Academy funded project titled: “Chatbots and the shaping of mental health recovery”. This work was carried out in collaboration with Professor Christine Hine.

Links

Primary paper

Heinz, M. V., Mackin, D. M., Trudeau, B. M., Bhattacharya, S., Wang, Y., Banta, H. A., … & Jacobson, N. C. (2025). Randomized trial of a generative AI chatbot for mental health treatmentNejm Ai2(4), AIoa2400802.

Other references

Fitzpatrick, K. K., Darcy, A., & Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trialJMIR Mental Health4(2), e7785.

Sharma, P., McPhail, S. M., Kularatna, S., Senanayake, S., & Abell, B. (2024). Navigating the challenges of imposter participants in online qualitative research: Lessons learned from a paediatric health services studyBMC Health Services Research24(1), 724.

Shaw, H., Ellis, D. A., Kendrick, L. R., Ziegler, F., & Wiseman, R. (2016). Predicting smartphone operating system from personality and individual differencesCyberpsychology, Behavior, and Social Networking19(12), 727-732.

Wolpert, M. (2025). AI and mental health: “it could help revolutionise treatments”. Wellcome.

Photo credits



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *