5 Must-Know Mental Health Datasets for AI Research | August 2023

Aseem Srivastava
3 min readJun 6, 2023
(Photo by Ben White on Unsplash)

I am a research scholar, and my research has revolved around AI for Mental Health Counseling for the last three years. I have published some papers at notable AI/NLP conferences. During my research engagement, I encountered the profound challenge of finding suitable datasets to advance my work. However, within this scarcity lies an opportunity to inspire change and collaborate toward constructing a brighter future for mental health support. In this blog, we embark on a quest to shed light on available datasets, both public and obtainable through request, paving the way for pioneering advancements in AI-powered mental health care.

HOPE Dataset

HOPE contains 202 dyadic counseling conversation transcripts. Each utterance is tagged with a dialogue-act. This dataset is best suited to design many NLP tasks (apart from dialogue-act classification) for mental health care. (read more)
[ Access is available on request here ]

MEMO Dataset

MEMO contains counseling session transcripts and their counseling notes (summaries). This dataset is best suited for dialogue summarization tasks in the counseling therapy space. (read more)
[ Access is available on request here ]

DAIC-WoZ Dataset

The DAIC-WOZ dataset comprises voice and text samples from 189 interviewed healthy and control persons and their PHQ-8 depression detection questionnaire. It is commonly used in research works for text-based depression detection and in multi-modal architecture. (read more)
[ Access is available on request here ]

ALONE Dataset

A multimodal dataset of toxic social media interactions between confirmed high school students, called ALONE (AdoLescents ON twittEr), along with a descriptive explanation. Each instance of interaction includes tweets, images, emoji, and related metadata.
[ Email the authors to gain access for research purposes ]

Counsel-Chat Dataset

Counselchat.com is an example of an expert community. It is a platform to help counselors build their reputation and make meaningful contact with potential clients. On the site, therapists respond to questions posed by clients, and users can like responses that they find most helpful.
[ Access is publicly available here ]

PAIR Dataset

A dataset consisting of brief interactions between counselors and clients portraying different levels of reflective listening skills. Each interaction is in English and includes a client prompt, i.e., a client’s statement that is usually given to the counseling trainee, paired with counseling responses portraying different levels of reflections skill, i.e., low quality, medium quality, and high quality.
[ Access is publicly available here ]

There is a whole other category of dataset in this space, scraped from Reddit. There are many many such datasets easily accessible on the Internet, but only a handful of them contains rich annotations. I am adding such annotated datasets here.

Reddit Based Datasets

  • Suicide Severity: Dataset for labeling suicidality posts with longitudinal information, using CSSRS questionnaire.
    [ Public access — here ]
  • Primate2022: Dataset for labeling depression-related posts using the PHQ-9 questionnaire.
    [ Public access — here ]

--

--