Detecting Dementia Early Using Large Language Models

 A researcher in a modern laboratory analyzing data and brain scans related to dementia on a computer screen, surrounded by medical and scientific references.

As part of a project at the University of Stavanger, my fellow Student Kinnan Al Amir and I developed multiple AI models to detect dementia from speech transcriptions. I was tasked with creating two deep learning models: A fine-tuned RoBERTa model and an Instruction tuned LLaMA model.

Why this matters

Having a loved one seemingly lose all memory of oneself is a hurtful experience that the friends and family of over 50 million people have to live with [7]. Dementia affects not only memory but also thinking and hinders patients from living a happy life. But Dementia is not a disease in itself.

Dementia is a syndrome, caused by a variety of diseases, with 60-70% of the cases attributed to Alzheimer's disease [7]. This makes Alzheimer's disease the most common cause of Dementia. One of the early signs of Alzheimer's is a language impairment which is even noticeable in the early stages of the disease [4]. Patients have difficulty finding the right words and are often frustrated with themselves which can lead to anxiety and depression. But these difficulties in expression also give hope for early diagnosis by language analysis.

To help the development of tools that can diagnose Alzheimer's disease early, Saturnino Luz et al. [6]. created the Alzheimer’s Dementia Recognition through Spontaneous Speech (ADReSS) challenge. Part of the challenge is to predict if a patient has Alzheimer's disease based on a speech sample. The challenge provides a dataset with transcriptions of speech samples from patients with Alzheimer's disease and healthy controls.

The Cookie Theft picture

Figure 1: The Cookie Theft picture

The speech samples were taken from patients describing the Cookie Theft picture shown in Figure 1. It is part of the Boston Diagnostic Aphasia Exam [3]. and is used to assess the language capabilities of a patient. We used this dataset to train multiple machine learning and deep learning models. This article will cover how I fine-tuned the large language model RoBERTa to classify the test samples with an accuracy of 87%.

Fine-tuning RoBERTa

By now you have read about RoBERTa a million times. I will therefore spare you details about how it was based on the BERT LLM by Google, and how it used more data to produce better results. In fact, I found it very useful to read through detailed notebooks, combining code and knowledge so for the rest of this article I will do just that.

Fine-tuning a LLM consists of three steps:

  1. Loading the Model.
  2. Loading the Data.

Loading the Model

We get the pre-trained model from Hugging Face. Here we can find the roberta-base model next to countless already fine-tuned versions of it.

from transformers import RobertaTokenizer,

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

Next to the model we also need its tokenizer. In a LLM, words are represented as tokens and the way these tokens are generated differs between models. That`s why it is important to use the model-specific tokenizer. It will be used in the next step to turn our transcriptions into their tokenized form.

Loading the data

We got the model. We got the tokenizer. Now the only thing missing, before we can train the model is the data.

The data provided by the ADReSS challenge came in three files, two of which had to be merged. Why? Because the dementia cases for the training were separate from the control group data. So the first steps in preparing the data were loading, merging, and shuffling.

import pandas as pd

control_df = pd.read_csv("/content/Control_db.csv")
dementia_df = pd.read_csv("/content/Dementia_db.csv")
test_df = pd.read_csv("/content/Testing_db.csv")

# Combine the datasets
df = pd.concat([control_df, dementia_df], ignore_index=True)
df = df.sample(frac=1)

X_train = df['Transcript']  # Features
y_train = df['Category']       # Target variable

X_test = test_df['Transcript']
y_test = test_df['Category']

Next, we use our tokenizer to translate our data to encodings. If you have a beefy graphics card you can use this opportunity to flex on us mortals by setting the max length to something higher. Note that the 512 is not the length of the letters in the strings but the maximum number of tokens for each transcription.

train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(X_test.tolist(), truncation=True, padding=True, max_length=512)

The last step in data preparation is to write a Dataset interface class for it. We will later use a Trainer class to train the model and this class expects the datasets to be of the type

The custom Dataset class, like DementiaDataset, allows for efficient loading and batching of data. It also implements shuffling and other data manipulation functions that are needed in training.

import torch

class DementiaDataset(
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = DementiaDataset(train_encodings, y_train.tolist())
test_dataset = DementiaDataset(test_encodings, y_test.tolist())


With all the ingredients prepared, we are ready to cook our model. This is also a good time to define custom metrics functions to evaluate the model on more than just accuracy. Accuracy alone can be misleading, especially in datasets that are imbalanced or have more complex success criteria. Including precision, recall, and F1 score provides a more holistic view of the model's performance.

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall

Finally, we are ready for training.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='/content/result',    # Directory where the model predictions and checkpoints will be written.
    num_train_epochs=9,              # Total number of training epochs to perform.
    per_device_train_batch_size=8,   # Batch size per device during training.
    per_device_eval_batch_size=8,    # Batch size for evaluation.
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler.
    weight_decay=0.01,               # Strength of weight decay.
    logging_dir='/content/logs',     # Directory for storing logs.
    logging_steps=1,                 # Log every X updates steps.
    evaluation_strategy="steps",     # Evaluation is done (and logged) every X steps.
    eval_steps=2000,                 # Number of steps to perform evaluation.
    save_strategy="steps",           # The checkpoint save strategy to adopt during training.
    save_steps=2000                  # Save checkpoint every X steps.

trainer = Trainer(
    compute_metrics=compute_metrics  # You can define a function to compute metrics here.


TrainingArguments Explained

  • output_dir: Specifies the directory where the model's predictions and checkpoints will be stored. This is essential for both persisting the trained model for future use and for recovery purposes in case the training is interrupted.
  • num_train_epochs: Defines the total number of times the training dataset will be passed through the model. More epochs can lead to better learning but might cause overfitting if not coupled with appropriate regularization techniques.
  • per_device_train_batch_size and per_device_eval_batch_size: Set the number of examples to process at a time for training and evaluation, respectively. This is critical for optimizing memory usage and computational efficiency on the hardware being used.
  • warmup_steps: The number of initial steps during which the learning rate is linearly increased to its maximum value. Warmup helps mitigate the risk of model divergence early in training by gradually ramping up the learning rate.
  • weight_decay: Implements L2 regularization by adding a penalty equal to the square of the magnitude of weights multiplied by this factor. It helps prevent the model from fitting the noise in the training data too closely.
  • logging_dir and logging_steps: Control the logging of training progress. Logs are written to the specified directory at intervals defined by logging_steps. This is useful for monitoring the training process and for debugging.
  • evaluation_strategy, eval_steps: Define how frequently the model is evaluated on the evaluation dataset. Regular evaluation helps track the model's performance on unseen data and can guide decisions regarding early stopping to avoid overfitting.
  • save_strategy and save_steps: Dictate how often the model's state is saved to a checkpoint, which is crucial for long training processes, as it allows for training to be resumed from the last checkpoint in case of an interruption.

Saving the model

For future use, you can save your model and the tokenizer.


Loading the model is equally simple.

from transformers import RobertaForSequenceClassification, RobertaTokenizer

model = RobertaForSequenceClassification.from_pretrained('/content/model')
tokenizer = RobertaTokenizer.from_pretrained('/content/tokenizer')


After waiting for almost two hours, we can evaluate our model.


# Output:
{'eval_loss': 0.48935237526893616,
 'eval_accuracy': 0.875,
 'eval_f1': 0.8695652173913043,
 'eval_precision': 0.9090909090909091,
 'eval_recall': 0.8333333333333334,
 'eval_runtime': 67.8835,
 'eval_samples_per_second': 0.707,
 'eval_steps_per_second': 0.088,
 'epoch': 9.0}

Our training data gave us the following accuracy matrix:

Figure 2: Accuracy matrix.

Closing thoughts

In this project, our objective was to develop a predictive model using a limited dataset of only 108 training points and 48 testing points. We found that RoBERTa's broad understanding of language significantly aided performance despite the small dataset size. However, caution must be exercised when interpreting the test results; evaluating accuracy with merely 48 samples does not provide a robust statistical basis. Ideally, cross-validation should be used to confirm the model's effectiveness in a real-world scenario, but fine-tuning a large language model (LLM) like RoBERTa across multiple data partitions is time-intensive. Given our constraints, conducting this process across 50 different partitions was impractical.

I believe that there is immense potential in the field of dementia detection using LLMs. I can imagine a future where you simply download an app that will then ask you a series of questions, and based on that recommends you to take action. For this future to become reality we need more data.