DrLlama: Can Llama read medical reports?

Let’s experiment how the open source Llama 2/llama.cpp performs on interpreting results from lab results

20 min readApr 10, 2024

Dr Llama: Will it succeed at interpreting medical reports? Image created with Dalle.

Here’s an interesting challenge that many startups are trying to solve: given a medical report as such:

Lab Report created specifically for this post. Image by author.

automatically extract all biomarkers, their result values and the reference ranges that best fit the patient data. Each biomarker must also be linked with a correspondent ID available on an internal knowledge database.

So, how would you solve this one 😅? Remember that each lab may have its own report template which can also change over time!

There are, probably, many ways to solve it. Let’s discuss here the best strategy we found so far (still, better solutions may exist). The approach,
expectedly, uses LLM to a great extent.

For those of you who prefer watching, I tried doing a YT video of this text, result is a exquisite disaster but still:

And, before we begin, an important disclaimer:

Disclaimer

Please keep in mind we’ll be using a mock lab report here. In practice do consider the risks you are taking before sending real lab reports (yours or of your customers) to any third-party services. This use case is particularly interesting for Llamas as they are open source so there’s no risk of data leakage.

If you decide to trust any company with your personal data — or your customers — consider the risks you are taking.

The Challenge

The best solution found so far is to divide this problem into 3 main parts:

1- Use LLM to first summarize all biomarkers found on the lab report.
2- Use LLM to cast each biomarker to JSON.
3- Map the extracted biomarkers to our internal database ID — this is the most challenging part!

Extract the summary then cast to JSON and lastly map to internal ID. Image by author.

The summary step reduces the amount of information for the next step. We’ll need JSON data to interact with the backend API. Finally, we need to map the biomarkers to our own ID system to extract more information on them such as understanding if they are within proper expected ranges and how the patient is evolving over time.

Let’s dissect each step and how each was solved.

STEP — 1: Summary

This is the test Complete Blood Count (CBC) in our reference lab report:

The test “CBC” contains a list of biomarkers. Erythrocytes has been highlighted here. Image by author.

We want a procedure that automatically summarizes the biomarkers “erythrocytes”, “hemoglobin” and so on, their respective result values, such as 4.89 millions/mm3 and the appropriate reference ranges, such as 4.3 to 5.7 for instance.

There is another issue we need to consider: some tests may contain biomarkers with conflicting names — for instance, at page 4 in our report the test Midstream Specimen of Urine (MSU) also contains the biomarker “erythrocytes”:

Urine tests may also contain the biomarker erythrocytes which can also be present in CBC tests. Image by author.

The problem is that when mapping erythrocytes from the test CBC and from MSU to our internal database they may end up overwriting each other; the solution for that is to retrieve not only the biomarkers but also which test they belong to, such as “Erythrocytes — RBC” or “Erythrocytes — MSU”.

This means that the summary of step 1 could be something like this:

Biomarker: Erythrocytes
Test name: CBC
result: 4.89 millions/mm3
reference range: 4.3 to 5.7
Biomarker: Erythrocytes
Test name: MSU
result: lower than 10,000/mL
reference range: lower than 10,000/mL

It’s hard to imagine an approach that uses some kind of regex operations being able to extract this information, specially on different types of reports from each laboratory.

This seems a good use case for LLMs, particularly for Llama 2 (as we can keep the patient’s data safe). Let’s test how it performs!

First, we need to extract the text from the pdf report — this is a common challenge when working with LLMs — we’ll use tesseract for the OCR reading task:

Using tesseract and pdf2image to extract the text from our lab report pdf. Image by author.

And here’s the resulting text:

Result of the pdf text extraction. Image by author.

Notice that some of the text is out of order which means the LLM must be able to interpret generic text and still be able make sense of it.

Let’s test Llama 2 70b here. Setting temperature to 0 (zero) and increasing max tokens to 4096, we send the extracted text from the first page:

Experimenting with sending first page extracted text to Llama2. Image by author.

And here’s the beginning of the result:

Results from the LLM answer. Here it scored 100% success extraction! Image by autor.

It extracts 100% of the requested information; if we send the very last page we see that is scores around ~70%. In real applications, the medical reports that the system will be receiving are way more challenging and chaotic to process and success rate will vary greatly but so far the results are considerably impressive.

Well, we do not want to manually copy and paste the text extracted from each page and send it to Llama — but if we try to run a 70b model locally to automate the process you’ll see your computer going BOOM fairly quickly. How to deal with that?

Enter llama.cpp

When we first start working with Llama it may be confusing to understand why there’s Llama 2 and everybody is talking about llama.cpp. In a nutshell, running Llama 2 is staggeringly resource-intensive. If you peruse through the huggingface uploaded model you’ll see that it takes ~140GB of VRAM (i.e., GPU memory) to run. Even for some startups having access to this amount of hardware may be prohibitive.

There’s one technique that can greatly help us here, it refers to quantization which in essence casts each neuron on the network as an integer instead of a floating point number of 32 bits; casting the model to int8 makes it 4x lighter and faster.

That is in essence what llama.cpp is all about, it’s fully built to process quantized neurons which makes the model at least manageable to run locally. The price we pay is on the quality of the answers but the interesting aspect of this trade-off is that it’s notably asymmetrical: we loose some on quality but gain orders of magnitude on resource-savings, to the point where running locally the model becomes feasible.

To proceed in our quest, this phase still requires a GPU (which is available on Colab); we’ll be using a 8-bit quantized version of Llama2–7b to keep a balance between quality and requirements. The library llama-cpp-python will bridge the gap between Python and llama.cpp:

Installing and instantiating llama.cpp in Python. Image by author.

We first install llama-cpp-python which in turn installs llama.cpp, then we download an 8-bit quantized model from huggingface (the format gguf is an evolution of ggml and basically refers to quantized models) and simply instantiate it in Python with all layers on GPU.

Let’s send a request with the first page of our medical report to llama.cpp — we keep temperature=0 to avoid hallucinations as much as possible:

Sending request to llama.cpp. Image by author.

Based on the provided medical PDF report, here is a list of biomarkers, their results with unit values, the appropriate reference values, and which test the biomarker belongs to:
Biomarker: 
Name: 

Test: Complete Blood Count (CBC)

Value: 4.89 millions/mm3
Reference Range: 4.30 to 5.70 millions/mm3

Biomarker: 
Name: Hemoglobin (Hb)

Test: Complete Blood Count (CBC)

Value: 14.2 g/dL
Reference Range: 13.5 to 17.5 g/dL

Biomarker: 
Name: Hematocrit (Hct)

Test: Complete Blood Count (CBC)

Value: 40.6 %
Reference Range: 39.0 to 50.0 %

(...)

Biomarker: 
Name: Platelet Count (PLT)

Test: Complete Blood Count (CBC)

Value: 190,123 thousands/mm3
Reference Range: 150,000 to 450,000 thousands/mm3

Note: The above results are for an adult male patient born on March 11, 1867, with a date of birth of 03/11/1867, and a date of examination of 10/23/2000.

It’s actually good — unfortunately, it’s also not good 😣…success rate now dropped to 60%.

This is an inevitable price to pay; we are using a less potent model (7b) whose weights are quantized, so there will be a hit on quality. For testing locally and prototyping services smaller LLMs have its place; it is nonetheless important to realize that these models, as much as they are fun to interact with, probably will not handle production. The more I worked with these systems the more I came to the conclusion that companies that want to use Llama interacting with customers don’t have much choice other than allocating a cluster of GPUs to run the most powerful models they have access to otherwise the interaction with customers won’t be satisfactory —that’s the cost to be paid for safety.

We can use our model to process all pages automatically now:

Using llama.cpp to process all pages of our medical report. Image by author.

It takes around just 1 minute to process the whole report (using T4 GPU). The model doesn’t perform well on the very last page, urine test is quite challenging for the LLM to interpret and results drops to 38% with lots of hallucinations.

An interesting point to consider here: prompt engineering will do very little to improve results. It’s highly recommended to use openAI’s lib eval for doing so. In our experience, there was almost no gains with most of the prompts and techniques tested. For the evaluation, it’s necessary to have a dataset (which can be small) containing the input (the extracted pdf text), the expected result and the answered. The eval framework will then compare both and compute how close they are; the closer the better.

We still can continue our exploration to see where it lands. Next step, the JSON phase:

Step 2 — JSON

The result from step 1 needs to be cast into a template we can control. By doing so we not only can send the information to backend APIs but also we can extract the specific names of each biomarkers so we can map them to our internal ID store.

The 70b model is surprisingly good at casting to JSON the input data; given this prompt:

Prompt example for casting to JSON. Imabe by author.

Here’s what we get:

🦙
{
"biomarkers": [
{
"name": "ERYTHROCYTES",
"value": "4.89 millions/mm3",
"test name": "Complete Blood Count (CBC)",
"reference range": "4.30 to 5.70"
},
{
"name": "HEMOGLOBIN",
"value": "14.2 g/dL",
"test name": "Complete Blood Count (CBC)",
"reference range": "13.5 to 17.5"
},

(...)

{
"name": "MEAN PLATELET VOLUME",
"value": "9.2 fL",
"test name": "Platelet Count",
"reference range": "9.2 to 12.6 fL"
}
]
}

Using llama.cpp we can force the answer to follow the grammar rules of a JSON string by using the option response_format :

content = f'{pdf_page}'
out = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs in JSON.",
        },
        {"role": "user", "content": pdf_page},
    ],
    response_format={
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "biomarkers": {
                    "type": "array",
                    "items": {
                        "name": {
                            "type": "string"
                        },
                        "value": {
                            "type": ["string", "number"]
                        }
                    }
                }
            },
            "required": ["biomarkers"],
        },
    },
    temperature=0,
)

Example of result:

llama.cpp json response. Image by author.

While it does succeed in returning a JSON response, notice that it hallucinates a lot. I’d say that at this point llama.cpp is not a possible tool to be used to handle this challenge; still, the 70b model performs remarkably well here.

Finally, the last and most challenging step of our goal:

Step 3 — Mapping

The first biomarker in the report is “erythrocytes”. It not always gets interpreted by LLMs as such; sometimes the system “dreams” the name “Red Blood Cell Count” which is actually its equivalent (actually at times I saw it returning the names in Spanish…a wild dream, to say the least😅)

We want to extract all biomarkers from each lab report but also to understand what those values are on a semantic level. The knowledge database may have an entry like so:

{
    "biomarker": "erythrocytes",
    "id": 15,
    "reference ranges healthy cohort": "10 to 15"
}

Regardless of whether the LLM generates erythrocytes or red blood cell count, we want both to be mapped to id: 15 so we know exactly not only its name but also additional information such as what is considered a good range for its values. This allows for automatically finding, for each patient, the status of their health and how it’s evolving.

The challenge resumes to this: from one side we have “dreamed” text that is completely open to variation and randomness, on the other side, the list of IDs that identifies those “dreams”:

Generated text from LLM to our correspondent ID. Image by author.

One possible way to solve this problem is to prompt the LLM with the list of the biomarker names we consider correct and ask the system to map the “dreamed” input. This could work only for very small lists; the bigger it gets the more the LLM will deviate from the correct answer:

Given this list of biomarkers:
CORRECT_BIOMARKERS=
biomarker1
biomarker2
...

and this input:
INPUT_BIOMARKERS=
input biomarker 1
input biomarker 2
...

Return for each value in input its correspondent correct value.
Answer with a newline semi-colon separated rows.

Despite being an easy to implement approach it has no chance of working in production. If our list of correct biomarkers contains thousands of values, each query will get prohibitively expensive. On top of that, the more tokens we send in the prompt the greater are the chances for the dreaded hallucinations; for what we observed in practice, this is a no-go solution.

Another interesting approach is to perform the search through embeddings similarity. If we use the embeddings from openAI this can be made quite straightforwardly (we are not risking any data leakage here so this is safe). We basically convert dreamed text and our correct values into embeddings and extract the dot product between then to find the highest values (as the vectors are normalized by default the dot product means the cosine similarity):

import os
import concurrent.futures
from functools import partial
import numpy as np
from google.colab import userdata
from openai import OpenAI


l = [('Human immunodeficiency virus', 'HIV'),
('Alpha-fetoprotein', 'AFP'),
('Serotonin', '5-HT'),
('Aspartate aminotransferase', 'AST'),
('Thyroid-stimulating hormone', 'TSH'),
('Amphetamine', 'The abbreviation used for Amphetamine on medical reports is AMPH.'),
('Potassium', 'K'),
('Carcinoembryonic antigen', 'CEA')
]

bms = '\n'.join([e[0] for e in l] + 
        ['Red Blood Count - Erythrocytes', 'Midstream Specimen of Urine' ])
dreamed = '\n'.join([f'{e[0]} {"(" + e[1] + ")" if e[1] else ""}' for e in l] +
           ['RBC', 'Erythrocytes - MSU'])

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
client = OpenAI()

exec_ = concurrent.futures.ThreadPoolExecutor(32)
partial_create = partial(
    client.embeddings.create,
    model="text-embedding-ada-002"
)

def get_emb(idx, input):
    emb = partial_create(input=input)
    return idx, emb.data[0].embedding

def get_embeddings(names: list[str]):
    futures = []
    resp = []
    for idx, name in enumerate(names):
        futures.append(exec_.submit(get_emb, idx, name))
    for f in futures:
        i, emb = f.result()
        resp.append((i, emb))
    return resp

def cast_resp_to_array(resp: tuple[int, list[float]]):
    emb_array = np.empty((len(resp), len(resp[0][1])))
    for idx, emb in resp:
        emb_array[idx, :] = np.array(emb)
    return emb_array

bms_list = bms.splitlines()
dreamed_list = dreamed.splitlines()

bms_embeds = get_embeddings(bms_list)
dreamed_embeds = get_embeddings(dreamed_list)

bms_array = cast_resp_to_array(bms_embeds)
dreamed_array = cast_resp_to_array(dreamed_embeds)

cosine_m = bms_array.dot(dreamed_array.T)
sorted_cos = np.argsort(-cosine_m, axis=0)
print(
    [(dreamed_list[idx], bms_list[v]) for idx, v in
     enumerate(sorted_cos[0, :])]
)

Here’s the result. First column is the generated text from the LLM and second column is the mapped value:

('Human immunodeficiency virus (HIV)', 'Human immunodeficiency virus'),
('Alpha-fetoprotein (AFP)', 'Alpha-fetoprotein'),
('Serotonin (5-HT)', 'Serotonin'),
('Aspartate aminotransferase (AST)', 'Aspartate aminotransferase'),
('Thyroid-stimulating hormone (TSH)', 'Thyroid-stimulating hormone'),
('Amphetamine (The abbreviation used for Amphetamine on medical reports is AMPH.)',
 'Amphetamine'),
('Potassium (K)', 'Potassium'),
('Carcinoembryonic antigen (CEA)', 'Carcinoembryonic antigen'),
('RBC', 'Red Blood Count - Erythrocytes'),
('Erythrocytes - MSU', 'Red Blood Count - Erythrocytes')]

It works quite well actually, but there are some problems. Notice the last two tests, RBC and Erythrocytes — MSU , both gets mapped to Red Blood Count — Erythrocytes !!

Now this is the main issue with the embeddings search, it still lacks this “fine adjustment” that would allow it to precisely differentiate between close enough names. As it is, we’d loose all information from the urine tests and potentially from the red blood count as well (as the value from urine test may override the value from red blood test).

Embeddings search still are not precise enough to solve our challenge. Image by author.

We need a solution that resembles the embeddings approach but still offers the refined delineation between biomarker names, like having a Support Vector Machine (SVM) approach that fully delineates the plane between the data points, increasing its respective distances, and hence, expected accuracy.

As it turns out, there’s one cool approach that might solve well this problem: we could use the embeddings strategy but instead train them with our own data!

The point being, we don’t need embeddings trained on the entire internet — our use case is fairly narrowed. We could, therefore, build a training dataset with several lab reports, let the LLM dream at will on all this data and then we could map all those generated text to what is the expected correct result. Training a mapper model would be fairly straightforward then!

This sounds promising but there’s one problem: building this dataset would be, mildly put, strenuous hard work. For each lab report in our dataset, we’d send to the text to the LLM so it can extract the summaries, from those summaries we’d extract the generated biomarkers (or hallucinated) and for each annotate its correct ID value so the mapper model can learn how to map.

Doing this work is inevitable and startups on this field should allocate resources on professionals capable of doing this mapping manually (this is definitely not easy and knowledge on medicine is also required).

But there’s one possible way out: we can generate this dataset using LLMs as well! The strategy would be to, given the correct list of biomarker names, ask the LLM to retrieve each biomarker abbreviations, which test they may belong to, synonyms, descriptions and so on and build a training/validation dataset to train a customized embedding model. The idea is to let the LLM dream and hallucinate at will with our biomarkers and keep annotating the data so we can train upon it later on.

For this part let’s use GPT as no GPU is required and costs are still low (the entire process has cost less than $1). Let’s begin by creating a list of biomarkers so those of you who want to follow along it’d also be possible to:

import re
results = []

for _ in range(3):
    completion = client.chat.completions.create(
        model='gpt-3.5-turbo-16k',
        messages=[
            {'role': 'system', 'content': 'You are a helpful physician assistant.'},
            {'role': 'user', 'content': '''
                Write the name of 200 biomarkers available on medical reports.
                Give each answer as the full name of the biomarker without any numbering, one per line.
                '''
            }
        ]
    )
    results.append(completion.choices[0].message.content)

biomarkers = list(set([re.sub(r'^\d+\.', '', bm).split('(')[0] for bm in '\n'.join(results).splitlines() if bm]))

Now we have a list of biomarkers to work with. Image by author.

We can generate synthetic data on top of those biomarkers; first, we could let the LLM dream/hallucinate possible abbreviations for each:

def process_abbr(bm):
    completion = client.chat.completions.create(
        model='gpt-3.5-turbo-16k',
        temperature=1,
        messages=[
            {'role': 'system', 'content': 'You are a physician assistant.'},
            {'role': 'user', 'content': '''
                What is the abbreviation used on medical reports for {bm}.
                Answer just the abbreviation, nothing else. For instance, for Mucin 1 the answer would be MUC1
                '''.format(bm=bm)
            }
        ]
    )
    return bm, completion.choices[0].message.content.strip()

Notice we use temperature=1 as we want the LLM to be the most creative possible (or crazy even). Also, we can only send one biomarker at a time, otherwise we risk the LLM hallucinating wrong entries in our dataset; this requires concurrent processing with many workers to compensate:

import concurrent.futures

abbrs = []
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    abbrs.extend(list(executor.map(process_abbr, biomarkers[:])))

List of tuples where first value is the biomarker name and second is its abbreviation. Image by author.

In the notebook that follows this post there are the steps to retrieve everything else, synonymous, definitions, tests names they belong to and so on. We then build a training and validation datasets where first column is the biomarker representation — as dreamed by the LLM — and second column is the ID we want, separated by the character @@ (an expected character).

Synthetic training dataset. First column is biomarker as returned by the LLM. Second is the ID. Image by author.

The synthetic dataset tries to simulate the distribution of what the LLM may generate so the trained model still have a chance of correctly mapping it.

We’ll need PyTorch to proceed.

Step 3 — Mapping — PyTorch

Two models were tested for this step, one is a simple EmbeddingBag model . Its implementation is fairly simple in PyTorch:

class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding_bag = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)

    def forward(self, text_idx, offsets):
        embedded = self.embedding_bag(text_idx, offsets) # (B, embed_dim)
        out = self.fc(embedded)
        return out

The idea is basically to apply an embedding layer and then take the average of the results:

Input text is converted into bag of embeddings and then each group is averaged. Image by author.

As an usual embedding model, each token gets its own embedding. Each phrase is averaged to a final representation — hence the name Bag — which in turn is fed to a dense layer that connects to the IDs that should turn on. The cool thing about this model is that it doesn’t require a context length such as used in LLM due the averaging effect.

Despite simple its performance is surprisingly good and actually it may be all that is needed to solve this step — as long as the training data is good enough then this model will already handle most of the potential hallucinations:

def predict(text):
    with torch.no_grad():
        emb_bag_model.eval()
        X = torch.tensor(text_pipeline(text))
        output = emb_bag_model(X, torch.tensor([0]))
        emb_bag_model.train()
        return output.argmax(1).item()

Mapping results. The model is resilient against potential hallucinations. Image by author.

Implementation details are available in the notebook.

Finally, another model experimented with was a transformer architecture but the final layer represents which class to turn on instead of which token to sample from; doing so in PyTorch is straightforward (no pun) and it’s equivalent to the EmbeddingBag model with some small tweaks. The dataset now is built with the biomarker name first and then the abbreviations, synonymous, definitions follows:

New training dataset. First comes the biomarker name and then the rest. Second column is its respective ID that the model must learn to map. Image by author.

The dataset implementation is the same of the EmbeddingBag but the dataloader is different:

import random
import re
import torch
from torch.utils.data import Dataset, DataLoader
from torch import nn
from torch.nn import functional as F
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator


class CustomBMDataset(Dataset):
    def __init__(self, file_path, transform=None, target_transform=None):
        self.data = open(file_path).readlines()
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        bm, label = self.data[idx].strip().split('@@')
        if self.transform:
            bm = self.transform(bm)
        if self.target_transform:
            label = self.target_transform(label)
        return bm, label


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ATT_BATCH_SIZE = 16
CONTEXT_LENGTH = 8

def att_collate_batch(batch):
    # batch = [(text, label), (text, label), ...]
    texts = torch.empty((0, CONTEXT_LENGTH), dtype=torch.int)
    labels = torch.empty((0,), dtype=torch.int)
    for text, label in batch:
        label = torch.tensor(label_pipeline(label), dtype=torch.int)
        token_ids = [e for e in text_pipeline(text) if e != 0]
        bm_idx = random.randint(0, N_PRIOR_TOKENS)
        bm = token_ids.pop(bm_idx) if len(token_ids) > N_PRIOR_TOKENS else ''

        if len(token_ids) <= CONTEXT_LENGTH:
            zrs = torch.zeros(CONTEXT_LENGTH - len(token_ids))
            text_tensor = torch.cat((torch.tensor(token_ids, dtype=torch.int), zrs), dim=-1) # padding
        else:
            token_ids = [e for e in list(set(token_ids)) if e != bm]
            zrs = [0 for _ in range(CONTEXT_LENGTH - len(token_ids))]
            token_ids += zrs
            random.shuffle(token_ids)
            text_tensor = torch.cat((torch.tensor([bm]), torch.tensor(token_ids[:CONTEXT_LENGTH - 1])), dim=-1)
        texts = torch.cat((texts, text_tensor.view(1, -1)), dim=0)
        labels = torch.cat((labels, label.view(-1)))
    return labels.to(device=device, dtype=torch.long), texts.to(device=device, dtype=torch.int)

att_train_dl = DataLoader(
    att_train_ds, batch_size=ATT_BATCH_SIZE, shuffle=True, collate_fn=att_collate_batch2
)

att_val_dl = DataLoader(
    att_val_ds, batch_size=ATT_BATCH_SIZE, shuffle=True, collate_fn=att_collate_batch2
)

As transformers work with known input size CONTEXT_LENGTH , the strategy here is to branch two use cases: if input text is lower in size then we just tokenize it and pad with zeros what is missing; when it’s bigger, we keep the biomarker name (such as the token 15 representing the biomarker erythrocytes) and then sample CONTEXT_LENGTH — 1 tokens from what is left:

If text is ≥ CONTEXT_LENGTH then we keep the biomarker name and sample the rest. Image by author.

And the model is implemented as:

EMB_DIM = 128
NCLASSES = len(bms_map)
PDROP = 0.1
ATT_BATCH_SIZE = 16
NHEADS = 4
HEAD_SIZE = EMB_DIM // NHEADS
VOCAB_SIZE = len(vocab)
ATT_LAYERS = 3

class Block(nn.Module):
    def __init__(self):
        super(Block, self).__init__()
        self.attention = nn.MultiheadAttention(EMB_DIM, NHEADS, batch_first=True)
        self.ffwd = nn.Sequential(
            nn.Linear(EMB_DIM, 2 * EMB_DIM),  # avoid bigger projections due overfit
            nn.LeakyReLU(),
            nn.Linear(2 * EMB_DIM, EMB_DIM),
        )
        self.ln1 = nn.LayerNorm(EMB_DIM)
        self.ln2 = nn.LayerNorm(EMB_DIM)

    def forward(self, x):
        xnorm = self.ln1(x)
        sa, _ = self.attention(xnorm, xnorm, xnorm, need_weights=False)
        x = x + sa
        return x + self.ffwd(self.ln2(x))


class Transformer(nn.Module):
    def __init__(self):
        super(Transformer, self).__init__()
        self.tok_emb = nn.Embedding(VOCAB_SIZE, EMB_DIM)
        self.blocks = nn.Sequential(*[Block() for _ in range(ATT_LAYERS)])
        self.ln = nn.LayerNorm(EMB_DIM)
        self.lm_head = nn.Linear(EMB_DIM, NCLASSES)
       
    def forward(self, seq):
        x = self.tok_emb(seq)  #(B, T, E)
        x = self.blocks(x)  #(B, T, E)
        x = self.ln(x)  #(B, T, E)
        x = self.lm_head(x)  #(B, T, C)
        x = x.mean(dim=-2)  #(B, 1, C)
        x = torch.squeeze(x)  #(B, C)
        return x

It’s just a regular transformer implementation but last step takes an average of the tokens predictions to reduce the dimension to 1. Notice also that no positional embedding is added as this is, arguably, not relevant for our use case. As we’ll be predicting on dreamed text, words may appear at any position and it doesn’t change final interpretation in our use case.

The training/validation sequence is a standard approach:

model = Transformer().to(device)

EPOCHS = 30
LR = 5e-2
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
eval_accs = []

def train(dataloader, model, optimizer, criterion):
    model.train()
    total_acc, total_count = 0, 0
    frames_to_log = 500
    start_time = time.time()
    counter = 0

    for y, X in dataloader:
        optimizer.zero_grad()
        y_hat = model(X)
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()
        total_acc += (y_hat.argmax(1) == y).sum().item()
        total_count += y.numel() # Batch size
        if counter % frames_to_log == 0 and counter:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, counter, len(dataloader), total_acc / total_count
                )
            )
            #total_acc, total_count = 0, 0
            start_time = time.time()


def evaluate(dataloader, model, criterion):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for y, X in dataloader:
            y_hat = model(X)
            loss = criterion(y_hat, y)
            total_acc += (y_hat.argmax(1) == y).sum().item()
            total_count += y.numel()
    print(f'Validation acc: {total_acc / total_count}')
    return total_acc / total_count


for epoch in range(1, EPOCHS + 1):
    print(f'Running epoch: {epoch}')
    epoch_start_time = time.time()
    train(att_train_dl, model, optimizer, criterion)
    accu_val = evaluate(att_val_dl, model, criterion)
    eval_accs.append(accu_val)

    print("-" * 80)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 80)

As the training data is small, we use a small network as well which also makes the whole fitting run fast. Here’s the results:

def att_predict(text, model):
    with torch.no_grad():
        model.eval()
        _, X = att_collate_batch([(text, 0)])
        y_hat = model(X)
        model.train()
        return y_hat.argmax().item()

The Attention framework is also capable of handling potential hallucinations and still find the correct answer. Image by author.

Interestingly enough, transformers nets can interpret medical reports and also solve the mapping step.

With that, we conclude all three steps necessary for automatically interpreting medical reports.

The Verdict

Can Llamas interpret medical reports? I think the answer to the question is

“Yes…but No….but maybe Yes!”

Working on a project as this one was quite fun; seeing LLMs interpreting the reports as they do is fascinating to say the least. But some issues still remain — the mocked lab report used here is relatively simple and despite Llama 2 being able to fully interpret the first 3 pages, the last page performance drops to 60%. Other lab reports are even more cluttered and harder to extract text from so accuracy is expected to go even lower. Casting to JSON (or yaml if it helps) is another challenge which is not immune to the (in)famous hallucinations. On top of that, the mapping phase cannot be solved by LLMs (currently) which is not surprising as this phase is more a sequence to class problem rather than a seq2seq one.

Also, as we saw in this post, in production we don’t have much option other than running the 70b model which is expensive — at least 150GB of VRAM is required — no quantized models (llama.cpp) will help us here.

So the conclusion would be a probable No. But this is not the end; using a LLM with knowledge of the entire internet to process medical reports is a bit of an overkill — knowing everything about history, cars, airplanes, planets and so on is not necessary for interpreting medical reports. Possible venues to explore then is to actually train a Llama 2 model from scratch by first collecting an humongous amount of medical reports in pdf and creating a huge training dataset to feed the model. Probably a much smaller network will suffice given the smaller context to learn from — which also hints at the 150GB of VRAM no longer being necessary. As much as this idea is likely to work, creating such dataset would be another challenge on itself and would require the cooperation of a team of physicians. As we did in this post, maybe using GPT models to synthesize some of the data might help; still, this would reduce the work necessary, not eliminate it.

Another option is to fine-tune Llama 2. This would require less training data and is more feasible, probably —still it’s hard to guess if its performance would be enough.

So this is the final verdict: Yes! But No :(…but maybe Yes :)!

Regardless of the conclusion, I recognize that working with those systems was quite fun — what the future awaits for us is intriguing.

I hope you enjoyed this journey as I did. As always, I hope to see you next mission ;).