This post was originally on the Assero Blog on 23/06/2017

Over the last few days I have been conducting a few experiments to look at a number of challenges, while also having a little bit of fun playing with some nice new shiny toys and technology.

The big goals were to see: a) how easy it would be to integrate Biomedical Concepts (BCs) with FHIR (Fast Healthcare Interoperability Resources) resources; and b) could an SDTM domain be generated from data held in a graph form with that data being extracted from an EHR via FHIR. For those unfamiliar with FHIR it is HL7’s new technology and more can be found on the HL7 website.

Why do this? Well the FHIR link with Biomedical Concepts could help with integrating research with healthcare and make the integration of data far easier. As for SDTM, well if we can automate our work so much the better but I also see this as the first step on a journey towards better handling and storage of clinical trial data.

So those were the serious issues being addressed but the fun techie aspects were:

  • Gain experience using a graph database called Neo4j and using it from Ruby on Rails;
  • See how well it would integrate with a semantic database and be able to share definitions; and
  • Start getting some experience with HL7 FHIR.

I decided to do this by by building a demonstration application that would:

  1. Use the existing Glandon MDR as a source of definitions. The definitions are all held in a semantic form within a triple store and include terminology, BCs, forms and domains. All can be accessed via an API.
  2. Take a form from Glandon constructed using BCs.
  3. Use FHIR to access an EHR to obtain a list of patients.
  4. Select a patient from the EHR and see if there are data (observations) that could be used to populate the form selected.
  5. Store that data thus captured in a graph form with Neo4j.
  6. Allow further subjects to be added to the data pool.
  7. Generate a SDTM domain from the data thus collected.

General Arrangement

The arrangement of the various applications

Before I started this work, I was reasonably familiar with Neo4j in that I had used it for a couple of small projects. The same cannot be said of FHIR; I know how to type it into Google but that was a far as it went. I already have the Glandon MDR that contains a fair amount of content already defined as noted above. The overall architecture is shown here. Note that you can click on the images to get a bigger version.

The demo application is divided into four main tabs (screens) and one associated screen for setting up some configuration data – more on the configuration data later. Three of the four tabs are there for the various stages of the operation of selecting a form, acquiring the data and generating a domain. The fourth is used to display the graph and is described at the bottom of this post. The demo application is a web application which, on entering the main page, initialises itself by reading the set of available forms from the Glandon MDR and the list of patients from the EHR.

General Layout

General layout of the application showing the forms available and the four tabs

The EHR used is a sandbox facility provided by the Healthcare Services Platform Consortium using their developer sandbox functionality. This gave me an EHR loaded with 60ish patients with dummy data so I didn’t need to worry about confidentiality. In theory the application should work with any EHR that meets the FHIR specification bar the need for authentication, usernames, passwords and such like.

Form Display

A form displayed in the demonstration application

The first tab allows you to select a form and the application grabs it from the MDR along with annotations and displays it. I decided to stick with a single form at the current time just to constrain the problem space but it is the classic case of do it for 1, put it into a loop to get N. The screen shot here shows a simple form being displayed.

You can then proceed to the second tab that lists the patients. Each patient has a show button and an add button. The show button just queries the EHR for all of the patients observations and displays the response in JSON format. I did this just to see if everything was running ok and try out a FHIR query. The image here shows the list of patients and a typical response.

Patient List

A list of patients obtained from the EHR

Now, as you will notice there are patient names there. I put the last name out in the table just to make it interesting. Obviously this is all dummy data and I have paid scant regard to patient confidentiality but that was not really what this work was about.

The add button is a little more specific. It looks at the form selected and examines each BC contained within the form, extracts the test code and converts it to a LOINC code. It is this conversion that drives the need for some configuration data; I need a map between CDISC terminology and LOINC because the FHIR observations use LOINC coding.

JSON Extract

An extract from a FHIR response in JSON format

This, to me, is one of the big impediments to integrating our worlds. The sample extract from one response shown to the left indicates the test codes are LOINC while the units for the actual values are coded using UCUM. So I need a simple mapping function. I took a very simple approach, just enough to get it to work, mapping a CDISC C Code to either a UCUM or a LOINC code and I store this within the graph database.

Terminology mapping

Very simple mechanism to map between terminologies


So having mapped my CDISC Test Code to a LOINC code I can query the EHR. I ask the EHR for all observations matching the LOINC code and use the first result if any results are returned, again using FHIR to perform the query. Not a very sophisticated approach taking the first observation but I’m not really interested in the not insignificant problem of which observation do I want for the purposes of research, I am just interested in having one FHIR resource that, in theory, relates to my BC.

Then I have the interesting issue of extracting the data from the FHIR response and populating the database. The approach I have taken is to copy the BC for the patient (now I call them a subject) and  attached the actual data to the correct place within the BC (result, result units, date and time of observation etc). To do this I need to get the data from the FHIR resource so I hold some mappings between BC patterns and FHIR resource patterns. This worked quite nicely and is one aspect I was keen to see. We need this to be easy. Preferably we want the artefacts, BCs and FHIR resources, to be the same.

So for my patient I create a subject, attach the one or more BCs to the subject (cloned from the form definition) and attach the data. One thing I should say is that I also create a study node with which I associate the subjects.

I can then repeat this for further subjects adding further data to the database.

After all of this I end up with having a graph containing the subject data with the data all being linked to BCs and the metadata definitions. So can I generate an SDTM domain?

SDTM Generation

Generation of an initial SDTM domain

The fourth tab allows for this. The tab contains a list of domains from the MDR once again (these are sponsor domains based upon SDTM IG specifications). The user domains have BCs associated with them thus allowing the variables within a domain to be automatically associated with the data within the BCs. The add button performs this linking operation reading the defintions from the MDR and then automatically linking the variables with all of the applicable data points. Once this has completed the tabulation can be generated via a single query and a small amount of presentation code.

Now the population is sparse at the moment. I have made no attempt to do some of the clever stuff, those are the next steps. Variables such as –CAT, –SCAT, –LOC and –POS would come from the BC definitions and captured data if they were present. –STRESC, –STRESN –STRESU are relatively simple derivations and, given the fact that we have UCUM units involved, I am sure there is a way in which that standard can be leveraged to allow this to be nicely automated. Then we hit the nastier timing variables but I feel a lot of these should be driven by protocol and study design elements thus driving the need for better study specifications and machine readable artefacts there. That will be a fun step. And, of course, non-standard variables … fun, fun and more fun but a future step; yes we can all dream up reasons why not to do this.

So this is but a first step and early days but I am rather heartned by what I have seen. It took me three to four days to do this from start to finish. One huge advantage was having solid definitions for everything from which to work all accessible via an API, i.e. having the MDR. FHIR was pretty easy to get to grips with and the integration with BCs was readily achieved. Moving them closer to each other is an obvious step for the industry. Generating the domain, well that was relatively easy. Now I have done the easy case but I can already see how it can be extended to finding domains in a generic manner. The timing variables will be fun but is actually a different problem in a way, it’s the study design problem and how study design relates to observations. Then I will need to extend to the other domain classes but, as ever, iterate, learn, adjust and repeat.

I can also see that pooling data from several studies would be relatively easy. Adding more data from another source that employs matching BC definitions simply results in any query returning more results. This is one of the things I will go on to investigate.

So I see great promise. First steps, but important ones.


For the keen readers you will notice I did not mention the third tab and the study graph. This may be a little too much for some readers, so I thought I would put this as a separate section. What I did was also provide a visualisation of the graph held within the Neo4j database. These are screenshots of the stages in the process. The colours key is shown below and I have also annotated the graphs.


This the graph after the first step of loading a form. Note we have the central study node linked to the two BCs that are then built up from a set of child nodes (based on BRIDG classes and attributes plus ISO 21090 datatypes) that eventually arrive at the leaf nodes. These are then connected to potential values if they are coded responses and linked to CDISC terminology. Here the values are the units for Weight and Height. These are then linked to the UCUM equivalent values, the mapped terms

Step 1

Graph after adding a form

The second graph is after the addition of the data for a single patient (subject). We see the addition of the nodes at the centre for the subject attached to a patient and then a copy of the BCs (cloned from the study node). I did this so as to simplify implementation. It creates more nodes but has the advantage of carrying the metadata with the data. At the leaves you see the data points attached to the relevant BC nodes. Look at the top you will see a BC node with four possible answers (CDISC terms) but one data node indicating the answer extracted from the EHR and linked to the UCUM term.


Graph after the addition of the data for a patient

The third graph is after the linking in of the SDTM domain. The metadata in the MDR contained within the domain and BCs allows this to be automatic. The variables are linked to the BC leaves which in turn have the data values attached. These links then allow the domain to be generated and the values to be placed into the correct place.


Graph after linking in the SDTM domain