Using OMOP for Real-World Data and Clinical Insights
During the planning process for Indgene's DataBox RWD platform, there was a good deal of discussion around the merits and challenges of using the OHDSI OMOP Common Data Model. In this post, we will summarize the content of those discussions from a data engineering and usability perspective. We will also point out some tweaks that can help mitigate challenges with the OMOP model and clarify the critical items that need to be addressed to find success with the OMOP approach.
What Is OMOP and Why Is It important?
The Observational Medical Outcomes Partnership (OMOP) was originally started as a partnership effort between the FDA, pharmaceutical companies, and healthcare providers. Today, it is no longer funded by the pharmaceutical industry or the government. Instead, it is now supported by a consortium of researchers at universities and hospitals worldwide and is freely available as open-source tools and technology. The OMOP project is centrally coordinated out of Columbia University in New York City.
In essence, OMOP consists of a data model and a set of mapping tables (“vocabularies”) that rationalize disparate clinical and observational data sets. These, along with a range of SQL-based ETL scripts and analytics routines built in R, enable these combined data sets to rigorously answer deep patient, population, and clinical questions. Here are some examples of questions one might answer using an OMOP approach:
|Open-source data model and supporting software tools
ETL is non-trivial for new data sets
|Can support a wide range of observational data sources and use cases
Data model is cumbersome for direct SQL users. Benefits from data marts over the OMOP model
|Basic vocabularies are freely available to download
Data volumes can spiral upward without clear use cases to guide data pruning
|SQL-based ETL scripts are freely available for major data sets
Deeply clinical, from the start
|Deeply clinical, from the start
Can feel very academic
|Enthusiastic user community
OMOP Sample Questions:
Of all available patients who took sertraline in the US between 2013 and 2016, what percentage experienced GI bleeding as a side effect in the 12 months after starting medication?
If a 42-year-old man in Waltham, MA, takes Risperdal, what is the likelihood that he will experience extrapyramidal effects?
Does sertraline cause GI bleeding more than fluoxetine?
What is the average cost of fMRI in hospital settings versus in outpatient clinics in Florida since 2016? What does patient insurance breakdown look like by geography, gender, and ethnicity?
The OMOP Data Model:
The open-source data model makes jumpstarting an OMOP effort quite easy, and GitHub has DDLs that support a wide range of platforms. The OMOP data model itself is a fairly well-organized set of tables where most of the source observational data fit into clinical, health systems, or health economics data concept areas. It is easy to understand in concept but cumbersome in implementation. To use a bit of developer speak, the ETL is non-trivial, particularly where new input data sources are concerned.
From a data engineering perspective, the tables themselves are neither fully modeled into third NF nor fully denormalized into dimensional structures. While there are ETL specs published on GitHub to support a range of data sets such as Truven, Optum, CCAE, SEER, and so on, data engineers will typically require a significant amount of SME input to push new source data into these structures. The open-source specs will help, but making good decisions along the way will be more likely with clinical expertise at hand. Using visits as an example, consider that Visits are not the same as Encounters and EHR claims may need to be consolidated to prevent double counting. A data engineer would not typically know this, and thus, will need domain SME input.
After consolidating disparate sources of activity data into the model areas described above, “vocabularies” is where OMOP really drives its business value. Vocabularies standardize descriptors across other disparate data sets. They map or bridge tables that resolve coding differences. To use a very simple example from OHDSI content (here, p. 55), note that “Paroxysmal Atrial Fibrillation” has three different codes across ICD9, ICD10, and ICD10CM:
Figure 3: Basic OMOP Vocabulary Example
The function of vocabularies is (at its simplest) to roll up these source codes into higher level "Concepts" while maintaining the availability of the original source detail. Seems simple enough, doesn't it? Now imagine that you need to do this exercise for every descriptor that exists in your data; race, gender, ethnicity, all come to mind, but how about Dx codes, drug names, drug strength, CPT4, and so on? The OMOP community has also supplied data to populate many of these standardized and/or public domain vocabularies to help this process along, and there is huge value in this. Where there is NO ready-made set of mapping values, organizations will need (again) clinical expertise to create specifications for the mapping, as well as continuous inputs to maintain and refine them.
The vocabulary section represents a data model unto itself. It is both endlessly flexible and fairly abstracted, which detracts from its ease of use.
Figure 4: OMOP Data Model – Vocabularies
Usability – The costs of endless flexibility are often paid for by difficult implementations and/or lack of usability for the untrained. However, in an era where every organization is trying to be data driven, usability matters. Bringing the benefits of OMOP to a broader audience matters. It's not just the deep-in-the-weeds researchers who need OMOP. Increasingly, even those closer to the business side of life sciences are looking for ways to ask questions of their data without necessarily being either a clinical researcher or a data engineer. “OHDSI-in-a-box” was a start in this direction, but much more needs to be done to bridge the gap.
Data Science Workbench (DSW) approaches are finding just this sort of middle ground, enabling collaboration among data scientists while also allowing end users to access visualization-driven results. Perhaps just as important, DSWs allow for the use of a broad range of languages, while the OMOP community is heavily SQL and R focused.
Data Model Maturity – In its current form, the complexity of the OMOP model supports a very wide array of use cases but can be a struggle to use for insight generation. The concept of data discovery is a useful way to think about usability in analytics platforms. Being able to quickly ask questions of the data, while avoiding code as long as possible, will generally increase user adoption. Semantic models and targeted data marts are two ways to do this. A semantic model layered over the source model could help reduce perceived complexity for end users.
Taking this approach a step further, a data mart modeling exercise that used the OMOP model as its source could yield significant benefits:
The dimensional modeling portion of that exercise would consolidate activity and vocabularies into more browsable data structures. Instead of having Visit_Occurrences and then doing lookups to find Visit_Types, users would see a unified Visit dimension.
The resulting SQL would be more optimized, supporting data discovery platforms (see Usability)
The fact modeling portion of any data mart exercise tends to help crystallize use cases and modeling around a discrete problem set, answering the question, “What are we solving for?”
Summary – OMOP's promise is significant. The ability to rationalize multiple observational data sets and manage the complex terminology within this domain is incredibly useful and important. However, organizations may want to consider the level of commitment they are willing to provide to support the OMOP approach for their own internal projects. Frankly, it is not for the faint of heart and requires strong collaboration among technical and clinical teams to succeed. Organizations with robust capabilities to handle complex enterprise efforts will do well. For smaller organizations and for more targeted projects with limited data sets, a standard dimensional modeling or data mart approach will likely provide faster time-to-insight.