Usage

There is a tiny SQLite database (1.4 MB) included with inspectomop to give first-time users a limited experimental playground and the ability to run code from the examples below.

Note

Inspectomop does NOT contain EHR data from real patients. The data are entirely synthetic and come from the SynPUF dataset released by Centers for Medicaid and Medicare Services (CMS).

Connecting to a database

Inspector objects are in charge of interfacing with the backend database, extracting the available OMOP CDM tables, and performing queries.

Inspectors require a single parameter, connection_url, for instantiation:

In [1]: import inspectomop as iomop

In [2]: connection_url = iomop.test.test_connection_url()

In [3]: inspector = iomop.Inspector(connection_url)

connection_url is a database URL defined by SQLAlchemy that describes how to connect to your database. A database URL has three main components: a dialect, driver, and URL. The dialect indicates what type of backend DB you wish to connect to. You can use any supported by SQLAlchemy (MySql, SQLite, Postgres, etc.) out-of-the-box or a dialect written by a third party. See the full list here. The driver indicates which python DBAPI library you wish to use to run your queries. The SQLAlchemy dialects often contain a default DBAPI, so this may or may not be necessary depending on your configuration. Finally, the URL indicates where to look for the database and includes options for supplying a username and password.

'dialect+driver://username:password@host:port/database'

Note

See the SQLAlchemy docs on engine configuration for more details.

Here is an example URL for MySQL:

In [4]: mysql_url = 'mysql://johnny:appleseed@localhost/omop'

and one for SQLite:

In [5]: sql_url = 'sqlite:////abs/path/to/tiny_omop_test.sqlite3'

As you can see SQLite URLs are slightly different. They include an extra ‘/’ and thus will have ‘///’ for relative paths and ‘////’ for absolute paths.

Inspecting a database

Accessing tables

The tables property of an Inspector contains a dictionary of associated OMOP tables that are accessible by table name.

In [6]: inspector.tables.keys()
Out[6]: dict_keys(['attribute_definition', 'care_site', 'cdm_source', 'cohort', 'cohort_attribute', 'cohort_definition', 'concept', 'concept_ancestor', 'concept_class', 'concept_relationship', 'concept_synonym', 'condition_era', 'condition_occurrence', 'cost', 'death', 'device_exposure', 'domain', 'dose_era', 'drug_era', 'drug_exposure', 'drug_strength', 'fact_relationship', 'location', 'measurement', 'note', 'note_nlp', 'observation', 'observation_period', 'payer_plan_period', 'person', 'procedure_occurrence', 'provider', 'relationship', 'source_to_concept_map', 'specimen', 'visit_occurrence', 'vocabulary'])

In [7]: person = inspector.tables['person']

Accessing table columns

The columns in each table object are dot accessible and can be assigned to variables to construct query statements.

In [8]: from sqlalchemy import select

In [9]: person_id = person.person_id

In [10]: statement = select([person_id])

In [11]: print(statement)
SELECT main.person.person_id 
FROM main.person

Complete table descriptions

You can also get a description of all columns within a table, the data types, etc.

In [12]: inspector.table_info('person')
Out[12]: 
                         column      type  nullable  primary_key
0                     person_id   INTEGER     False         True
1             gender_concept_id   INTEGER     False        False
2                 year_of_birth   INTEGER     False        False
3                month_of_birth   INTEGER      True        False
4                  day_of_birth   INTEGER      True        False
5                birth_datetime  DATETIME      True        False
6               race_concept_id   INTEGER     False        False
7          ethnicity_concept_id   INTEGER     False        False
8                   location_id   INTEGER      True        False
9                   provider_id   INTEGER      True        False
10                 care_site_id   INTEGER      True        False
11          person_source_value      TEXT      True        False
12          gender_source_value      TEXT      True        False
13     gender_source_concept_id   INTEGER      True        False
14            race_source_value      TEXT      True        False
15       race_source_concept_id   INTEGER      True        False
16       ethnicity_source_value      TEXT      True        False
17  ethnicity_source_concept_id   INTEGER      True        False

Running built-in queries

A basic example

There are a variety of built in queries available in the Queries submodule. A typical query takes arguments for inputs (concept_ids, keywords, etc.), an Inspector to run the query against, and optionally a list of columns to subset from the default columns returned by the query.

# retrieve concepts for a list of concept_ids
In [13]: from inspectomop.queries.general import concepts_for_concept_ids

In [14]: concept_ids = [2, 3, 4, 7, 8, 10, 46287342, 46271022]

In [15]: return_columns = ['concept_name', 'concept_id']

In [16]: concepts_for_concept_ids(concept_ids, inspector, return_columns=return_columns).fetchall()
Out[16]: 
[(2, 'Gender'),
 (3, 'Race'),
 (4, 'Ethnicity'),
 (7, 'Metadata'),
 (8, 'Visit'),
 (10, 'Procedure'),
 (46271022, 'Chronic kidney disease'),
 (46287342, '2 ML Verapamil hydrochloride 2.5 MG/ML Injection')]

Note

You can get a list of columns a query returns by looking at the return_columns parameter in the docstring for each query.

Specifying how results are returned

By default all queries return a Results object. Results objects behave like database cursors and have the expected methods such as .fetchone() and .fetchall() for fetching rows.

Note

Results objects ultimately point back to the underlying DBAPI used for interacting with the DB (pymssql for SQL Server, sqlite3 for SQLite, etc). More or less these should follow the pythong DB API spec for cursor objects. Most of this is handled by SQLAlchemy. Results is a subclass of sqlalchemy.engine.ResultProxy with additional methods for working with pandas.

Fetching examples

In [17]: results = concepts_for_concept_ids(concept_ids, inspector)

#get the return column names
In [18]: results.keys()
Out[18]: 
['concept_id',
 'concept_name',
 'concept_code',
 'concept_class_id',
 'standard_concept',
 'vocabulary_id',
 'vocabulary_name']

#get one row
In [19]: results.fetchone()
Out[19]: (2, 'Gender', 'OMOP generated', 'Domain', '', 'Domain', 'OMOP Domain')

#get many rows
In [20]: two_results = results.fetchmany(2)

In [21]: len(two_results)
Out[21]: 2

#iterating over rows
In [22]: for row in results:
   ....:     print(row[:2])
   ....: 
(7, 'Metadata')
(8, 'Visit')
(10, 'Procedure')
(46271022, 'Chronic kidney disease')
(46287342, '2 ML Verapamil hydrochloride 2.5 MG/ML Injection')

Results as pandas DataFrames

Results objects also have two handy methods, .as_pandas() and .as_pandas_chunks(), for returning results as pandas DataFrames.

#return the results as as a dataframe
In [23]: results = concepts_for_concept_ids(concept_ids, inspector).as_pandas()

In [24]: results[['concept_name','vocabulary_id']]
Out[24]: 
                                       concept_name vocabulary_id
0                                            Gender        Domain
1                                              Race        Domain
2                                         Ethnicity        Domain
3                                          Metadata        Domain
4                                             Visit        Domain
5                                         Procedure        Domain
6                            Chronic kidney disease        SNOMED
7  2 ML Verapamil hydrochloride 2.5 MG/ML Injection        RxNorm

# return the results in chunks
In [25]: chunksize = 3

In [26]: results = concepts_for_concept_ids(concept_ids, inspector).as_pandas_chunks(chunksize)

In [27]: for num, chunk in enumerate(results):
   ....:     print('chunk {}'.format(num + 1))
   ....:     print(chunk['concept_name'])
   ....: 
chunk 1
0       Gender
1         Race
2    Ethnicity
Name: concept_name, dtype: object
chunk 2
0     Metadata
1        Visit
2    Procedure
Name: concept_name, dtype: object
chunk 3
0                              Chronic kidney disease
1    2 ML Verapamil hydrochloride 2.5 MG/ML Injection
Name: concept_name, dtype: object

Creating custom queries

From SQLAlchemy SQL Expressions

Statements built out of constructs from SQLAlchemy’s SQL Expression API make queries backend-neutral paving the way for sharable code that can be used in a plug-and-play fashion. While there is no guarantee that every query will work with every backend, most of the basic selects, joins, etc should run without issue.

SQLAlchemy is extremely powerful, but like any software package, has a bit of a learning curve. It is highly recommended that users read the SQL Expression Language Tutorial and note the warning below.

Below are a few simple examples of using SQLAlchemy expression language constructs for running queries on the OMOP CDM.

Warning

Tables from Inspector.tables are actually mapped to ORM objects. These are NOT the same as Table objects from the SQLAlchemy Core API, although they can be used in nearly identical fashion in SQL Expressions with the following caveat about accessing table columns:

In [28]: from sqlalchemy import alias

In [29]: p = inspector.tables['person']

In [30]: p_alias = alias(inspector.tables['person'], 'p_alias')

# p is an automapped ORM object with dot accessible columns
In [31]: p
Out[31]: sqlalchemy.ext.automap.person

In [32]: p.person_id
Out[32]: <sqlalchemy.orm.attributes.InstrumentedAttribute at 0x7fd92f702eb8>

# p_alias is an Alias object.
# Columns must be accessed using .c.column
In [33]: p_alias
Out[33]: <sqlalchemy.sql.selectable.Alias at 0x7fd92f7485c0; p_alias>

In [34]: p_alias.c.person_id
Out[34]: Column('person_id', INTEGER(), table=<p_alias>, primary_key=True, nullable=False)

# and so this fails
In [35]: p_alias.person_id
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-a8a43d4647a5> in <module>
----> 1 p_alias.person_id

AttributeError: 'Alias' object has no attribute 'person_id'

Explanation: Using a portion of the SQLAlchemy ORM to infer table structure was a conscious design decision. Although it makes for a bit of confusion when constructing queries with SQL expressions users that work in an interactive development environment (iPython, Jupyter Notebooks, etc.) get the benefit of dot accessible column properties. In addition, automapping alleviates compatibility issues that would inevitably arise with hard-coded table structures on future versions of the OMOP CDM.

Select all of the conditions for person 1:

In [36]: from sqlalchemy import select, and_

In [37]: c = inspector.tables['concept']

In [38]: co = inspector.tables['condition_occurrence']

In [39]: person_id = 1

In [40]: statement = select([co.condition_start_date, co.condition_concept_id, c.concept_name]).\
   ....:             where(and_(\
   ....:                 co.person_id == person_id,\
   ....:                 co.condition_concept_id == c.concept_id))
   ....: 

In [41]: print(statement)
SELECT main.condition_occurrence.condition_start_date, main.condition_occurrence.condition_concept_id, main.concept.concept_name 
FROM main.condition_occurrence, main.concept 
WHERE main.condition_occurrence.person_id = :person_id_1 AND main.condition_occurrence.condition_concept_id = main.concept.concept_id

In [42]: inspector.execute(statement).as_pandas()
Out[42]: 
   condition_start_date  ...                                       concept_name
0            2010-03-12  ...                                       Osteoporosis
1            2009-07-25  ...                                           Backache
2            2009-07-25  ...                                      Low back pain
3            2010-08-17  ...                                        Neck sprain
4            2010-11-05  ...                 Subchronic catatonic schizophrenia
5            2009-10-14  ...                                       Hypocalcemia
6            2010-03-12  ...                           Congestive heart failure
7            2010-11-05  ...                                      Schizophrenia
8            2010-03-12  ...               Antiallergenic drug adverse reaction
9            2010-04-01  ...                                   Bipolar disorder
10           2010-03-12  ...                          Pure hypercholesterolemia
11           2009-10-14  ...                                 Postoperative pain
12           2010-04-01  ...  Bipolar I disorder, single manic episode, in f...
13           2009-07-25  ...                                Menopausal syndrome
14           2009-07-25  ...                               Thoracic radiculitis
15           2010-03-12  ...                                 Retention of urine

[16 rows x 3 columns]

Count the number of inpatient and outpatient visits for each person broken down by visit type and sorted by person_id:

In [43]: from sqlalchemy import join, func

In [44]: vo = inspector.tables['visit_occurrence']

In [45]: j = join(vo, c, vo.visit_concept_id == c.concept_id)

In [46]: j2 = join(j, p, vo.person_id == p.person_id)

In [47]: visit_types = ['Inpatient Visit','Outpatient Visit']

In [48]: statement = select([p.person_id, func.count(vo.visit_occurrence_id).label('num_visits'), c.concept_name.label('visit_type')]).\
   ....:             select_from(j2).\
   ....:             where(c.concept_name.in_(visit_types)).\
   ....:             group_by(p.person_id, c.concept_name).\
   ....:             order_by(p.person_id)
   ....: 

In [49]: inspector.execute(statement).as_pandas()
Out[49]: 
   person_id  num_visits        visit_type
0          1           1   Inpatient Visit
1          1           1  Outpatient Visit
2          2           4   Inpatient Visit
3          2           2  Outpatient Visit
4          3           1  Outpatient Visit
5          5           4  Outpatient Visit
6          7          18  Outpatient Visit
7          8          11  Outpatient Visit
8          9           2  Outpatient Visit

From Strings

You can execute unaltered SQL strings directly, but remember to always used parametrized code for shared/production projects.

Warning

Only use strings for rapid prototyping and in-house projects! Executing strings directly breaks backend compatibility and can potentially lead to SQL injection attacks!

Example:

In [50]: inspector.execute('select person_id from person').as_pandas()
Out[50]: 
   person_id
0          1
1          2
2          3
3          4
4          5
5          6
6          7
7          8
8          9
9         16

Sharing custom queries as functions

Custom queries that may prove useful to the OMOP CDM community can easily be shared by wrapping them in a function and following a standard recipe. View the source code on GitHub to get a better feel of how to construct queries and contribute (via pull request or posting your function in issues).

In general, consider the following:

  • appropriately named query functions should begin with the data you intend to return and end with the data/parameters you expect as input. E.g. concepts_for_concept_ids
  • the return value for a query should always be a Results object. This provides consistency and gives the end-user control over how to process the results.
  • write a docstring following the numpydoc docstring guide to accompany your code.

Prototype:

def output_for_input(inputs, inspector, return_columns=None):
    """
    Short description.

    Longer explanation.

    Parameters
    ----------
    inputs : type
        description of inputs
    inspector : inspectomop.inspector.Inspector
    return_columns : list of str, optional
        - optional subset of columns to return from the query
        - columns : ['col_name_1', 'col_name_2']

    Returns
    -------
    results : inspectomop.results.Results
        a cursor-like object with methods such as fetchone(), fetchmany() etc.

    Notes
    -----
    Optional

    """
    columns = [] # specify return columns

    if return_columns: # filter based on end-user selection
        columns = [col for col in columns if col.name in return_columns]

    statement = select([columns]).where(inputs == criteria)

    return inspector.execute(statement)