Biomedical Informatics and Data Science Research Projects

Clinical Computing Within Yale New Haven Health System (YNHHS)

Faculty: Profs. Shiffman, Brandt, Melnick, Schonberger, Hsiao (YNHHS Chief Medical Information Officer), Thomas (YNHHS data warehouse team), Kashyap (Exec Director of Clinical Informatics for YNHHS), and Paek (Medical Director of Information Technology for Greenwich Hospital)

Description: YNHHS organizations - including Yale New Haven, Bridgeport and Greenwich Hospitals, Yale Medical Group and New England Medical Group all share a single instance of the Epic database. Epic is certified and fully implemented. Its features include storage of all patient notes, lab, and imaging reports, patient registries, a patient portal (MyChart), and integration with a clinical trials management system (OnCore). Epic’s transactional database is backed up nightly to a relational database which feeds an enterprise-wide, star schema data warehouse that includes external data feeds: costs, claims, Medicare data, patient satisfaction surveys, Social Security Death Index, and cancer registry data. YNHHS has a Hadoop cluster that receives streaming sensor data from intensive care units, genomics data, and social media feeds. Collaborators include Yale's Center for Outcomes Research and Evaluation featuring expertise in mathematical modeling and statistics. Recent projects include machine learning for prediction of sepsis and 30-day mortality in cancer patients. This rich environment has been the focus of informatics research projects including: 1) Andrew Taylor, MD, (MHS informatics thesis); 2) Robert Schonberger MD, MHS (participation in the Multicenter Perioperative Outcomes Group (MPOG) - Epic anesthesia data).

Ongoing Collaboration with the VA Connecticut Healthcare System (VACHS)

Faculty: Profs. Brandt, Justice, Cheung, Womack, Erdos, Concato, Rajeevan, and Miller

Description: Over the past two decades, there have been many collaborations between faculty at the YCMI and VACHS and opportunities for research for fellows. The IT group led by Prof. Erdos (Director of the VA East Coast IT Team) has moved much of the data from the national VA electronic health record (EHR) into query-able relational databases. This has allowed us to explore research questions related to the design, implementation, and use of large EHR-based clinical data registries and cohorts. Collaborations include: 1) Profs. Brandt and Justice receiving support for a Medical informatics postdoctoral training program at the VACHS closely affiliated with Yale's NLM training program, 2) several projects between Profs. Brandt and Justice focusing on informatics research in clinical epidemiology and health services research, 3) Prof. Brandt leading the VACHS participation in a major national VA research project focused on the use of natural language/text mining techniques in the VA EHR, 4) Profs. Miller, Rajeevan and Cheung providing informatics research for a major national GWAS study focused on schizophrenia and bipolar disease within the VA that Prof. Concato is the Co-PI of (Million Veteran Program - MVP), and 5) a newly funded research sub-study of the MVP project to study alcohol, smoking and opioid use led by Prof. Justice (Contact PI), Gerstein, Zhao, and Concato to develop EHR-based phenotypes.

Clinical Decision Support Systems (CDS) and Clinical Practice Guideline Knowledge Management

Faculty: Profs. Shiffman, Brandt, Justice, Cheung, Miller, informatics and clinical faculty at Yale and VACHS

Description: A longstanding research activity has involved the development of systems that bring knowledge to the practicing clinician at the point of care. A related emphasis has been on development of quality clinical practice guidelines based on best evidence and delivered transparently. The GLIDES project explores the acquisition and representation of guideline knowledge, and implementation and evaluation of guidelines.

  • GLIDES (GuideLines Into Decision Support) and ERGO (Effective Representation of Guidelines With Ontologies). Funded by NLM and AHRQ, these projects developed tools and standards including GEM (the Guideline Elements Model) [19] standard for representation of guideline knowledge in XML, GLIA [20] (the Guideline Implementability Appraisal), the Yale Guideline Recommendation Corpus [21], and BRIDGE-Wiz [22]. These products have been widely applied by national and international guideline organizations [23] and have offered many opportunities for informatics fellows.
  • CDS systems been implemented in VACHS with several fellows involved in the research. 1) The VACS (Veterans Aging Cohort Study) Index Calculator, provides patients/providers with estimates of expected years of life, 2) Neuropath/CDS assists in the medical management of patients with neuropathic pain [24], and 3) TRIM (Tool to Reduce Inappropriate Medications) is used to identify potentially inappropriate medications in older adults and provides recommendations for improving the medication regimen.

National Genomics Data Science Initiatives

Faculty: Profs. Gerstein, Sesta and Weissman

Description: The Gerstein lab and collaborating faculty is involved in several large-scale national collaborations focused on aspects of genomics and data science. Many CBB graduate students participate in these translational projects that provide a spectrum of translational bioinformatics research opportunities.

  • The 1000 Genomes Project ( This is NIH's marquee effort on personal genomics, the sequencing of individual people's genomes. The project aims to sequence thousands of individuals’ genomes to get a sense of their variability. Prof. Gerstein’s group developed annotation pipelines and methods that were invaluable to the production phase of 1000 Genomes, and is involved in the 1KG SV trio project, a plan to sequence trios of individual from multiple families to very high coverage.
  • ENCODE ( As part of a multi-institutional collaboration, Prof. Gerstein’s lab is involved in annotating the human genome and developing methods for analyzing large-scale genomic experiments, focusing on pseudogene identification and annotation of the human genome in collaboration with the GENCODE team members (, and elucidating transcription factor binding sites and chromatin structure based on ChIP-Seq experiments [25]. The group was involved in the ENCODE’s consortium’s rollout of a number of major human genome annotation papers in 2012 and 2014.
  • Brainspan/psychENCODE: In collaboration with Prof. Nenad Sestan’s lab at Yale, groups at USC, the Allen Brain Institute and elsewhere, Prof. Gerstein’s lab is developing RSEQtools, a suite of tools that performs common tasks on RNA-seq data for creating a reference atlas of gene expression in different regions of the brain that will provide valuable information to help interpret neurological function and dysfunction. This collaborative consortium has started work on PsychENCODE, a project aimed at understanding regulatory variants in the context of their functional connections to psychiatric disease involving a comprehensive examination of the genome, transcriptome, epigenome, and proteome in relation to brain function.
  • Privacy of Genomic Information & Data Science

Collaborative Translational Research in Statistical Genomics and Proteomics

Faculty: Profs. Zhao, Gerstein, Cheung, and many bioscience faculty at Yale

Description: Prof. Zhao’s laboratory, which is located immediately adjacent to the YCMI and to Pathology Informatics Unit (on the same floor of the same building), focuses on a broad range of collaborative translational (disease focused) projects in statistical genomics and proteomics, in which numerous CBB students participate. Current research projects include: 1) data integration methods in genome-wide association studies to identify disease causal variants; 2) disease risk prediction from genomics data leveraging pleiotropy and functional annotations; 3) spatial temporal modeling of expression patterns in brain developments in humans and other species; 4) cancer genomics; 5) single cell analysis; 6) microbiome analysis; 6) high dimensional longitudinal data analysis; and 7) outcome research. Prof. Zhao’s lab developed tissue specific functional annotations based on the Roadmap Epigenomics data and integrated it with various GWAS results to infer disease relevant tissues at the risk loci, to infer shared genetic contributions to diverse types of diseases (e.g. schizophrenia and immune diseases) and to improve disease risk predictions. His lab developed a suite of software to identify somatic mutation clusters in cancer. Prof. Zhao collaborates extensively with many researchers, including Prof. Gelernter on substance dependence disorders, Prof. Lifton on hypertension, Prof. Gunel on intracranial aneurysm, and Prof. Mistry on Gaucher disease.

Computational Immunology

Faculty: Profs. Kleinstein, Schatz, Kaech, Cheung, Kluger, O’Connor, and Montgomery

Description: Yale has a diverse research program in computational immunology that brings together expertise from a variety of scientific disciplines on research projects in vaccine response, host-pathogen dynamics, cell-fate choices, immune genomics, and many others. Computational Immunology is included as a distinct research area in the Immunology track of the Yale BBS program. Prof. Kleinstein has a longstanding collaboration with Prof. Schatz, Prof. Kaech and other immunobiology-related faculty and combines techniques from dynamic modeling, systems biology and bioinformatics to better understand the immune response. His group has developed many widely used computational methods for B cell immunoglobulin repertoire analysis through high-throughput sequencing. As part of the Program for Research on Immune Modeling and Experimentation, one of the NIAID Modeling Immunity for Biodefense centers, Prof. Kleinstein integrates mathematical modeling with experimental studies to elucidate the early immunological responses to influenza virus infection in key human cell types and in human lung tissue.  As part of the NIH Human Immunology Project Consortium (HIPC), the team identify molecular signatures associated with differential clinical outcomes of human infection and vaccination responses. Prof. Kleinstein also led the consortium-wide effort to identify common signatures of influenza vaccination responses. He currently works with Prof. Cheung to lead the HIPC data standards effort. Prof. Kluger is collaborating with Prof. Montgomery to apply deep learning methods to the analysis of single-cell immunophenotyping (CyTOF) data. Along with seminars and hands-on mini-courses, Prof. Kleinstein organized three summer schools in computational immunology, each associated with a research symposium, at Yale (in 2008, 2011 and planned for June 2016).

Interdisciplinary Translational Research to Understand the Mechanisms Underlying Asthma

Faculty: Profs. Chupp, Krauthammer, Cheung, and Brandt

Description: A collaboration between Prof. Chupp, (Internal Medicine, Pulmonary Critical Care) and informatics faculty (Profs. Krauthammer, Cheung and Brandt) involves translational research focusing on the genes, proteins, pathways, and regulatory networks underlying asthma (a.k.a. GenEx). A web-accessible database (YCAAD) allows Prof. Chupp’s lab members to collect data on patients with asthma, on clinical data, laboratory data, and high throughput genomic data from analysis of diverse samples (e.g., blood, sputum, and biopsy tissue). Several other informatics projects have emerged from this collaboration. One is a mobile health application being built with the Connecticut Foundation for Better Health (CTFBH) and Boehringer Ingelheim Pharmaceutical Company. It integrates patients’ disease activity scores remotely to the EMR (in this case Epic) and allows physicians to provide advice to patients on disease management. Patients receive regular alerts by email or text to input, and gamification features are being added. Features include integration of environmental conditions from weather stations in the patient’s zip code or location (via GPS attachment to rescue inhaler propeller) and wheeze and respiratory rate monitoring via an Automated Device for Asthma Management (ADAM) (

The Yale Protein Expression Database & Related Informatics Projects

Faculty: Profs. Cheung, Krauthammer, Gerstein and Shifman

Description: The YCMI and Keck Biotechnology Center are involved in an ongoing collaboration to build, maintain, and refine the Yale Protein Expression Database (YPED) to help organize the processing of mass spectrometry proteomics data being produced in increasingly large volumes for researchers at Yale and beyond. YPED is a bioinformatics suite and database for proteomics research. It contains modules for liquid chromatography-tandem mass spectrometry (LC-MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. There are peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. It has a targeted proteomics module for automated multiple reaction monitoring (MRM)/selective reaction monitoring (SRM) assay development. We linked YPED's search results and both label-based and label-free fold-change analysis to external tools/databases such as the Skyline Panorama repository for online spectra visualization. We have functionality to curate peptide identifications into an MS/MS peptide spectral library for all protein database search identification results. These features have been used in translational, disease- focused contexts. In addition, Prof. Gerstein’s lab studies protein function by understanding structure and molecular motions, taking into consideration the permitted packing geometry. An additional focus is protein (and gene) network analysis studying various aspects such as network hub bottlenecks and dynamics of networks. Finally, computational methods have been developed for deep integration of genomic, transcriptomic, and proteomic data using next generation transcriptome sequencing (RNA-Seq) and MS-based proteomics data.

Neuroinformatics as Part of the National Human Brain Project and the NIF

Faculty: Profs. Cheung, Miller, Shepherd, and Marenco

Description: As part of the national Human Brain Project, we developed informatics support of neuroscience research and computer-based modeling using the olfactory system as a pilot domain. Components of this project currently include: 1) ORDB, a database of information about olfactory receptors, 2) NeuronDB, a database of information about the compartmental properties and connectivity of different neurons, and 3) ModelDB, a database of models of neurons, neuronal compartments, and neuronal microcircuits. The project provides a focus for exploring a flexible approach for designing bioscience databases, the EAV/CR design (Entity-Attribute-Value with Classes and Relationships). This design facilitates the flexible storage and retrieval of complex bioscience data and of the biological relationships between those data items. In addition, we are participating in a national NIH-based collaboration to build a Neuroscience Informatics Framework (NIF) that allows researchers to flexibly search and query a wide spectrum of Internet-based neuroscience resources. One translational focus is on neuronal modeling to explore issues related to Alzheimer's Disease.

Natural Language Processing and Text Mining in Biomedicine

Faculty: Profs. Krauthammer, Brandt, Womack, Cheung, Justice, and Fodeh

Description: A variety of projects focus on natural language processing (NLP) and text mining in biomedicine. One set of projects are led by Prof. Krauthammer in Pathology Informatics with an emphasis on text and image mining from the biomedical literature. An example is the Yale Image Finder system, which pioneered the use of advanced image analysis capabilities to extract image text for improving biomedical document retrieval. Another set of projects are based at the VACHS directed by Prof. Brandt, and are the result of the work done that focused on natural language progressing and text mining of the VA’s extensive EHR. Our informatics group at VACHS and YCMI have worked with multiple informatics research teams to develop and extend open-source clinical NLP pipeline systems such as yTEX [26], a branch of cTAKES [27] and others to facilitate the detection of treatment and diagnoses (use of contraception [28], end stage liver disease [29] post-traumatic stress disorder [30], falls [31] and pseudo-seizure [32]) from clinical notes. Many of these projects were worked on by informatics trainees funded by NLM and/or VA.

Semantic Web and Ontologies in Biomedicine

Faculty: Profs. Gerstein, Cheung, Krauthammer, and Kleinstein

Description: This research continues a longstanding set of collaborative activities that focus on the integrative analysis of genomic, proteomic, neuroscience, pathway, and drug data from many different perspectives. The activities involve the use of biomedical ontologies and semantic web technologies to integrate diverse types of data from diverse types of web-accessible resources in several broad scientific domains.

  • Genomics research: We employed state-of-the-art semantic web technologies to represent and integrate a variety of genomics data including yeast genome, pseudogene and gene expression data. We expanded annotation and integration of metadata as part of the informatics effort in the Extracellular RNA Communication Consortium (ERCC). We work with the Gene Ontology Consortium (GOC) to extend the Gene Ontology (GO) to include extracellular RNA concepts and relationships and gene product annotation.
  • Neuroscience research: We converted neuroscience databases (SenseLab) into ontologies capturing knowledge about neurons and their cell membrane properties. We are also exploring how to link such neuroscience ontologies with other biomedical ontologies.
  • Translational research We are using semantic web technologies to facilitate integration of high-throughput genomic data (e.g., microarray data and pathway data) with drug data in the context of cancer research.·
  • Immunology research As part of the Human Immunology Project Consortium (HIPC) data standards effort, we designed new data templates that have been adopted by the NIAID ImmPort repository for use by the wider scientific community. We have mapped metadata associated with these templates to existing ontologies to help standardized data annotation and semantic data integration across diverse immunology datasets. We have worked with external ontology maintainers, such as the Cell Ontology (CO), to include new concepts and improve annotation.
  • Big Data to Knowledge (BD2K) Initiative. As part of the BD2K initiative, we have been working with the Center for Extended Data Annotation and Retrieval (CEDAR) to utilize ontologies and related semantic web technologies in order to improve data submission to repositories such as NIAID ImmPort, so biomedical researchers and analysts create and use better metadata. We designed a 1-day training program on metadata management and standards for large scale experiments that will be offered as part of a computational immunology summer school being organized at Yale for June 2016. As part of bioCADDIE, we have been working on methods to improve access, discoverability and citation for immunoglobulin repertoire sequencing data.
  • Ontology-Driven NLP and Data Annotation and Integration We incorporated biomedical ontologies into named entity recognition (as part of information extraction) and concept-based information retrieval in the context of semantic annotation and integration of biomedical data including genomics, proteomics, immunology and clinical data, and integrated ontologies into full-text searches in database engines (Microsoft SQLServer) and other full-text search technologies (e.g., Apache Solr).

Quantitative Biomedicine

Faculty: Profs. Emonet, O’Hern, and Regan

Description: These research groups focus on computational studies of biological systems including proteins, cells, and tissues, and is performed in collaboration with experimentalists from Yale’s Department of Molecular, Cellular, and Developmental Biology and Department of Molecular Biophysics and Biochemistry, as well as the Departments of Pathology, Dermatology and Microbial Pathogenesis at Yale School of Medicine (YSM).

  • Discrete element simulations of tumor microenvironment: Prof. O’Hern with Prof. Bosenberg from the Departments of Dermatology and Pathology has developed analysis software to segment and identify melanoma and immune cells and vasculature from confocal microscopy images derived from melanoma animal models, and developed discrete element simulations that model spatiotemporal dynamics of the melanoma. These simulations allow the study of virtual tumors in three spatial dimensions, cross-validate results with tumor growth in vivo, and simulate tumor growth and metastasis.
  • Computational design of protein-protein interactions: Prof. O’Hern with Prof. Regan’s laboratory studies the malfunction of a single protein-protein interaction in health and disease that are directly relevant to the development of small molecule inhibitors. The studies on protein-protein interactions allow for the design of new, specific protein binding modules that have a myriad of applications in medicine, for example in cell-targeting and biosensor applications, and can be considered as a versatile replacement for antibodies.·
  • Dynamics studies of infectious bacteria: Prof. Emonet and his colleagues in collaboration with Prof. Kazmierczak (Microbial Pathogenesis, YSM) study motile behaviors and flagellar properties of infectious bacterium Pseudomonas aeruginosa. Prof. Emonet’s group has been involved in developing automated software that allows high-resolution tracking and behavioral analysis of large numbers of individual motile bacteria. Prof. O’Hern collaborates with Prof. Jacobs-Wagner’s laboratory (Molecular, Cellular and Developmental Biology) to study spatiotemporal dynamics of the cytoplasm of Escherichia coli (E. coli) bacteria with a focus on describing metabolically dormant cells. Dormant bacteria are known to be more resistant to stress including antibiotic treatment and provide a non-genetic basis for antibiotic resistance.

Bioinformatic Tools for Cancer Genomics and Epidemiology

Faculty: Profs. Townsend, Krauthammer, Gerstein and Kluger

Description: Next Generation Sequencing and other high throughput technologies have created tremendous potential for revealing the genetic basis and underlying molecular mechanisms of diverse forms of cancer. Many researchers at Yale use this data to derive novel insights into cancer diagnosis, therapy and risk prediction.

  • Prof. Townsend applies principles from evolutionary theory to study cancer whole-exome sequence data sets to reveal the level of clonal natural selection for cancer drivers. In collaboration with members of the Departments of Immunology and Pharmacology, he is using genomic sequencing and RNA-seq data to identifying promising cancer immune therapeutic targets and pathways.
  • Prof. Krauthammer as Co-Director of the bioinformatics/biostatistics core of the Yale SPORE in skin cancer is supervising translational research collaborations with multiple researchers across the YSM. A special emphasis is the analysis of next generation sequencing data probing the melanoma genome, transcriptome, and epigenome. Another focus is the elucidation of kinase activation in cancer, as well as the molecular study of anticancer drug resistance.
  • Prof. Kluger implements and develops models for analysis of high dimensional data, including characterizing heterogeneity in cancers, improved detection of copy number alterations in Exome-seq data, approaches for studying the epigenetic landscape at different length scales, data organization of mutation profiles within specific cancer populations and data integration approaches for combining various data sources, such as combining 4C-seq, FISH and epigenetics for studying translocations in immune cells.
  • Prof. Gerstein’s work in cancer genomics focuses on noncoding sequence variants, and includes the development of tools such as Funseq, a tool for functionally annotating regulatory variants in cancer genome sequences, and LARVA, a tool for detecting significant mutation burdens in noncoding elements in cancer whole genomes. The group is also heavily involved in the Pan-Cancer Analysis Working Group (PCAWG), as co-leaders of the PCAWG-2 group.

Data Science approaches for knowledge discovery from vast medical claims and clinical datasets

Faculty: Profs. Krumholz, Krauthammer, Zhao, Bates, and Zhou

Description: The Yale Center for Outcomes Research and Evaluation (CORE, Prof. Krumholz, director) aims to develop novel approaches and tools to advance our ability to generate meaningful knowledge from large, complex health care data collections. The team of leaders in outcomes research, mathematical and modeling sciences, informatics, computer science, engineering, software development, economics, and implementation science works to integrate and apply methods from these fields towards complex questions using big healthcare data. CORE is committed to disseminating the tools that result, to training researchers in their use, and to collaborating externally to ensure that the work makes a difference for patients and society.  CORE has made particular strides in developing novel measurement strategies for the Centers for Medicare & Medicaid Services; supporting new approaches to assessing medical device surveillance at the Food and Drug Administration; changing the approach to clinical trial transparency through the implementation of groundbreaking partnerships and programs with industry and other organizations; developing advanced analytic approaches to medical problems with colleagues in applied mathematics and statistics; creating breakthrough research programs in China with the Chinese National Center for Cardiovascular Disease; and generating new knowledge in medicine through diverse federal grants and contracts.