Overview
Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.
The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:
- in an R-friendly “tidy” format with one gene pair per row
- for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
- as gene symbols as well as NCBI Entrez and Ensembl IDs
- without accessing external resources requiring an active internet connection
Usage
Load package.
The msigdbr()
function retrieves a data frame of all
genes and gene sets in the database.
Species
The species
parameter enables conversion of the original
human genes to their orthologs in various model organisms, such as
mouse. You can use msigdbr_species()
to check the available
species.
Please be aware that the orthologs are computationally predicted at the gene level. The full pathways may not be well conserved across species.
There are human and mouse versions of MSigDB. The
db_species
parameter specifies the database (human by
default).
The genes within each gene set may originate from a species different
from the database target species, as indicated by the
gs_source_species
and db_target_species
fields.
Collections
You can retrieve data just for a specific collection, such as the Hallmark gene sets.
h_gene_sets <- msigdbr(species = "mouse", collection = "H")
You can specify a sub-collection, such as C2 (curated) CGP (chemical and genetic perturbations) gene sets.
cgp_gene_sets <- msigdbr(species = "mouse", collection = "C2", subcollection = "CGP")
If you require more precise filtering, the msigdbr()
function output is a data frame that can be manipulated using standard
methods. For example, you can subset to a specific collection using
dplyr.
dplyr::filter(all_gene_sets, gs_collection == "H")
Output
The msigdbr()
function returns a data frame with one
gene per row. Generally, the gene_symbol
(gene symbol) and
gs_name
(gene set name) columns are the most useful.
The data frame contains the following columns for gene-level info:
-
gene_symbol
Official gene symbol for the requested species. -
ncbi_gene
NCBI (formerly Entrez) ID for the requested species. -
ensembl_gene
Ensembl ID for the requested species. -
db_gene_symbol
Official gene symbol in the MSigDB database. -
db_ncbi_gene
NCBI ID in the MSigDB database. -
db_ensembl_gene
Ensembl ID in the MSigDB database. -
source_gene
Gene identifier in the original publication. -
ortholog_taxon_id
The taxon ID of the species. -
ortholog_sources
The databases that support the ortholog mapping. -
num_ortholog_sources
The number of databases that support the ortholog mapping.
The gs_*
columns provide details about the gene
sets:
-
gs_id
Gene set systematic name. -
gs_name
Gene set standard name. -
gs_collection
The collection the gene set belongs to. -
gs_subcollection
The sub-collection the gene set belongs to. -
gs_collection_name
The name of the collection. -
gs_description
A description of the gene set. -
gs_source_species
The species from which the gene set originated from. -
gs_pmid
PubMed ID. -
gs_geoid
GEO ID. -
gs_exact_source
The original source of the gene set, such as a resource identifier, a figure, or a supplementary document from a publication. -
gs_url
A URL to the original source of the gene set.
The db_*
columns provide details about the MSigDB
database and should be identical for the entire data frame:
-
db_version
The version of the MSigDB database. -
db_target_species
The target species of the MSigDB database (HS or MM).
unique(all_gene_sets$db_version)
#> [1] "2025.1.Hs"
Helper functions
There are helper functions to assist with setting the
msigdbr()
parameters.
Use msigdbr_species()
to check the available species.
Both scientific and common names are acceptable for the
msigdbr()
function.
Use msigdbr_collections()
to check the available
collections.
Pathway enrichment analysis
The msigdbr output can be used with various pathway analysis packages.
Use the gene sets data frame for clusterProfiler with genes as NCBI/Entrez IDs.
msigdbr_t2g <- dplyr::distinct(msigdbr_df, gs_name, ncbi_gene)
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)
Use the gene sets data frame for clusterProfiler with genes as gene symbols.
msigdbr_t2g <- dplyr::distinct(msigdbr_df, gs_name, gene_symbol)
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)
Use the gene sets data frame for fgsea.
msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)
Use the gene sets data frame for GSVA.
msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsvapar <- gsvaParam(geneSets = msigdbr_list, ...)
gsva(gsvapar)
Earlier versions of GSVA (<1.50) only need the gsva()
function.
msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)
Potential questions and concerns
Which version of MSigDB was used?
The MSigDB version is stored in the db_version
column of
the returned data frame. You can check the version used with
unique(msigdbr_df$db_version)
.
Why use this package when I can download the gene sets directly from MSigDB?
This package makes it more convenient to work with MSigDB gene sets in R. This package includes Ensembl IDs. You don’t need to download the GMT files and import them. You don’t need to learn how to restructure the output to make it compatible with downstream tools. You don’t need to convert the genes to your organism if you are working with non-human data.
Can I convert between human and mouse genes simply by adjusting gene capitalization?
That will work for most, but not all, genes.
Can I convert human genes to any organism myself instead of using this package?
One popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.
Aren’t there already other similar tools?
There are a few resources that provide some of the msigdbr functionality and served as an inspiration for this package. EGSEAdata Bioconductor package provides data as a list. WEHI MSigDB provides data for human and mouse as RDS files. MSigDF provides data as an R data frame. These are updated at varying frequencies and may not be based on the latest version of MSigDB.
Since 2022, the GSEA/MSigDB team provides collections that are natively mouse and don’t require orthology conversion.
What if I have other questions?
Please submit feedback and report bugs on GitHub.
Details
The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software. To cite use of the underlying MSigDB data, reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), Castanza, et al. (2023, Nature Methods) and also the source for the gene set.
Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.
For information on how to cite the msigdbr R package, run
citation("msigdbr")
.