Introduction to msigdbr • msigdbr

About

Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

in an R-friendly “tidy” format with one gene pair per row
for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
as gene symbols as well as NCBI Entrez and Ensembl IDs
without accessing external resources requiring an active internet connection

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Basic usage

Load package.

library(msigdbr)

The msigdbr() function retrieves a data frame of all genes and gene sets in the database.

all_gene_sets <- msigdbr()
head(all_gene_sets)
#> # A tibble: 6 × 20
#>   gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
#>   <chr>       <chr>     <chr>        <chr>          <chr>        <chr>          
#> 1 ABCC4       10257     ENSG0000012… ABCC4          10257        ENSG00000125257
#> 2 ABRAXAS2    23172     ENSG0000016… ABRAXAS2       23172        ENSG00000165660
#> 3 ACTN4       81        ENSG0000013… ACTN4          81           ENSG00000130402
#> 4 ACVR1       90        ENSG0000011… ACVR1          90           ENSG00000115170
#> 5 ADAM9       8754      ENSG0000016… ADAM9          8754         ENSG00000168615
#> 6 ADAMTS5     11096     ENSG0000015… ADAMTS5        11096        ENSG00000154736
#> # ℹ 14 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
#> #   db_target_species <chr>

Species

The species parameter enables conversion of the original human genes to their orthologs in various model organisms, such as mouse. You can use msigdbr_species() to check the available species.

all_gene_sets <- msigdbr(species = "Mus musculus")
head(all_gene_sets)
#> # A tibble: 6 × 23
#>   gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
#>   <chr>       <chr>     <chr>        <chr>          <chr>        <chr>          
#> 1 Abcc4       239273    ENSMUSG0000… ABCC4          10257        ENSG00000125257
#> 2 Abraxas2    109359    ENSMUSG0000… ABRAXAS2       23172        ENSG00000165660
#> 3 Actn4       60595     ENSMUSG0000… ACTN4          81           ENSG00000130402
#> 4 Acvr1       11477     ENSMUSG0000… ACVR1          90           ENSG00000115170
#> 5 Adam9       11502     ENSMUSG0000… ADAM9          8754         ENSG00000168615
#> 6 Adamts5     23794     ENSMUSG0000… ADAMTS5        11096        ENSG00000154736
#> # ℹ 17 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
#> #   db_target_species <chr>, ortholog_taxon_id <int>, ortholog_sources <chr>,
#> #   num_ortholog_sources <dbl>

Please be aware that the orthologs are computationally predicted at the gene level. The full pathways may not be well conserved across species.

There are human and mouse versions of MSigDB. The db_species parameter specifies the database (human by default).

all_mm_gene_sets <- msigdbr(db_species = "MM", species = "Mus musculus")
head(all_mm_gene_sets)
#> # A tibble: 6 × 20
#>   gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
#>   <chr>       <chr>     <chr>        <chr>          <chr>        <chr>          
#> 1 AU021092    239691    ENSMUSG0000… AU021092       239691       ENSMUSG0000005…
#> 2 Ahnak       66395     ENSMUSG0000… Ahnak          66395        ENSMUSG0000006…
#> 3 Alcam       11658     ENSMUSG0000… Alcam          11658        ENSMUSG0000002…
#> 4 Ankrd40     71452     ENSMUSG0000… Ankrd40        71452        ENSMUSG0000002…
#> 5 Arid1a      93760     ENSMUSG0000… Arid1a         93760        ENSMUSG0000000…
#> 6 Bckdhb      12040     ENSMUSG0000… Bckdhb         12040        ENSMUSG0000003…
#> # ℹ 14 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
#> #   db_target_species <chr>

The genes within each gene set may originate from a species different from the database target species, as indicated by the gs_source_species and db_target_species fields.

Collections

You can retrieve data just for a specific collection, such as the Hallmark gene sets.

h_gene_sets <- msigdbr(species = "mouse", collection = "H")

You can specify a sub-collection, such as C2 (curated) CGP (chemical and genetic perturbations) gene sets.

cgp_gene_sets <- msigdbr(species = "mouse", collection = "C2", subcollection = "CGP")

If you require more precise filtering, the msigdbr() function output is a data frame that can be manipulated using standard methods. For example, you can subset to a specific collection using dplyr.

dplyr::filter(all_gene_sets, gs_collection == "H")

Output

The msigdbr() function returns a data frame with one gene per row. Generally, the gene_symbol (gene symbol) and gs_name (gene set name) columns are the most useful.

The data frame contains the following columns for gene-level info:

gene_symbol Official gene symbol for the requested species.
ncbi_gene NCBI (formerly Entrez) ID for the requested species.
ensembl_gene Ensembl ID for the requested species.
db_gene_symbol Official gene symbol in the MSigDB database.
db_ncbi_gene NCBI ID in the MSigDB database.
db_ensembl_gene Ensembl ID in the MSigDB database.
source_gene Gene identifier in the original publication.
ortholog_taxon_id The taxon ID of the species.
ortholog_sources The databases that support the ortholog mapping.
num_ortholog_sources The number of databases that support the ortholog mapping.

The gs_* columns provide details about the gene sets:

gs_id Gene set systematic name.
gs_name Gene set standard name.
gs_collection The collection the gene set belongs to.
gs_subcollection The sub-collection the gene set belongs to.
gs_collection_name The name of the collection.
gs_description A description of the gene set.
gs_source_species The species from which the gene set originated from.
gs_pmid PubMed ID.
gs_geoid GEO ID.
gs_exact_source The original source of the gene set, such as a resource identifier, a figure, or a supplementary document from a publication.
gs_url A URL to the original source of the gene set.

The db_* columns provide details about the MSigDB database and should be identical for the entire data frame:

db_version The version of the MSigDB database.
db_target_species The target species of the MSigDB database (HS or MM).

unique(all_gene_sets$db_version)
#> [1] "2025.1.Hs"

Helper functions

There are helper functions to assist with setting the msigdbr() parameters.

Use msigdbr_species() to check the available species. Both scientific and common names are acceptable for the msigdbr() function.

msigdbr_species()
#> # A tibble: 20 × 2
#>    species_name                    species_common_name                          
#>    <chr>                           <chr>                                        
#>  1 Anolis carolinensis             Carolina anole, green anole                  
#>  2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
#>  3 Caenorhabditis elegans          NA                                           
#>  4 Canis lupus familiaris          dog, dogs                                    
#>  5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
#>  6 Drosophila melanogaster         fruit fly                                    
#>  7 Equus caballus                  domestic horse, equine, horse                
#>  8 Felis catus                     cat, cats, domestic cat                      
#>  9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
#> 10 Homo sapiens                    human                                        
#> 11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
#> 12 Monodelphis domestica           gray short-tailed opossum                    
#> 13 Mus musculus                    house mouse, mouse                           
#> 14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
#> 15 Pan troglodytes                 chimpanzee                                   
#> 16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
#> 17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
#> 18 Schizosaccharomyces pombe 972h- NA                                           
#> 19 Sus scrofa                      pig, pigs, swine, wild boar                  
#> 20 Xenopus tropicalis              tropical clawed frog, western clawed frog

Use msigdbr_collections() to check the available collections.

msigdbr_collections()
#> # A tibble: 25 × 4
#>    gs_collection gs_subcollection  gs_collection_name               num_genesets
#>    <chr>         <chr>             <chr>                                   <int>
#>  1 C1            ""                "Positional"                              302
#>  2 C2            "CGP"             "Chemical and Genetic Perturbat…         3538
#>  3 C2            "CP"              "Canonical Pathways"                       19
#>  4 C2            "CP:BIOCARTA"     "BioCarta Pathways"                       292
#>  5 C2            "CP:KEGG_LEGACY"  "KEGG Legacy Pathways"                    186
#>  6 C2            "CP:KEGG_MEDICUS" "KEGG Medicus Pathways"                   658
#>  7 C2            "CP:PID"          "PID Pathways"                            196
#>  8 C2            "CP:REACTOME"     "Reactome Pathways"                      1787
#>  9 C2            "CP:WIKIPATHWAYS" "WikiPathways"                            885
#> 10 C3            "MIR:MIRDB"       "miRDB"                                  2377
#> 11 C3            "MIR:MIR_LEGACY"  "MIR_Legacy"                              221
#> 12 C3            "TFT:GTRD"        "GTRD"                                    505
#> 13 C3            "TFT:TFT_LEGACY"  "TFT_Legacy"                              610
#> 14 C4            "3CA"             "Curated Cancer Cell Atlas gene…          148
#> 15 C4            "CGN"             "Cancer Gene Neighborhoods"               427
#> 16 C4            "CM"              "Cancer Modules"                          431
#> 17 C5            "GO:BP"           "GO Biological Process"                  7583
#> 18 C5            "GO:CC"           "GO Cellular Component"                  1042
#> 19 C5            "GO:MF"           "GO Molecular Function"                  1855
#> 20 C5            "HPO"             "Human Phenotype Ontology"               5748
#> 21 C6            ""                "Oncogenic Signature"                     189
#> 22 C7            "IMMUNESIGDB"     "ImmuneSigDB"                            4872
#> 23 C7            "VAX"             "HIPC Vaccine Response"                   347
#> 24 C8            ""                "Cell Type Signature"                     866
#> 25 H             ""                "Hallmark"                                 50

Pathway enrichment analysis

The msigdbr output can be used with various pathway analysis packages.

Use the gene sets data frame for clusterProfiler with genes as NCBI/Entrez IDs.

msigdbr_t2g <- dplyr::distinct(msigdbr_df, gs_name, ncbi_gene)
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for clusterProfiler with genes as gene symbols.

msigdbr_t2g <- dplyr::distinct(msigdbr_df, gs_name, gene_symbol)
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for fgsea.

msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)

Use the gene sets data frame for GSVA.

msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsvapar <- gsvaParam(geneSets = msigdbr_list, ...)
gsva(gsvapar)

Earlier versions of GSVA (<1.50) only need the gsva() function.

msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)

Potential questions and concerns

Which version of MSigDB was used?

The MSigDB version is included in the returned data frame. Check the version with unique(msigdbr_df$db_version).

Why use this package when I can download the gene sets directly from MSigDB?

This package makes it more convenient to work with the MSigDB gene sets in R. You don’t need to import the GMT files and restructure the content to make it compatible with downstream tools. It adds Ensembl gene IDs and converts the gene symbols/IDs if you are working with non-human data.

Can I convert between human and mouse genes simply by adjusting gene capitalization?

That will work for most, but not all, genes. Since 2022, the GSEA/MSigDB team provides collections that are natively mouse and don’t require orthology conversion.

Can I convert genes to any organism myself?

One popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful. Because biomaRt relies on live BioMart services, it can be disrupted by server outage or maintenance.

Aren’t there already other similar tools?

There are a few resources that provide some of the msigdbr functionality and served as an inspiration for this package. EGSEAdata Bioconductor package provides data as a list. WEHI MSigDB provides data for human and mouse as RDS files. MSigDF provides data as an R data frame. These are updated at varying frequencies and may not be based on the latest version of MSigDB.

What if I have other questions?

Please post questions or feedback on GitHub Discussions. Report bugs on GitHub Issues.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software. To cite the use of the underlying MSigDB data, reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), Castanza, et al. (2023, Nature Methods) and also the source for the gene set.

Gene homologs are mapped using the babelgene package. The information is sourced from the HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.

For information on how to cite the msigdbr R package, run citation("msigdbr").