Skip to contents

Overview

Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

  • in an R-friendly “tidy” format with one gene pair per row
  • for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
  • as gene symbols as well as NCBI Entrez and Ensembl IDs
  • without accessing external resources requiring an active internet connection

Please be aware that the orthologs were computationally predicted at the gene level. The full pathways may not be well conserved across species.

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

The package includes only a small subset of the full MSigDB database due to CRAN size limitations. Please install the msigdbdf package to access the full MSigDB database:

install.packages("msigdbdf", repos = "https://igordot.r-universe.dev")

Usage

Load package.

All gene sets in the database can be retrieved by specifying a species of interest.

all_gene_sets <- msigdbr(species = "Mus musculus")
head(all_gene_sets)
#> # A tibble: 6 × 22
#>   gene_symbol ncbi_gene ensembl_gene      db_gene_sy…¹ db_ncbi_gene db_ensembl…²
#>   <chr>           <int> <chr>             <chr>        <chr>        <chr>       
#> 1 Abcc4          239273 ENSMUSG000000328… ABCC4        10257        ENSG0000012…
#> 2 Abraxas2       109359 ENSMUSG000000309… ABRAXAS2     23172        ENSG0000016…
#> 3 Actn4           60595 ENSMUSG000000548… ACTN4        81           ENSG0000013…
#> 4 Acvr1           11477 ENSMUSG000000268… ACVR1        90           ENSG0000011…
#> 5 Adam9           11502 ENSMUSG000000315… ADAM9        8754         ENSG0000016…
#> 6 Adamts5         23794 ENSMUSG000000228… ADAMTS5      11096        ENSG0000015…
#> # ℹ abbreviated names: ¹​db_gene_symbol, ²​db_ensembl_gene
#> # ℹ 16 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>,
#> #   ortholog_taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

You can retrieve data just for a specific collection, such as the Hallmark gene sets.

h_gene_sets <- msigdbr(species = "mouse", collection = "H")
head(h_gene_sets)
#> # A tibble: 6 × 22
#>   gene_symbol ncbi_gene ensembl_gene      db_gene_sy…¹ db_ncbi_gene db_ensembl…²
#>   <chr>           <int> <chr>             <chr>        <chr>        <chr>       
#> 1 Abca1           11303 ENSMUSG000000152… ABCA1        19           ENSG0000016…
#> 2 Abcb8           74610 ENSMUSG000000289… ABCB8        11194        ENSG0000019…
#> 3 Acaa2           52538 ENSMUSG000000368… ACAA2        10449        ENSG0000016…
#> 4 Acadl           11363 ENSMUSG000000260… ACADL        33           ENSG0000011…
#> 5 Acadm           11364 ENSMUSG000000629… ACADM        34           ENSG0000011…
#> 6 Acads           11409 ENSMUSG000000295… ACADS        35           ENSG0000012…
#> # ℹ abbreviated names: ¹​db_gene_symbol, ²​db_ensembl_gene
#> # ℹ 16 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>,
#> #   ortholog_taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

You can specify a sub-collection, such as C2 (curated) CGP (chemical and genetic perturbations) gene sets.

cgp_gene_sets <- msigdbr(species = "mouse", collection = "C2", subcollection = "CGP")
head(cgp_gene_sets)
#> # A tibble: 6 × 22
#>   gene_symbol ncbi_gene ensembl_gene      db_gene_sy…¹ db_ncbi_gene db_ensembl…²
#>   <chr>           <int> <chr>             <chr>        <chr>        <chr>       
#> 1 Ahnak           66395 ENSMUSG000000698… AHNAK        79026        ENSG0000012…
#> 2 Alcam           11658 ENSMUSG000000226… ALCAM        214          ENSG0000017…
#> 3 Ankrd40         71452 ENSMUSG000000208… ANKRD40      91369        ENSG0000015…
#> 4 Arid1a          93760 ENSMUSG000000078… ARID1A       8289         ENSG0000011…
#> 5 Bckdhb          12040 ENSMUSG000000322… BCKDHB       594          ENSG0000008…
#> 6 AU021092       239691 ENSMUSG000000516… C16orf89     146556       ENSG0000015…
#> # ℹ abbreviated names: ¹​db_gene_symbol, ²​db_ensembl_gene
#> # ℹ 16 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>,
#> #   ortholog_taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

If you require more precise filtering, the msigdbr() function output is a data frame that can be manipulated using standard methods. For example, you can subset to a specific collection using dplyr.

all_gene_sets |>
  dplyr::filter(gs_collection == "H") |>
  head()
#> # A tibble: 6 × 22
#>   gene_symbol ncbi_gene ensembl_gene      db_gene_sy…¹ db_ncbi_gene db_ensembl…²
#>   <chr>           <int> <chr>             <chr>        <chr>        <chr>       
#> 1 Abca1           11303 ENSMUSG000000152… ABCA1        19           ENSG0000016…
#> 2 Abcb8           74610 ENSMUSG000000289… ABCB8        11194        ENSG0000019…
#> 3 Acaa2           52538 ENSMUSG000000368… ACAA2        10449        ENSG0000016…
#> 4 Acadl           11363 ENSMUSG000000260… ACADL        33           ENSG0000011…
#> 5 Acadm           11364 ENSMUSG000000629… ACADM        34           ENSG0000011…
#> 6 Acads           11409 ENSMUSG000000295… ACADS        35           ENSG0000012…
#> # ℹ abbreviated names: ¹​db_gene_symbol, ²​db_ensembl_gene
#> # ℹ 16 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
#> #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
#> #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
#> #   gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>,
#> #   ortholog_taxon_id <int>, ortholog_sources <chr>, num_ortholog_sources <dbl>

The version of the MSigDB database is stored in the db_version column of the returned data frame.

unique(all_gene_sets$db_version)
#> [1] "2024.1.Hs"

Helper functions

There are helper functions to assist with setting the msigdbr() parameters.

Use msigdbr_species() to check the available species. Both scientific and common names are acceptable for the msigdbr() function.

msigdbr_species()
#> # A tibble: 20 × 2
#>    species_name                    species_common_name                          
#>    <chr>                           <chr>                                        
#>  1 Anolis carolinensis             Carolina anole, green anole                  
#>  2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
#>  3 Caenorhabditis elegans          NA                                           
#>  4 Canis lupus familiaris          dog, dogs                                    
#>  5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
#>  6 Drosophila melanogaster         fruit fly                                    
#>  7 Equus caballus                  domestic horse, equine, horse                
#>  8 Felis catus                     cat, cats, domestic cat                      
#>  9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
#> 10 Homo sapiens                    human                                        
#> 11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
#> 12 Monodelphis domestica           gray short-tailed opossum                    
#> 13 Mus musculus                    house mouse, mouse                           
#> 14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
#> 15 Pan troglodytes                 chimpanzee                                   
#> 16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
#> 17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
#> 18 Schizosaccharomyces pombe 972h- NA                                           
#> 19 Sus scrofa                      pig, pigs, swine, wild boar                  
#> 20 Xenopus tropicalis              tropical clawed frog, western clawed frog

Use msigdbr_collections() to check the available collections.

msigdbr_collections()
#> # A tibble: 25 × 4
#>    gs_collection gs_subcollection  gs_collection_name               num_genesets
#>    <chr>         <chr>             <chr>                                   <int>
#>  1 C1            ""                "Positional"                              302
#>  2 C2            "CGP"             "Chemical and Genetic Perturbat…         3494
#>  3 C2            "CP"              "Canonical Pathways"                       19
#>  4 C2            "CP:BIOCARTA"     "BioCarta Pathways"                       292
#>  5 C2            "CP:KEGG_LEGACY"  "KEGG Legacy Pathways"                    186
#>  6 C2            "CP:KEGG_MEDICUS" "KEGG Medicus Pathways"                   658
#>  7 C2            "CP:PID"          "PID Pathways"                            196
#>  8 C2            "CP:REACTOME"     "Reactome Pathways"                      1736
#>  9 C2            "CP:WIKIPATHWAYS" "WikiPathways"                            830
#> 10 C3            "MIR:MIRDB"       "miRDB"                                  2377
#> 11 C3            "MIR:MIR_LEGACY"  "MIR_Legacy"                              221
#> 12 C3            "TFT:GTRD"        "GTRD"                                    505
#> 13 C3            "TFT:TFT_LEGACY"  "TFT_Legacy"                              610
#> 14 C4            "3CA"             "Curated Cancer Cell Atlas gene…          148
#> 15 C4            "CGN"             "Cancer Gene Neighborhoods"               427
#> 16 C4            "CM"              "Cancer Modules"                          431
#> 17 C5            "GO:BP"           "GO Biological Process"                  7608
#> 18 C5            "GO:CC"           "GO Cellular Component"                  1026
#> 19 C5            "GO:MF"           "GO Molecular Function"                  1820
#> 20 C5            "HPO"             "Human Phenotype Ontology"               5653
#> 21 C6            ""                "Oncogenic Signature"                     189
#> 22 C7            "IMMUNESIGDB"     "ImmuneSigDB"                            4872
#> 23 C7            "VAX"             "HIPC Vaccine Response"                   347
#> 24 C8            ""                "Cell Type Signature"                     840
#> 25 H             ""                "Hallmark"                                 50

Pathway enrichment analysis

The msigdbr output can be used with various pathway analysis packages.

Use the gene sets data frame for clusterProfiler with genes as NCBI/Entrez IDs.

msigdbr_t2g <- dplyr::distinct(msigdbr_df, gs_name, ncbi_gene)
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for clusterProfiler with genes as gene symbols.

msigdbr_t2g <- dplyr::distinct(msigdbr_df, gs_name, gene_symbol)
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for fgsea.

msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)

Use the gene sets data frame for GSVA.

msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)

Potential questions and concerns

Which version of MSigDB was used?

The MSigDB version is stored in the db_version column of the returned data frame. You can check the version used with unique(msigdbr_df$db_version).

Why use this package when I can download the gene sets directly from MSigDB?

This package makes it more convenient to work with MSigDB gene sets in R. You don’t need to download the GMT files and import them. You don’t need to learn how to restructure the output to make it compatible with downstream tools. You don’t need to convert the genes to your organism if you are working with non-human data.

Can I convert between human and mouse genes just by adjusting gene capitalization?

That will work for most, but not all, genes.

Can I convert human genes to any organism myself instead of using this package?

One popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren’t there already other similar tools?

There are a few resources that provide some of the msigdbr functionality and served as an inspiration for this package. WEHI provides MSigDB gene sets in R format for human and mouse. MSigDF and a more recent ToledoEM/msigdf fork provide a tidyverse-friendly data frame. These are updated at varying frequencies and may not be based on the latest version of MSigDB. Since 2022, the GSEA/MSigDB team provides collections that are natively mouse and don’t require orthology conversion.

What if I have other questions?

You can submit feedback and report bugs on GitHub.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software. To cite use of the underlying MSigDB data, reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), Castanza, et al. (2023, Nature Methods) and also the source for the gene set.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.

For information on how to cite the msigdbr R package, run citation("msigdbr").