Overview
Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.
The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:
- in an R-friendly “tidy” format with one gene pair per row
- for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
- as gene symbols as well as NCBI Entrez and Ensembl IDs
- without accessing external resources and requiring an active internet connection
Please be aware that the homologs were computationally predicted for distinct genes. The full pathways may not be well conserved across species.
Usage
Load package.
All gene sets in the database can be retrieved by specifying a species of interest.
all_gene_sets <- msigdbr(species = "Mus musculus")
head(all_gene_sets)
#> # A tibble: 6 × 18
#> gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene human_gene…¹
#> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 C3 MIR:MIR_Lega… AAACCA… Abcc4 239273 ENSMUSG0000… ABCC4
#> 2 C3 MIR:MIR_Lega… AAACCA… Abraxas2 109359 ENSMUSG0000… ABRAXAS2
#> 3 C3 MIR:MIR_Lega… AAACCA… Actn4 60595 ENSMUSG0000… ACTN4
#> 4 C3 MIR:MIR_Lega… AAACCA… Acvr1 11477 ENSMUSG0000… ACVR1
#> 5 C3 MIR:MIR_Lega… AAACCA… Adam9 11502 ENSMUSG0000… ADAM9
#> 6 C3 MIR:MIR_Lega… AAACCA… Adamts5 23794 ENSMUSG0000… ADAMTS5
#> # … with 11 more variables: human_entrez_gene <int>, human_ensembl_gene <chr>,
#> # gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>, gs_exact_source <chr>,
#> # gs_url <chr>, gs_description <chr>, taxon_id <int>, ortholog_sources <chr>,
#> # num_ortholog_sources <dbl>, and abbreviated variable name
#> # ¹human_gene_symbol
You can retrieve data just for a specific collection/category, such as the hallmark gene sets.
h_gene_sets <- msigdbr(species = "mouse", category = "H")
head(h_gene_sets)
#> # A tibble: 6 × 18
#> gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene human_gene…¹
#> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 H "" HALLMARK_A… Abca1 11303 ENSMUSG0000… ABCA1
#> 2 H "" HALLMARK_A… Abcb8 74610 ENSMUSG0000… ABCB8
#> 3 H "" HALLMARK_A… Acaa2 52538 ENSMUSG0000… ACAA2
#> 4 H "" HALLMARK_A… Acadl 11363 ENSMUSG0000… ACADL
#> 5 H "" HALLMARK_A… Acadm 11364 ENSMUSG0000… ACADM
#> 6 H "" HALLMARK_A… Acads 11409 ENSMUSG0000… ACADS
#> # … with 11 more variables: human_entrez_gene <int>, human_ensembl_gene <chr>,
#> # gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>, gs_exact_source <chr>,
#> # gs_url <chr>, gs_description <chr>, taxon_id <int>, ortholog_sources <chr>,
#> # num_ortholog_sources <dbl>, and abbreviated variable name
#> # ¹human_gene_symbol
You can specify a sub-category, such as C2 (curated) CGP (chemical and genetic perturbations) gene sets.
cgp_gene_sets <- msigdbr(species = "mouse", category = "C2", subcategory = "CGP")
head(cgp_gene_sets)
#> # A tibble: 6 × 18
#> gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene human_gene…¹
#> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 C2 CGP ABBUD_LIF_… Ahnak 66395 ENSMUSG0000… AHNAK
#> 2 C2 CGP ABBUD_LIF_… Alcam 11658 ENSMUSG0000… ALCAM
#> 3 C2 CGP ABBUD_LIF_… Ankrd40 71452 ENSMUSG0000… ANKRD40
#> 4 C2 CGP ABBUD_LIF_… Arid1a 93760 ENSMUSG0000… ARID1A
#> 5 C2 CGP ABBUD_LIF_… Bckdhb 12040 ENSMUSG0000… BCKDHB
#> 6 C2 CGP ABBUD_LIF_… AU021092 239691 ENSMUSG0000… C16orf89
#> # … with 11 more variables: human_entrez_gene <int>, human_ensembl_gene <chr>,
#> # gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>, gs_exact_source <chr>,
#> # gs_url <chr>, gs_description <chr>, taxon_id <int>, ortholog_sources <chr>,
#> # num_ortholog_sources <dbl>, and abbreviated variable name
#> # ¹human_gene_symbol
If you require more custom filtering, the msigdbr()
function output is a data frame that can be manipulated using standard methods. For example, you can subset to a specific collection/category using dplyr::filter()
.
all_gene_sets %>%
dplyr::filter(gs_cat == "H") %>%
head()
#> # A tibble: 6 × 18
#> gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene human_gene…¹
#> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 H "" HALLMARK_A… Abca1 11303 ENSMUSG0000… ABCA1
#> 2 H "" HALLMARK_A… Abcb8 74610 ENSMUSG0000… ABCB8
#> 3 H "" HALLMARK_A… Acaa2 52538 ENSMUSG0000… ACAA2
#> 4 H "" HALLMARK_A… Acadl 11363 ENSMUSG0000… ACADL
#> 5 H "" HALLMARK_A… Acadm 11364 ENSMUSG0000… ACADM
#> 6 H "" HALLMARK_A… Acads 11409 ENSMUSG0000… ACADS
#> # … with 11 more variables: human_entrez_gene <int>, human_ensembl_gene <chr>,
#> # gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>, gs_exact_source <chr>,
#> # gs_url <chr>, gs_description <chr>, taxon_id <int>, ortholog_sources <chr>,
#> # num_ortholog_sources <dbl>, and abbreviated variable name
#> # ¹human_gene_symbol
Helper functions
There are msigdbr_species()
and msigdbr_collections()
helper functions to assist with setting the msigdbr()
parameters.
You can check the available species with msigdbr_species()
. Either scientific or common names are acceptable for the msigdbr()
function.
msigdbr_species()
#> # A tibble: 20 × 2
#> species_name species_common_name
#> <chr> <chr>
#> 1 Anolis carolinensis Carolina anole, green anole
#> 2 Bos taurus bovine, cattle, cow, dairy cow, domestic cat…
#> 3 Caenorhabditis elegans NA
#> 4 Canis lupus familiaris dog, dogs
#> 5 Danio rerio leopard danio, zebra danio, zebra fish, zebr…
#> 6 Drosophila melanogaster fruit fly
#> 7 Equus caballus domestic horse, equine, horse
#> 8 Felis catus cat, cats, domestic cat
#> 9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
#> 10 Homo sapiens human
#> 11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monk…
#> 12 Monodelphis domestica gray short-tailed opossum
#> 13 Mus musculus house mouse, mouse
#> 14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, pla…
#> 15 Pan troglodytes chimpanzee
#> 16 Rattus norvegicus brown rat, Norway rat, rat, rats
#> 17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
#> 18 Schizosaccharomyces pombe 972h- NA
#> 19 Sus scrofa pig, pigs, swine, wild boar
#> 20 Xenopus tropicalis tropical clawed frog, western clawed frog
You can check the available collections with msigdbr_collections()
.
msigdbr_collections()
#> # A tibble: 23 × 3
#> gs_cat gs_subcat num_genesets
#> <chr> <chr> <int>
#> 1 C1 "" 299
#> 2 C2 "CGP" 3399
#> 3 C2 "CP" 29
#> 4 C2 "CP:BIOCARTA" 292
#> 5 C2 "CP:KEGG" 186
#> 6 C2 "CP:PID" 196
#> 7 C2 "CP:REACTOME" 1635
#> 8 C2 "CP:WIKIPATHWAYS" 712
#> 9 C3 "MIR:MIR_Legacy" 221
#> 10 C3 "MIR:MIRDB" 2377
#> 11 C3 "TFT:GTRD" 517
#> 12 C3 "TFT:TFT_Legacy" 610
#> 13 C4 "CGN" 427
#> 14 C4 "CM" 431
#> 15 C5 "GO:BP" 7763
#> 16 C5 "GO:CC" 1035
#> 17 C5 "GO:MF" 1763
#> 18 C5 "HPO" 5142
#> 19 C6 "" 189
#> 20 C7 "IMMUNESIGDB" 4872
#> 21 C7 "VAX" 347
#> 22 C8 "" 704
#> 23 H "" 50
Pathway enrichment analysis
The msigdbr output can be used with various pathway analysis packages.
Use the gene sets data frame for clusterProfiler with genes as Entrez Gene IDs.
msigdbr_t2g <- msigdbr_df %>%
dplyr::distinct(gs_name, entrez_gene) %>%
as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)
Use the gene sets data frame for clusterProfiler with genes as gene symbols.
msigdbr_t2g <- msigdbr_df %>%
dplyr::distinct(gs_name, gene_symbol) %>%
as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)
Use the gene sets data frame for fgsea.
msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)
Use the gene sets data frame for GSVA.
msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)
Potential questions or concerns
Which version of MSigDB was used?
This package was generated with MSigDB v7.5.1. The MSigDB version is used as the base of the msigdbr CRAN package version. You can check the installed version with packageVersion("msigdbr")
.
Can I download the gene sets directly from MSigDB instead of using this package?
Yes. You can then import the GMT files (with getGmt()
from the GSEABase
package, for example). The GMTs only include the human genes, even for gene sets generated from mouse experiments. If you are working with non-human data, you then have to convert the MSigDB genes to your organism or your genes to human.
Can I convert between human and mouse genes just by adjusting gene capitalization?
That will work for most genes, but not all.
Can I convert human genes to any organism myself instead of using this package?
Yes. A popular method is using the biomaRt
package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.
Aren’t there already other similar tools?
There are a few resources that provide some of the msigdbr functionality and served as an inspiration for this package. WEHI provides MSigDB gene sets in R format for human and mouse. MSigDF relies on the WEHI resource, but is converted to a more tidyverse-friendly data frame. There is a more recent ToledoEM/msigdf fork. These are updated at varying frequencies and may not be based on the latest version of MSigDB. Since 2022, the GSEA/MSigDB team provides collections that are natively mouse and don’t require orthology conversion.
What if I have other questions?
You can submit feedback and report bugs on GitHub.
Details
The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software. To cite use of the underlying MSigDB data, reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), and also the source for the gene set.
Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.
For information on how to cite cite an R package such as msigdbr, you can execute citation("msigdbr")
.