MGazine of MGnify data#
What is a mgnipy.MGazine?#
Study and Analysis details include ‘downloads’ fields which contain information such as types, short descriptions, urls etc about the datasets outputed from MGnify pipelines.
mgnipy.MGazine as well as more analysis-specific classes such as TaxaMGazine and DWCTaxaMGazine can be used to download the datasets onto disk or read them into our notebooks.
For downloading, MGazine supports the downloading of all filetypes. For streaming (via mixins.StreamMixin), the supported filetypes are:
TSV/CSV — stream_pandas (pandas) or stream_polars (polars) (handles gzipped TSV/CSV).
TXT — stream_txt (full text or line-chunks).
HTML — stream_html (opens in browser).
FASTA / GFF / BIOM — stream_fasta, stream_gff, stream_biom (scikit-bio generators).
JSONL / NDJSON — stream_jsonl (pandas or polars).
Tree / Newick — stream_tree (scikit-bio).
Other — JSON files under other are streamed via stream_json; binary/unsupported types should be downloaded.
Accessing a MGazine from a MGnifier search#
Recalling,
Start up a
mgnipy.MGnipyclient with your desired configurationSearch in MGnify resources using a
mgnipy.MGnifierglassReceive a
mgnipy.MGazineof MGnify datasets
For step 2 specifically the following mgnifiers can output a mgazine:
proxies.Studyproxies.Analysisproxies.Studiesproxies.Analyses
In this demonstration we will get the MGazine of a single study, but this would be the same for a multi-study collection of proxies.Studies
from mgnipy import MGnipy
# 1. init with default config
MG = MGnipy()
# 2. search up a study/analysis detail or a list of studies/analyses and get their details
study = MG.study("MGYS00010442")
study.get()
MGazines for a given study or analysis detail can be accessed via their .datasets attributes
# access the study's mgazine
mz = study.datasets
# check it out
print(mz)
MGazine containing:
- MGnify pipeline versions: ['v6']
- Number of downloads: 8
- Short descriptions: ['DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB',
'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB',
'DwC-Ready summary of closed-ref taxonomies using PR2 as ref DB',
'DwC-Ready summary of closed-ref taxonomies using SILVA-SSU as ref DB',
'Summary of DADA2-PR2 taxonomies',
'Summary of DADA2-SILVA taxonomies',
'Summary of PR2 taxonomies',
'Summary of SILVA-SSU taxonomies']
As we see above, the str representaiton of mgazine gives us a peak into the pipeline versions within, number of downloads and the short_description categories
Downloading datasets#
You can pass the url or alias if wanting to .download() or explore/read in via .stream() ONE download file/dataset.
You can look at the file aliases as a list via .aliases attribute, also shown in “alias” column in .downloads_df()
The urls are also in a column in .downloads_df() but there are also helpers .url_list and .url_dict which provide {alias: url}
# lets try out one
one_alias = ssu.aliases[0]
print(one_alias)
# downloading to a downloads folder
ssu.download(to_dir="downloads", alias=one_alias)
ERP120598_SILVA-SSU_study_summary.tsv
also the option to download_all()
ssu.download_all(to_dir="downloads")
Reading in a dataset .stream()#
.stream() resolves a download alias or URL and returns the appropriate streaming handler for the file type. It supports returning either a full object (when chunksize is None) or an iterator of chunks when chunksize is provided.
df = ssu.stream(alias=one_alias, dataframe_engine="pandas")
df.head()
| taxonomy | ERR5382938 | ERR5382939 | ERR5382940 | ERR5382941 | ERR5382942 | ERR5382943 | ERR5382944 | ERR5382945 | ERR5382946 | ... | ERR5382991 | ERR5382992 | ERR5382993 | ERR5382994 | ERR5382995 | ERR5382996 | ERR5382997 | ERR5382998 | ERR5382999 | ERR5383000 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 |
| 1 | sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | sk__Bacteria | 26 | 4 | 12 | 22 | 1 | 4 | 3 | 1 | 25 | ... | 14 | 3 | 1 | 2 | 1 | 0 | 2 | 0 | 0 | 0 |
| 3 | sk__Bacteria;k__;p__Acidobacteriota | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | sk__Bacteria;k__;p__Acidobacteriota;c__Holopha... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 64 columns
✨ Bonus section: The TaxaMGazine#
There are analysis type-specific mgazines, such as this TaxaMGazine
for example, we can also combine the taxonomic assignment results into one dataframe e.g. .to_pandas(), .to_polars, .X()
ssu.to_pandas()
| taxonomy | ERR5382938 | ERR5382939 | ERR5382940 | ERR5382941 | ERR5382942 | ERR5382943 | ERR5382944 | ERR5382945 | ERR5382946 | ... | ERR5382991 | ERR5382992 | ERR5382993 | ERR5382994 | ERR5382995 | ERR5382996 | ERR5382997 | ERR5382998 | ERR5382999 | ERR5383000 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 |
| 1 | sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | sk__Bacteria | 26 | 4 | 12 | 22 | 1 | 4 | 3 | 1 | 25 | ... | 14 | 3 | 1 | 2 | 1 | 0 | 2 | 0 | 0 | 0 |
| 3 | sk__Bacteria;k__;p__Acidobacteriota | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | sk__Bacteria;k__;p__Acidobacteriota;c__Holopha... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 670 | sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 671 | sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... | 2 | 0 | 0 | 0 | 31 | 0 | 0 | 0 | 0 | ... | 10 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 672 | sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... | 32 | 1 | 1 | 1 | 8938 | 4 | 1 | 44 | 1 | ... | 157 | 848 | 0 | 2 | 4 | 0 | 206 | 47 | 0 | 1 |
| 673 | sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... | 0 | 0 | 0 | 0 | 32 | 0 | 0 | 0 | 0 | ... | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 674 | sk__Eukaryota;k__Viridiplantae;p__Streptophyta... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
675 rows × 64 columns
There is also option to enrich with additional metadata!
From already retrieved
MGnifierresults you can set toruns_results,samples_results,studies_resultsetc, oruse
.enrich_runs()etc or.enrich_biosampleswhich will make the get requests for the additional metadata
ssu.enrich_runs(limit=None)
ssu.to_anndata()
AnnData object with n_obs × n_vars = 675 × 63
obs: 'Superkingdom', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'
var: 'experiment_type', 'instrument_model', 'instrument_platform', 'sample_accession', 'study_accession', 'sample__accession', 'sample__ena_accessions', 'sample__sample_title', 'sample__biome', 'sample__updated_at', 'study__accession', 'study__ena_accessions', 'study__title', 'study__updated_at', 'study__biome.biome_name', 'study__biome.lineage'
ssu.clear_cache()