MGazine of MGnify data#

What is a mgnipy.MGazine?#

Study and Analysis details include ‘downloads’ fields which contain information such as types, short descriptions, urls etc about the datasets outputed from MGnify pipelines.

mgnipy.MGazine as well as more analysis-specific classes such as TaxaMGazine and DWCTaxaMGazine can be used to download the datasets onto disk or read them into our notebooks.

For downloading, MGazine supports the downloading of all filetypes. For streaming (via mixins.StreamMixin), the supported filetypes are:

  • TSV/CSV — stream_pandas (pandas) or stream_polars (polars) (handles gzipped TSV/CSV).

  • TXT — stream_txt (full text or line-chunks).

  • HTML — stream_html (opens in browser).

  • FASTA / GFF / BIOM — stream_fasta, stream_gff, stream_biom (scikit-bio generators).

  • JSONL / NDJSON — stream_jsonl (pandas or polars).

  • Tree / Newick — stream_tree (scikit-bio).

  • Other — JSON files under other are streamed via stream_json; binary/unsupported types should be downloaded.


Downloading datasets#

You can pass the url or alias if wanting to .download() or explore/read in via .stream() ONE download file/dataset.

You can look at the file aliases as a list via .aliases attribute, also shown in “alias” column in .downloads_df()

The urls are also in a column in .downloads_df() but there are also helpers .url_list and .url_dict which provide {alias: url}

# lets try out one
one_alias = ssu.aliases[0]
print(one_alias)

# downloading to a downloads folder
ssu.download(to_dir="downloads", alias=one_alias)
ERP120598_SILVA-SSU_study_summary.tsv

also the option to download_all()

ssu.download_all(to_dir="downloads")

Reading in a dataset .stream()#

.stream() resolves a download alias or URL and returns the appropriate streaming handler for the file type. It supports returning either a full object (when chunksize is None) or an iterator of chunks when chunksize is provided.

df = ssu.stream(alias=one_alias, dataframe_engine="pandas")
df.head()
taxonomy ERR5382938 ERR5382939 ERR5382940 ERR5382941 ERR5382942 ERR5382943 ERR5382944 ERR5382945 ERR5382946 ... ERR5382991 ERR5382992 ERR5382993 ERR5382994 ERR5382995 ERR5382996 ERR5382997 ERR5382998 ERR5382999 ERR5383000
0 sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... 2 0 0 0 0 0 1 0 1 ... 0 0 0 0 0 0 0 2 0 0
1 sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 sk__Bacteria 26 4 12 22 1 4 3 1 25 ... 14 3 1 2 1 0 2 0 0 0
3 sk__Bacteria;k__;p__Acidobacteriota 3 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 sk__Bacteria;k__;p__Acidobacteriota;c__Holopha... 6 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 64 columns

✨ Bonus section: The TaxaMGazine#

There are analysis type-specific mgazines, such as this TaxaMGazine

for example, we can also combine the taxonomic assignment results into one dataframe e.g. .to_pandas(), .to_polars, .X()

ssu.to_pandas()
taxonomy ERR5382938 ERR5382939 ERR5382940 ERR5382941 ERR5382942 ERR5382943 ERR5382944 ERR5382945 ERR5382946 ... ERR5382991 ERR5382992 ERR5382993 ERR5382994 ERR5382995 ERR5382996 ERR5382997 ERR5382998 ERR5382999 ERR5383000
0 sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... 2 0 0 0 0 0 1 0 1 ... 0 0 0 0 0 0 0 2 0 0
1 sk__Archaea;k__;p__Euryarchaeota;c__Methanobac... 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 sk__Bacteria 26 4 12 22 1 4 3 1 25 ... 14 3 1 2 1 0 2 0 0 0
3 sk__Bacteria;k__;p__Acidobacteriota 3 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 sk__Bacteria;k__;p__Acidobacteriota;c__Holopha... 6 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
670 sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... 0 0 0 0 8 0 0 0 0 ... 5 0 0 0 0 0 0 0 0 0
671 sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... 2 0 0 0 31 0 0 0 0 ... 10 2 0 0 0 0 0 0 0 0
672 sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... 32 1 1 1 8938 4 1 44 1 ... 157 848 0 2 4 0 206 47 0 1
673 sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru... 0 0 0 0 32 0 0 0 0 ... 0 3 0 0 0 0 1 0 0 0
674 sk__Eukaryota;k__Viridiplantae;p__Streptophyta... 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0

675 rows × 64 columns

There is also option to enrich with additional metadata!

  1. From already retrieved MGnifier results you can set to runs_results, samples_results, studies_results etc, or

  2. use .enrich_runs() etc or .enrich_biosamples which will make the get requests for the additional metadata

ssu.enrich_runs(limit=None)
ssu.to_anndata()
AnnData object with n_obs × n_vars = 675 × 63
    obs: 'Superkingdom', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'
    var: 'experiment_type', 'instrument_model', 'instrument_platform', 'sample_accession', 'study_accession', 'sample__accession', 'sample__ena_accessions', 'sample__sample_title', 'sample__biome', 'sample__updated_at', 'study__accession', 'study__ena_accessions', 'study__title', 'study__updated_at', 'study__biome.biome_name', 'study__biome.lineage'
ssu.clear_cache()