MGazine of MGnify data

`MGazine` of MGnify data#

What is a `mgnipy.MGazine`?#

Study and Analysis details include ‘downloads’ fields which contain information such as types, short descriptions, urls etc about the datasets outputed from MGnify pipelines.

mgnipy.MGazine as well as more analysis-specific classes such as TaxaMGazine and DWCTaxaMGazine can be used to download the datasets onto disk or read them into our notebooks.

For downloading, MGazine supports the downloading of all filetypes. For streaming (via mixins.StreamMixin), the supported filetypes are:

TSV/CSV — stream_pandas (pandas) or stream_polars (polars) (handles gzipped TSV/CSV).
TXT — stream_txt (full text or line-chunks).
HTML — stream_html (opens in browser).
FASTA / GFF / BIOM — stream_fasta, stream_gff, stream_biom (scikit-bio generators).
JSONL / NDJSON — stream_jsonl (pandas or polars).
Tree / Newick — stream_tree (scikit-bio).
Other — JSON files under other are streamed via stream_json; binary/unsupported types should be downloaded.

Accessing a MGazine from a `MGnifier` search#

Recalling,

Start up a mgnipy.MGnipy client with your desired configuration
Search in MGnify resources using a mgnipy.MGnifier glass
Receive a mgnipy.MGazine of MGnify datasets

For step 2 specifically the following mgnifiers can output a mgazine:

proxies.Study
proxies.Analysis
proxies.Studies
proxies.Analyses

In this demonstration we will get the MGazine of a single study, but this would be the same for a multi-study collection of proxies.Studies

from mgnipy import MGnipy

# 1. init with default config
MG = MGnipy()

# 2. search up a study/analysis detail or a list of studies/analyses and get their details
study = MG.study("MGYS00010442")
study.get()

Show code cell output

Hide code cell output

{'accession': 'MGYS00010442',
 'ena_accessions': ['PRJEB37289', 'ERP120598'],
 'title': 'TKI',
 'biome': {'biome_name': 'Digestive system',
  'lineage': 'root:Host-associated:Human:Digestive system'},
 'updated_at': '2026-04-21T08:55:57.196000+00:00',
 'downloads': [{'file_type': 'tsv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'Summary of DADA2-PR2 taxonomies',
   'long_description': 'Summary of DADA2-PR2 taxonomic assignments, across all runs in the study',
   'alias': 'ERP120598_DADA2-PR2_16S-V3-V4_study_summary.tsv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_DADA2-PR2_16S-V3-V4_study_summary.tsv'},
  {'file_type': 'tsv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'Summary of PR2 taxonomies',
   'long_description': 'Summary of PR2 taxonomic assignments, across all runs in the study',
   'alias': 'ERP120598_PR2_study_summary.tsv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_PR2_study_summary.tsv'},
  {'file_type': 'tsv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'Summary of DADA2-SILVA taxonomies',
   'long_description': 'Summary of DADA2-SILVA taxonomic assignments, across all runs in the study',
   'alias': 'ERP120598_DADA2-SILVA_16S-V3-V4_study_summary.tsv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_DADA2-SILVA_16S-V3-V4_study_summary.tsv'},
  {'file_type': 'tsv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'Summary of SILVA-SSU taxonomies',
   'long_description': 'Summary of SILVA-SSU taxonomic assignments, across all runs in the study',
   'alias': 'ERP120598_SILVA-SSU_study_summary.tsv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_SILVA-SSU_study_summary.tsv'},
  {'file_type': 'csv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB',
   'long_description': 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB, across all runs in the study',
   'alias': 'ERP120598_DADA2-PR2_16S-V3-V4_dwcready.csv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_DADA2-PR2_16S-V3-V4_dwcready.csv'},
  {'file_type': 'csv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB',
   'long_description': 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB, across all runs in the study',
   'alias': 'ERP120598_DADA2-SILVA_16S-V3-V4_dwcready.csv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_DADA2-SILVA_16S-V3-V4_dwcready.csv'},
  {'file_type': 'csv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'DwC-Ready summary of closed-ref taxonomies using SILVA-SSU as ref DB',
   'long_description': 'DwC-Ready summary of closed-reference taxonomies using SILVA-SSU as ref DB, across all runs in the study',
   'alias': 'ERP120598_closedref_SILVA-SSU_dwcready.csv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_closedref_SILVA-SSU_dwcready.csv'},
  {'file_type': 'csv',
   'download_type': 'Taxonomic analysis',
   'short_description': 'DwC-Ready summary of closed-ref taxonomies using PR2 as ref DB',
   'long_description': 'DwC-Ready summary of closed-reference taxonomies using PR2 as ref DB, across all runs in the study',
   'alias': 'ERP120598_closedref_PR2_dwcready.csv',
   'download_group': 'study_summary.v6.amplicon',
   'file_size_bytes': None,
   'index_files': None,
   'url': 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_results/ERP120/ERP120598/study-summaries/ERP120598_closedref_PR2_dwcready.csv'}],
 'metadata': {},
 'first_accession': 'ERP120598'}

MGazines for a given study or analysis detail can be accessed via their .datasets attributes

# access the study's mgazine
mz = study.datasets

# check it out
print(mz)

MGazine containing:
- MGnify pipeline versions: ['v6']
- Number of downloads: 8
- Short descriptions: ['DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB',
 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using PR2 as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using SILVA-SSU as ref DB',
 'Summary of DADA2-PR2 taxonomies',
 'Summary of DADA2-SILVA taxonomies',
 'Summary of PR2 taxonomies',
 'Summary of SILVA-SSU taxonomies']

As we see above, the str representaiton of mgazine gives us a peak into the pipeline versions within, number of downloads and the short_description categories

Navigating and filtering a `MGazine`#

Built in to mgazine, you can filter the mgazine to a specific pipeline versions and short_descriptions which will return a mgazine again but filtered or a curated mgazine with additional functionalities if available ✨.

FOr accessing a specific pipeline version you can call the veresion as an attribute:

# above we saw that v6 is the only one so this will return the same basically
v6_data = mz.v6

# if we print again we will see the same info
print(v6_data)

MGazine containing:
- MGnify pipeline versions: ['v6']
- Number of downloads: 8
- Short descriptions: ['DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB',
 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using PR2 as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using SILVA-SSU as ref DB',
 'Summary of DADA2-PR2 taxonomies',
 'Summary of DADA2-SILVA taxonomies',
 'Summary of PR2 taxonomies',
 'Summary of SILVA-SSU taxonomies']

You can filter by short descriptioins by passing them as you would an index into square brackets i..e, getitem

# we want the taxonomic assignments
ssu = v6_data["Summary of SILVA-SSU taxonomies"]

# checking out what it is
print(type(ssu))
print(ssu)

# also downloads_df
ssu.downloads_df()

<class 'mgnipy.V2.datasets.taxonomic.TaxaMGazine'>
MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v6']
- Number of downloads: 1
- Short descriptions: ['Summary of SILVA-SSU taxonomies']

	file_type	download_type	short_description	long_description	alias	download_group	file_size_bytes	index_files	url	accession	pipeline_version
0	tsv	Taxonomic analysis	Summary of SILVA-SSU taxonomies	Summary of SILVA-SSU taxonomic assignments, ac...	ERP120598_SILVA-SSU_study_summary.tsv	study_summary.v6.amplicon	None	None	https://ftp.ebi.ac.uk/pub/databases/metagenomi...	MGYS00010442	v6

Downloading datasets#

You can pass the url or alias if wanting to .download() or explore/read in via .stream() ONE download file/dataset.

You can look at the file aliases as a list via .aliases attribute, also shown in “alias” column in .downloads_df()

The urls are also in a column in .downloads_df() but there are also helpers .url_list and .url_dict which provide {alias: url}

# lets try out one
one_alias = ssu.aliases[0]
print(one_alias)

# downloading to a downloads folder
ssu.download(to_dir="downloads", alias=one_alias)

ERP120598_SILVA-SSU_study_summary.tsv

also the option to download_all()

ssu.download_all(to_dir="downloads")

Reading in a dataset `.stream()`#

.stream() resolves a download alias or URL and returns the appropriate streaming handler for the file type. It supports returning either a full object (when chunksize is None) or an iterator of chunks when chunksize is provided.

df = ssu.stream(alias=one_alias, dataframe_engine="pandas")
df.head()

	taxonomy	ERR5382938	ERR5382939	ERR5382940	ERR5382941	ERR5382942	ERR5382943	ERR5382944	ERR5382945	ERR5382946	...	ERR5382991	ERR5382992	ERR5382993	ERR5382994	ERR5382995	ERR5382997	ERR5382998
0	sk__Archaea;k__;p__Euryarchaeota;c__Methanobac...	2	0	0	0	0	0	1	0	1	...	0	0	0	0	0	0	2
1	sk__Archaea;k__;p__Euryarchaeota;c__Methanobac...	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0
2	sk__Bacteria	26	4	12	22	1	4	3	1	25	...	14	3	1	2	1	2	0
3	sk__Bacteria;k__;p__Acidobacteriota	3	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0
4	sk__Bacteria;k__;p__Acidobacteriota;c__Holopha...	6	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0

5 rows × 64 columns

✨ Bonus section: The `TaxaMGazine`#

There are analysis type-specific mgazines, such as this TaxaMGazine

for example, we can also combine the taxonomic assignment results into one dataframe e.g. .to_pandas(), .to_polars, .X()

ssu.to_pandas()

	taxonomy	ERR5382938	ERR5382939	ERR5382940	ERR5382941	ERR5382942	ERR5382943	ERR5382944	ERR5382945	ERR5382946	...	ERR5382991	ERR5382992	ERR5382993	ERR5382994	ERR5382995	ERR5382996	ERR5382997	ERR5382998	ERR5382999	ERR5383000
0	sk__Archaea;k__;p__Euryarchaeota;c__Methanobac...	2	0	0	0	0	0	1	0	1	...	0	0	0	0	0	0	0	2	0	0
1	sk__Archaea;k__;p__Euryarchaeota;c__Methanobac...	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	sk__Bacteria	26	4	12	22	1	4	3	1	25	...	14	3	1	2	1	0	2	0	0	0
3	sk__Bacteria;k__;p__Acidobacteriota	3	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	sk__Bacteria;k__;p__Acidobacteriota;c__Holopha...	6	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
670	sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru...	0	0	0	0	8	0	0	0	0	...	5	0	0	0	0	0	0	0	0	0
671	sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru...	2	0	0	0	31	0	0	0	0	...	10	2	0	0	0	0	0	0	0	0
672	sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru...	32	1	1	1	8938	4	1	44	1	...	157	848	0	2	4	0	206	47	0	1
673	sk__Bacteria;k__;p__Verrucomicrobiota;c__Verru...	0	0	0	0	32	0	0	0	0	...	0	3	0	0	0	0	1	0	0	0
674	sk__Eukaryota;k__Viridiplantae;p__Streptophyta...	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0

675 rows × 64 columns

There is also option to enrich with additional metadata!

From already retrieved MGnifier results you can set to runs_results, samples_results, studies_results etc, or
use .enrich_runs() etc or .enrich_biosamples which will make the get requests for the additional metadata

ssu.enrich_runs(limit=None)

ssu.to_anndata()

AnnData object with n_obs × n_vars = 675 × 63
    obs: 'Superkingdom', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'
    var: 'experiment_type', 'instrument_model', 'instrument_platform', 'sample_accession', 'study_accession', 'sample__accession', 'sample__ena_accessions', 'sample__sample_title', 'sample__biome', 'sample__updated_at', 'study__accession', 'study__ena_accessions', 'study__title', 'study__updated_at', 'study__biome.biome_name', 'study__biome.lineage'

ssu.clear_cache()

MGazine of MGnify data

Contents

MGazine of MGnify data#

What is a mgnipy.MGazine?#

Accessing a MGazine from a MGnifier search#

Navigating and filtering a MGazine#