Finding Parkinson’s disease taxonomic analyses#

In this notebook we aim to imitate the analyses in ABaCo demo: Parkinson’s disease gut microbiome” where the aim was to “integrate the 9 studies while preserving key distinctions from the two patient states (Parkinson’s v.s. Healthy).”


# uncomment if colab
# !pip install mgnipy
import logging 
logging.basicConfig(level=logging.WARNING)

Searching for studies#

To start we configure our MGnipy client and access the MGnify API Studies resource.

We will filter our query to studies of the gut microbiome that mention “parkinson”s disease.

We can preview the resulting query urls via .explain()

from mgnipy import MGnipy 

mg = MGnipy(cache_dir='downloads')

pd_studies = mg.studies(
    search='parkinson',
    biome_lineage='root:Host-associated:Human:Digestive system:Large intestine:Fecal',
)

pd_studies.explain()
https://www.ebi.ac.uk/metagenomics/api/v2/studies?biome_lineage=root%3AHost-associated%3AHuman%3ADigestive+system%3ALarge+intestine%3AFecal&search=parkinson&page=1

looks good. we can proceed with actually executing the list query/queries via .get(). To enrich our list of studies with metadata details we can do this in bulk using .enrich_details() or asynchronously via .aenrich_details()

# populate study list
pd_studies.get()
# enrich studies with metadta
await pd_studies.aenrich_details()

# or even save to file if you prefer
study_meta = pd_studies.details_df(expand_nested_dicts=True)

# check it out 
study_meta.head()

Hide code cell output

accession ena_accessions title updated_at downloads first_accession metadata__study_name metadata__center_name metadata__study_title metadata__study_accession metadata__study_description metadata__secondary_study_accession biome__biome_name biome__lineage
0 MGYS00006121 [ERP142200, PRJEB57228] Dietary intervention of people with Parkinson'... 2026-05-28T15:47:01.432000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... ERP142200 NaN NaN NaN NaN NaN NaN Fecal root:Host-associated:Human:Digestive system:La...
1 MGYS00006760 [ERP148661, PRJEB63522] EMG produced TPA metagenomics assembly of PRJN... 2026-05-28T15:47:02.672000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... ERP148661 NaN NaN NaN NaN NaN NaN Fecal root:Host-associated:Human:Digestive system:La...
2 MGYS00006759 [ERP146353, PRJEB61255] EMG produced TPA metagenomics assembly of PRJN... 2026-05-28T15:47:02.660000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... ERP146353 NaN NaN NaN NaN NaN NaN Fecal root:Host-associated:Human:Digestive system:La...
3 MGYS00001650 [ERP004264, PRJEB4927] Alterations of the Fecal Microbiome in Parkins... 2026-05-28T15:47:02.598000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... ERP004264 Fecal Microbiome in Parkinson's Disease Institute of Biotechnology;University of Helsi... Alterations of the Fecal Microbiome in Parkins... PRJEB4927 In the course of Parkinson’s disease (PD), the... ERP004264 Fecal root:Host-associated:Human:Digestive system:La...
4 MGYS00005129 [ERP109659, PRJEB27564] Gut microbiota in Parkinson's disease: tempora... 2026-05-06T12:25:31.349000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... ERP109659 Parkinson's disease gut microbiota follow-up Institute of Biotechnology;University of Helsi... Gut microbiota in Parkinson's disease: tempora... PRJEB27564 Aiming to explore the temporal stability of gu... ERP109659 Fecal root:Host-associated:Human:Digestive system:La...

Using MGazine to explore the study datasets#

we can access the mgazine of datasets via .datasets attribute. The study details we retrieved above will also be passed on to the mgazine

# access mgazine
mz = pd_studies.datasets

# take a look
print(mz)
MGazine containing:
- MGnify pipeline versions: ['v3', 'v4_1', 'v5', 'v6']
- Number of downloads: 72
- Short descriptions: ['Complete GO annotation',
 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB',
 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using ITSoneDB as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using PR2 as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using SILVA-LSU as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using SILVA-SSU as ref DB',
 'GO slim annotation',
 'InterPro matches',
 'Phylum level taxonomies',
 'Phylum level taxonomies LSU',
 'Phylum level taxonomies SSU',
 'Summary of DADA2-PR2 taxonomies',
 'Summary of DADA2-SILVA taxonomies',
 'Summary of ITSoneDB taxonomies',
 'Summary of PR2 taxonomies',
 'Summary of SILVA-LSU taxonomies',
 'Summary of SILVA-SSU taxonomies',
 'Taxonomic assignments',
 'Taxonomic assignments LSU',
 'Taxonomic assignments SSU',
 'Taxonomic diversity metrics']

For the ABaCo demo we will use the taxonomic analyses and we will use v4 onwards due to differences in pipeline versions and specifically SILVA databases that were used for the taxonomic analysis

# can add magazines
mz_taxa = mz['Summary of SILVA-SSU taxonomies'] + mz.v5['Taxonomic assignments SSU']   

# print still works
print(mz_taxa)

# studies details are preserved
import pandas as pd
display(pd.DataFrame(mz_taxa.studies_details))

Hide code cell output

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v6']
- Number of downloads: 3
- Short descriptions: ['Summary of SILVA-SSU taxonomies']
-----------------------
Next steps: Use `.load()` to initialize.

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v5']
- Number of downloads: 4
- Short descriptions: ['Taxonomic assignments SSU']
-----------------------
Next steps: Use `.load()` to initialize.

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v5', 'v6']
- Number of downloads: 7
- Short descriptions: ['Summary of SILVA-SSU taxonomies', 'Taxonomic assignments SSU']
-----------------------
Next steps: Use `.load()` to initialize.

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v5', 'v6']
- Number of downloads: 7
- Short descriptions: ['Summary of SILVA-SSU taxonomies', 'Taxonomic assignments SSU']
accession ena_accessions title biome updated_at downloads metadata first_accession
0 MGYS00006121 [ERP142200, PRJEB57228] Dietary intervention of people with Parkinson'... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-28T15:47:01.432000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {} ERP142200
1 MGYS00006760 [ERP148661, PRJEB63522] EMG produced TPA metagenomics assembly of PRJN... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-28T15:47:02.672000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {} ERP148661
2 MGYS00006759 [ERP146353, PRJEB61255] EMG produced TPA metagenomics assembly of PRJN... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-28T15:47:02.660000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {} ERP146353
3 MGYS00001650 [ERP004264, PRJEB4927] Alterations of the Fecal Microbiome in Parkins... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-28T15:47:02.598000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {'study_name': 'Fecal Microbiome in Parkinson'... ERP004264
4 MGYS00005129 [ERP109659, PRJEB27564] Gut microbiota in Parkinson's disease: tempora... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-06T12:25:31.349000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {'study_name': 'Parkinson's disease gut microb... ERP109659
5 MGYS00005755 [PRJNA510730, SRP173877] Microbiota composition of Parkinson's disease ... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-06T11:35:50.490000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {} SRP173877
6 MGYS00005601 [ERP113090, PRJEB30615] Identification of Intestinal Bacterial Taxa wi... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-28T15:47:01.054000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {} ERP113090
7 MGYS00005130 [ERP112853, PRJEB30401] Gut Microbiome Alterations Drive Distinct Meta... {'biome_name': 'Fecal', 'lineage': 'root:Host-... 2026-05-06T10:02:48.163000+00:00 [{'file_type': 'tsv', 'download_type': 'Taxono... {'study_name': 'Gut Microbiome and Parkinson's... ERP112853

(Lazy)Loading into one taxonomic dataset#

# lazyload the mgnify taxanomic assignments datasets
mz_taxa.load()

# calling to_pandas or to_polars will collect the data and return a dataframe
mz_taxa.to_polars().head()

Hide code cell output

TaxaMGazine loaded with 7 datasets. 
Cached runs results: 0 of total 1828.
shape: (5, 1_829)
taxonomyERR2730148ERR2730149ERR2730150ERR2730151ERR2730152ERR2730153ERR2730154ERR2730155ERR2730156ERR2730157ERR2730158ERR2730159ERR2730160ERR2730161ERR2730162ERR2730163ERR2730164ERR2730165ERR2730166ERR2730167ERR2730168ERR2730169ERR2730170ERR2730171ERR2730172ERR2730173ERR2730174ERR2730175ERR2730176ERR2730177ERR2730178ERR2730179ERR2730180ERR2730181ERR2730182ERR2730183ERR3046538ERR3046548ERR3046558ERR3046568ERR3046578ERR3046588ERR3046598ERR3046608ERR3046618ERR3046628ERR3046638ERR3046389ERR3046399ERR3046409ERR3046419ERR3046429ERR3046439ERR3046449ERR3046459ERR3046469ERR3046479ERR3046489ERR3046499ERR3046509ERR3046519ERR3046529ERR3046539ERR3046549ERR3046559ERR3046569ERR3046579ERR3046589ERR3046599ERR3046609ERR3046619ERR3046629ERR3046639
stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
"sk__Archaea"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull0000000000000000000000000000000000000
"sk__Archaea;k__;p__Candidatus_…nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"sk__Archaea;k__;p__Candidatus_…nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"sk__Archaea;k__;p__Candidatus_…000000000000000000000000000000000000nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"sk__Archaea;k__;p__Candidatus_…000000000000000030000000000000000000nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

Enriching with metadata#

taxonomic info#

mz_taxa.taxonomic_metadata()
Superkingdom Kingdom Phylum Class Order Family Genus Species
0 Archaea NA NA NA NA NA NA NA
1 Archaea NA Candidatus_Thermoplasmatota Thermoplasmata NA NA NA NA
2 Archaea NA Candidatus_Thermoplasmatota Thermoplasmata Methanomassiliicoccales NA NA NA
3 Archaea NA Candidatus_Thermoplasmatota Thermoplasmata Methanomassiliicoccales Methanomassiliicoccaceae Methanomassiliicoccus NA
4 Archaea NA Candidatus_Thermoplasmatota Thermoplasmata Methanomassiliicoccales Methanomassiliicoccaceae Methanomassiliicoccus Candidatus_Methanomassiliicoccus_intestinalis
... ... ... ... ... ... ... ... ...
2912 Bacteria NA Proteobacteria Alphaproteobacteria Rhizobiales Rhizobiaceae Rhizobium NA
2913 Bacteria NA Proteobacteria Gammaproteobacteria Vibrionales Vibrionaceae Vibrio NA
2914 Bacteria NA Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium Clostridium_sp._CA6
2915 Bacteria NA NA NA Haloplasmatales NA NA NA
2916 Bacteria NA Firmicutes Clostridia Clostridiales Lachnospiraceae NA Lachnospiraceae_bacterium_613

2917 rows × 8 columns

if we take a look at the metadata it will be empty

mz_taxa.metadata().head()
accession
ERR2730148
ERR2730149
ERR2730150
ERR2730151
ERR2730152
mz_taxa.to_anndata()
AnnData object with n_obs × n_vars = 2917 × 1828
    obs: 'Superkingdom', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'

however we can add additional metadata that we collected manually, or taxacurator can help some

# getting some runs metdata, can run this cell multi times
await mz_taxa.aenrich_runs(limit=200)

# check it out
df_runs = pd.DataFrame(mz_taxa.runs_details)
print(df_runs.shape)
display(df_runs.head())

Hide code cell output

(200, 8)
experiment_type instrument_model instrument_platform sample study accession sample_accession study_accession
0 Amplicon None None {'accession': 'SAMEA4821217', 'ena_accessions'... {'accession': 'MGYS00005129', 'ena_accessions'... ERR2730153 SAMEA4821217 MGYS00005129
1 Amplicon None None {'accession': 'SAMEA4821224', 'ena_accessions'... {'accession': 'MGYS00005129', 'ena_accessions'... ERR2730160 SAMEA4821224 MGYS00005129
2 Amplicon None None {'accession': 'SAMEA4821214', 'ena_accessions'... {'accession': 'MGYS00005129', 'ena_accessions'... ERR2730150 SAMEA4821214 MGYS00005129
3 Amplicon None None {'accession': 'SAMEA4821236', 'ena_accessions'... {'accession': 'MGYS00005129', 'ena_accessions'... ERR2730172 SAMEA4821236 MGYS00005129
4 Amplicon None None {'accession': 'SAMEA4821226', 'ena_accessions'... {'accession': 'MGYS00005129', 'ena_accessions'... ERR2730162 SAMEA4821226 MGYS00005129
mz_taxa.enrich_biosamples(limit=10, incl_ena=True)

# check it out
df_biosam = pd.DataFrame(mz_taxa.biosamples_details)
print(df_biosam.shape)
display(df_biosam.head())
(10, 46)
GivenID RunID StudyID SampleID seq_meth decimalLatitude decimalLongitude depth center_name temperature ... organism parkinson pcr primers project name scientific_name sequencing method target gene target subfragment timepoint title
0 ERR2730148 ERR2730148 ERP109659 SAMEA4821212 Illumina MiSeq 60.189122 24.907119 NA UH/IB NA ... human gut metagenome no Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... Parkinson's Disease Microbiome human gut metagenome Illumina MiSeq 16S V3-V4 baseline C0001 baseline
1 ERR2730149 ERR2730149 ERP109659 SAMEA4821213 Illumina MiSeq 60.189122 24.907119 NA UH/IB NA ... human gut metagenome no Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... Parkinson's Disease Microbiome human gut metagenome Illumina MiSeq 16S V3-V4 baseline C0005 baseline
2 ERR2730150 ERR2730150 ERP109659 SAMEA4821214 Illumina MiSeq 60.189122 24.907119 NA UH/IB NA ... human gut metagenome no Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... Parkinson's Disease Microbiome human gut metagenome Illumina MiSeq 16S V3-V4 baseline C0007 baseline
3 ERR2730151 ERR2730151 ERP109659 SAMEA4821215 Illumina MiSeq 60.189122 24.907119 NA UH/IB NA ... human gut metagenome no Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... Parkinson's Disease Microbiome human gut metagenome Illumina MiSeq 16S V3-V4 baseline C0009 baseline
4 ERR2730152 ERR2730152 ERP109659 SAMEA4821216 Illumina MiSeq 60.189122 24.907119 NA UH/IB NA ... human gut metagenome no Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... Parkinson's Disease Microbiome human gut metagenome Illumina MiSeq 16S V3-V4 baseline C0015 baseline

5 rows × 46 columns

# PICK UP HERE
# from abaco.dataloader import DataPreprocess, one_hot_encoding
# # Load Parkinson's disease dataset
# path_to_dataset = 'data/dataset_parkinson.csv'
# batch_col = "study_code"
# bio_col = "phenotype"
# id_col = "samples"

# # Convert data path into compatible pd.DataFrame
# df_parkinson = DataPreprocess(
#     path_to_dataset,
#     factors = [
#         id_col,
#         batch_col,
#         bio_col
#     ]
# ).dropna()

# # see if there are 3 categorical and n numeric columns (should be an extra column for location)
# print(df_parkinson.info())