Finding Parkinson’s disease taxonomic analyses

Finding Parkinson’s disease taxonomic analyses#

In this notebook we aim to imitate the analyses in “ABaCo demo: Parkinson’s disease gut microbiome” where the aim was to “integrate the 9 studies while preserving key distinctions from the two patient states (Parkinson’s v.s. Healthy).”

# uncomment if colab
# !pip install mgnipy

import logging 
logging.basicConfig(level=logging.WARNING)

Searching for studies#

To start we configure our MGnipy client and access the MGnify API Studies resource.

We will filter our query to studies of the gut microbiome that mention “parkinson”s disease.

We can preview the resulting query urls via .explain()

from mgnipy import MGnipy 

mg = MGnipy(cache_dir='downloads')

pd_studies = mg.studies(
    search='parkinson',
    biome_lineage='root:Host-associated:Human:Digestive system:Large intestine:Fecal',
)

pd_studies.explain()

https://www.ebi.ac.uk/metagenomics/api/v2/studies?biome_lineage=root%3AHost-associated%3AHuman%3ADigestive+system%3ALarge+intestine%3AFecal&search=parkinson&page=1

looks good. we can proceed with actually executing the list query/queries via .get(). To enrich our list of studies with metadata details we can do this in bulk using .enrich_details() or asynchronously via .aenrich_details()

# populate study list
pd_studies.get()
# enrich studies with metadta
await pd_studies.aenrich_details()

# or even save to file if you prefer
study_meta = pd_studies.details_df(expand_nested_dicts=True)

# check it out 
study_meta.head()

Show code cell output

Hide code cell output

	accession	ena_accessions	title	updated_at	downloads	first_accession	metadata__study_name	metadata__center_name	metadata__study_title	metadata__study_accession	metadata__study_description	metadata__secondary_study_accession	biome__biome_name	biome__lineage
0	MGYS00006121	[ERP142200, PRJEB57228]	Dietary intervention of people with Parkinson'...	2026-05-28T15:47:01.432000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	ERP142200	NaN	NaN	NaN	NaN	NaN	NaN	Fecal	root:Host-associated:Human:Digestive system:La...
1	MGYS00006760	[ERP148661, PRJEB63522]	EMG produced TPA metagenomics assembly of PRJN...	2026-05-28T15:47:02.672000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	ERP148661	NaN	NaN	NaN	NaN	NaN	NaN	Fecal	root:Host-associated:Human:Digestive system:La...
2	MGYS00006759	[ERP146353, PRJEB61255]	EMG produced TPA metagenomics assembly of PRJN...	2026-05-28T15:47:02.660000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	ERP146353	NaN	NaN	NaN	NaN	NaN	NaN	Fecal	root:Host-associated:Human:Digestive system:La...
3	MGYS00001650	[ERP004264, PRJEB4927]	Alterations of the Fecal Microbiome in Parkins...	2026-05-28T15:47:02.598000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	ERP004264	Fecal Microbiome in Parkinson's Disease	Institute of Biotechnology;University of Helsi...	Alterations of the Fecal Microbiome in Parkins...	PRJEB4927	In the course of Parkinson’s disease (PD), the...	ERP004264	Fecal	root:Host-associated:Human:Digestive system:La...
4	MGYS00005129	[ERP109659, PRJEB27564]	Gut microbiota in Parkinson's disease: tempora...	2026-05-06T12:25:31.349000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	ERP109659	Parkinson's disease gut microbiota follow-up	Institute of Biotechnology;University of Helsi...	Gut microbiota in Parkinson's disease: tempora...	PRJEB27564	Aiming to explore the temporal stability of gu...	ERP109659	Fecal	root:Host-associated:Human:Digestive system:La...

Using `MGazine` to explore the study datasets#

we can access the mgazine of datasets via .datasets attribute. The study details we retrieved above will also be passed on to the mgazine

# access mgazine
mz = pd_studies.datasets

# take a look
print(mz)

MGazine containing:
- MGnify pipeline versions: ['v3', 'v4_1', 'v5', 'v6']
- Number of downloads: 72
- Short descriptions: ['Complete GO annotation',
 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB',
 'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using ITSoneDB as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using PR2 as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using SILVA-LSU as ref DB',
 'DwC-Ready summary of closed-ref taxonomies using SILVA-SSU as ref DB',
 'GO slim annotation',
 'InterPro matches',
 'Phylum level taxonomies',
 'Phylum level taxonomies LSU',
 'Phylum level taxonomies SSU',
 'Summary of DADA2-PR2 taxonomies',
 'Summary of DADA2-SILVA taxonomies',
 'Summary of ITSoneDB taxonomies',
 'Summary of PR2 taxonomies',
 'Summary of SILVA-LSU taxonomies',
 'Summary of SILVA-SSU taxonomies',
 'Taxonomic assignments',
 'Taxonomic assignments LSU',
 'Taxonomic assignments SSU',
 'Taxonomic diversity metrics']

For the ABaCo demo we will use the taxonomic analyses and we will use v4 onwards due to differences in pipeline versions and specifically SILVA databases that were used for the taxonomic analysis

# can add magazines
mz_taxa = mz['Summary of SILVA-SSU taxonomies'] + mz.v5['Taxonomic assignments SSU']   

# print still works
print(mz_taxa)

# studies details are preserved
import pandas as pd
display(pd.DataFrame(mz_taxa.studies_details))

Show code cell output

Hide code cell output

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v6']
- Number of downloads: 3
- Short descriptions: ['Summary of SILVA-SSU taxonomies']
-----------------------
Next steps: Use `.load()` to initialize.

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v5']
- Number of downloads: 4
- Short descriptions: ['Taxonomic assignments SSU']
-----------------------
Next steps: Use `.load()` to initialize.

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v5', 'v6']
- Number of downloads: 7
- Short descriptions: ['Summary of SILVA-SSU taxonomies', 'Taxonomic assignments SSU']
-----------------------
Next steps: Use `.load()` to initialize.

MGazine Curation TaxaMGazine containing:
- MGnify pipeline versions: ['v5', 'v6']
- Number of downloads: 7
- Short descriptions: ['Summary of SILVA-SSU taxonomies', 'Taxonomic assignments SSU']

	accession	ena_accessions	title	biome	updated_at	downloads	metadata	first_accession
0	MGYS00006121	[ERP142200, PRJEB57228]	Dietary intervention of people with Parkinson'...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-28T15:47:01.432000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{}	ERP142200
1	MGYS00006760	[ERP148661, PRJEB63522]	EMG produced TPA metagenomics assembly of PRJN...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-28T15:47:02.672000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{}	ERP148661
2	MGYS00006759	[ERP146353, PRJEB61255]	EMG produced TPA metagenomics assembly of PRJN...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-28T15:47:02.660000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{}	ERP146353
3	MGYS00001650	[ERP004264, PRJEB4927]	Alterations of the Fecal Microbiome in Parkins...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-28T15:47:02.598000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{'study_name': 'Fecal Microbiome in Parkinson'...	ERP004264
4	MGYS00005129	[ERP109659, PRJEB27564]	Gut microbiota in Parkinson's disease: tempora...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-06T12:25:31.349000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{'study_name': 'Parkinson's disease gut microb...	ERP109659
5	MGYS00005755	[PRJNA510730, SRP173877]	Microbiota composition of Parkinson's disease ...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-06T11:35:50.490000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{}	SRP173877
6	MGYS00005601	[ERP113090, PRJEB30615]	Identification of Intestinal Bacterial Taxa wi...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-28T15:47:01.054000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{}	ERP113090
7	MGYS00005130	[ERP112853, PRJEB30401]	Gut Microbiome Alterations Drive Distinct Meta...	{'biome_name': 'Fecal', 'lineage': 'root:Host-...	2026-05-06T10:02:48.163000+00:00	[{'file_type': 'tsv', 'download_type': 'Taxono...	{'study_name': 'Gut Microbiome and Parkinson's...	ERP112853

(Lazy)Loading into one taxonomic dataset#

# lazyload the mgnify taxanomic assignments datasets
mz_taxa.load()

# calling to_pandas or to_polars will collect the data and return a dataframe
mz_taxa.to_polars().head()

Show code cell output

Hide code cell output

TaxaMGazine loaded with 7 datasets. 
Cached runs results: 0 of total 1828.

shape: (5, 1_829)

taxonomy	ERR2730148	ERR2730149	ERR2730150	ERR2730151	ERR2730152	ERR2730153	ERR2730154	ERR2730155	ERR2730156	ERR2730157	ERR2730158	ERR2730159	ERR2730160	ERR2730161	ERR2730162	ERR2730163	ERR2730164	ERR2730165	ERR2730166	ERR2730167	ERR2730168	ERR2730169	ERR2730170	ERR2730171	ERR2730172	ERR2730173	ERR2730174	ERR2730175	ERR2730176	ERR2730177	ERR2730178	ERR2730179	ERR2730180	ERR2730181	ERR2730182	ERR2730183	…	ERR3046538	ERR3046548	ERR3046558	ERR3046568	ERR3046578	ERR3046588	ERR3046598	ERR3046608	ERR3046618	ERR3046628	ERR3046638	ERR3046389	ERR3046399	ERR3046409	ERR3046419	ERR3046429	ERR3046439	ERR3046449	ERR3046459	ERR3046469	ERR3046479	ERR3046489	ERR3046499	ERR3046509	ERR3046519	ERR3046529	ERR3046539	ERR3046549	ERR3046559	ERR3046569	ERR3046579	ERR3046589	ERR3046599	ERR3046609	ERR3046619	ERR3046629	ERR3046639
str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	…	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
"sk__Archaea"	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
"sk__Archaea;k__;p__Candidatus_…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"sk__Archaea;k__;p__Candidatus_…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"sk__Archaea;k__;p__Candidatus_…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null
"sk__Archaea;k__;p__Candidatus_…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	…	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null	null

Enriching with metadata#

taxonomic info#

mz_taxa.taxonomic_metadata()

	Superkingdom	Kingdom	Phylum	Class	Order	Family	Genus	Species
0	Archaea	NA	NA	NA	NA	NA	NA	NA
1	Archaea	NA	Candidatus_Thermoplasmatota	Thermoplasmata	NA	NA	NA	NA
2	Archaea	NA	Candidatus_Thermoplasmatota	Thermoplasmata	Methanomassiliicoccales	NA	NA	NA
3	Archaea	NA	Candidatus_Thermoplasmatota	Thermoplasmata	Methanomassiliicoccales	Methanomassiliicoccaceae	Methanomassiliicoccus	NA
4	Archaea	NA	Candidatus_Thermoplasmatota	Thermoplasmata	Methanomassiliicoccales	Methanomassiliicoccaceae	Methanomassiliicoccus	Candidatus_Methanomassiliicoccus_intestinalis
...	...	...	...	...	...	...	...	...
2912	Bacteria	NA	Proteobacteria	Alphaproteobacteria	Rhizobiales	Rhizobiaceae	Rhizobium	NA
2913	Bacteria	NA	Proteobacteria	Gammaproteobacteria	Vibrionales	Vibrionaceae	Vibrio	NA
2914	Bacteria	NA	Firmicutes	Clostridia	Clostridiales	Clostridiaceae	Clostridium	Clostridium_sp._CA6
2915	Bacteria	NA	NA	NA	Haloplasmatales	NA	NA	NA
2916	Bacteria	NA	Firmicutes	Clostridia	Clostridiales	Lachnospiraceae	NA	Lachnospiraceae_bacterium_613

2917 rows × 8 columns

if we take a look at the metadata it will be empty

mz_taxa.metadata().head()


accession
ERR2730148
ERR2730149
ERR2730150
ERR2730151
ERR2730152

mz_taxa.to_anndata()

AnnData object with n_obs × n_vars = 2917 × 1828
    obs: 'Superkingdom', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'

however we can add additional metadata that we collected manually, or taxacurator can help some

# getting some runs metdata, can run this cell multi times
await mz_taxa.aenrich_runs(limit=200)

# check it out
df_runs = pd.DataFrame(mz_taxa.runs_details)
print(df_runs.shape)
display(df_runs.head())

Show code cell output

Hide code cell output

(200, 8)

	experiment_type	instrument_model	instrument_platform	sample	study	accession	sample_accession	study_accession
0	Amplicon	None	None	{'accession': 'SAMEA4821217', 'ena_accessions'...	{'accession': 'MGYS00005129', 'ena_accessions'...	ERR2730153	SAMEA4821217	MGYS00005129
1	Amplicon	None	None	{'accession': 'SAMEA4821224', 'ena_accessions'...	{'accession': 'MGYS00005129', 'ena_accessions'...	ERR2730160	SAMEA4821224	MGYS00005129
2	Amplicon	None	None	{'accession': 'SAMEA4821214', 'ena_accessions'...	{'accession': 'MGYS00005129', 'ena_accessions'...	ERR2730150	SAMEA4821214	MGYS00005129
3	Amplicon	None	None	{'accession': 'SAMEA4821236', 'ena_accessions'...	{'accession': 'MGYS00005129', 'ena_accessions'...	ERR2730172	SAMEA4821236	MGYS00005129
4	Amplicon	None	None	{'accession': 'SAMEA4821226', 'ena_accessions'...	{'accession': 'MGYS00005129', 'ena_accessions'...	ERR2730162	SAMEA4821226	MGYS00005129

mz_taxa.enrich_biosamples(limit=10, incl_ena=True)

# check it out
df_biosam = pd.DataFrame(mz_taxa.biosamples_details)
print(df_biosam.shape)
display(df_biosam.head())

(10, 46)

	GivenID	RunID	StudyID	SampleID	seq_meth	decimalLatitude	decimalLongitude	depth	center_name	temperature	...	organism	parkinson	pcr primers	project name	scientific_name	sequencing method	target gene	target subfragment	timepoint	title
0	ERR2730148	ERR2730148	ERP109659	SAMEA4821212	Illumina MiSeq	60.189122	24.907119	NA	UH/IB	NA	...	human gut metagenome	no	Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA...	Parkinson's Disease Microbiome	human gut metagenome	Illumina MiSeq	16S	V3-V4	baseline	C0001 baseline
1	ERR2730149	ERR2730149	ERP109659	SAMEA4821213	Illumina MiSeq	60.189122	24.907119	NA	UH/IB	NA	...	human gut metagenome	no	Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA...	Parkinson's Disease Microbiome	human gut metagenome	Illumina MiSeq	16S	V3-V4	baseline	C0005 baseline
2	ERR2730150	ERR2730150	ERP109659	SAMEA4821214	Illumina MiSeq	60.189122	24.907119	NA	UH/IB	NA	...	human gut metagenome	no	Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA...	Parkinson's Disease Microbiome	human gut metagenome	Illumina MiSeq	16S	V3-V4	baseline	C0007 baseline
3	ERR2730151	ERR2730151	ERP109659	SAMEA4821215	Illumina MiSeq	60.189122	24.907119	NA	UH/IB	NA	...	human gut metagenome	no	Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA...	Parkinson's Disease Microbiome	human gut metagenome	Illumina MiSeq	16S	V3-V4	baseline	C0009 baseline
4	ERR2730152	ERR2730152	ERP109659	SAMEA4821216	Illumina MiSeq	60.189122	24.907119	NA	UH/IB	NA	...	human gut metagenome	no	Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA...	Parkinson's Disease Microbiome	human gut metagenome	Illumina MiSeq	16S	V3-V4	baseline	C0015 baseline

5 rows × 46 columns

# PICK UP HERE

# from abaco.dataloader import DataPreprocess, one_hot_encoding
# # Load Parkinson's disease dataset
# path_to_dataset = 'data/dataset_parkinson.csv'
# batch_col = "study_code"
# bio_col = "phenotype"
# id_col = "samples"

# # Convert data path into compatible pd.DataFrame
# df_parkinson = DataPreprocess(
#     path_to_dataset,
#     factors = [
#         id_col,
#         batch_col,
#         bio_col
#     ]
# ).dropna()

# # see if there are 3 categorical and n numeric columns (should be an extra column for location)
# print(df_parkinson.info())

Finding Parkinson’s disease taxonomic analyses

Contents

Finding Parkinson’s disease taxonomic analyses#

Searching for studies#

Using MGazine to explore the study datasets#

(Lazy)Loading into one taxonomic dataset#

Enriching with metadata#

taxonomic info#

Using `MGazine` to explore the study datasets#