Finding Parkinson’s disease taxonomic analyses#
In this notebook we aim to imitate the analyses in “ABaCo demo: Parkinson’s disease gut microbiome” where the aim was to “integrate the 9 studies while preserving key distinctions from the two patient states (Parkinson’s v.s. Healthy).”
# uncomment if colab
# !pip install mgnipy
import logging
logging.basicConfig(level=logging.WARNING)
Searching for studies#
To start we configure our MGnipy client and access the MGnify API Studies resource.
We will filter our query to studies of the gut microbiome that mention “parkinson”s disease.
We can preview the resulting query urls via .explain()
from mgnipy import MGnipy
mg = MGnipy(cache_dir='downloads')
pd_studies = mg.studies(
search='parkinson',
biome_lineage='root:Host-associated:Human:Digestive system:Large intestine:Fecal',
)
pd_studies.explain()
https://www.ebi.ac.uk/metagenomics/api/v2/studies?biome_lineage=root%3AHost-associated%3AHuman%3ADigestive+system%3ALarge+intestine%3AFecal&search=parkinson&page=1
looks good. we can proceed with actually executing the list query/queries via .get(). To enrich our list of studies with metadata details we can do this in bulk using .enrich_details() or asynchronously via .aenrich_details()
# populate study list
pd_studies.get()
# enrich studies with metadta
await pd_studies.aenrich_details()
# or even save to file if you prefer
study_meta = pd_studies.details_df(expand_nested_dicts=True)
# check it out
study_meta.head()
Using MGazine to explore the study datasets#
we can access the mgazine of datasets via .datasets attribute. The study details we retrieved above will also be passed on to the mgazine
# access mgazine
mz = pd_studies.datasets
# take a look
print(mz)
MGazine containing:
- MGnify pipeline versions: ['v3', 'v4_1', 'v5', 'v6']
- Number of downloads: 72
- Short descriptions: ['Complete GO annotation',
'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -PR2 as ref DB',
'DwC-Ready summary of 16S-V3-V4 ASV taxonomies using -SILVA as ref DB',
'DwC-Ready summary of closed-ref taxonomies using ITSoneDB as ref DB',
'DwC-Ready summary of closed-ref taxonomies using PR2 as ref DB',
'DwC-Ready summary of closed-ref taxonomies using SILVA-LSU as ref DB',
'DwC-Ready summary of closed-ref taxonomies using SILVA-SSU as ref DB',
'GO slim annotation',
'InterPro matches',
'Phylum level taxonomies',
'Phylum level taxonomies LSU',
'Phylum level taxonomies SSU',
'Summary of DADA2-PR2 taxonomies',
'Summary of DADA2-SILVA taxonomies',
'Summary of ITSoneDB taxonomies',
'Summary of PR2 taxonomies',
'Summary of SILVA-LSU taxonomies',
'Summary of SILVA-SSU taxonomies',
'Taxonomic assignments',
'Taxonomic assignments LSU',
'Taxonomic assignments SSU',
'Taxonomic diversity metrics']
For the ABaCo demo we will use the taxonomic analyses and we will use v4 onwards due to differences in pipeline versions and specifically SILVA databases that were used for the taxonomic analysis
# can add magazines
mz_taxa = mz['Summary of SILVA-SSU taxonomies'] + mz.v5['Taxonomic assignments SSU']
# print still works
print(mz_taxa)
# studies details are preserved
import pandas as pd
display(pd.DataFrame(mz_taxa.studies_details))
(Lazy)Loading into one taxonomic dataset#
# lazyload the mgnify taxanomic assignments datasets
mz_taxa.load()
# calling to_pandas or to_polars will collect the data and return a dataframe
mz_taxa.to_polars().head()
Enriching with metadata#
taxonomic info#
mz_taxa.taxonomic_metadata()
| Superkingdom | Kingdom | Phylum | Class | Order | Family | Genus | Species | |
|---|---|---|---|---|---|---|---|---|
| 0 | Archaea | NA | NA | NA | NA | NA | NA | NA |
| 1 | Archaea | NA | Candidatus_Thermoplasmatota | Thermoplasmata | NA | NA | NA | NA |
| 2 | Archaea | NA | Candidatus_Thermoplasmatota | Thermoplasmata | Methanomassiliicoccales | NA | NA | NA |
| 3 | Archaea | NA | Candidatus_Thermoplasmatota | Thermoplasmata | Methanomassiliicoccales | Methanomassiliicoccaceae | Methanomassiliicoccus | NA |
| 4 | Archaea | NA | Candidatus_Thermoplasmatota | Thermoplasmata | Methanomassiliicoccales | Methanomassiliicoccaceae | Methanomassiliicoccus | Candidatus_Methanomassiliicoccus_intestinalis |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2912 | Bacteria | NA | Proteobacteria | Alphaproteobacteria | Rhizobiales | Rhizobiaceae | Rhizobium | NA |
| 2913 | Bacteria | NA | Proteobacteria | Gammaproteobacteria | Vibrionales | Vibrionaceae | Vibrio | NA |
| 2914 | Bacteria | NA | Firmicutes | Clostridia | Clostridiales | Clostridiaceae | Clostridium | Clostridium_sp._CA6 |
| 2915 | Bacteria | NA | NA | NA | Haloplasmatales | NA | NA | NA |
| 2916 | Bacteria | NA | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | NA | Lachnospiraceae_bacterium_613 |
2917 rows × 8 columns
if we take a look at the metadata it will be empty
mz_taxa.metadata().head()
| accession |
|---|
| ERR2730148 |
| ERR2730149 |
| ERR2730150 |
| ERR2730151 |
| ERR2730152 |
mz_taxa.to_anndata()
AnnData object with n_obs × n_vars = 2917 × 1828
obs: 'Superkingdom', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'
however we can add additional metadata that we collected manually, or taxacurator can help some
# getting some runs metdata, can run this cell multi times
await mz_taxa.aenrich_runs(limit=200)
# check it out
df_runs = pd.DataFrame(mz_taxa.runs_details)
print(df_runs.shape)
display(df_runs.head())
mz_taxa.enrich_biosamples(limit=10, incl_ena=True)
# check it out
df_biosam = pd.DataFrame(mz_taxa.biosamples_details)
print(df_biosam.shape)
display(df_biosam.head())
(10, 46)
| GivenID | RunID | StudyID | SampleID | seq_meth | decimalLatitude | decimalLongitude | depth | center_name | temperature | ... | organism | parkinson | pcr primers | project name | scientific_name | sequencing method | target gene | target subfragment | timepoint | title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ERR2730148 | ERR2730148 | ERP109659 | SAMEA4821212 | Illumina MiSeq | 60.189122 | 24.907119 | NA | UH/IB | NA | ... | human gut metagenome | no | Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... | Parkinson's Disease Microbiome | human gut metagenome | Illumina MiSeq | 16S | V3-V4 | baseline | C0001 baseline |
| 1 | ERR2730149 | ERR2730149 | ERP109659 | SAMEA4821213 | Illumina MiSeq | 60.189122 | 24.907119 | NA | UH/IB | NA | ... | human gut metagenome | no | Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... | Parkinson's Disease Microbiome | human gut metagenome | Illumina MiSeq | 16S | V3-V4 | baseline | C0005 baseline |
| 2 | ERR2730150 | ERR2730150 | ERP109659 | SAMEA4821214 | Illumina MiSeq | 60.189122 | 24.907119 | NA | UH/IB | NA | ... | human gut metagenome | no | Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... | Parkinson's Disease Microbiome | human gut metagenome | Illumina MiSeq | 16S | V3-V4 | baseline | C0007 baseline |
| 3 | ERR2730151 | ERR2730151 | ERP109659 | SAMEA4821215 | Illumina MiSeq | 60.189122 | 24.907119 | NA | UH/IB | NA | ... | human gut metagenome | no | Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... | Parkinson's Disease Microbiome | human gut metagenome | Illumina MiSeq | 16S | V3-V4 | baseline | C0009 baseline |
| 4 | ERR2730152 | ERR2730152 | ERP109659 | SAMEA4821216 | Illumina MiSeq | 60.189122 | 24.907119 | NA | UH/IB | NA | ... | human gut metagenome | no | Forward: CCTACGGGNGGCWGCAG, GTCCTACGGGNGGCWGCA... | Parkinson's Disease Microbiome | human gut metagenome | Illumina MiSeq | 16S | V3-V4 | baseline | C0015 baseline |
5 rows × 46 columns
# PICK UP HERE
# from abaco.dataloader import DataPreprocess, one_hot_encoding
# # Load Parkinson's disease dataset
# path_to_dataset = 'data/dataset_parkinson.csv'
# batch_col = "study_code"
# bio_col = "phenotype"
# id_col = "samples"
# # Convert data path into compatible pd.DataFrame
# df_parkinson = DataPreprocess(
# path_to_dataset,
# factors = [
# id_col,
# batch_col,
# bio_col
# ]
# ).dropna()
# # see if there are 3 categorical and n numeric columns (should be an extra column for location)
# print(df_parkinson.info())