mgnipy.V2.datasets.taxonomic module

Contents

mgnipy.V2.datasets.taxonomic module#

class mgnipy.V2.datasets.taxonomic.DWCTaxaMGazine(mgazine, config=None, *, long_short_mapping=None, assemblies_details=None, runs_details=None, samples_details=None, studies_details=None, biosamples_details=None, analyses_details=None)[source]#

Bases: _MGazineSetup

Parameters:
async adownload(to_dir, alias=None, *, url=None, filename=None, httpx_aclient=None, overwrite=False, hide_progress=False)#

Asynchronously download a file from an alias or URL.

Parameters:
  • to_dir (DirectoryPath) – Directory where the file will be saved.

  • alias (str or None, optional) – Download alias known to this MGazine instance.

  • url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.

  • filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.

  • httpx_aclient (httpx.AsyncClient, optional) – Optional httpx.AsyncClient to use for the HTTP request.

  • overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.

  • hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) await mg.adownload(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

async adownload_all(to_dir, overwrite=False, hide_progress=False)#

Asynchronously download all files known to this MGazine.

Parameters:
  • to_dir (DirectoryPath) – Directory where the files will be saved.

  • overwrite (bool , optional) – Passed to adownload to control overwriting behavior.

  • hide_progress (bool , optional) – Disable progress bars when True.

Notes

This helper creates a single async HTTP client and schedules concurrent adownload calls for all aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> await mg.adownload_all("download_to_here")
async aenrich_runs(limit=200, hide_progress=False)#

Asynchronously enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.

Parameters:
  • limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.

  • hide_progress (bool , default=False) – Whether to hide the progress bar during enrichment. Defaults to False.

Returns:

The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.

Return type:

None

property aliases: list [str ]#

Return a list of all download aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).aliases
['example.txt']
property analyses_details: list [dict [str , Any ]]#
append_analyses_details(value)#
Parameters:

value (dict [str , Any ])

append_assemblies_details(value)#
Parameters:

value (dict [str , Any ])

append_biosamples_details(value)#
Parameters:

value (dict [str , Any ])

append_runs_details(value)#
Parameters:

value (dict [str , Any ])

append_samples_details(value)#
Parameters:

value (dict [str , Any ])

append_studies_details(value)#
Parameters:

value (dict [str , Any ])

property assemblies_details: list [dict [str , Any ]]#
property biosamples_details: list [dict [str , Any ]]#
by_pipeline_version()#

Group downloads by pipeline version based on the ‘pipeline_version’ column in the downloads dataframe.

Returns:

A dictionary where keys are pipeline versions and values are lists of download dictionaries.

Return type:

dict

by_short_desc()#

Group downloads by short description based on the ‘short_description’ column in the downloads dataframe.

Returns:

A dictionary where keys are short descriptions and values are lists of download dictionaries.

Return type:

dict

clear_cache()#
download(to_dir, alias=None, *, url=None, filename=None, httpx_client=None, overwrite=False, hide_progress=False)#

Download a file from an alias or URL to a local directory.

Parameters:
  • to_dir (DirectoryPath) – Directory where the file will be saved.

  • alias (str or None, optional) – Download alias known to this MGazine instance. When provided the corresponding URL from the instance’s downloads list is used.

  • url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.

  • filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.

  • httpx_client (httpx.Client, optional) – Optional httpx.Client to use for the HTTP request. If not supplied a temporary client from _mgnifier_helper is used.

  • overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.

  • hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) mg.download(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

download_all(to_dir, hide_progress=False, overwrite=False)#

Download all files known to this MGazine instance.

Parameters:
  • to_dir (DirectoryPath) – Directory where the files will be saved.

  • hide_progress (bool , optional) – Disable per-file and overall progress bars when True.

  • overwrite (bool , optional) – Passed to download to control overwriting behavior.

Notes

This helper calls download for each alias present in the instance’s downloads list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> mg.download_all("download_to_here")
downloads_df(**pd_kwargs)#

Return a pandas.DataFrame of all downloads.

The dataframe will contain columns such as alias, url and file_type when those keys exist in the provided download dicts.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> df = MGazine(downloads).downloads_df()
>>> list(df.columns)
['alias', 'url', 'file_type']
Return type:

DataFrame

enrich_biosamples(limit=200, hide_progress=False, incl_ena=True)#

Enriches the biosample metadata for the biosamples in the taxonomic dataset by iterating through the biosample accessions and retrieving their details using the BiosampleDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.

Parameters:
  • limit (Optional[int ], default=200) – An optional integer to limit the number of biosamples to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of biosamples enriched.

  • hide_progress (bool )

  • incl_ena (bool )

Returns:

The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.

Return type:

None

enrich_runs(limit=200, hide_progress=False)#

Enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.

Parameters:
  • limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.

  • hide_progress (bool )

Returns:

The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.

Return type:

None

enrich_samples()#
enrich_studies()#
property lazy_merged: LazyFrame#
list_pipeline_version()#

Return a list of pipeline versions extracted from the download groups.

This looks for patterns like ‘.v4.1’ in the ‘download_group’ field of the downloads and extracts the version number.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.v4.1", "pipeline_version": 'v4_1'},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.v5", "pipeline_version": 'v5'},
... ]
>>> MGazine(downloads).list_pipeline_version()
['v4_1', 'v5']
list_short_descriptions()#

Return a list of short descriptions extracted from the download groups.

This looks for patterns like ‘shortdesc’ in the ‘download_group’ field of the downloads and extracts the short description.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.shortdesc1", "pipeline_version": 4.1, "short_description": "shortdesc1"},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.shortdesc2", "pipeline_version": 4.1, "short_description": "shortdesc2"},
... ]
>>> MGazine(downloads).list_short_descriptions()
['shortdesc1', 'shortdesc2']
load()[source]#

Lazy loading and merging of the datasets contained in url_list. This method should be called after instantiating to set up the internal state and load any cached results.

metadata(df_engine='pandas', strict=False, expand_nested_dicts=True, incl_runs_details=True, incl_samples_details=True, incl_studies_details=True, incl_biosamples_details=True, incl_analyses_details=True, incl_assemblies_details=True)#
Parameters:
  • df_engine (Literal ['polars', 'pandas'])

  • strict (bool )

  • expand_nested_dicts (bool )

  • incl_runs_details (bool )

  • incl_samples_details (bool )

  • incl_studies_details (bool )

  • incl_biosamples_details (bool )

  • incl_analyses_details (bool )

  • incl_assemblies_details (bool )

Return type:

DataFrame | DataFrame

property runs_accessions: list #
property runs_details: list [dict [str , Any ]]#
property runs_to_samples: dict [str , str ]#
property samples_details: list [dict [str , Any ]]#
stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)#

Streams a single download based on its alias or url.

If chunksize is specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.

Supported formats and their handlers#

param alias:

The alias of the download to stream.

type alias:

Optional[str]

param url:

The url of the download to stream.

type url:

Optional[HttpUrl]

param chunksize:

The size of the chunks to read from the stream.

type chunksize:

Optional[int]

param max_skip:

The maximum number of rows to skip before raising an error. Default is 5.

type max_skip:

int, optional

param **kwargs:

Additional keyword arguments to pass to the streamer function.

returns:

The streamer result for the resolved alias or url.

rtype:

Any

Parameters:
  • alias (str | None)

  • url (HttpUrl | None)

  • chunksize (int | None)

  • max_skip (int )

Return type:

Any

stream_biom(url, **skbio_kwargs)#

Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the biom file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the biom file.

Return type:

Generator

stream_fasta(url, **skbio_kwargs)#

Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the FASTA file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the FASTA file.

Return type:

Generator

stream_gff(url, **skbio_kwargs)#

Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the GFF file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the GFF file.

Return type:

Generator

stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)#

Stream a gzipped HTTP resource and present a file-like interface.

When chunksize is None the entire compressed payload is fetched and decompressed into memory. When chunksize is provided a streaming file-like object is returned.

Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

  • decode (bool )

  • encoding (str )

  • errors (str )

Return type:

bytes | str | BufferedReader | TextIOWrapper

stream_html(url, **web_kwargs)#

Open an HTML URL in the default web browser.

Parameters:
  • url (str ) – The URL to open in the web browser.

  • **web_kwargs – Additional keyword arguments passed to webbrowser.open(), such as new and autoraise.

Returns:

True if the URL was opened successfully, False otherwise.

Return type:

bool

stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

Return type:

dict | Generator

stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)#
Parameters:
  • url (str )

  • orient (Literal ['records', 'split', 'index', 'columns', 'values', 'table'] | None)

  • chunksize (int | None)

  • dataframe_engine (Literal ['pandas', 'polars'] | None)

Return type:

dict

stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)#

Read a TSV from a URL or local file with resilient header handling.

The helper will retry with increasing skiprows when pandas raises a ParserError (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pd_kwargs – Additional keyword arguments passed to pd.read_csv.

  • low_memory (bool )

Returns:

A DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pd.DataFrame or TextFileReader

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).

stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)#

Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.

The helper will retry with increasing skip_rows when Polars raises an error (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pl_kwargs – Additional keyword arguments passed to pl.read_csv.

  • low_memory (bool )

Returns:

A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pl.DataFrame or Iterator[pl.DataFrame]

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Polars Error – If the TSV cannot be parsed due to a format error (after retries).

stream_tree(url, **skbio_kwargs)#
Parameters:

url (str )

Return type:

Generator

stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)#

Stream a plain-text resource. When chunksize is None the full text is returned as a string. When chunksize is an integer the function yields lists of lines.

Parameters:
  • url (str ) – The URL to stream the text from.

  • chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.

  • httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.

  • **httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method

Returns:

The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.

Return type:

str or Generator

property studies_details: list [dict [str , Any ]]#
taxonomic_metadata(fill_na='NA', df_engine='pandas', strict=False)#
Parameters:
Return type:

DataFrame | DataFrame

to_pandas(**pd_kwargs)#
Return type:

DataFrame

to_polars()#
Return type:

DataFrame

property url_dict: dict [str , dict ]#

Return mapping of alias to URL for all downloads.

Returns:

Dictionary mapping alias -> url (or None when no url is available).

Return type:

dict

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_dict['example.txt']
'http://ex/x'
property url_list#

Return a list of all download URLs.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_list
['http://ex/x']
property urls: list [str | None ]#

Return a list of all download URLs. Same as url_list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).urls
['http://ex/x']
class mgnipy.V2.datasets.taxonomic.TaxaMGazine(mgazine, config=None, *, long_short_mapping=None, assemblies_details=None, runs_details=None, samples_details=None, studies_details=None, biosamples_details=None, analyses_details=None)[source]#

Bases: _MGazineSetup

not for dwc

Parameters:
X(df_engine='pandas')[source]#
Parameters:

df_engine (Literal ['polars', 'pandas'])

Return type:

DataFrame | DataFrame

async adownload(to_dir, alias=None, *, url=None, filename=None, httpx_aclient=None, overwrite=False, hide_progress=False)#

Asynchronously download a file from an alias or URL.

Parameters:
  • to_dir (DirectoryPath) – Directory where the file will be saved.

  • alias (str or None, optional) – Download alias known to this MGazine instance.

  • url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.

  • filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.

  • httpx_aclient (httpx.AsyncClient, optional) – Optional httpx.AsyncClient to use for the HTTP request.

  • overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.

  • hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) await mg.adownload(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

async adownload_all(to_dir, overwrite=False, hide_progress=False)#

Asynchronously download all files known to this MGazine.

Parameters:
  • to_dir (DirectoryPath) – Directory where the files will be saved.

  • overwrite (bool , optional) – Passed to adownload to control overwriting behavior.

  • hide_progress (bool , optional) – Disable progress bars when True.

Notes

This helper creates a single async HTTP client and schedules concurrent adownload calls for all aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> await mg.adownload_all("download_to_here")
async aenrich_runs(limit=200, hide_progress=False)#

Asynchronously enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.

Parameters:
  • limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.

  • hide_progress (bool , default=False) – Whether to hide the progress bar during enrichment. Defaults to False.

Returns:

The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.

Return type:

None

property aliases: list [str ]#

Return a list of all download aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).aliases
['example.txt']
property analyses_details: list [dict [str , Any ]]#
append_analyses_details(value)#
Parameters:

value (dict [str , Any ])

append_assemblies_details(value)#
Parameters:

value (dict [str , Any ])

append_biosamples_details(value)#
Parameters:

value (dict [str , Any ])

append_runs_details(value)#
Parameters:

value (dict [str , Any ])

append_samples_details(value)#
Parameters:

value (dict [str , Any ])

append_studies_details(value)#
Parameters:

value (dict [str , Any ])

property assemblies_details: list [dict [str , Any ]]#
property biosamples_details: list [dict [str , Any ]]#
by_pipeline_version()#

Group downloads by pipeline version based on the ‘pipeline_version’ column in the downloads dataframe.

Returns:

A dictionary where keys are pipeline versions and values are lists of download dictionaries.

Return type:

dict

by_short_desc()#

Group downloads by short description based on the ‘short_description’ column in the downloads dataframe.

Returns:

A dictionary where keys are short descriptions and values are lists of download dictionaries.

Return type:

dict

clear_cache()#
download(to_dir, alias=None, *, url=None, filename=None, httpx_client=None, overwrite=False, hide_progress=False)#

Download a file from an alias or URL to a local directory.

Parameters:
  • to_dir (DirectoryPath) – Directory where the file will be saved.

  • alias (str or None, optional) – Download alias known to this MGazine instance. When provided the corresponding URL from the instance’s downloads list is used.

  • url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.

  • filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.

  • httpx_client (httpx.Client, optional) – Optional httpx.Client to use for the HTTP request. If not supplied a temporary client from _mgnifier_helper is used.

  • overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.

  • hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) mg.download(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

download_all(to_dir, hide_progress=False, overwrite=False)#

Download all files known to this MGazine instance.

Parameters:
  • to_dir (DirectoryPath) – Directory where the files will be saved.

  • hide_progress (bool , optional) – Disable per-file and overall progress bars when True.

  • overwrite (bool , optional) – Passed to download to control overwriting behavior.

Notes

This helper calls download for each alias present in the instance’s downloads list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> mg.download_all("download_to_here")
downloads_df(**pd_kwargs)#

Return a pandas.DataFrame of all downloads.

The dataframe will contain columns such as alias, url and file_type when those keys exist in the provided download dicts.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> df = MGazine(downloads).downloads_df()
>>> list(df.columns)
['alias', 'url', 'file_type']
Return type:

DataFrame

enrich_biosamples(limit=200, hide_progress=False, incl_ena=True)#

Enriches the biosample metadata for the biosamples in the taxonomic dataset by iterating through the biosample accessions and retrieving their details using the BiosampleDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.

Parameters:
  • limit (Optional[int ], default=200) – An optional integer to limit the number of biosamples to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of biosamples enriched.

  • hide_progress (bool )

  • incl_ena (bool )

Returns:

The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.

Return type:

None

enrich_runs(limit=200, hide_progress=False)#

Enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.

Parameters:
  • limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.

  • hide_progress (bool )

Returns:

The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.

Return type:

None

enrich_samples()#
enrich_studies()#
property lazy_merged: LazyFrame#
list_pipeline_version()#

Return a list of pipeline versions extracted from the download groups.

This looks for patterns like ‘.v4.1’ in the ‘download_group’ field of the downloads and extracts the version number.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.v4.1", "pipeline_version": 'v4_1'},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.v5", "pipeline_version": 'v5'},
... ]
>>> MGazine(downloads).list_pipeline_version()
['v4_1', 'v5']
list_short_descriptions()#

Return a list of short descriptions extracted from the download groups.

This looks for patterns like ‘shortdesc’ in the ‘download_group’ field of the downloads and extracts the short description.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.shortdesc1", "pipeline_version": 4.1, "short_description": "shortdesc1"},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.shortdesc2", "pipeline_version": 4.1, "short_description": "shortdesc2"},
... ]
>>> MGazine(downloads).list_short_descriptions()
['shortdesc1', 'shortdesc2']
load()[source]#

Lazy loading and merging of the datasets contained in url_list. This method should be called after instantiating to set up the internal state and load any cached results.

metadata(df_engine='pandas', strict=False, expand_nested_dicts=True, incl_runs_details=True, incl_samples_details=True, incl_studies_details=True, incl_biosamples_details=True, incl_analyses_details=True, incl_assemblies_details=True)#
Parameters:
  • df_engine (Literal ['polars', 'pandas'])

  • strict (bool )

  • expand_nested_dicts (bool )

  • incl_runs_details (bool )

  • incl_samples_details (bool )

  • incl_studies_details (bool )

  • incl_biosamples_details (bool )

  • incl_analyses_details (bool )

  • incl_assemblies_details (bool )

Return type:

DataFrame | DataFrame

property runs_accessions: list #
property runs_details: list [dict [str , Any ]]#
property runs_to_samples: dict [str , str ]#
property samples_details: list [dict [str , Any ]]#
stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)#

Streams a single download based on its alias or url.

If chunksize is specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.

Supported formats and their handlers#

param alias:

The alias of the download to stream.

type alias:

Optional[str]

param url:

The url of the download to stream.

type url:

Optional[HttpUrl]

param chunksize:

The size of the chunks to read from the stream.

type chunksize:

Optional[int]

param max_skip:

The maximum number of rows to skip before raising an error. Default is 5.

type max_skip:

int, optional

param **kwargs:

Additional keyword arguments to pass to the streamer function.

returns:

The streamer result for the resolved alias or url.

rtype:

Any

Parameters:
  • alias (str | None)

  • url (HttpUrl | None)

  • chunksize (int | None)

  • max_skip (int )

Return type:

Any

stream_biom(url, **skbio_kwargs)#

Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the biom file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the biom file.

Return type:

Generator

stream_fasta(url, **skbio_kwargs)#

Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the FASTA file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the FASTA file.

Return type:

Generator

stream_gff(url, **skbio_kwargs)#

Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the GFF file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the GFF file.

Return type:

Generator

stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)#

Stream a gzipped HTTP resource and present a file-like interface.

When chunksize is None the entire compressed payload is fetched and decompressed into memory. When chunksize is provided a streaming file-like object is returned.

Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

  • decode (bool )

  • encoding (str )

  • errors (str )

Return type:

bytes | str | BufferedReader | TextIOWrapper

stream_html(url, **web_kwargs)#

Open an HTML URL in the default web browser.

Parameters:
  • url (str ) – The URL to open in the web browser.

  • **web_kwargs – Additional keyword arguments passed to webbrowser.open(), such as new and autoraise.

Returns:

True if the URL was opened successfully, False otherwise.

Return type:

bool

stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

Return type:

dict | Generator

stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)#
Parameters:
  • url (str )

  • orient (Literal ['records', 'split', 'index', 'columns', 'values', 'table'] | None)

  • chunksize (int | None)

  • dataframe_engine (Literal ['pandas', 'polars'] | None)

Return type:

dict

stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)#

Read a TSV from a URL or local file with resilient header handling.

The helper will retry with increasing skiprows when pandas raises a ParserError (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pd_kwargs – Additional keyword arguments passed to pd.read_csv.

  • low_memory (bool )

Returns:

A DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pd.DataFrame or TextFileReader

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).

stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)#

Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.

The helper will retry with increasing skip_rows when Polars raises an error (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pl_kwargs – Additional keyword arguments passed to pl.read_csv.

  • low_memory (bool )

Returns:

A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pl.DataFrame or Iterator[pl.DataFrame]

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Polars Error – If the TSV cannot be parsed due to a format error (after retries).

stream_tree(url, **skbio_kwargs)#
Parameters:

url (str )

Return type:

Generator

stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)#

Stream a plain-text resource. When chunksize is None the full text is returned as a string. When chunksize is an integer the function yields lists of lines.

Parameters:
  • url (str ) – The URL to stream the text from.

  • chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.

  • httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.

  • **httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method

Returns:

The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.

Return type:

str or Generator

property studies_details: list [dict [str , Any ]]#
taxonomic_metadata(fill_na='NA', df_engine='pandas')[source]#
Parameters:
Return type:

DataFrame | DataFrame

to_anndata(**anndata_kwargs)[source]#

Converts the taxonomic metadata to an AnnData object. The taxonomic ranks are stored in the obs attribute of the AnnData object.

Parameters:

**anndata_kwargs – Additional keyword arguments to pass to the AnnData constructor.

Returns:

An AnnData object containing the taxonomic metadata in the obs attribute.

Return type:

ad.AnnData

to_pandas(**pd_kwargs)#
Return type:

DataFrame

to_polars()#
Return type:

DataFrame

property url_dict: dict [str , dict ]#

Return mapping of alias to URL for all downloads.

Returns:

Dictionary mapping alias -> url (or None when no url is available).

Return type:

dict

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_dict['example.txt']
'http://ex/x'
property url_list#

Return a list of all download URLs.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_list
['http://ex/x']
property urls: list [str | None ]#

Return a list of all download URLs. Same as url_list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).urls
['http://ex/x']
mgnipy.V2.datasets.taxonomic.prep_obs(df, tax_col, long_short_mapping, fill_na='NA')[source]#

Prepares the taxonomy DataFrame by splitting the taxonomy string into separate columns for each taxonomic rank.

Parameters:
  • df (pl.DataFrame) – A Polars DataFrame containing a column named ‘taxonomy’ with taxonomic classifications in a semicolon-separated format.

  • tax_col (Literal["taxonomy", "#SampleID"]) – The name of the column in the DataFrame that contains the taxonomy string to be split.

  • long_short_mapping (Optional[dict [str , str ]]) – A dictionary mapping the long taxonomic rank names (e.g., “Superkingdom”) to their corresponding short prefixes (e.g., “sk”). This is used to clean the taxonomic rank values by stripping the short prefixes.

  • fill_na (Optional[Any], default="NA") – The value to use for filling empty strings or null values in the taxonomic rank columns after stripping the short prefixes. If not provided, it defaults to “NA”.

Returns:

A Polars DataFrame with separate columns for each taxonomic rank based on the taxonomy ranks defined in the constants.

Return type:

pl.DataFrame