mgnipy.V2.datasets package

mgnipy.V2.datasets package#

class mgnipy.V2.datasets.MGazine(downloads, config=None, *, studies_details=None, analyses_details=None, runs_details=None, samples_details=None, assemblies_details=None, biosamples_details=None)[source]#

Bases: StreamMixin

MGazine is a class for managing and downloading datasets from MGnify. - Accepts a list of download-like dictionaries (for example the objects returned by the MGnify API for downloads) and provides simple streaming and download helpers. - Supports grouping datasets by pipeline version and short description, and provides methods for downloading individual files or all files in the MGazine.

Parameters:

downloads (list [dict ]) – List of download descriptors with keys such as alias, url and file_type.
config (MGnipyConfig, optional) – Optional configuration to use; when omitted the global MGnipyConfig is used.
studies_details (Optional[list [dict [str , Any]]])
analyses_details (Optional[list [dict [str , Any]]])
runs_details (Optional[list [dict [str , Any]]])
samples_details (Optional[list [dict [str , Any]]])
assemblies_details (Optional[list [dict [str , Any]]])
biosamples_details (Optional[list [dict [str , Any]]])

Examples

>>> downloads = [
...     {"alias": "a", "url": "/tmp/a.txt", "file_type": "txt"},
... ]
>>> mg = MGazine(downloads)
>>> isinstance(mg, MGazine)
True
>>> mg.url_dict['a']
'/tmp/a.txt'
>>> mg.url_list
['/tmp/a.txt']

async adownload(to_dir, alias=None, *, url=None, filename=None, httpx_aclient=None, overwrite=False, hide_progress=False)[source]#

Asynchronously download a file from an alias or URL.

Parameters:

to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this MGazine instance.
url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.
filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_aclient (httpx.AsyncClient, optional) – Optional httpx.AsyncClient to use for the HTTP request.
overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.
hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) await mg.adownload(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

async adownload_all(to_dir, overwrite=False, hide_progress=False)[source]#

Asynchronously download all files known to this MGazine.

Parameters:

to_dir (DirectoryPath) – Directory where the files will be saved.
overwrite (bool , optional) – Passed to adownload to control overwriting behavior.
hide_progress (bool , optional) – Disable progress bars when True.

Notes

This helper creates a single async HTTP client and schedules concurrent adownload calls for all aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> await mg.adownload_all("download_to_here")

property aliases: list [str ]#

Return a list of all download aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).aliases
['example.txt']

property analyses_details: list [dict [str , Any ]] | None #

property assemblies_details: list [dict [str , Any ]] | None #

property biosamples_details: list [dict [str , Any ]] | None #

by_pipeline_version()[source]#

Group downloads by pipeline version based on the ‘pipeline_version’ column in the downloads dataframe.

Returns:: A dictionary where keys are pipeline versions and values are lists of download dictionaries.
Return type:: dict

by_short_desc()[source]#

Group downloads by short description based on the ‘short_description’ column in the downloads dataframe.

Returns:: A dictionary where keys are short descriptions and values are lists of download dictionaries.
Return type:: dict

download(to_dir, alias=None, *, url=None, filename=None, httpx_client=None, overwrite=False, hide_progress=False)[source]#

Download a file from an alias or URL to a local directory.

Parameters:

to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this MGazine instance. When provided the corresponding URL from the instance’s downloads list is used.
url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.
filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_client (httpx.Client, optional) – Optional httpx.Client to use for the HTTP request. If not supplied a temporary client from _mgnifier_helper is used.
overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.
hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) mg.download(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

download_all(to_dir, hide_progress=False, overwrite=False)[source]#

Download all files known to this MGazine instance.

Parameters:

to_dir (DirectoryPath) – Directory where the files will be saved.
hide_progress (bool , optional) – Disable per-file and overall progress bars when True.
overwrite (bool , optional) – Passed to download to control overwriting behavior.

Notes

This helper calls download for each alias present in the instance’s downloads list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> mg.download_all("download_to_here")

downloads_df(**pd_kwargs)[source]#

Return a pandas.DataFrame of all downloads.

The dataframe will contain columns such as alias, url and file_type when those keys exist in the provided download dicts.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> df = MGazine(downloads).downloads_df()
>>> list(df.columns)
['alias', 'url', 'file_type']

Return type:: DataFrame

list_pipeline_version()[source]#

Return a list of pipeline versions extracted from the download groups.

This looks for patterns like ‘.v4.1’ in the ‘download_group’ field of the downloads and extracts the version number.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.v4.1", "pipeline_version": 'v4_1'},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.v5", "pipeline_version": 'v5'},
... ]
>>> MGazine(downloads).list_pipeline_version()
['v4_1', 'v5']

list_short_descriptions()[source]#

Return a list of short descriptions extracted from the download groups.

This looks for patterns like ‘shortdesc’ in the ‘download_group’ field of the downloads and extracts the short description.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.shortdesc1", "pipeline_version": 4.1, "short_description": "shortdesc1"},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.shortdesc2", "pipeline_version": 4.1, "short_description": "shortdesc2"},
... ]
>>> MGazine(downloads).list_short_descriptions()
['shortdesc1', 'shortdesc2']

property runs_details: list [dict [str , Any ]] | None #

property samples_details: list [dict [str , Any ]] | None #

stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)#

Streams a single download based on its alias or url.

If chunksize is specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.

Supported formats and their handlers#

tsv: handled by stream_pandas() (pandas) or stream_polars() (polars). Gzipped TSVs are supported via the gzip/compression options.
csv: handled by stream_pandas() / stream_polars() (sep=”,”).
txt: handled by stream_txt() (returns full text or yields line chunks).
html: handled by stream_html() (opens URL in browser).
fasta: handled by stream_fasta() (scikit-bio generator).
gff: handled by stream_gff() (scikit-bio generator).
biom: handled by stream_biom() (scikit-bio generator).
gzipped HTTP resources: use stream_gzipped() for a file-like object, or stream_json() for gzipped JSON content.
jsonl / ndjson: handled by stream_jsonl() (pandas or polars modes).
json: handled by stream_json() (returns full JSON or streams via ijson).
tree/newick: handled by stream_tree() (scikit-bio newick reader).
other: if the URL ends with .json it’s streamed via stream_json(); otherwise use the download helper for unsupported binary formats.

param alias:: The alias of the download to stream.
type alias:: Optional[str]
param url:: The url of the download to stream.
type url:: Optional[HttpUrl]
param chunksize:: The size of the chunks to read from the stream.
type chunksize:: Optional[int]
param max_skip:: The maximum number of rows to skip before raising an error. Default is 5.
type max_skip:: int, optional
param **kwargs:: Additional keyword arguments to pass to the streamer function.
returns:: The streamer result for the resolved alias or url.
rtype:: Any

Parameters:

alias (str | None)
url (HttpUrl | None)
chunksize (int | None)
max_skip (int )

Return type:

Any

stream_biom(url, **skbio_kwargs)#

Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:

url (str ) – The URL to the biom file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the biom file.

Return type:

Generator

stream_fasta(url, **skbio_kwargs)#

Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:

url (str ) – The URL to the FASTA file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the FASTA file.

Return type:

Generator

stream_gff(url, **skbio_kwargs)#

Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:

url (str ) – The URL to the GFF file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the GFF file.

Return type:

Generator

stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)#

Stream a gzipped HTTP resource and present a file-like interface.

When chunksize is None the entire compressed payload is fetched and decompressed into memory. When chunksize is provided a streaming file-like object is returned.

Parameters:

url (str )
chunksize (int | None)
httpx_client (Client | None)
decode (bool )
encoding (str )
errors (str )

Return type:

bytes | str | BufferedReader | TextIOWrapper

stream_html(url, **web_kwargs)#

Open an HTML URL in the default web browser.

Parameters:

url (str ) – The URL to open in the web browser.
**web_kwargs – Additional keyword arguments passed to webbrowser.open(), such as new and autoraise.

Returns:

True if the URL was opened successfully, False otherwise.

Return type:

bool

stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)#

Parameters:

url (str )
chunksize (int | None)
httpx_client (Client | None)

Return type:

dict | Generator

stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)#

Parameters:

url (str )
orient (Literal ['records', 'split', 'index', 'columns', 'values', 'table'] | None)
chunksize (int | None)
dataframe_engine (Literal ['pandas', 'polars'] | None)

Return type:

dict

stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)#

Read a TSV from a URL or local file with resilient header handling.

The helper will retry with increasing skiprows when pandas raises a ParserError (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:

url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pd_kwargs – Additional keyword arguments passed to pd.read_csv.
low_memory (bool )

Returns:

A DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pd.DataFrame or TextFileReader

Raises:

ValueError – If chunksize is not a positive integer or None.
RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.
Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).

stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)#

Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.

The helper will retry with increasing skip_rows when Polars raises an error (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:

url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pl_kwargs – Additional keyword arguments passed to pl.read_csv.
low_memory (bool )

Returns:

A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pl.DataFrame or Iterator[pl.DataFrame]

Raises:

ValueError – If chunksize is not a positive integer or None.
RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.
Polars Error – If the TSV cannot be parsed due to a format error (after retries).

stream_tree(url, **skbio_kwargs)#

Parameters:: url (str )
Return type:: Generator

stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)#

Stream a plain-text resource. When chunksize is None the full text is returned as a string. When chunksize is an integer the function yields lists of lines.

Parameters:

url (str ) – The URL to stream the text from.
chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.
httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.
**httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method

Returns:

The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.

Return type:

str or Generator

property studies_details: list [dict [str , Any ]] | None #

property url_dict: dict [str , dict ]#

Return mapping of alias to URL for all downloads.

Returns:: Dictionary mapping alias -> url (or None when no url is available).
Return type:: dict

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_dict['example.txt']
'http://ex/x'

property url_list#

Return a list of all download URLs.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_list
['http://ex/x']

property urls: list [str | None ]#

Return a list of all download URLs. Same as url_list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).urls
['http://ex/x']

Submodules#

mgnipy.V2.datasets.taxonomic module

mgnipy.V2.datasets package

Contents

mgnipy.V2.datasets package#

Supported formats and their handlers#

Submodules#