mgnipy.V2.datasets package#

class mgnipy.V2.datasets.MGazine(downloads, config=None, *, studies_details=None, analyses_details=None, runs_details=None, samples_details=None, assemblies_details=None, biosamples_details=None)[source]#

Bases: StreamMixin

MGazine is a class for managing and downloading datasets from MGnify. - Accepts a list of download-like dictionaries (for example the objects returned by the MGnify API for downloads) and provides simple streaming and download helpers. - Supports grouping datasets by pipeline version and short description, and provides methods for downloading individual files or all files in the MGazine.

Parameters:
  • downloads (list [dict ]) – List of download descriptors with keys such as alias, url and file_type.

  • config (MGnipyConfig, optional) – Optional configuration to use; when omitted the global MGnipyConfig is used.

  • studies_details (Optional[list [dict [str , Any]]])

  • analyses_details (Optional[list [dict [str , Any]]])

  • runs_details (Optional[list [dict [str , Any]]])

  • samples_details (Optional[list [dict [str , Any]]])

  • assemblies_details (Optional[list [dict [str , Any]]])

  • biosamples_details (Optional[list [dict [str , Any]]])

Examples

>>> downloads = [
...     {"alias": "a", "url": "/tmp/a.txt", "file_type": "txt"},
... ]
>>> mg = MGazine(downloads)
>>> isinstance(mg, MGazine)
True
>>> mg.url_dict['a']
'/tmp/a.txt'
>>> mg.url_list
['/tmp/a.txt']
async adownload(to_dir, alias=None, *, url=None, filename=None, httpx_aclient=None, overwrite=False, hide_progress=False)[source]#

Asynchronously download a file from an alias or URL.

Parameters:
  • to_dir (DirectoryPath) – Directory where the file will be saved.

  • alias (str or None, optional) – Download alias known to this MGazine instance.

  • url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.

  • filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.

  • httpx_aclient (httpx.AsyncClient, optional) – Optional httpx.AsyncClient to use for the HTTP request.

  • overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.

  • hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) await mg.adownload(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

async adownload_all(to_dir, overwrite=False, hide_progress=False)[source]#

Asynchronously download all files known to this MGazine.

Parameters:
  • to_dir (DirectoryPath) – Directory where the files will be saved.

  • overwrite (bool , optional) – Passed to adownload to control overwriting behavior.

  • hide_progress (bool , optional) – Disable progress bars when True.

Notes

This helper creates a single async HTTP client and schedules concurrent adownload calls for all aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> await mg.adownload_all("download_to_here")
property aliases: list [str ]#

Return a list of all download aliases.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).aliases
['example.txt']
property analyses_details: list [dict [str , Any ]] | None #
property assemblies_details: list [dict [str , Any ]] | None #
property biosamples_details: list [dict [str , Any ]] | None #
by_pipeline_version()[source]#

Group downloads by pipeline version based on the ‘pipeline_version’ column in the downloads dataframe.

Returns:

A dictionary where keys are pipeline versions and values are lists of download dictionaries.

Return type:

dict

by_short_desc()[source]#

Group downloads by short description based on the ‘short_description’ column in the downloads dataframe.

Returns:

A dictionary where keys are short descriptions and values are lists of download dictionaries.

Return type:

dict

download(to_dir, alias=None, *, url=None, filename=None, httpx_client=None, overwrite=False, hide_progress=False)[source]#

Download a file from an alias or URL to a local directory.

Parameters:
  • to_dir (DirectoryPath) – Directory where the file will be saved.

  • alias (str or None, optional) – Download alias known to this MGazine instance. When provided the corresponding URL from the instance’s downloads list is used.

  • url (str or None, optional) – Direct URL to fetch. Either alias or url must be provided.

  • filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.

  • httpx_client (httpx.Client, optional) – Optional httpx.Client to use for the HTTP request. If not supplied a temporary client from _mgnifier_helper is used.

  • overwrite (bool , optional) – If False and the destination file already exists the download is skipped. When True the existing file will be overwritten.

  • hide_progress (bool , optional) – Disable the progress bar when True.

Raises:

ValueError – If neither alias nor url is provided.

Examples

downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) mg.download(“download_to_here”, alias=”example.txt”) # doctest: +SKIP

download_all(to_dir, hide_progress=False, overwrite=False)[source]#

Download all files known to this MGazine instance.

Parameters:
  • to_dir (DirectoryPath) – Directory where the files will be saved.

  • hide_progress (bool , optional) – Disable per-file and overall progress bars when True.

  • overwrite (bool , optional) – Passed to download to control overwriting behavior.

Notes

This helper calls download for each alias present in the instance’s downloads list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
...     {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"},
... ]
>>> mg = MGazine(downloads)
>>> mg.download_all("download_to_here")
downloads_df(**pd_kwargs)[source]#

Return a pandas.DataFrame of all downloads.

The dataframe will contain columns such as alias, url and file_type when those keys exist in the provided download dicts.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> df = MGazine(downloads).downloads_df()
>>> list(df.columns)
['alias', 'url', 'file_type']
Return type:

DataFrame

list_pipeline_version()[source]#

Return a list of pipeline versions extracted from the download groups.

This looks for patterns like ‘.v4.1’ in the ‘download_group’ field of the downloads and extracts the version number.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.v4.1", "pipeline_version": 'v4_1'},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.v5", "pipeline_version": 'v5'},
... ]
>>> MGazine(downloads).list_pipeline_version()
['v4_1', 'v5']
list_short_descriptions()[source]#

Return a list of short descriptions extracted from the download groups.

This looks for patterns like ‘shortdesc’ in the ‘download_group’ field of the downloads and extracts the short description.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.shortdesc1", "pipeline_version": 4.1, "short_description": "shortdesc1"},
...     {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.shortdesc2", "pipeline_version": 4.1, "short_description": "shortdesc2"},
... ]
>>> MGazine(downloads).list_short_descriptions()
['shortdesc1', 'shortdesc2']
property runs_details: list [dict [str , Any ]] | None #
property samples_details: list [dict [str , Any ]] | None #
stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)#

Streams a single download based on its alias or url.

If chunksize is specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.

Supported formats and their handlers#

param alias:

The alias of the download to stream.

type alias:

Optional[str]

param url:

The url of the download to stream.

type url:

Optional[HttpUrl]

param chunksize:

The size of the chunks to read from the stream.

type chunksize:

Optional[int]

param max_skip:

The maximum number of rows to skip before raising an error. Default is 5.

type max_skip:

int, optional

param **kwargs:

Additional keyword arguments to pass to the streamer function.

returns:

The streamer result for the resolved alias or url.

rtype:

Any

Parameters:
  • alias (str | None)

  • url (HttpUrl | None)

  • chunksize (int | None)

  • max_skip (int )

Return type:

Any

stream_biom(url, **skbio_kwargs)#

Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the biom file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the biom file.

Return type:

Generator

stream_fasta(url, **skbio_kwargs)#

Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the FASTA file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the FASTA file.

Return type:

Generator

stream_gff(url, **skbio_kwargs)#

Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the GFF file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the GFF file.

Return type:

Generator

stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)#

Stream a gzipped HTTP resource and present a file-like interface.

When chunksize is None the entire compressed payload is fetched and decompressed into memory. When chunksize is provided a streaming file-like object is returned.

Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

  • decode (bool )

  • encoding (str )

  • errors (str )

Return type:

bytes | str | BufferedReader | TextIOWrapper

stream_html(url, **web_kwargs)#

Open an HTML URL in the default web browser.

Parameters:
  • url (str ) – The URL to open in the web browser.

  • **web_kwargs – Additional keyword arguments passed to webbrowser.open(), such as new and autoraise.

Returns:

True if the URL was opened successfully, False otherwise.

Return type:

bool

stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

Return type:

dict | Generator

stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)#
Parameters:
  • url (str )

  • orient (Literal ['records', 'split', 'index', 'columns', 'values', 'table'] | None)

  • chunksize (int | None)

  • dataframe_engine (Literal ['pandas', 'polars'] | None)

Return type:

dict

stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)#

Read a TSV from a URL or local file with resilient header handling.

The helper will retry with increasing skiprows when pandas raises a ParserError (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pd_kwargs – Additional keyword arguments passed to pd.read_csv.

  • low_memory (bool )

Returns:

A DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pd.DataFrame or TextFileReader

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).

stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)#

Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.

The helper will retry with increasing skip_rows when Polars raises an error (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pl_kwargs – Additional keyword arguments passed to pl.read_csv.

  • low_memory (bool )

Returns:

A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pl.DataFrame or Iterator[pl.DataFrame]

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Polars Error – If the TSV cannot be parsed due to a format error (after retries).

stream_tree(url, **skbio_kwargs)#
Parameters:

url (str )

Return type:

Generator

stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)#

Stream a plain-text resource. When chunksize is None the full text is returned as a string. When chunksize is an integer the function yields lists of lines.

Parameters:
  • url (str ) – The URL to stream the text from.

  • chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.

  • httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.

  • **httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method

Returns:

The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.

Return type:

str or Generator

property studies_details: list [dict [str , Any ]] | None #
property url_dict: dict [str , dict ]#

Return mapping of alias to URL for all downloads.

Returns:

Dictionary mapping alias -> url (or None when no url is available).

Return type:

dict

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_dict['example.txt']
'http://ex/x'
property url_list#

Return a list of all download URLs.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).url_list
['http://ex/x']
property urls: list [str | None ]#

Return a list of all download URLs. Same as url_list.

Examples

>>> downloads = [
...     {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"},
... ]
>>> MGazine(downloads).urls
['http://ex/x']

Submodules#