mgnipy.V2.datasets package#
- class mgnipy.V2.datasets.MGazine(downloads, config=None, *, studies_details=None, analyses_details=None, runs_details=None, samples_details=None, assemblies_details=None, biosamples_details=None)[source]#
Bases:
StreamMixinMGazine is a class for managing and downloading datasets from MGnify. - Accepts a list of download-like dictionaries (for example the objects returned by the MGnify API for downloads) and provides simple streaming and download helpers. - Supports grouping datasets by pipeline version and short description, and provides methods for downloading individual files or all files in the MGazine.
- Parameters:
Examples
>>> downloads = [ ... {"alias": "a", "url": "/tmp/a.txt", "file_type": "txt"}, ... ] >>> mg = MGazine(downloads) >>> isinstance(mg, MGazine) True >>> mg.url_dict['a'] '/tmp/a.txt' >>> mg.url_list ['/tmp/a.txt']
- async adownload(to_dir, alias=None, *, url=None, filename=None, httpx_aclient=None, overwrite=False, hide_progress=False)[source]#
Asynchronously download a file from an alias or URL.
- Parameters:
to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this
MGazineinstance.url (str or None, optional) – Direct URL to fetch. Either
aliasorurlmust be provided.filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_aclient (httpx.AsyncClient, optional) – Optional httpx.AsyncClient to use for the HTTP request.
overwrite (bool , optional) – If
Falseand the destination file already exists the download is skipped. WhenTruethe existing file will be overwritten.hide_progress (bool , optional) – Disable the progress bar when
True.
- Raises:
ValueError – If neither
aliasnorurlis provided.
Examples
downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) await mg.adownload(“download_to_here”, alias=”example.txt”) # doctest: +SKIP
- async adownload_all(to_dir, overwrite=False, hide_progress=False)[source]#
Asynchronously download all files known to this
MGazine.- Parameters:
Notes
This helper creates a single async HTTP client and schedules concurrent adownload calls for all aliases.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"}, ... ] >>> mg = MGazine(downloads) >>> await mg.adownload_all("download_to_here")
- property aliases: list [str ]#
Return a list of all download aliases.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).aliases ['example.txt']
- by_pipeline_version()[source]#
Group downloads by pipeline version based on the ‘pipeline_version’ column in the downloads dataframe.
- Returns:
A dictionary where keys are pipeline versions and values are lists of download dictionaries.
- Return type:
- by_short_desc()[source]#
Group downloads by short description based on the ‘short_description’ column in the downloads dataframe.
- Returns:
A dictionary where keys are short descriptions and values are lists of download dictionaries.
- Return type:
- download(to_dir, alias=None, *, url=None, filename=None, httpx_client=None, overwrite=False, hide_progress=False)[source]#
Download a file from an alias or URL to a local directory.
- Parameters:
to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this
MGazineinstance. When provided the corresponding URL from the instance’s downloads list is used.url (str or None, optional) – Direct URL to fetch. Either
aliasorurlmust be provided.filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_client (httpx.Client, optional) – Optional httpx.Client to use for the HTTP request. If not supplied a temporary client from _mgnifier_helper is used.
overwrite (bool , optional) – If
Falseand the destination file already exists the download is skipped. WhenTruethe existing file will be overwritten.hide_progress (bool , optional) – Disable the progress bar when
True.
- Raises:
ValueError – If neither
aliasnorurlis provided.
Examples
downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) mg.download(“download_to_here”, alias=”example.txt”) # doctest: +SKIP
- download_all(to_dir, hide_progress=False, overwrite=False)[source]#
Download all files known to this
MGazineinstance.- Parameters:
Notes
This helper calls download for each alias present in the instance’s downloads list.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"}, ... ] >>> mg = MGazine(downloads) >>> mg.download_all("download_to_here")
- downloads_df(**pd_kwargs)[source]#
Return a
pandas.DataFrameof all downloads.The dataframe will contain columns such as
alias,urlandfile_typewhen those keys exist in the provided download dicts.Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> df = MGazine(downloads).downloads_df() >>> list(df.columns) ['alias', 'url', 'file_type']
- Return type:
DataFrame
- list_pipeline_version()[source]#
Return a list of pipeline versions extracted from the download groups.
This looks for patterns like ‘.v4.1’ in the ‘download_group’ field of the downloads and extracts the version number.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.v4.1", "pipeline_version": 'v4_1'}, ... {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.v5", "pipeline_version": 'v5'}, ... ] >>> MGazine(downloads).list_pipeline_version() ['v4_1', 'v5']
- list_short_descriptions()[source]#
Return a list of short descriptions extracted from the download groups.
This looks for patterns like ‘shortdesc’ in the ‘download_group’ field of the downloads and extracts the short description.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.shortdesc1", "pipeline_version": 4.1, "short_description": "shortdesc1"}, ... {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.shortdesc2", "pipeline_version": 4.1, "short_description": "shortdesc2"}, ... ] >>> MGazine(downloads).list_short_descriptions() ['shortdesc1', 'shortdesc2']
- stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)#
Streams a single download based on its alias or url.
If
chunksizeis specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.Supported formats and their handlers#
tsv: handled by
stream_pandas()(pandas) orstream_polars()(polars). Gzipped TSVs are supported via the gzip/compression options.csv: handled by
stream_pandas()/stream_polars()(sep=”,”).txt: handled by
stream_txt()(returns full text or yields line chunks).html: handled by
stream_html()(opens URL in browser).fasta: handled by
stream_fasta()(scikit-bio generator).gff: handled by
stream_gff()(scikit-bio generator).biom: handled by
stream_biom()(scikit-bio generator).gzipped HTTP resources: use
stream_gzipped()for a file-like object, orstream_json()for gzipped JSON content.jsonl / ndjson: handled by
stream_jsonl()(pandas or polars modes).json: handled by
stream_json()(returns full JSON or streams via ijson).tree/newick: handled by
stream_tree()(scikit-bio newick reader).other: if the URL ends with
.jsonit’s streamed viastream_json(); otherwise use the download helper for unsupported binary formats.
- param alias:
The alias of the download to stream.
- type alias:
Optional[str]
- param url:
The url of the download to stream.
- type url:
Optional[HttpUrl]
- param chunksize:
The size of the chunks to read from the stream.
- type chunksize:
Optional[int]
- param max_skip:
The maximum number of rows to skip before raising an error. Default is 5.
- type max_skip:
int, optional
- param **kwargs:
Additional keyword arguments to pass to the streamer function.
- returns:
The streamer result for the resolved alias or url.
- rtype:
Any
- stream_biom(url, **skbio_kwargs)#
Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the biom file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the biom file.
- Return type:
Generator
- stream_fasta(url, **skbio_kwargs)#
Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the FASTA file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the FASTA file.
- Return type:
Generator
- stream_gff(url, **skbio_kwargs)#
Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the GFF file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the GFF file.
- Return type:
Generator
- stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)#
Stream a gzipped HTTP resource and present a file-like interface.
When
chunksizeis None the entire compressed payload is fetched and decompressed into memory. Whenchunksizeis provided a streaming file-like object is returned.- Parameters:
- Return type:
bytes | str | BufferedReader | TextIOWrapper
- stream_html(url, **web_kwargs)#
Open an HTML URL in the default web browser.
- stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
- stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)#
- stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)#
Read a TSV from a URL or local file with resilient header handling.
The helper will retry with increasing
skiprowswhenpandasraises aParserError(useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pd_kwargs – Additional keyword arguments passed to
pd.read_csv.low_memory (bool )
- Returns:
A DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pd.DataFrame or TextFileReader
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).
- stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)#
Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.
The helper will retry with increasing
skip_rowswhen Polars raises an error (useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pl_kwargs – Additional keyword arguments passed to
pl.read_csv.low_memory (bool )
- Returns:
A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pl.DataFrame or Iterator[pl.DataFrame]
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Polars Error – If the TSV cannot be parsed due to a format error (after retries).
- stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
Stream a plain-text resource. When
chunksizeisNonethe full text is returned as a string. Whenchunksizeis an integer the function yields lists of lines.- Parameters:
url (str ) – The URL to stream the text from.
chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.
httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.
**httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method
- Returns:
The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.
- Return type:
str or Generator
- property url_dict: dict [str , dict ]#
Return mapping of alias to URL for all downloads.
- Returns:
Dictionary mapping alias -> url (or
Nonewhen no url is available).- Return type:
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).url_dict['example.txt'] 'http://ex/x'
- property url_list#
Return a list of all download URLs.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).url_list ['http://ex/x']