mgnipy.V2.datasets.taxonomic module#
- class mgnipy.V2.datasets.taxonomic.DWCTaxaMGazine(mgazine, config=None, *, long_short_mapping=None, assemblies_details=None, runs_details=None, samples_details=None, studies_details=None, biosamples_details=None, analyses_details=None)[source]#
Bases:
_MGazineSetup- Parameters:
mgazine (MGazine)
config (Optional[MGnipyConfig])
- async adownload(to_dir, alias=None, *, url=None, filename=None, httpx_aclient=None, overwrite=False, hide_progress=False)#
Asynchronously download a file from an alias or URL.
- Parameters:
to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this
MGazineinstance.url (str or None, optional) – Direct URL to fetch. Either
aliasorurlmust be provided.filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_aclient (httpx.AsyncClient, optional) – Optional httpx.AsyncClient to use for the HTTP request.
overwrite (bool , optional) – If
Falseand the destination file already exists the download is skipped. WhenTruethe existing file will be overwritten.hide_progress (bool , optional) – Disable the progress bar when
True.
- Raises:
ValueError – If neither
aliasnorurlis provided.
Examples
downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) await mg.adownload(“download_to_here”, alias=”example.txt”) # doctest: +SKIP
- async adownload_all(to_dir, overwrite=False, hide_progress=False)#
Asynchronously download all files known to this
MGazine.- Parameters:
Notes
This helper creates a single async HTTP client and schedules concurrent adownload calls for all aliases.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"}, ... ] >>> mg = MGazine(downloads) >>> await mg.adownload_all("download_to_here")
- async aenrich_runs(limit=200, hide_progress=False)#
Asynchronously enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.
- Parameters:
limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.
hide_progress (bool , default=False) – Whether to hide the progress bar during enrichment. Defaults to False.
- Returns:
The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.
- Return type:
None
- property aliases: list [str ]#
Return a list of all download aliases.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).aliases ['example.txt']
- by_pipeline_version()#
Group downloads by pipeline version based on the ‘pipeline_version’ column in the downloads dataframe.
- Returns:
A dictionary where keys are pipeline versions and values are lists of download dictionaries.
- Return type:
- by_short_desc()#
Group downloads by short description based on the ‘short_description’ column in the downloads dataframe.
- Returns:
A dictionary where keys are short descriptions and values are lists of download dictionaries.
- Return type:
- clear_cache()#
- download(to_dir, alias=None, *, url=None, filename=None, httpx_client=None, overwrite=False, hide_progress=False)#
Download a file from an alias or URL to a local directory.
- Parameters:
to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this
MGazineinstance. When provided the corresponding URL from the instance’s downloads list is used.url (str or None, optional) – Direct URL to fetch. Either
aliasorurlmust be provided.filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_client (httpx.Client, optional) – Optional httpx.Client to use for the HTTP request. If not supplied a temporary client from _mgnifier_helper is used.
overwrite (bool , optional) – If
Falseand the destination file already exists the download is skipped. WhenTruethe existing file will be overwritten.hide_progress (bool , optional) – Disable the progress bar when
True.
- Raises:
ValueError – If neither
aliasnorurlis provided.
Examples
downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) mg.download(“download_to_here”, alias=”example.txt”) # doctest: +SKIP
- download_all(to_dir, hide_progress=False, overwrite=False)#
Download all files known to this
MGazineinstance.- Parameters:
Notes
This helper calls download for each alias present in the instance’s downloads list.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"}, ... ] >>> mg = MGazine(downloads) >>> mg.download_all("download_to_here")
- downloads_df(**pd_kwargs)#
Return a
pandas.DataFrameof all downloads.The dataframe will contain columns such as
alias,urlandfile_typewhen those keys exist in the provided download dicts.Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> df = MGazine(downloads).downloads_df() >>> list(df.columns) ['alias', 'url', 'file_type']
- Return type:
DataFrame
- enrich_biosamples(limit=200, hide_progress=False, incl_ena=True)#
Enriches the biosample metadata for the biosamples in the taxonomic dataset by iterating through the biosample accessions and retrieving their details using the BiosampleDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.
- Parameters:
limit (Optional[int ], default=200) – An optional integer to limit the number of biosamples to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of biosamples enriched.
hide_progress (bool )
incl_ena (bool )
- Returns:
The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.
- Return type:
None
- enrich_runs(limit=200, hide_progress=False)#
Enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.
- Parameters:
limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.
hide_progress (bool )
- Returns:
The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.
- Return type:
None
- enrich_samples()#
- enrich_studies()#
- property lazy_merged: LazyFrame#
- list_pipeline_version()#
Return a list of pipeline versions extracted from the download groups.
This looks for patterns like ‘.v4.1’ in the ‘download_group’ field of the downloads and extracts the version number.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.v4.1", "pipeline_version": 'v4_1'}, ... {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.v5", "pipeline_version": 'v5'}, ... ] >>> MGazine(downloads).list_pipeline_version() ['v4_1', 'v5']
- list_short_descriptions()#
Return a list of short descriptions extracted from the download groups.
This looks for patterns like ‘shortdesc’ in the ‘download_group’ field of the downloads and extracts the short description.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.shortdesc1", "pipeline_version": 4.1, "short_description": "shortdesc1"}, ... {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.shortdesc2", "pipeline_version": 4.1, "short_description": "shortdesc2"}, ... ] >>> MGazine(downloads).list_short_descriptions() ['shortdesc1', 'shortdesc2']
- load()[source]#
Lazy loading and merging of the datasets contained in url_list. This method should be called after instantiating to set up the internal state and load any cached results.
- metadata(df_engine='pandas', strict=False, expand_nested_dicts=True, incl_runs_details=True, incl_samples_details=True, incl_studies_details=True, incl_biosamples_details=True, incl_analyses_details=True, incl_assemblies_details=True)#
- Parameters:
- Return type:
DataFrame | DataFrame
- stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)#
Streams a single download based on its alias or url.
If
chunksizeis specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.Supported formats and their handlers#
tsv: handled by
stream_pandas()(pandas) orstream_polars()(polars). Gzipped TSVs are supported via the gzip/compression options.csv: handled by
stream_pandas()/stream_polars()(sep=”,”).txt: handled by
stream_txt()(returns full text or yields line chunks).html: handled by
stream_html()(opens URL in browser).fasta: handled by
stream_fasta()(scikit-bio generator).gff: handled by
stream_gff()(scikit-bio generator).biom: handled by
stream_biom()(scikit-bio generator).gzipped HTTP resources: use
stream_gzipped()for a file-like object, orstream_json()for gzipped JSON content.jsonl / ndjson: handled by
stream_jsonl()(pandas or polars modes).json: handled by
stream_json()(returns full JSON or streams via ijson).tree/newick: handled by
stream_tree()(scikit-bio newick reader).other: if the URL ends with
.jsonit’s streamed viastream_json(); otherwise use the download helper for unsupported binary formats.
- param alias:
The alias of the download to stream.
- type alias:
Optional[str]
- param url:
The url of the download to stream.
- type url:
Optional[HttpUrl]
- param chunksize:
The size of the chunks to read from the stream.
- type chunksize:
Optional[int]
- param max_skip:
The maximum number of rows to skip before raising an error. Default is 5.
- type max_skip:
int, optional
- param **kwargs:
Additional keyword arguments to pass to the streamer function.
- returns:
The streamer result for the resolved alias or url.
- rtype:
Any
- stream_biom(url, **skbio_kwargs)#
Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the biom file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the biom file.
- Return type:
Generator
- stream_fasta(url, **skbio_kwargs)#
Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the FASTA file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the FASTA file.
- Return type:
Generator
- stream_gff(url, **skbio_kwargs)#
Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the GFF file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the GFF file.
- Return type:
Generator
- stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)#
Stream a gzipped HTTP resource and present a file-like interface.
When
chunksizeis None the entire compressed payload is fetched and decompressed into memory. Whenchunksizeis provided a streaming file-like object is returned.- Parameters:
- Return type:
bytes | str | BufferedReader | TextIOWrapper
- stream_html(url, **web_kwargs)#
Open an HTML URL in the default web browser.
- stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
- stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)#
- stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)#
Read a TSV from a URL or local file with resilient header handling.
The helper will retry with increasing
skiprowswhenpandasraises aParserError(useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pd_kwargs – Additional keyword arguments passed to
pd.read_csv.low_memory (bool )
- Returns:
A DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pd.DataFrame or TextFileReader
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).
- stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)#
Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.
The helper will retry with increasing
skip_rowswhen Polars raises an error (useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pl_kwargs – Additional keyword arguments passed to
pl.read_csv.low_memory (bool )
- Returns:
A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pl.DataFrame or Iterator[pl.DataFrame]
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Polars Error – If the TSV cannot be parsed due to a format error (after retries).
- stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
Stream a plain-text resource. When
chunksizeisNonethe full text is returned as a string. Whenchunksizeis an integer the function yields lists of lines.- Parameters:
url (str ) – The URL to stream the text from.
chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.
httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.
**httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method
- Returns:
The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.
- Return type:
str or Generator
- taxonomic_metadata(fill_na='NA', df_engine='pandas', strict=False)#
- to_pandas(**pd_kwargs)#
- Return type:
DataFrame
- to_polars()#
- Return type:
DataFrame
- property url_dict: dict [str , dict ]#
Return mapping of alias to URL for all downloads.
- Returns:
Dictionary mapping alias -> url (or
Nonewhen no url is available).- Return type:
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).url_dict['example.txt'] 'http://ex/x'
- property url_list#
Return a list of all download URLs.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).url_list ['http://ex/x']
- class mgnipy.V2.datasets.taxonomic.TaxaMGazine(mgazine, config=None, *, long_short_mapping=None, assemblies_details=None, runs_details=None, samples_details=None, studies_details=None, biosamples_details=None, analyses_details=None)[source]#
Bases:
_MGazineSetupnot for dwc
- Parameters:
mgazine (MGazine)
config (Optional[MGnipyConfig])
- X(df_engine='pandas')[source]#
- Parameters:
df_engine (Literal ['polars', 'pandas'])
- Return type:
DataFrame | DataFrame
- async adownload(to_dir, alias=None, *, url=None, filename=None, httpx_aclient=None, overwrite=False, hide_progress=False)#
Asynchronously download a file from an alias or URL.
- Parameters:
to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this
MGazineinstance.url (str or None, optional) – Direct URL to fetch. Either
aliasorurlmust be provided.filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_aclient (httpx.AsyncClient, optional) – Optional httpx.AsyncClient to use for the HTTP request.
overwrite (bool , optional) – If
Falseand the destination file already exists the download is skipped. WhenTruethe existing file will be overwritten.hide_progress (bool , optional) – Disable the progress bar when
True.
- Raises:
ValueError – If neither
aliasnorurlis provided.
Examples
downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) await mg.adownload(“download_to_here”, alias=”example.txt”) # doctest: +SKIP
- async adownload_all(to_dir, overwrite=False, hide_progress=False)#
Asynchronously download all files known to this
MGazine.- Parameters:
Notes
This helper creates a single async HTTP client and schedules concurrent adownload calls for all aliases.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"}, ... ] >>> mg = MGazine(downloads) >>> await mg.adownload_all("download_to_here")
- async aenrich_runs(limit=200, hide_progress=False)#
Asynchronously enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.
- Parameters:
limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.
hide_progress (bool , default=False) – Whether to hide the progress bar during enrichment. Defaults to False.
- Returns:
The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.
- Return type:
None
- property aliases: list [str ]#
Return a list of all download aliases.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).aliases ['example.txt']
- by_pipeline_version()#
Group downloads by pipeline version based on the ‘pipeline_version’ column in the downloads dataframe.
- Returns:
A dictionary where keys are pipeline versions and values are lists of download dictionaries.
- Return type:
- by_short_desc()#
Group downloads by short description based on the ‘short_description’ column in the downloads dataframe.
- Returns:
A dictionary where keys are short descriptions and values are lists of download dictionaries.
- Return type:
- clear_cache()#
- download(to_dir, alias=None, *, url=None, filename=None, httpx_client=None, overwrite=False, hide_progress=False)#
Download a file from an alias or URL to a local directory.
- Parameters:
to_dir (DirectoryPath) – Directory where the file will be saved.
alias (str or None, optional) – Download alias known to this
MGazineinstance. When provided the corresponding URL from the instance’s downloads list is used.url (str or None, optional) – Direct URL to fetch. Either
aliasorurlmust be provided.filename (str or None, optional) – Filename to use for the saved file. When omitted the alias is used.
httpx_client (httpx.Client, optional) – Optional httpx.Client to use for the HTTP request. If not supplied a temporary client from _mgnifier_helper is used.
overwrite (bool , optional) – If
Falseand the destination file already exists the download is skipped. WhenTruethe existing file will be overwritten.hide_progress (bool , optional) – Disable the progress bar when
True.
- Raises:
ValueError – If neither
aliasnorurlis provided.
Examples
downloads = [ … { … “alias”: “example.txt”, … “url”: “http://ex/x ”, … “file_type”: “txt”, … }] mg = MGazine(downloads) mg.download(“download_to_here”, alias=”example.txt”) # doctest: +SKIP
- download_all(to_dir, hide_progress=False, overwrite=False)#
Download all files known to this
MGazineinstance.- Parameters:
Notes
This helper calls download for each alias present in the instance’s downloads list.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... {"alias": "example2.fasta.gz", "url": "http://ex/x2", "file_type": "fasta"}, ... ] >>> mg = MGazine(downloads) >>> mg.download_all("download_to_here")
- downloads_df(**pd_kwargs)#
Return a
pandas.DataFrameof all downloads.The dataframe will contain columns such as
alias,urlandfile_typewhen those keys exist in the provided download dicts.Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> df = MGazine(downloads).downloads_df() >>> list(df.columns) ['alias', 'url', 'file_type']
- Return type:
DataFrame
- enrich_biosamples(limit=200, hide_progress=False, incl_ena=True)#
Enriches the biosample metadata for the biosamples in the taxonomic dataset by iterating through the biosample accessions and retrieving their details using the BiosampleDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.
- Parameters:
limit (Optional[int ], default=200) – An optional integer to limit the number of biosamples to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of biosamples enriched.
hide_progress (bool )
incl_ena (bool )
- Returns:
The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.
- Return type:
None
- enrich_runs(limit=200, hide_progress=False)#
Enriches the run metadata for the runs in the taxonomic dataset by iterating through the run accessions and retrieving their details using the RunDetail proxy. The results are cached using the DiskCheckpointer to avoid redundant API calls in future runs.
- Parameters:
limit (Optional[int ], default=200) – An optional integer to limit the number of runs to enrich. If not provided, it defaults to 200. This is useful for testing or when dealing with large datasets to avoid long runtimes during development. If set to None, there will be no limit on the number of runs enriched.
hide_progress (bool )
- Returns:
The function does not return anything. It updates the run_results attribute of the TaxaMGazine instance with the enriched run metadata.
- Return type:
None
- enrich_samples()#
- enrich_studies()#
- property lazy_merged: LazyFrame#
- list_pipeline_version()#
Return a list of pipeline versions extracted from the download groups.
This looks for patterns like ‘.v4.1’ in the ‘download_group’ field of the downloads and extracts the version number.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.v4.1", "pipeline_version": 'v4_1'}, ... {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.v5", "pipeline_version": 'v5'}, ... ] >>> MGazine(downloads).list_pipeline_version() ['v4_1', 'v5']
- list_short_descriptions()#
Return a list of short descriptions extracted from the download groups.
This looks for patterns like ‘shortdesc’ in the ‘download_group’ field of the downloads and extracts the short description.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt", "download_group": "group.shortdesc1", "pipeline_version": 4.1, "short_description": "shortdesc1"}, ... {"alias": "example2.txt", "url": "http://ex/x2", "file_type": "txt", "download_group": "group.shortdesc2", "pipeline_version": 4.1, "short_description": "shortdesc2"}, ... ] >>> MGazine(downloads).list_short_descriptions() ['shortdesc1', 'shortdesc2']
- load()[source]#
Lazy loading and merging of the datasets contained in url_list. This method should be called after instantiating to set up the internal state and load any cached results.
- metadata(df_engine='pandas', strict=False, expand_nested_dicts=True, incl_runs_details=True, incl_samples_details=True, incl_studies_details=True, incl_biosamples_details=True, incl_analyses_details=True, incl_assemblies_details=True)#
- Parameters:
- Return type:
DataFrame | DataFrame
- stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)#
Streams a single download based on its alias or url.
If
chunksizeis specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.Supported formats and their handlers#
tsv: handled by
stream_pandas()(pandas) orstream_polars()(polars). Gzipped TSVs are supported via the gzip/compression options.csv: handled by
stream_pandas()/stream_polars()(sep=”,”).txt: handled by
stream_txt()(returns full text or yields line chunks).html: handled by
stream_html()(opens URL in browser).fasta: handled by
stream_fasta()(scikit-bio generator).gff: handled by
stream_gff()(scikit-bio generator).biom: handled by
stream_biom()(scikit-bio generator).gzipped HTTP resources: use
stream_gzipped()for a file-like object, orstream_json()for gzipped JSON content.jsonl / ndjson: handled by
stream_jsonl()(pandas or polars modes).json: handled by
stream_json()(returns full JSON or streams via ijson).tree/newick: handled by
stream_tree()(scikit-bio newick reader).other: if the URL ends with
.jsonit’s streamed viastream_json(); otherwise use the download helper for unsupported binary formats.
- param alias:
The alias of the download to stream.
- type alias:
Optional[str]
- param url:
The url of the download to stream.
- type url:
Optional[HttpUrl]
- param chunksize:
The size of the chunks to read from the stream.
- type chunksize:
Optional[int]
- param max_skip:
The maximum number of rows to skip before raising an error. Default is 5.
- type max_skip:
int, optional
- param **kwargs:
Additional keyword arguments to pass to the streamer function.
- returns:
The streamer result for the resolved alias or url.
- rtype:
Any
- stream_biom(url, **skbio_kwargs)#
Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the biom file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the biom file.
- Return type:
Generator
- stream_fasta(url, **skbio_kwargs)#
Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the FASTA file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the FASTA file.
- Return type:
Generator
- stream_gff(url, **skbio_kwargs)#
Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the GFF file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the GFF file.
- Return type:
Generator
- stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)#
Stream a gzipped HTTP resource and present a file-like interface.
When
chunksizeis None the entire compressed payload is fetched and decompressed into memory. Whenchunksizeis provided a streaming file-like object is returned.- Parameters:
- Return type:
bytes | str | BufferedReader | TextIOWrapper
- stream_html(url, **web_kwargs)#
Open an HTML URL in the default web browser.
- stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
- stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)#
- stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)#
Read a TSV from a URL or local file with resilient header handling.
The helper will retry with increasing
skiprowswhenpandasraises aParserError(useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pd_kwargs – Additional keyword arguments passed to
pd.read_csv.low_memory (bool )
- Returns:
A DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pd.DataFrame or TextFileReader
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).
- stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)#
Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.
The helper will retry with increasing
skip_rowswhen Polars raises an error (useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pl_kwargs – Additional keyword arguments passed to
pl.read_csv.low_memory (bool )
- Returns:
A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pl.DataFrame or Iterator[pl.DataFrame]
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Polars Error – If the TSV cannot be parsed due to a format error (after retries).
- stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)#
Stream a plain-text resource. When
chunksizeisNonethe full text is returned as a string. Whenchunksizeis an integer the function yields lists of lines.- Parameters:
url (str ) – The URL to stream the text from.
chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.
httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.
**httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method
- Returns:
The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.
- Return type:
str or Generator
- to_anndata(**anndata_kwargs)[source]#
Converts the taxonomic metadata to an AnnData object. The taxonomic ranks are stored in the obs attribute of the AnnData object.
- Parameters:
**anndata_kwargs – Additional keyword arguments to pass to the AnnData constructor.
- Returns:
An AnnData object containing the taxonomic metadata in the obs attribute.
- Return type:
ad.AnnData
- to_pandas(**pd_kwargs)#
- Return type:
DataFrame
- to_polars()#
- Return type:
DataFrame
- property url_dict: dict [str , dict ]#
Return mapping of alias to URL for all downloads.
- Returns:
Dictionary mapping alias -> url (or
Nonewhen no url is available).- Return type:
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).url_dict['example.txt'] 'http://ex/x'
- property url_list#
Return a list of all download URLs.
Examples
>>> downloads = [ ... {"alias": "example.txt", "url": "http://ex/x", "file_type": "txt"}, ... ] >>> MGazine(downloads).url_list ['http://ex/x']
- mgnipy.V2.datasets.taxonomic.prep_obs(df, tax_col, long_short_mapping, fill_na='NA')[source]#
Prepares the taxonomy DataFrame by splitting the taxonomy string into separate columns for each taxonomic rank.
- Parameters:
df (pl.DataFrame) – A Polars DataFrame containing a column named ‘taxonomy’ with taxonomic classifications in a semicolon-separated format.
tax_col (Literal["taxonomy", "#SampleID"]) – The name of the column in the DataFrame that contains the taxonomy string to be split.
long_short_mapping (Optional[dict [str , str ]]) – A dictionary mapping the long taxonomic rank names (e.g., “Superkingdom”) to their corresponding short prefixes (e.g., “sk”). This is used to clean the taxonomic rank values by stripping the short prefixes.
fill_na (Optional[Any], default="NA") – The value to use for filling empty strings or null values in the taxonomic rank columns after stripping the short prefixes. If not provided, it defaults to “NA”.
- Returns:
A Polars DataFrame with separate columns for each taxonomic rank based on the taxonomy ranks defined in the constants.
- Return type:
pl.DataFrame