mgnipy.V2.mixins module#
- class mgnipy.V2.mixins.BioSamplesMetadataMixin[source]#
Bases:
objectMixin providing properties to access BioSamples metadata for samples in the list.
- async aenrich_biosamples(incl_ena=False, overwrite=False)[source]#
Asynchronously fetch and cache the BioSamples metadata for the sample associated with this SampleDetail instance, based on its accession. The metadata is stored in the cache properties for later retrieval by the abiosamplesdata property.
- Parameters:
- Return type:
None
Notes
This property retrieves the BioSamples metadata for the sample accession using the aget_biosample_metadata_from_acc function, which asynchronously queries the BioSamples API and constructs a DataFrame with the relevant metadata fields.
The resulting DataFrame will have a single row corresponding to the sample accession, and columns for ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the sample, with missing values filled as ‘NA’.
- async aenrich_biosamples_details(incl_ena=False, overwrite=False)[source]#
Asynchronously retrieve the concatenated BioSamples metadata for all samples in the list. Each row corresponds to a sample, and columns include ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the samples.
- Parameters:
- Returns:
This method does not return a value, but it caches the BioSamples metadata for later retrieval.
- Return type:
None
Notes
Relies on the abiosample_metadata property of each SampleDetail instance, which asynchronously retrieves the BioSamples metadata for each sample accession.
The resulting DataFrame is constructed by concatenating the individual DataFrames for each sample, and if each sample has different characteristics, the resulting DataFrame will have columns for all unique characteristics across the samples, with missing values filled as NaN.
- biosamples(incl_ena=False)[source]#
A DataFrame containing the BioSamples metadata for the sample associated with this SampleDetail instance, based on its accession. The DataFrame includes columns such as ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the sample.
- Parameters:
incl_ena (bool )
- Return type:
DataFrame
- details_biosamples(incl_ena=True)[source]#
A DataFrame containing the concatenated BioSamples metadata for all samples in the list. Each row corresponds to a sample, and columns include ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the samples.
- Parameters:
incl_ena (bool , optional) – Whether to include ENA-specific metadata fields in the resulting DataFrame. Defaults to True.
- Returns:
A DataFrame containing the BioSamples metadata for all samples in the list.
- Return type:
pd.DataFrame
Notes
Relies on the biosample_metadata property of each SampleDetail instance, which retrieves the BioSamples metadata for each sample accession.
The resulting DataFrame is constructed by concatenating the individual DataFrames for each sample, and if each sample has different characteristics, the resulting DataFrame will have columns for all unique characteristics across the samples, with missing values filled as NaN.
- enrich_biosamples(incl_ena=False, overwrite=False)[source]#
Fetch and cache the BioSamples metadata for the sample associated with this SampleDetail instance, based on its accession. The metadata is stored in the cache properties for later retrieval by the biosamplesdata property.
- Parameters:
- Return type:
None
Notes
This property retrieves the BioSamples metadata for the sample accession using the get_biosample_metadata_from_acc function, which queries the BioSamples API and constructs a DataFrame with the relevant metadata fields.
The resulting DataFrame will have a single row corresponding to the sample accession, and columns for ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the sample, with missing values filled as ‘NA’.
- enrich_biosamples_details(incl_ena=True, overwrite=False)[source]#
Fetch and cache the concatenated BioSamples metadata for all samples in the list. Each row corresponds to a sample, and columns include ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the samples. The metadata is stored in the cache properties for later retrieval by the details_biosamplesdata property.
- Parameters:
- Return type:
None
Notes
Relies on the biosample_metadata property of each SampleDetail instance, which retrieves the BioSamples metadata for each sample accession.
The resulting DataFrame is constructed by concatenating the individual DataFrames for each sample, and if each sample has different characteristics, the resulting DataFrame will have columns for all unique characteristics across the samples, with missing values filled as NaN.
- class mgnipy.V2.mixins.BiomesTreeMixin[source]#
Bases:
object- show_tree(method='compact')[source]#
- Parameters:
method (Literal ['compact', 'show', 'print', 'horizontal', 'hshow', 'h', 'hprint', 'vertical', 'vshow', 'v', 'vprint'])
- property tree: Tree#
Convert the biomes metadata to a tree structure for visualization or analysis.
- Returns:
A tree representation of the biomes and their relationships.
- Return type:
Tree
- class mgnipy.V2.mixins.DiskCheckpointer(*, params_getter, resource_str, config, results_store=None, count=None, num_requests=None)[source]#
Bases:
objectCheckpoint manager for request-makers.
- Parameters:
- class mgnipy.V2.mixins.ResultsHandler(data=None)[source]#
Bases:
object- to_df(data=None, expand_nested_dicts=False, rename_columns=None, **kwargs)[source]#
Convert the current or provided metadata to a pandas DataFrame.
- Parameters:
data (list of dict , optional) – List of records to convert. If
None, uses :pyattr:`data`.expand_nested_dicts (list of str or bool , optional) – List of keys to expand into separate columns, or
Trueto expand defaults.rename_columns (dict of str to str, optional) – A dictionary mapping old column names to new column names.
**kwargs – Additional keyword arguments passed to
pd.DataFrame.
- Returns:
DataFrame containing the metadata or
Nonewhen no data is available.- Return type:
pd.DataFrame or None
Examples
>>> handler = ResultsHandler(data=[{"a": 1, "b": 2}]) >>> df = handler.to_df() >>> list(df.columns) ['a', 'b'] >>> df.iloc[0]['a'] np.int64(1)
- to_json(data=None, orient='records', lines=True, **json_kwargs)[source]#
Convert the current metadata to a JSON string or save it to a file.
- Parameters:
- Returns:
The JSON string representation of the metadata, or None if no data is available.
- Return type:
str or None
- Raises:
RuntimeError – If no data is available to convert.
- to_list(data=None)[source]#
Convert the current or provided metadata to a list of dictionaries.
- Parameters:
data (optional) – The paginated data to convert. If
None, uses :pyattr:`data`.- Returns:
A list of metadata records as dictionaries, or
Noneif no data is available.- Return type:
Examples
>>> handler = ResultsHandler(data=[{"x": 10}]) >>> handler.to_list() [{'x': 10}]
- to_polars(data=None, expand_nested_dicts=False, rename_columns=None, **polars_kwargs)[source]#
Convert the current metadata to a Polars DataFrame.
- Parameters:
- Returns:
A Polars DataFrame containing the metadata.
- Return type:
pl.DataFrame
- Raises:
RuntimeError – If no data is available to convert.
- class mgnipy.V2.mixins.StreamMixin(mgnifier_helper=None)[source]#
Bases:
objectMixin providing streaming helpers for downloads.
# TODO remove below dependencies on mgnifier This mixin assumes the host class provides the following helpers/properties: - _mgnifier_helper(url, cache_dir=None) returning an object with
.exec.httpx_client and .exec.httpx_aclient attributes
_get_type_by_alias(alias) to resolve file types
downloads_df when needed for examples/tests
The implementation mirrors the streaming helpers previously defined on
MGazineso they can be reused by other classes.- stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)[source]#
Streams a single download based on its alias or url.
If
chunksizeis specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.Supported formats and their handlers#
tsv: handled by
stream_pandas()(pandas) orstream_polars()(polars). Gzipped TSVs are supported via the gzip/compression options.csv: handled by
stream_pandas()/stream_polars()(sep=”,”).txt: handled by
stream_txt()(returns full text or yields line chunks).html: handled by
stream_html()(opens URL in browser).fasta: handled by
stream_fasta()(scikit-bio generator).gff: handled by
stream_gff()(scikit-bio generator).biom: handled by
stream_biom()(scikit-bio generator).gzipped HTTP resources: use
stream_gzipped()for a file-like object, orstream_json()for gzipped JSON content.jsonl / ndjson: handled by
stream_jsonl()(pandas or polars modes).json: handled by
stream_json()(returns full JSON or streams via ijson).tree/newick: handled by
stream_tree()(scikit-bio newick reader).other: if the URL ends with
.jsonit’s streamed viastream_json(); otherwise use the download helper for unsupported binary formats.
- param alias:
The alias of the download to stream.
- type alias:
Optional[str]
- param url:
The url of the download to stream.
- type url:
Optional[HttpUrl]
- param chunksize:
The size of the chunks to read from the stream.
- type chunksize:
Optional[int]
- param max_skip:
The maximum number of rows to skip before raising an error. Default is 5.
- type max_skip:
int, optional
- param **kwargs:
Additional keyword arguments to pass to the streamer function.
- returns:
The streamer result for the resolved alias or url.
- rtype:
Any
- stream_biom(url, **skbio_kwargs)[source]#
Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the biom file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the biom file.
- Return type:
Generator
- stream_fasta(url, **skbio_kwargs)[source]#
Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the FASTA file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the FASTA file.
- Return type:
Generator
- stream_gff(url, **skbio_kwargs)[source]#
Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.
- Parameters:
url (str ) – The URL to the GFF file to stream.
**skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.
- Returns:
A generator yielding scikit-bio Sequence objects parsed from the GFF file.
- Return type:
Generator
- stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)[source]#
Stream a gzipped HTTP resource and present a file-like interface.
When
chunksizeis None the entire compressed payload is fetched and decompressed into memory. Whenchunksizeis provided a streaming file-like object is returned.- Parameters:
- Return type:
bytes | str | BufferedReader | TextIOWrapper
- stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)[source]#
Read a TSV from a URL or local file with resilient header handling.
The helper will retry with increasing
skiprowswhenpandasraises aParserError(useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pd_kwargs – Additional keyword arguments passed to
pd.read_csv.low_memory (bool )
- Returns:
A DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pd.DataFrame or TextFileReader
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).
- stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)[source]#
Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.
The helper will retry with increasing
skip_rowswhen Polars raises an error (useful for files with extra header lines). Whenchunksizeis provided an iterator is returned.- Parameters:
url (str ) – The URL or local file path to read the TSV from.
sep (str ) – The delimiter to use (default is tab).
chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.
max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.
**pl_kwargs – Additional keyword arguments passed to
pl.read_csv.low_memory (bool )
- Returns:
A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if
chunksizeis specified.- Return type:
pl.DataFrame or Iterator[pl.DataFrame]
- Raises:
ValueError – If
chunksizeis not a positive integer or None.RuntimeError – If the TSV cannot be parsed after skipping up to
max_skiplines.Polars Error – If the TSV cannot be parsed due to a format error (after retries).
- stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)[source]#
Stream a plain-text resource. When
chunksizeisNonethe full text is returned as a string. Whenchunksizeis an integer the function yields lists of lines.- Parameters:
url (str ) – The URL to stream the text from.
chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.
httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.
**httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method
- Returns:
The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.
- Return type:
str or Generator