mgnipy.V2.mixins module#

class mgnipy.V2.mixins.BioSamplesMetadataMixin[source]#

Bases: object

Mixin providing properties to access BioSamples metadata for samples in the list.

async aenrich_biosamples(incl_ena=False, overwrite=False)[source]#

Asynchronously fetch and cache the BioSamples metadata for the sample associated with this SampleDetail instance, based on its accession. The metadata is stored in the cache properties for later retrieval by the abiosamplesdata property.

Parameters:
  • incl_ena (bool , optional) – Whether to include ENA-specific metadata fields in the fetched metadata. Defaults to False.

  • overwrite (bool , optional) – Whether to overwrite the cached metadata if it already exists. Defaults to False.

Return type:

None

Notes

  • This property retrieves the BioSamples metadata for the sample accession using the aget_biosample_metadata_from_acc function, which asynchronously queries the BioSamples API and constructs a DataFrame with the relevant metadata fields.

  • The resulting DataFrame will have a single row corresponding to the sample accession, and columns for ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the sample, with missing values filled as ‘NA’.

async aenrich_biosamples_details(incl_ena=False, overwrite=False)[source]#

Asynchronously retrieve the concatenated BioSamples metadata for all samples in the list. Each row corresponds to a sample, and columns include ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the samples.

Parameters:
  • incl_ena (bool , optional) – Whether to include ENA-specific metadata fields in the resulting DataFrame. Defaults to False.

  • overwrite (bool , optional) – Whether to overwrite the cached DataFrame if it already exists. Defaults to False.

Returns:

This method does not return a value, but it caches the BioSamples metadata for later retrieval.

Return type:

None

Notes

  • Relies on the abiosample_metadata property of each SampleDetail instance, which asynchronously retrieves the BioSamples metadata for each sample accession.

  • The resulting DataFrame is constructed by concatenating the individual DataFrames for each sample, and if each sample has different characteristics, the resulting DataFrame will have columns for all unique characteristics across the samples, with missing values filled as NaN.

biosamples(incl_ena=False)[source]#

A DataFrame containing the BioSamples metadata for the sample associated with this SampleDetail instance, based on its accession. The DataFrame includes columns such as ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the sample.

Parameters:

incl_ena (bool )

Return type:

DataFrame

details_biosamples(incl_ena=True)[source]#

A DataFrame containing the concatenated BioSamples metadata for all samples in the list. Each row corresponds to a sample, and columns include ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the samples.

Parameters:

incl_ena (bool , optional) – Whether to include ENA-specific metadata fields in the resulting DataFrame. Defaults to True.

Returns:

A DataFrame containing the BioSamples metadata for all samples in the list.

Return type:

pd.DataFrame

Notes

  • Relies on the biosample_metadata property of each SampleDetail instance, which retrieves the BioSamples metadata for each sample accession.

  • The resulting DataFrame is constructed by concatenating the individual DataFrames for each sample, and if each sample has different characteristics, the resulting DataFrame will have columns for all unique characteristics across the samples, with missing values filled as NaN.

enrich_biosamples(incl_ena=False, overwrite=False)[source]#

Fetch and cache the BioSamples metadata for the sample associated with this SampleDetail instance, based on its accession. The metadata is stored in the cache properties for later retrieval by the biosamplesdata property.

Parameters:
  • incl_ena (bool , optional) – Whether to include ENA-specific metadata fields in the fetched metadata. Defaults to False.

  • overwrite (bool , optional) – Whether to overwrite the cached metadata if it already exists. Defaults to False.

Return type:

None

Notes

  • This property retrieves the BioSamples metadata for the sample accession using the get_biosample_metadata_from_acc function, which queries the BioSamples API and constructs a DataFrame with the relevant metadata fields.

  • The resulting DataFrame will have a single row corresponding to the sample accession, and columns for ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the sample, with missing values filled as ‘NA’.

enrich_biosamples_details(incl_ena=True, overwrite=False)[source]#

Fetch and cache the concatenated BioSamples metadata for all samples in the list. Each row corresponds to a sample, and columns include ‘SampleID’, ‘SRA accession’, ‘taxid’, and any characteristics available for the samples. The metadata is stored in the cache properties for later retrieval by the details_biosamplesdata property.

Parameters:
  • incl_ena (bool , optional) – Whether to include ENA-specific metadata fields in the resulting DataFrame. Defaults to True.

  • overwrite (bool , optional) – Whether to overwrite the cached DataFrame if it already exists. Defaults to False.

Return type:

None

Notes

  • Relies on the biosample_metadata property of each SampleDetail instance, which retrieves the BioSamples metadata for each sample accession.

  • The resulting DataFrame is constructed by concatenating the individual DataFrames for each sample, and if each sample has different characteristics, the resulting DataFrame will have columns for all unique characteristics across the samples, with missing values filled as NaN.

class mgnipy.V2.mixins.BiomesTreeMixin[source]#

Bases: object

property lineages: list [str ]#
property results: dict #

Get results and auto-normalize lineage field.

show_tree(method='compact')[source]#
Parameters:

method (Literal ['compact', 'show', 'print', 'horizontal', 'hshow', 'h', 'hprint', 'vertical', 'vshow', 'v', 'vprint'])

property tree: Tree#

Convert the biomes metadata to a tree structure for visualization or analysis.

Returns:

A tree representation of the biomes and their relationships.

Return type:

Tree

class mgnipy.V2.mixins.DiskCheckpointer(*, params_getter, resource_str, config, results_store=None, count=None, num_requests=None)[source]#

Bases: object

Checkpoint manager for request-makers.

Parameters:
  • params_getter (Callable[[], dict ])

  • resource_str (str )

  • config (MGnipyConfig)

  • results_store (Optional[dict ])

  • count (Optional[Callable[[], int ]])

  • num_requests (Optional[Callable[[], int ]])

async aload_cache()[source]#

Async wrapper for load_cache.

Return type:

int

async awrite_results(request_num, items)[source]#

Async wrapper for write_results.

Parameters:
Return type:

None

clear_cache()[source]#

Remove all cached pages for this query.

Return type:

None

load_cache()[source]#
Return type:

int

load_cache_manifest()[source]#
Return type:

dict

load_cache_results()[source]#

Load cached pages into self._results. Returns count loaded.

Return type:

list [int ]

write_results(request_num, items)[source]#

Auto atomic write to disk.

Parameters:
Return type:

None

class mgnipy.V2.mixins.ResultsHandler(data=None)[source]#

Bases: object

Parameters:

data (Optional[chain[dict [str , Any]]])

property data: chain[dict [str , Any ]]#

results based on the current resource.

to_df(data=None, expand_nested_dicts=False, rename_columns=None, **kwargs)[source]#

Convert the current or provided metadata to a pandas DataFrame.

Parameters:
  • data (list of dict , optional) – List of records to convert. If None, uses :pyattr:`data`.

  • expand_nested_dicts (list of str or bool , optional) – List of keys to expand into separate columns, or True to expand defaults.

  • rename_columns (dict of str to str, optional) – A dictionary mapping old column names to new column names.

  • **kwargs – Additional keyword arguments passed to pd.DataFrame.

Returns:

DataFrame containing the metadata or None when no data is available.

Return type:

pd.DataFrame or None

Examples

>>> handler = ResultsHandler(data=[{"a": 1, "b": 2}])
>>> df = handler.to_df()
>>> list(df.columns)
['a', 'b']
>>> df.iloc[0]['a']
np.int64(1)
to_json(data=None, orient='records', lines=True, **json_kwargs)[source]#

Convert the current metadata to a JSON string or save it to a file.

Parameters:
  • data (dict of int to list of dict , optional) – The paginated data to convert. If None, uses self.qs._results.

  • **json_kwargs – Additional keyword arguments passed to the JSON serialization function.

  • orient (str )

  • lines (bool )

Returns:

The JSON string representation of the metadata, or None if no data is available.

Return type:

str or None

Raises:

RuntimeError – If no data is available to convert.

to_list(data=None)[source]#

Convert the current or provided metadata to a list of dictionaries.

Parameters:

data (optional) – The paginated data to convert. If None, uses :pyattr:`data`.

Returns:

A list of metadata records as dictionaries, or None if no data is available.

Return type:

list

Examples

>>> handler = ResultsHandler(data=[{"x": 10}])
>>> handler.to_list()
[{'x': 10}]
to_polars(data=None, expand_nested_dicts=False, rename_columns=None, **polars_kwargs)[source]#

Convert the current metadata to a Polars DataFrame.

Parameters:
  • data (dict of int to list of dict , optional) – The paginated data to convert. If None, uses self.qs._results.

  • **polars_kwargs – Additional keyword arguments passed to pl.DataFrame.

  • expand_nested_dicts (list [str ] | bool | None)

  • rename_columns (dict [str , str ] | None)

Returns:

A Polars DataFrame containing the metadata.

Return type:

pl.DataFrame

Raises:

RuntimeError – If no data is available to convert.

class mgnipy.V2.mixins.StreamMixin(mgnifier_helper=None)[source]#

Bases: object

Mixin providing streaming helpers for downloads.

# TODO remove below dependencies on mgnifier This mixin assumes the host class provides the following helpers/properties: - _mgnifier_helper(url, cache_dir=None) returning an object with

.exec.httpx_client and .exec.httpx_aclient attributes

  • _get_type_by_alias(alias) to resolve file types

  • downloads_df when needed for examples/tests

The implementation mirrors the streaming helpers previously defined on MGazine so they can be reused by other classes.

stream(*, alias=None, url=None, chunksize=None, max_skip=5, **kwargs)[source]#

Streams a single download based on its alias or url.

If chunksize is specified then iterators of dataframes or strings will be returned; otherwise the full data will be returned as a single object.

Supported formats and their handlers#

param alias:

The alias of the download to stream.

type alias:

Optional[str]

param url:

The url of the download to stream.

type url:

Optional[HttpUrl]

param chunksize:

The size of the chunks to read from the stream.

type chunksize:

Optional[int]

param max_skip:

The maximum number of rows to skip before raising an error. Default is 5.

type max_skip:

int, optional

param **kwargs:

Additional keyword arguments to pass to the streamer function.

returns:

The streamer result for the resolved alias or url.

rtype:

Any

Parameters:
  • alias (str | None)

  • url (HttpUrl | None)

  • chunksize (int | None)

  • max_skip (int )

Return type:

Any

stream_biom(url, **skbio_kwargs)[source]#

Stream a biom file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the biom file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the biom file.

Return type:

Generator

stream_fasta(url, **skbio_kwargs)[source]#

Stream a FASTA file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the FASTA file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the FASTA file.

Return type:

Generator

stream_gff(url, **skbio_kwargs)[source]#

Stream a GFF file from a URL using scikit-bio’s read function. Refer there for more info.

Parameters:
  • url (str ) – The URL to the GFF file to stream.

  • **skbio_kwargs – Additional keyword arguments passed to skbio.io.read(), such as into and verify.

Returns:

A generator yielding scikit-bio Sequence objects parsed from the GFF file.

Return type:

Generator

stream_gzipped(url, chunksize=None, httpx_client=None, decode=False, encoding='utf-8', errors='replace', **httpx_kwargs)[source]#

Stream a gzipped HTTP resource and present a file-like interface.

When chunksize is None the entire compressed payload is fetched and decompressed into memory. When chunksize is provided a streaming file-like object is returned.

Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

  • decode (bool )

  • encoding (str )

  • errors (str )

Return type:

bytes | str | BufferedReader | TextIOWrapper

stream_html(url, **web_kwargs)[source]#

Open an HTML URL in the default web browser.

Parameters:
  • url (str ) – The URL to open in the web browser.

  • **web_kwargs – Additional keyword arguments passed to webbrowser.open(), such as new and autoraise.

Returns:

True if the URL was opened successfully, False otherwise.

Return type:

bool

stream_json(url, chunksize=None, httpx_client=None, **httpx_kwargs)[source]#
Parameters:
  • url (str )

  • chunksize (int | None)

  • httpx_client (Client | None)

Return type:

dict | Generator

stream_jsonl(url, orient=None, chunksize=None, dataframe_engine='pandas', **df_kwargs)[source]#
Parameters:
  • url (str )

  • orient (Literal ['records', 'split', 'index', 'columns', 'values', 'table'] | None)

  • chunksize (int | None)

  • dataframe_engine (Literal ['pandas', 'polars'] | None)

Return type:

dict

stream_pandas(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pd_kwargs)[source]#

Read a TSV from a URL or local file with resilient header handling.

The helper will retry with increasing skiprows when pandas raises a ParserError (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pd_kwargs – Additional keyword arguments passed to pd.read_csv.

  • low_memory (bool )

Returns:

A DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pd.DataFrame or TextFileReader

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Pandas ParserError – If the TSV cannot be parsed due to a format error (after retries).

stream_polars(url, sep='\t', chunksize=None, max_skip=5, low_memory=False, **pl_kwargs)[source]#

Read a TSV from a URL or local file into a Polars DataFrame with resilient header handling.

The helper will retry with increasing skip_rows when Polars raises an error (useful for files with extra header lines). When chunksize is provided an iterator is returned.

Parameters:
  • url (str ) – The URL or local file path to read the TSV from.

  • sep (str ) – The delimiter to use (default is tab).

  • chunksize (int or None) – If an integer is provided, returns an iterator that yields DataFrames of that many rows. If None, returns a single DataFrame.

  • max_skip (int ) – The maximum number of lines to skip when trying to parse the TSV.

  • **pl_kwargs – Additional keyword arguments passed to pl.read_csv.

  • low_memory (bool )

Returns:

A Polars DataFrame containing the TSV data, or an iterator yielding DataFrames if chunksize is specified.

Return type:

pl.DataFrame or Iterator[pl.DataFrame]

Raises:
  • ValueError – If chunksize is not a positive integer or None.

  • RuntimeError – If the TSV cannot be parsed after skipping up to max_skip lines.

  • Polars Error – If the TSV cannot be parsed due to a format error (after retries).

stream_tree(url, **skbio_kwargs)[source]#
Parameters:

url (str )

Return type:

Generator

stream_txt(url, chunksize=None, httpx_client=None, **httpx_kwargs)[source]#

Stream a plain-text resource. When chunksize is None the full text is returned as a string. When chunksize is an integer the function yields lists of lines.

Parameters:
  • url (str ) – The URL to stream the text from.

  • chunksize (int or None) – If an integer is provided, yields lists of lines of that size. If None, yields the entire text as a single string.

  • httpx_client (httpx.Client, optional) – An optional httpx.Client to use for the request. If None, a new client will be created for the request.

  • **httpx_kwargs – Additional keyword arguments passed to the httpx.Client.request() method

Returns:

The full text as a string if chunksize is None, or a generator yielding lists of lines if chunksize is an integer.

Return type:

str or Generator