A MGnifyList of records#

Here we demonstrate the basic usability of MGni.py to see what records/items (e.g., biomes) are available in a given resource (e.g., the Biomes endpoint of the MGnify API v2 )


🎯 The Goal: Get a list of MGnify Biomes#

The GOLD ecosystem classifications organize environmental samples into a hierarchical taxonomy of biome typesβ€”from broad categories like β€œEngineered” to specific environments like β€œPlant rhizosphere.”

This demo will show you how to:

  1. Prepare queries β€” Learn different ways to initialize and configure your API requests using MGnipy or direct proxies

  2. Preview before fetching β€” Use filtering and preview methods (preview, dry_run, explain) to confirm your query before retrieving results

  3. Fetch results β€” Execute requests using iterative get(), specific page(), or bulk_fetch() methods (sync or async)

  4. Monitor progress β€” Track your requests and check completion status

By the end, we hope you’ll be comfortable querying the MGnify resource – or specifically the biomes resource at least

# uncomment below if colab
#!pip install mgnipy

We can initiate using mgnipy.MGnipy or proxies.Biomes

πŸ–οΈ The start: Preparing queries#

Option 1. mgnipy.MGnipy#

The MGnipy client offers a unified interface to access various MGnify API endpoints, including biomes. This approach is convenient if you want to manage multiple types of queries or resources through a single client object.

  • Instantiate MGnipy to configure your API access and manage requests.

  • Use .biomes to create a biome query with your desired parameters.

  • Use list_parameters() to see all available filters and options.

  • The filter() method allows you to refine your query further.

  • The explain() method previews the constructed API URLs and the first few results.

This method has an additional helper function to list and describe available resources

πŸ’‘ Tip: See Configuration page for more setup details πŸ› .

from mgnipy import MGnipy

# init
mg = MGnipy(
    # configuration
    cache_dir=None,  # set to None to disable caching, or specify a directory for caching
)

# access proxy
biomes = mg.biomes

# checking it out
print(biomes)
MGnifier instance for resource: biomes
I.e., mgnipy.V2.proxies.biomes.Biomes
----------------------------------------
Base URL: https://www.ebi.ac.uk/
Parameters: {}
Example request URL: https://www.ebi.ac.uk/metagenomics/api/v2/biomes?page=1
Endpoint module: mgnipy.emgapi_v2_client.api.miscellaneous.list_mgnify_biomes
Is list endpoint (returns paginated results): True
Cache directory: None

In the print we can see that we have not initiated any query parameters.

If you would like to know what params are supported for the endpoint there is a helper method you can use: .list_supported_params()

# if not sure what kwargs suupported
print("Supported kwargs for biomes: ", biomes.list_supported_params())
Supported kwargs for biomes:  ['biome_lineage', 'max_depth', 'page', 'page_size']

also like describe_resources() there is a describe_endpoint() with even more info about the endpoint based on the openapi.json spec

biomes.describe_endpoint()
List all biomes

List all biomes in the MGnify database.

Supported parameters:
- biome_lineage: None | str | Unset The lineage to match, including all descendant biomes
- max_depth: int | None | Unset Maximum depth of the biome lineage to include, e.g. `root` is 1 and `root:Host-Associated:Human` is level 3
- page: int | Unset Default: 1.
- page_size: int | None | Unset

Let’s add some search params via .filter()

biomes = biomes.filter(
    page_size=5,
    max_depth=6,
)

# check it out again
print(biomes)
MGnifier instance for resource: biomes
I.e., mgnipy.V2.proxies.biomes.Biomes
----------------------------------------
Base URL: https://www.ebi.ac.uk/
Parameters: {'page_size': 5, 'max_depth': 6}
Example request URL: https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=1&page_size=5
Endpoint module: mgnipy.emgapi_v2_client.api.miscellaneous.list_mgnify_biomes
Is list endpoint (returns paginated results): True
Cache directory: None

Great we can see that the query string (i.e., after ?s) has been updated with our given parameters

Option 2. Proxies such as mgnipy.V2.proxies.Biomes#

  • Alternatively, you can also instantiate and configure one resource proxy at a time via the available mgnipy.V2.proxies 😊

  • it all works the same since mgnipy.V2.proxies.Biomes is what is returned via:

    # init client
    mg = MGnipy()
    # get biomes proxy
    biomes_proxy = mg.biomes
    
from mgnipy.V2.proxies import Biomes

biomes = Biomes(
    config={
        "cache_dir": None
    },  # set to None to disable caching, or specify a directory for caching
    page_size=5,
)

# and can filter as well
biomes = biomes.filter(
    max_depth=6,
)
print(biomes)
MGnifier instance for resource: biomes
I.e., mgnipy.V2.proxies.biomes.Biomes
----------------------------------------
Base URL: https://www.ebi.ac.uk/
Parameters: {'page_size': 5, 'max_depth': 6}
Example request URL: https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=1&page_size=5
Endpoint module: mgnipy.emgapi_v2_client.api.miscellaneous.list_mgnify_biomes
Is list endpoint (returns paginated results): True
Cache directory: None

πŸ‘“ Previewing your requests#

There is an optional but recommended step to

  • .preview() the first page of results as a pandas.DataFrame, or

  • .dry_run() to print the number of pages and records to request

  • .explain() to print the planned request urls

before .get()ting all the result pages.

# checking out first 5 request urls to be made
biomes.explain(head=5)
# or
# biomes.dry_run()
# or
biomes.preview()
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=1&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=2&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=3&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=4&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=5&page_size=5

πŸ“¨ Carry out requests to list endpoints#

If happy with the plan, proceed with the async or sync get requests.

There are multiple options:

  • .get() or .aget() like next() iteratively carries out one page/request at a time per call. Returning the page dict or None when iteration is complete

  • page() or .apage() pass specific page_num

  • .bulk_fetch() or abulk_fetch() fetch the pages in bulk sync or asynchronously

Option 1. .get() iteratively#

For a demo of this we will make the first 5 requests.

NOTE: There are protective (for API and user memory) limits to the number of requests that can be made in one bulk fetch call or iteration. However, the requests can be continued using .continue_interator() or .resume() see caching notebook for more details.

# getting first 5
for _ in range(5):
    biomes.get()

For each option there is an async option

for _ in range(5):
    await biomes.aget()

and you can take a look at the results as you go πŸ˜€ :

# by page, e.g. page 5
biomes.results[5]
[{'biome_name': 'Activated sludge',
  'biome_lineage': 'root:Engineered:Bioremediation:Terephthalate:Wastewater:Activated sludge'},
 {'biome_name': 'Bioreactor',
  'biome_lineage': 'root:Engineered:Bioremediation:Terephthalate:Wastewater:Bioreactor'},
 {'biome_name': 'Tetrachloroethylene and derivatives',
  'biome_lineage': 'root:Engineered:Bioremediation:Tetrachloroethylene and derivatives'},
 {'biome_name': 'Chloroethene',
  'biome_lineage': 'root:Engineered:Bioremediation:Tetrachloroethylene and derivatives:Chloroethene'},
 {'biome_name': 'Bioreactor',
  'biome_lineage': 'root:Engineered:Bioremediation:Tetrachloroethylene and derivatives:Chloroethene:Bioreactor'}]
# or by records, first 2 records
biomes.to_list()[:2]
# or via .records iterator
# list(biomes.records)[:2]
[{'biome_name': 'root', 'biome_lineage': 'root'},
 {'biome_name': 'Control', 'biome_lineage': 'root:Control'}]

Specific to the biomes, results can also be visualized as a tree β€œprint” β€œhshow” or β€œvshow”

biomes.show_tree()
root
β”œβ”€β”€ Control
└── Engineered
    β”œβ”€β”€ Biogas plant
    β”‚   └── Wet fermentation
    β”œβ”€β”€ Bioreactor
    β”‚   └── Continuous culture
    β”‚       β”œβ”€β”€ Marine intertidal flat sediment inoculum
    β”‚       β”‚   └── Wadden Sea-Germany
    β”‚       └── Marine sediment inoculum
    β”‚           └── Wadden Sea-Germany
    β”œβ”€β”€ Bioremediation
    β”‚   β”œβ”€β”€ Hydrocarbon
    β”‚   β”‚   └── Benzene
    β”‚   β”‚       └── Bioreactor
    β”‚   β”œβ”€β”€ Metal
    β”‚   β”œβ”€β”€ Persistent organic pollutants (POP)
    β”‚   β”œβ”€β”€ Polycyclic aromatic hydrocarbons
    β”‚   β”œβ”€β”€ Terephthalate
    β”‚   β”‚   └── Wastewater
    β”‚   β”‚       β”œβ”€β”€ Activated sludge
    β”‚   β”‚       └── Bioreactor
    β”‚   └── Tetrachloroethylene and derivatives
    β”‚       β”œβ”€β”€ Chloroethene
    β”‚       β”‚   └── Bioreactor
    β”‚       └── Tetrachloroethylene
    β”‚           └── Bioreactor
    β”œβ”€β”€ Biotransformation
    β”‚   β”œβ”€β”€ Microbial enhanced oil recovery
    β”‚   β”œβ”€β”€ Microbial solubilization of coal
    β”‚   └── Mixed alcohol bioreactor
    β”œβ”€β”€ Built environment
    β”œβ”€β”€ Food production
    β”‚   β”œβ”€β”€ Dairy products
    β”‚   β”œβ”€β”€ Fermented beverages
    β”‚   β”œβ”€β”€ Fermented seafood
    β”‚   β”œβ”€β”€ Fermented vegetables
    β”‚   └── Silage fermentation
    β”œβ”€β”€ Industrial production
    β”‚   └── Engineered product
    β”œβ”€β”€ Lab enrichment
    β”‚   β”œβ”€β”€ Defined media
    β”‚   β”‚   β”œβ”€β”€ Aerobic media
    β”‚   β”‚   β”œβ”€β”€ Anaerobic media
    β”‚   β”‚   └── Marine media
    β”‚   β”‚       └── Algoconsortia
    β”‚   └── Undefined media
    β”œβ”€β”€ Lab Synthesis
    β”‚   └── Genetic cross
    └── Modeled

Option 2. get a specific page()#

  • Will make the request and also returns the items/records in a list like above.

  • When calling page() on an alrady completed request, the api call is not repeated and instead the output is a page from the cache

biomes.page(3)
[{'biome_name': 'Wadden Sea-Germany',
  'biome_lineage': 'root:Engineered:Bioreactor:Continuous culture:Marine sediment inoculum:Wadden Sea-Germany'},
 {'biome_name': 'Bioremediation',
  'biome_lineage': 'root:Engineered:Bioremediation'},
 {'biome_name': 'Hydrocarbon',
  'biome_lineage': 'root:Engineered:Bioremediation:Hydrocarbon'},
 {'biome_name': 'Benzene',
  'biome_lineage': 'root:Engineered:Bioremediation:Hydrocarbon:Benzene'},
 {'biome_name': 'Bioreactor',
  'biome_lineage': 'root:Engineered:Bioremediation:Hydrocarbon:Benzene:Bioreactor'}]

Option 3. bulk_fetch() of all requests (with safety limits)#

can handle multiple requests via

  • specifying a list of pages to .bulk_fetch(pages=<list_of_pages>)

  • or by not specifying pages you can continually call on the method which will let the bulk fetch handle the batching whilst considering limit=<num_items

Especially before fetching in bulk we should take a look at the total number of requests/pages.

# let's first checkout num requests
print("Number of requests:", biomes.num_requests)
# or better yet do a dry_run
biomes.dry_run()
Number of requests: 99
Planning the API call with params:
{'page_size': 5, 'max_depth': 6}
Total requests to make: 99
Total records to retrieve: 492

Now we can get some data sync or async:

# synchronously fetch first 50 items/records
biomes.bulk_fetch(
    limit=50,  # number of items/records to fetch
)
<mgnipy.V2.proxies.biomes.Biomes at 0x7f668e329d60>
# and async
await biomes.abulk_fetch(limit=50)
<mgnipy.V2.proxies.biomes.Biomes at 0x7f668e329d60>

⏳ Checking progress#

As we saw earlier in the notebook we can take a look at results as we go along. For a concise update on progress you can use .progress and .last_successful_page

biomes.progress
Retrieved pages: 30%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘| 30/99
biomes.last_successful_page
30
# no cache for this isntance but we can clear anywahys
biomes.clear_cache()