A MGnifyList of records#
Here we demonstrate the basic usability of MGni.py to see what records/items (e.g., biomes) are available in a given resource (e.g., the Biomes endpoint of the MGnify API v2 )
π― The Goal: Get a list of MGnify Biomes#
The GOLD ecosystem classifications organize environmental samples into a hierarchical taxonomy of biome typesβfrom broad categories like βEngineeredβ to specific environments like βPlant rhizosphere.β
This demo will show you how to:
Prepare queries β Learn different ways to initialize and configure your API requests using MGnipy or direct proxies
Preview before fetching β Use filtering and preview methods (preview, dry_run, explain) to confirm your query before retrieving results
Fetch results β Execute requests using iterative get(), specific page(), or bulk_fetch() methods (sync or async)
Monitor progress β Track your requests and check completion status
By the end, we hope youβll be comfortable querying the MGnify resource β or specifically the biomes resource at least
# uncomment below if colab
#!pip install mgnipy
We can initiate using mgnipy.MGnipy or proxies.Biomes
ποΈ The start: Preparing queries#
Option 1. mgnipy.MGnipy#
The MGnipy client offers a unified interface to access various MGnify API endpoints, including biomes. This approach is convenient if you want to manage multiple types of queries or resources through a single client object.
Instantiate
MGnipyto configure your API access and manage requests.Use
.biomesto create a biome query with your desired parameters.Use
list_parameters()to see all available filters and options.The
filter()method allows you to refine your query further.The
explain()method previews the constructed API URLs and the first few results.
This method has an additional helper function to list and describe available resources
π‘ Tip: See Configuration page for more setup details π .
from mgnipy import MGnipy
# init
mg = MGnipy(
# configuration
cache_dir=None, # set to None to disable caching, or specify a directory for caching
)
# access proxy
biomes = mg.biomes
# checking it out
print(biomes)
MGnifier instance for resource: biomes
I.e., mgnipy.V2.proxies.biomes.Biomes
----------------------------------------
Base URL: https://www.ebi.ac.uk/
Parameters: {}
Example request URL: https://www.ebi.ac.uk/metagenomics/api/v2/biomes?page=1
Endpoint module: mgnipy.emgapi_v2_client.api.miscellaneous.list_mgnify_biomes
Is list endpoint (returns paginated results): True
Cache directory: None
In the print we can see that we have not initiated any query parameters.
If you would like to know what params are supported for the endpoint there is a helper method you can use: .list_supported_params()
# if not sure what kwargs suupported
print("Supported kwargs for biomes: ", biomes.list_supported_params())
Supported kwargs for biomes: ['biome_lineage', 'max_depth', 'page', 'page_size']
also like describe_resources() there is a describe_endpoint() with even more info about the endpoint based on the openapi.json spec
biomes.describe_endpoint()
List all biomes
List all biomes in the MGnify database.
Supported parameters:
- biome_lineage: None | str | Unset The lineage to match, including all descendant biomes
- max_depth: int | None | Unset Maximum depth of the biome lineage to include, e.g. `root` is 1 and `root:Host-Associated:Human` is level 3
- page: int | Unset Default: 1.
- page_size: int | None | Unset
Letβs add some search params via .filter()
biomes = biomes.filter(
page_size=5,
max_depth=6,
)
# check it out again
print(biomes)
MGnifier instance for resource: biomes
I.e., mgnipy.V2.proxies.biomes.Biomes
----------------------------------------
Base URL: https://www.ebi.ac.uk/
Parameters: {'page_size': 5, 'max_depth': 6}
Example request URL: https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=1&page_size=5
Endpoint module: mgnipy.emgapi_v2_client.api.miscellaneous.list_mgnify_biomes
Is list endpoint (returns paginated results): True
Cache directory: None
Great we can see that the query string (i.e., after ?s) has been updated with our given parameters
Option 2. Proxies such as mgnipy.V2.proxies.Biomes#
Alternatively, you can also instantiate and configure one resource proxy at a time via the available
mgnipy.V2.proxiesπit all works the same since
mgnipy.V2.proxies.Biomesis what is returned via:# init client mg = MGnipy() # get biomes proxy biomes_proxy = mg.biomes
from mgnipy.V2.proxies import Biomes
biomes = Biomes(
config={
"cache_dir": None
}, # set to None to disable caching, or specify a directory for caching
page_size=5,
)
# and can filter as well
biomes = biomes.filter(
max_depth=6,
)
print(biomes)
MGnifier instance for resource: biomes
I.e., mgnipy.V2.proxies.biomes.Biomes
----------------------------------------
Base URL: https://www.ebi.ac.uk/
Parameters: {'page_size': 5, 'max_depth': 6}
Example request URL: https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=1&page_size=5
Endpoint module: mgnipy.emgapi_v2_client.api.miscellaneous.list_mgnify_biomes
Is list endpoint (returns paginated results): True
Cache directory: None
π Previewing your requests#
There is an optional but recommended step to
.preview()the first page of results as apandas.DataFrame, or.dry_run()to print the number of pages and records to request.explain()to print the planned request urls
before .get()ting all the result pages.
# checking out first 5 request urls to be made
biomes.explain(head=5)
# or
# biomes.dry_run()
# or
biomes.preview()
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=1&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=2&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=3&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=4&page_size=5
https://www.ebi.ac.uk/metagenomics/api/v2/biomes?max_depth=6&page=5&page_size=5
π¨ Carry out requests to list endpoints#
If happy with the plan, proceed with the async or sync get requests.
There are multiple options:
.get()or.aget()like next() iteratively carries out one page/request at a time per call. Returning the page dict orNonewhen iteration is completepage()or.apage()pass specificpage_num.bulk_fetch()orabulk_fetch()fetch the pages in bulk sync or asynchronously
Option 1. .get() iteratively#
For a demo of this we will make the first 5 requests.
NOTE: There are protective (for API and user memory) limits to the number of requests that can be made in one bulk fetch call or iteration. However, the requests can be continued using
.continue_interator()or.resume()see caching notebook for more details.
# getting first 5
for _ in range(5):
biomes.get()
For each option there is an async option
for _ in range(5):
await biomes.aget()
and you can take a look at the results as you go π :
# by page, e.g. page 5
biomes.results[5]
[{'biome_name': 'Activated sludge',
'biome_lineage': 'root:Engineered:Bioremediation:Terephthalate:Wastewater:Activated sludge'},
{'biome_name': 'Bioreactor',
'biome_lineage': 'root:Engineered:Bioremediation:Terephthalate:Wastewater:Bioreactor'},
{'biome_name': 'Tetrachloroethylene and derivatives',
'biome_lineage': 'root:Engineered:Bioremediation:Tetrachloroethylene and derivatives'},
{'biome_name': 'Chloroethene',
'biome_lineage': 'root:Engineered:Bioremediation:Tetrachloroethylene and derivatives:Chloroethene'},
{'biome_name': 'Bioreactor',
'biome_lineage': 'root:Engineered:Bioremediation:Tetrachloroethylene and derivatives:Chloroethene:Bioreactor'}]
# or by records, first 2 records
biomes.to_list()[:2]
# or via .records iterator
# list(biomes.records)[:2]
[{'biome_name': 'root', 'biome_lineage': 'root'},
{'biome_name': 'Control', 'biome_lineage': 'root:Control'}]
Specific to the biomes, results can also be visualized as a tree βprintβ βhshowβ or βvshowβ
biomes.show_tree()
root
βββ Control
βββ Engineered
βββ Biogas plant
β βββ Wet fermentation
βββ Bioreactor
β βββ Continuous culture
β βββ Marine intertidal flat sediment inoculum
β β βββ Wadden Sea-Germany
β βββ Marine sediment inoculum
β βββ Wadden Sea-Germany
βββ Bioremediation
β βββ Hydrocarbon
β β βββ Benzene
β β βββ Bioreactor
β βββ Metal
β βββ Persistent organic pollutants (POP)
β βββ Polycyclic aromatic hydrocarbons
β βββ Terephthalate
β β βββ Wastewater
β β βββ Activated sludge
β β βββ Bioreactor
β βββ Tetrachloroethylene and derivatives
β βββ Chloroethene
β β βββ Bioreactor
β βββ Tetrachloroethylene
β βββ Bioreactor
βββ Biotransformation
β βββ Microbial enhanced oil recovery
β βββ Microbial solubilization of coal
β βββ Mixed alcohol bioreactor
βββ Built environment
βββ Food production
β βββ Dairy products
β βββ Fermented beverages
β βββ Fermented seafood
β βββ Fermented vegetables
β βββ Silage fermentation
βββ Industrial production
β βββ Engineered product
βββ Lab enrichment
β βββ Defined media
β β βββ Aerobic media
β β βββ Anaerobic media
β β βββ Marine media
β β βββ Algoconsortia
β βββ Undefined media
βββ Lab Synthesis
β βββ Genetic cross
βββ Modeled
Option 2. get a specific page()#
Will make the request and also returns the items/records in a list like above.
When calling page() on an alrady completed request, the api call is not repeated and instead the output is a page from the cache
biomes.page(3)
[{'biome_name': 'Wadden Sea-Germany',
'biome_lineage': 'root:Engineered:Bioreactor:Continuous culture:Marine sediment inoculum:Wadden Sea-Germany'},
{'biome_name': 'Bioremediation',
'biome_lineage': 'root:Engineered:Bioremediation'},
{'biome_name': 'Hydrocarbon',
'biome_lineage': 'root:Engineered:Bioremediation:Hydrocarbon'},
{'biome_name': 'Benzene',
'biome_lineage': 'root:Engineered:Bioremediation:Hydrocarbon:Benzene'},
{'biome_name': 'Bioreactor',
'biome_lineage': 'root:Engineered:Bioremediation:Hydrocarbon:Benzene:Bioreactor'}]
Option 3. bulk_fetch() of all requests (with safety limits)#
can handle multiple requests via
specifying a list of pages to
.bulk_fetch(pages=<list_of_pages>)or by not specifying pages you can continually call on the method which will let the bulk fetch handle the batching whilst considering
limit=<num_items
Especially before fetching in bulk we should take a look at the total number of requests/pages.
# let's first checkout num requests
print("Number of requests:", biomes.num_requests)
# or better yet do a dry_run
biomes.dry_run()
Number of requests: 99
Planning the API call with params:
{'page_size': 5, 'max_depth': 6}
Total requests to make: 99
Total records to retrieve: 492
Now we can get some data sync or async:
# synchronously fetch first 50 items/records
biomes.bulk_fetch(
limit=50, # number of items/records to fetch
)
<mgnipy.V2.proxies.biomes.Biomes at 0x7f668e329d60>
# and async
await biomes.abulk_fetch(limit=50)
<mgnipy.V2.proxies.biomes.Biomes at 0x7f668e329d60>
β³ Checking progress#
As we saw earlier in the notebook we can take a look at results as we go along. For a concise update on progress you can use .progress and .last_successful_page
biomes.progress
Retrieved pages: 30%|ββββββββββββββββββββ| 30/99
biomes.last_successful_page
30
# no cache for this isntance but we can clear anywahys
biomes.clear_cache()