Docs: add obstore tutorial#527
Conversation
Adds a new tutorial walking through reading Planetary Computer data with obstore (auto-refreshing SAS tokens, range reads, async, library composability). Companion notebook lives in PlanetaryComputerExamples at quickstarts/obstore.ipynb and is wired in via external_docs_config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the Colab badge (off-brand for PC; Hub is the canonical JupyterLab environment) and replaces the TODO placeholders with real URLs: nbgitpuller deep link to PC Hub and a github.com blob link to the companion notebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inlines the Hub and GitHub URLs on the badge line and drops the reference-style defs at the bottom. Also picks up the inline copy edits across the body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hub link is the canonical way to open the notebook; the GitHub view duplicates what the docs site already renders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops Lonboard reference (no obstore integration in Lonboard) and notes that zarr-python access goes through the zarr.storage.ObjectStore adapter rather than direct hand-off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| @@ -0,0 +1,164 @@ | |||
| # Reading Planetary Computer data with obstore | |||
|
|
|||
| [obstore](https://developmentseed.org/obstore/) is a Python library for reading and writing cloud object stores (Azure Blob, Amazon S3, Google Cloud Storage) directly through their native APIs. Using obstore, SAS tokens refresh automatically, async I/O is built in, and the same store you build for reading bytes can be handed to higher-level libraries like [async-geotiff](https://github.com/developmentseed/async-geotiff), [Lonboard](https://developmentseed.org/lonboard/), and [zarr-python](https://zarr.dev/) without re-authenticating. | |||
There was a problem hiding this comment.
directly through their native APIs
I think this is a bit misleading. I think most users would understand "native" to mean "the raw underlying API specific to each cloud storage provider". That's not what obstore does; if a user wants to use the Azure API directly, they'll use azure.storage directly.
Obstore presents one, unified, abstracted API that is the same across Azure, S3, and GCS. That's the selling point.
| ```python | ||
| import pystac_client | ||
| from obstore.auth.planetary_computer import PlanetaryComputerCredentialProvider | ||
|
|
||
| catalog = pystac_client.Client.open( | ||
| "https://planetarycomputer.microsoft.com/api/stac/v1" | ||
| ) | ||
| item = next(catalog.search(collections=["naip"], max_items=1).items()) | ||
| asset = item.assets["image"] | ||
| ``` | ||
|
|
||
| 2. Build a credential provider from the asset. | ||
|
|
||
| ```python | ||
| provider = PlanetaryComputerCredentialProvider.from_asset(asset) | ||
| ``` |
There was a problem hiding this comment.
This makes me realize that from_asset is a bit annoying if you want to work with a collection instead of an item.
I see that the NAIP Collection JSON defines
"msft:storage_account": "naipeuwest"so we could potentially have a from_collection constructor too.
Or maybe from_asset should really be renamed to from_stac, and support both Item and Collection? Thoughts?
| provider = PlanetaryComputerCredentialProvider.from_asset(asset) | ||
| ``` | ||
|
|
||
| 3. Build a store using that provider. The store is your reusable connection to that asset. |
There was a problem hiding this comment.
It's important to note that the store doesn't just connect to one asset; it provides the auth to access anything in that bucket (or I guess "container" in Azure terminology) (except as mentioned below, the prefix on the store is currently mounted to this specific file)
| 2. **Read multiple byte ranges in a single request.** Cuts round-trip latency when you need several non-contiguous slices of the same file (e.g. multiple COG tiles). | ||
|
|
||
| ```python | ||
| ranges = obstore.get_ranges( |
| async def fetch(start, end): | ||
| return await obstore.get_range_async(async_store, "", start=start, end=end) | ||
|
|
||
| results = await asyncio.gather(*[fetch(i * 4096, (i + 1) * 4096) for i in range(8)]) |
There was a problem hiding this comment.
This is a bad example, because it's making several independent requests for different parts of a file.
For this use case we should be pointing users towards store.get_ranges_async, because under the hood that will combine adjacent ranges into a single network request.
So for example, this example makes independent requests for 0-4096, 4096-8192, etc. But get_ranges_async would automatically make just a single request under the hood for 0-32768, instead of 8 concurrent requests, and that would be a lot faster.
| from obstore.store import S3Store | ||
|
|
||
| s3_store = S3Store(bucket="my-bucket", region="us-west-2") | ||
| buf = obstore.get(s3_store, "path/to/object").bytes() |
There was a problem hiding this comment.
Actually this doesn't work... obstore.get won't work against the obspec protocol... The obspec protocol is defined in terms of the methods on the class. That's part of why I want to nudge people to use store.get instead of obstore.get
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Tutorial for obstore, based on the outline here: https://docs.google.com/document/d/1LIf6SvMHK3Gr8gSG8eqmjwyROeeoQl3Odg1CVLqp1AI/edit?usp=sharing