Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Explorer <overview/explorer>
Use VS Code <overview/ui-vscode>
Use GitHub Codespaces <overview/ui-codespaces>
Using QGIS <overview/qgis-plugin>
Reading data with obstore <overview/obstore>
Changelog <overview/changelog>
```

Expand Down
164 changes: 164 additions & 0 deletions docs/overview/obstore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Reading Planetary Computer data with obstore

[obstore](https://developmentseed.org/obstore/) is a Python library for reading and writing cloud object stores (Azure Blob, Amazon S3, Google Cloud Storage) directly through their native APIs. Using obstore, Planetary Computer SAS tokens refresh automatically, async I/O is built in, and the same store you build for reading bytes can be handed to higher-level libraries like [async-geotiff](https://github.com/developmentseed/async-geotiff), [Lonboard](https://developmentseed.org/lonboard/), and [zarr-python](https://zarr.dev/) without re-authenticating.

A companion notebook walks through every step end-to-end with live timings. [Open in Planetary Computer Hub](https://pccompute.westeurope.cloudapp.azure.com/compute/hub/user-redirect/git-pull?repo=https://github.com/microsoft/PlanetaryComputerExamples&urlpath=lab/tree/PlanetaryComputerExamples/quickstarts/obstore.ipynb&branch=main)

## Install obstore

obstore works in any Python project. To get started, install obstore alongside `pystac-client` (for searching the Planetary Computer's STAC API) and the HTTP libraries that power its [credential providers](https://developmentseed.org/obstore/latest/authentication/#credential-providers):

```bash
uv add obstore pystac-client requests aiohttp aiohttp_retry
```

`requests` powers the sync credential provider; `aiohttp` and `aiohttp_retry` power the async one. Install both unless you know you only need one path.

## Connect to a Planetary Computer asset

The most common starting point is a STAC asset returned from a search. obstore's `PlanetaryComputerCredentialProvider` reads the asset's blob URL and handles SAS token acquisition and refresh for you.

1. Open the Planetary Computer STAC catalog and pick a scene to work with.

```python
import pystac_client
from obstore.auth.planetary_computer import PlanetaryComputerCredentialProvider

catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
item = next(catalog.search(collections=["naip"], max_items=1).items())
asset = item.assets["image"]
```

2. Build a credential provider from the asset.

```python
provider = PlanetaryComputerCredentialProvider.from_asset(asset)
```
Comment on lines +23 to +38
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me realize that from_asset is a bit annoying if you want to work with a collection instead of an item.

I see that the NAIP Collection JSON defines

"msft:storage_account": "naipeuwest"

so we could potentially have a from_collection constructor too.

Or maybe from_asset should really be renamed to from_stac, and support both Item and Collection? Thoughts?


3. Build a store using that provider. The store is your reusable connection to that asset.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to note that the store doesn't just connect to one asset; it provides the auth to access anything in that bucket (or I guess "container" in Azure terminology) (except as mentioned below, the prefix on the store is currently mounted to this specific file)


```python
from obstore.store import AzureStore

store = AzureStore(credential_provider=provider)
```

## Read bytes from the store

Once you have a working store, obstore exposes three read operations that map directly to native Azure Blob API calls.

1. **Read a byte range.** Useful when you only need part of the file. For example, the first ~16 KB of a Cloud Optimized GeoTIFF.

```python
import obstore

header = obstore.get_range(store, "", start=0, end=16384)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer pointing users to the method API rather than the functional API.

I.e.

store.get_range(..., start=0, end=16384)

also it is confusing for end users to pass in "" here... this works because the "prefix" on the store is pointing to the specific file already. It might be better to have the prefix pointing at the top level bucket directory? Or, we have two workflows "opening the bucket at the root" and "opening with an item prefix (which uses from_asset)

```

2. **Read multiple byte ranges in a single request.** Cuts round-trip latency when you need several non-contiguous slices of the same file (e.g. multiple COG tiles).

```python
ranges = obstore.get_ranges(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto use store.get_ranges

store, "", starts=[0, 65536], ends=[16384, 81920]
)
```

3. **Read the entire file.** Avoid this for large rasters. Range reads and async (below) exist to avoid this scenario.

```python
buf = obstore.get(store, "").bytes()
```
Comment on lines +68 to +72
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ditto on using store.get over obstore.get

  2. It's useful to note that the result of get is an iterator, so you don't have to collect it into a single buffer if you have a use case that supports iteration over a file's contents.


## Run reads in parallel

For multi-file workloads like building a mosaic or fetching all bands across all scenes in an AOI, making concurrent requests is faster. obstore exposes async equivalents of every read function (`get_async`, `get_range_async`, etc.) that you can compose with `asyncio.gather`.

Async needs its own credential provider class, `PlanetaryComputerAsyncCredentialProvider`, backed by `aiohttp` instead of `requests`. Same `from_asset()` signature.

```python
import asyncio
from obstore.auth.planetary_computer import PlanetaryComputerAsyncCredentialProvider

async_provider = PlanetaryComputerAsyncCredentialProvider.from_asset(asset)
async_store = AzureStore(credential_provider=async_provider)

async def fetch(start, end):
return await obstore.get_range_async(async_store, "", start=start, end=end)

results = await asyncio.gather(*[fetch(i * 4096, (i + 1) * 4096) for i in range(8)])
Comment on lines +87 to +90
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad example, because it's making several independent requests for different parts of a file.

For this use case we should be pointing users towards store.get_ranges_async, because under the hood that will combine adjacent ranges into a single network request.

So for example, this example makes independent requests for 0-4096, 4096-8192, etc. But get_ranges_async would automatically make just a single request under the hood for 0-32768, instead of 8 concurrent requests, and that would be a lot faster.

```

This is typically 3–5× faster in practice.

## List objects across a container

To enumerate objects under a prefix ("show me every NAIP scene in Montana in 2023"), build a fresh provider against the container URL instead.

```python
container_provider = PlanetaryComputerCredentialProvider(
"https://naipeuwest.blob.core.windows.net/naip/"
)
container_store = AzureStore(
account_name="naipeuwest",
container_name="naip",
credential_provider=container_provider,
)

for batch in obstore.list(container_store, prefix="v002/mt/2023/"):
for entry in batch:
print(entry["path"], entry["size"])
```

## Hand the store to other libraries

Any library that accepts an [obspec](https://github.com/developmentseed/obspec)-compatible store reads through your authenticated connection without re-doing auth. Open the same NAIP scene as a Cloud Optimized GeoTIFF using [async-geotiff](https://github.com/developmentseed/async-geotiff):

```python
from async_geotiff import GeoTIFF

geotiff = await GeoTIFF.open("", store=async_store)
print(geotiff.transform, geotiff.crs.name)
```

[zarr-python](https://zarr.dev/) works through a thin adapter (`zarr.storage.ObjectStore` wraps your obstore store). See the [obstore Zarr example](https://developmentseed.org/obstore/latest/examples/zarr/) for a Planetary Computer Daymet walkthrough.

## Migrate from `planetary_computer.sign()` + fsspec

If you're updating an existing project, here's the side-by-side. The old pattern:

```python
import planetary_computer
import fsspec

signed = planetary_computer.sign(asset.href)
with fsspec.open(signed) as f:
data = f.read()
```

The obstore equivalent:

```python
from obstore.auth.planetary_computer import PlanetaryComputerCredentialProvider
from obstore.store import AzureStore
import obstore

provider = PlanetaryComputerCredentialProvider.from_asset(asset)
store = AzureStore(credential_provider=provider)
data = obstore.get(store, "").bytes()
```

obstore handles re-signing on expiry, talks to Azure's native blob API instead of routing through HTTP via fsspec, and exposes async I/O for parallel reads — all without changing your auth code per request.

## Use the same code against other clouds

obstore implements the [obspec](https://github.com/developmentseed/obspec) protocol, so the same read and write calls work against S3 or GCS. Any library built on obspec inherits this portability automatically.

```python
from obstore.store import S3Store

s3_store = S3Store(bucket="my-bucket", region="us-west-2")
buf = obstore.get(s3_store, "path/to/object").bytes()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this doesn't work... obstore.get won't work against the obspec protocol... The obspec protocol is defined in terms of the methods on the class. That's part of why I want to nudge people to use store.get instead of obstore.get

```

1 change: 1 addition & 0 deletions etl/config/external_docs_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@
- file_url: quickstarts/reading-tabular-data.ipynb
- file_url: quickstarts/reading-zarr-data.ipynb
- file_url: quickstarts/storage.ipynb
- file_url: quickstarts/obstore.ipynb