world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match) by salmanmkc · Pull Request #268 · google/xrblocks

salmanmkc · 2026-05-10T11:00:59Z

been wanting a voice + vision companion in xr for a while. start it, talk to it, it sees what you see and can drop markers on stuff.

demo's in demos/world_companion. it runs on the existing startGeminiLive in GeminiManager, which now takes a captureMode: 'screenshot' | 'camera' flag, screenshot streams the rendered scene (optionally over the camera image), camera streams raw passthrough frames. default stays camera so existing callers keep their old behaviour; the companion passes camera since it reasons about the real room. tool calls + transcription come back via the manager's events, so nothing new on the world api.

there's a placeLabel tool with three styles (dot, arrow, pulse) so it can pick how to highlight, arrow if you ask it to find something, pulse for tiny stuff, dot otherwise. uses world.objects.runDetection so markers stick to the real object via depth, not to wherever your head was when the tool fired, it will do this on the desktop simulator though. a lookCloser tool answers "what am i pointing at" off the reticle. small spatial panel for start/stop/clear so it's actually usable once you're in immersive.

you can ask for several things in one go ("label the couch, tv, and coffee table") and the tool takes an items[] array so they all get placed in a single call, each with its own style. labels billboard back at the camera so they stay readable when you walk around.

detector labels and what you say don't always match, television vs tv, pendant light vs floor lamp, picture vs painting. it matches by meaning through gemini's embedContent api (SEMANTIC_SIMILARITY task type + cosine), reducing to the core noun first so modifiers like "green chair" don't tank the match, and dedupes markers across rephrasings. the embed cache is per-page so it's basically free after the first call.

Try launch the demo and say for example "place an arrow on my water bottle".

I have my thoughts on updating states of objects it has seen, to update later, however for now this seems ok.

Gemini will be able to talk and see screens afaik in Android XR, however this will allow interaction in the real world + gemini live.

I will see if I can get a demo recorded for this. This is open to lots of feedback though, since this is just a very rough version.

Edit, here's a demo! https://youtu.be/-5s_aV6eV_A

I may have to just add key input in again but will double check later when I'm home

world.streamScene(prompt, opts) opens a Gemini Live session and runs a periodic camera-frame loop into it, with auto-dispatch of agentic tools and auto-playback of model audio via CoreSound. Returns a {stop, isActive} handle. Throws cleanly when AI / Live capability / device camera are missing instead of failing deep in the SDK. world.lookingAt(controllerId?) is sugar over User.getReticleTarget so demos can stay on the world.* namespace. World now takes registry as a Script dependency so the new primitives can resolve AI / XRDeviceCamera / CoreSound / User without callers wiring it through every method. 11 tests covering missing-AI, non-Live AI, missing-camera, the frame loop, text+audio routing, onAudio override, tool dispatch, unknown tool, and onToolCall intercept.

A small single-file demo that wires xb.core.world.streamScene to a Live session with two demo-local tools: placeLabel drops a marker in front of the camera, and lookCloser reports what the user's reticle is aimed at via xb.core.world.lookingAt. Mirrors the world_ask UI pattern (floating bottom panel, transcript, start/stop) so users have a complete reference for the new primitive without leaving the demos directory.

Switch placeLabel from live reticle sampling to world.objects.runDetection so labels anchor to actual detected objects in world space, not wherever the user was looking when the tool fired. Also render a Troika text label above the marker, not just a bare sphere. Add a SpatialPanel with start/stop/clear controls so the demo is usable in immersive mode, not just from the flat web overlay.

placeLabel now takes a style param so the model can pick how to highlight something: dot for casual noting, arrow for 'point this out for me', pulse for small or hard-to-spot things. Arrow gently bobs, pulse expands and fades on a 1.5s loop.

Default enableDepth() leaves updateFullResolutionGeometry off, so the depth mesh snapshot used by object detection is too sparse to raycast against. Markers were landing near the camera instead of on the actual detected object. Copy the depth flags the gemini_xrobject demo uses.

salmanmkc · 2026-05-10T14:07:09Z

Turns out I hit rate limits of 20 object detections per day when I checked logs, I for some reason though it was broken

ObjectDetector now switches targetDevice to 'quest' when the Oculus browser is detected, instead of always falling back to galaxyxr params. Adds QuestCameraParams.ts with approximate Quest 3 passthrough intrinsics (fx/fy ~800 at 1280x720, ~77° HFOV from the cropped getUserMedia stream) and an offset for the RGB camera relative to the right XR eye. These are estimates - WebXR doesn't expose the real values - and may need per-device tweaks. Also swaps the detection debug image dump from auto-downloading PNGs (unusable on Quest browser) to a console-log preview that shows the image inline, and adds a few extra logs in world_companion to help see what placeLabel is actually receiving from the detector.

Quest 3 passthrough cameras are physically angled downward; labels were landing too high above table-surface objects. Apply a -0.26 rad pitch in the right-camera pose so unprojected detections line up with what the user actually sees.

Floating world labels were getting cut up by the passthrough depth mesh - letters disappearing where the mesh triangles passed in front of them. Disable depthTest/depthWrite on the troika text and bump renderOrder so labels always draw on top.

Gemini sometimes calls placeLabel multiple times for what's clearly the same physical thing (e.g. "laptop" then "macbook"), and unprojection drift puts the two markers a few cm apart - so the user sees the label twice. Match by text first, then fall back to a 2m proximity check, and update the existing marker in place instead of stacking a new one.

When the Gemini Live websocket drops (1011 internal error) and reconnects, it replays its tool-call context, which fires placeLabel again with the same items. Cache the last call key for 2s and short- circuit the duplicate so we don't redo detection or stack new markers on top of the existing ones.

Was useful while debugging Quest calibration and dedup behaviour but just noise in the console for everyone else. Error paths keep their console.warn.

uiblocks UIText has no .text setter, so the want/got detection feedback stopped updating in XR after the panel moved off SpatialPanel. Use setText.

Pass the capture config (fps, quality, width, height) and the tools through startGeminiLive instead of only exposing them as fields the caller mutates, and honor fps via the screenshot interval. This covers the rest of what the old StreamSceneOptions configured so callers don't need a separate entrypoint.

Hand the fps / downscale and the tools to startGeminiLive directly rather than setting fields before the call.

dli7319 · 2026-06-18T04:06:08Z

The screenshot sends frames of virtual content (with the camera image optionally underneath). I thought your demo required sending camera images.

dli7319 · 2026-06-18T04:08:04Z

Apologies for not being clear with my comments.

My thought was to merge the capabilities of your streamScene into startGeminiLive so an explicit flag decides whether to stream screenshots or stream camera frames. The rest of GeminiManager already handles the callbacks and the microphone so those would be redundant.

dli7319 · 2026-06-18T04:12:28Z

Your demo video is super cool btw! We're looking forward to merging it!
We can probably delete demos/aisimulator once your demo is merged.

startGeminiLive can now stream either the rendered scene (captureMode 'screenshot', via core.screenshotSynthesizer, optionally composited over the camera) or raw passthrough frames (captureMode 'camera', via xrDeviceCamera.getSnapshot). Defaults to 'screenshot' with the camera overlay on. This folds the camera-streaming capability in as a mode instead of a separate entrypoint.

The companion reasons about the real room, so pass captureMode 'camera' to send raw device-camera frames rather than a render of the virtual content.

initializeAudioContext never resumed the context, so a context created outside a user gesture could stay suspended and play model audio silently. Resume it when suspended before scheduling playback.

The companion was pre-judging from its own camera view and refusing ("I'm not seeing a chair") instead of calling placeLabel. Tell it to call placeLabel for every item the user named and base its reply on the tool's placed / not-found result rather than deciding up front.

The detector returns names like "the chair", which read badly as label text and in the status line. Strip a leading the/a/an from each detected label.

Join the want/got label lists with ", " so the XR status reads cleanly.

Request the gemini-embedding-001 vectors with taskType SEMANTIC_SIMILARITY, which separates related furniture words from unrelated ones far better than the default embeddings (true pairs land ~0.89-0.99, unrelated ones stay below ~0.88). Raise the cosine threshold to 0.88 to match and drop the hand-written synonym list, since the embeddings now cover couch/sofa, tv/television, chair/stool, light/ceiling lamp and the like on their own.

The live onclose only flipped isAIRunning, so a server-side close left the mic, audio nodes and screenshot interval running. Run cleanup() there and dispatch a 'close' event so callers can reset their own UI on a remote close.

Add a starting flag so the XR and DOM buttons can't open two sessions during the opening window, and listen for the manager's close event to re-enable the controls (instead of staying stuck on "listening") when the session ends server-side.

…hreshold The detector emits verbose names like "art piece on the left", whose trailing location phrase weakens the embedding match and clutters the labels; strip it (plus the leading article) down to the core noun. Also lower the cosine threshold to 0.87 so an armchair the detector calls "sofa" still matches a request for a chair (chair/sofa ~0.88) while unrelated pairs stay below.

The model often labels with a modified phrase ("another chair", "green chair"), and embedding the whole phrase pulled it away from the detected word (e.g. "another chair" vs "sofa" dropped to ~0.84, below threshold). Reduce both the request and the detected label to their core noun (dropping articles, determiners and adjectives like colours) before the similarity check, and add those determiners to the stopword list. "another chair" now matches "sofa" the same as "chair" does.

…by position Two fixes so "label another chair" marks a second chair instead of re-hitting the first: findMatch now prefers detections that aren't already labelled (falling back to any match when the user re-references the only one), and placeMarker dedupes by world position instead of by text. Same physical spot still updates in place, but two different objects of the same kind keep their own labels.

salmanmkc · 2026-06-18T07:33:31Z

The screenshot sends frames of virtual content (with the camera image optionally underneath). I thought your demo required sending camera images.

ah I see, yea I understand what you meannow

Apologies for not being clear with my comments.
My thought was to merge the capabilities of your streamScene into startGeminiLive so an explicit flag decides whether to stream screenshots or stream camera frames. The rest of GeminiManager already handles the callbacks and the microphone so those would be redundant.

thank you that makes sense, one thing on defaults: i kept captureMode: 'camera' as the default so existing callers of startGeminiLive (7_ai_live etc.) keep streaming the passthrough camera like before. screenshot is opt-in. world_companion passes 'camera' explicitly anyway. lmk if you'd rather screenshot be the default. I've got it setup like this: startGeminiLive now takes captureMode: 'screenshot' | 'camera'. screenshot streams the rendered scene via core.screenshotSynthesizer (with the camera underneath when overlayOnCamera is set), camera streams raw xrDeviceCamera frames. the world_companion demo passes 'camera'

Your demo video is super cool btw! We're looking forward to merging it! We can probably delete demos/aisimulator once your demo is merged.

thanks! and yeah, makes sense to drop demos/aisimulator once this lands.

…niLive existing callers (e.g. templates/7_ai_live) call startGeminiLive with no captureMode and used to stream raw camera frames. screenshot is now opt-in so those callers keep streaming the passthrough camera.

screenshotSynthesizer.getScreenshot returns a PNG data URL, but sendVideoFrame always tagged frames image/jpeg. parse the MIME from the data URL so screenshot mode sends png and camera mode stays jpeg.

…opening startLiveAI only resolved on onopen and rejected on onerror; a close before open left startGeminiLive hanging (UI stuck at 'opening session...'). reject in onclose when it never opened.

cleanup cleared the queue but left nextAudioStartTime and queuedSourceNodes intact, so a restarted session's audio was delayed until the new AudioContext clock caught up. stop playback (which resets both) before closing the context.

the placeLabel description suggests items like ["mug","laptop"], but normItem rejected bare strings (JSON.parse throws). treat a bare string as {text}, so it works whether the model follows the schema (objects) or the description.

the XR stop button is always enabled, so pressing it mid-start ran stop while isAIRunning was still false (stopGeminiLive no-ops) and the start finished anyway. bail out of stop() when starting.

… re-rolls the detector is non-deterministic and sometimes returns a sparse/unrelated set (often just ["coffee table"]). a rapid second placeLabel would re-run it and report not_found for something the previous call had just placed. reuse a detection younger than 3.5s, and prefer the richer of fresh vs cached so a degenerate re-detection can't clobber a good one.

salmanmkc added 11 commits May 9, 2026 13:33

Apply prettier formatting to world.streamScene + tests

7c29eed

Use IconButton + Orbiter for world_companion XR panel

2a5fa3c

Fix duplicate startBtn/stopBtn/clearBtn declarations

9953571

Shrink world_companion XR panel and push it further

a78a2ee

Flatten world_companion XR panel layout

aa63f1e

Head-lock world_companion panel to camera

b90acc9

salmanmkc marked this pull request as draft May 10, 2026 11:39

salmanmkc added 7 commits May 10, 2026 13:35

Attach world_companion panel to user rig instead of bare camera

92e84b3

Let placeLabel mark multiple objects in one call

9ada520

Accept single text/style alongside items[] in placeLabel

ccfd954

Log placeLabel detection diagnostics

e04efb9

Stop placing labels on wrong objects when no name match

825867b

Fall back to in-front-of-user placement when detection misses

9109a67

Prefer any detected object over in-front fallback

d162207

Merge branch 'main' into feat/world-stream-scene

e987544

salmanmkc marked this pull request as ready for review May 11, 2026 06:44

salmanmkc added 6 commits May 11, 2026 07:47

world_companion: drop placeLabel/placeMarker debug logging

ea50d91

Was useful while debugging Quest calibration and dedup behaviour but just noise in the console for everyone else. Error paths keep their console.warn.

salmanmkc force-pushed the feat/world-stream-scene branch from c6512b3 to ea50d91 Compare May 11, 2026 06:47

salmanmkc changed the title ~~Add world.streamScene + world.lookingAt primitives, plus world_companion demo~~ world.streamScene + world.lookingAt + world_companion demo (now with Quest 3 camera support) May 11, 2026

salmanmkc added 3 commits June 18, 2026 10:11

world_companion: use setText for the XR detection-status line

165f234

uiblocks UIText has no .text setter, so the want/got detection feedback stopped updating in XR after the panel moved off SpatialPanel. Use setText.

world_companion: pass camera + tools config into startGeminiLive

997bb5a

Hand the fps / downscale and the tools to startGeminiLive directly rather than setting fields before the call.

salmanmkc added 12 commits June 18, 2026 12:55

world_companion: stream passthrough frames via captureMode camera

7095c26

The companion reasons about the real room, so pass captureMode 'camera' to send raw device-camera frames rather than a render of the virtual content.

GeminiManager: resume the audio context before playback

f49d8c7

initializeAudioContext never resumed the context, so a context created outside a user gesture could stay suspended and play model audio silently. Resume it when suspended before scheduling playback.

world_companion: strip leading articles from detected labels

36b023e

The detector returns names like "the chair", which read badly as label text and in the status line. Strip a leading the/a/an from each detected label.

world_companion: space the want/got status line after commas

3fdc67a

Join the want/got label lists with ", " so the XR status reads cleanly.

Merge branch 'main' into feat/world-stream-scene

7a1a503

salmanmkc force-pushed the feat/world-stream-scene branch from 3f305a9 to 7a1a503 Compare June 18, 2026 07:48

salmanmkc added 7 commits June 18, 2026 17:13

GeminiManager: default captureMode to camera to match prior startGemi…

8b23a9a

…niLive existing callers (e.g. templates/7_ai_live) call startGeminiLive with no captureMode and used to stream raw camera frames. screenshot is now opt-in so those callers keep streaming the passthrough camera.

GeminiManager: send the screenshot's real MIME type, not hardcoded jpeg

f6de2de

screenshotSynthesizer.getScreenshot returns a PNG data URL, but sendVideoFrame always tagged frames image/jpeg. parse the MIME from the data URL so screenshot mode sends png and camera mode stays jpeg.

GeminiManager: fail the startup promise if the session closes before …

a37e88c

…opening startLiveAI only resolved on onopen and rejected on onerror; a close before open left startGeminiLive hanging (UI stuck at 'opening session...'). reject in onclose when it never opened.

GeminiManager: reset audio scheduling on cleanup

9e18c99

cleanup cleared the queue but left nextAudioStartTime and queuedSourceNodes intact, so a restarted session's audio was delayed until the new AudioContext clock caught up. stop playback (which resets both) before closing the context.

world_companion: ignore stop while a start is in progress

a852bd0

the XR stop button is always enabled, so pressing it mid-start ran stop while isAIRunning was still false (stopGeminiLive no-ops) and the start finished anyway. bail out of stop() when starting.

salmanmkc changed the title ~~world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera)~~ world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match, Quest 3 camera) Jun 18, 2026

salmanmkc changed the title ~~world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match, Quest 3 camera)~~ world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match) Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match)#268

world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match)#268
salmanmkc wants to merge 84 commits into
google:mainfrom
salmanmkc:feat/world-stream-scene

salmanmkc commented May 10, 2026 •

edited

Loading

Uh oh!

salmanmkc commented May 10, 2026

Uh oh!

dli7319 commented Jun 18, 2026

Uh oh!

dli7319 commented Jun 18, 2026

Uh oh!

dli7319 commented Jun 18, 2026

Uh oh!

salmanmkc commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

salmanmkc commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

salmanmkc commented May 10, 2026

Uh oh!

dli7319 commented Jun 18, 2026

Uh oh!

dli7319 commented Jun 18, 2026

Uh oh!

dli7319 commented Jun 18, 2026

Uh oh!

salmanmkc commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

salmanmkc commented May 10, 2026 •

edited

Loading

salmanmkc commented Jun 18, 2026 •

edited

Loading