Skip to content

world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match)#268

Open
salmanmkc wants to merge 84 commits into
google:mainfrom
salmanmkc:feat/world-stream-scene
Open

world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match)#268
salmanmkc wants to merge 84 commits into
google:mainfrom
salmanmkc:feat/world-stream-scene

Conversation

@salmanmkc

@salmanmkc salmanmkc commented May 10, 2026

Copy link
Copy Markdown
Contributor

been wanting a voice + vision companion in xr for a while. start it, talk to it, it sees what you see and can drop markers on stuff.

demo's in demos/world_companion. it runs on the existing startGeminiLive in GeminiManager, which now takes a captureMode: 'screenshot' | 'camera' flag, screenshot streams the rendered scene (optionally over the camera image), camera streams raw passthrough frames. default stays camera so existing callers keep their old behaviour; the companion passes camera since it reasons about the real room. tool calls + transcription come back via the manager's events, so nothing new on the world api.

there's a placeLabel tool with three styles (dot, arrow, pulse) so it can pick how to highlight, arrow if you ask it to find something, pulse for tiny stuff, dot otherwise. uses world.objects.runDetection so markers stick to the real object via depth, not to wherever your head was when the tool fired, it will do this on the desktop simulator though. a lookCloser tool answers "what am i pointing at" off the reticle. small spatial panel for start/stop/clear so it's actually usable once you're in immersive.

you can ask for several things in one go ("label the couch, tv, and coffee table") and the tool takes an items[] array so they all get placed in a single call, each with its own style. labels billboard back at the camera so they stay readable when you walk around.

detector labels and what you say don't always match, television vs tv, pendant light vs floor lamp, picture vs painting. it matches by meaning through gemini's embedContent api (SEMANTIC_SIMILARITY task type + cosine), reducing to the core noun first so modifiers like "green chair" don't tank the match, and dedupes markers across rephrasings. the embed cache is per-page so it's basically free after the first call.

Try launch the demo and say for example "place an arrow on my water bottle".

I have my thoughts on updating states of objects it has seen, to update later, however for now this seems ok.

Gemini will be able to talk and see screens afaik in Android XR, however this will allow interaction in the real world + gemini live.

I will see if I can get a demo recorded for this. This is open to lots of feedback though, since this is just a very rough version.

Edit, here's a demo! https://youtu.be/-5s_aV6eV_A

I may have to just add key input in again but will double check later when I'm home

salmanmkc added 11 commits May 9, 2026 13:33
world.streamScene(prompt, opts) opens a Gemini Live session and runs a
periodic camera-frame loop into it, with auto-dispatch of agentic tools and
auto-playback of model audio via CoreSound. Returns a {stop, isActive}
handle. Throws cleanly when AI / Live capability / device camera are
missing instead of failing deep in the SDK.

world.lookingAt(controllerId?) is sugar over User.getReticleTarget so demos
can stay on the world.* namespace.

World now takes registry as a Script dependency so the new primitives can
resolve AI / XRDeviceCamera / CoreSound / User without callers wiring it
through every method.

11 tests covering missing-AI, non-Live AI, missing-camera, the frame loop,
text+audio routing, onAudio override, tool dispatch, unknown tool, and
onToolCall intercept.
A small single-file demo that wires xb.core.world.streamScene to a Live
session with two demo-local tools: placeLabel drops a marker in front of
the camera, and lookCloser reports what the user's reticle is aimed at via
xb.core.world.lookingAt.

Mirrors the world_ask UI pattern (floating bottom panel, transcript,
start/stop) so users have a complete reference for the new primitive
without leaving the demos directory.
Switch placeLabel from live reticle sampling to world.objects.runDetection
so labels anchor to actual detected objects in world space, not wherever
the user was looking when the tool fired. Also render a Troika text label
above the marker, not just a bare sphere.

Add a SpatialPanel with start/stop/clear controls so the demo is usable
in immersive mode, not just from the flat web overlay.
placeLabel now takes a style param so the model can pick how to highlight
something: dot for casual noting, arrow for 'point this out for me',
pulse for small or hard-to-spot things. Arrow gently bobs, pulse
expands and fades on a 1.5s loop.
Default enableDepth() leaves updateFullResolutionGeometry off, so the
depth mesh snapshot used by object detection is too sparse to raycast
against. Markers were landing near the camera instead of on the actual
detected object. Copy the depth flags the gemini_xrobject demo uses.
@salmanmkc salmanmkc marked this pull request as draft May 10, 2026 11:39
@salmanmkc

Copy link
Copy Markdown
Contributor Author

Turns out I hit rate limits of 20 object detections per day when I checked logs, I for some reason though it was broken

@salmanmkc salmanmkc marked this pull request as ready for review May 11, 2026 06:44
salmanmkc added 6 commits May 11, 2026 07:47
ObjectDetector now switches targetDevice to 'quest' when the Oculus
browser is detected, instead of always falling back to galaxyxr params.
Adds QuestCameraParams.ts with approximate Quest 3 passthrough intrinsics
(fx/fy ~800 at 1280x720, ~77° HFOV from the cropped getUserMedia stream)
and an offset for the RGB camera relative to the right XR eye. These are
estimates - WebXR doesn't expose the real values - and may need
per-device tweaks.

Also swaps the detection debug image dump from auto-downloading PNGs
(unusable on Quest browser) to a console-log preview that shows the
image inline, and adds a few extra logs in world_companion to help see
what placeLabel is actually receiving from the detector.
Quest 3 passthrough cameras are physically angled downward; labels were
landing too high above table-surface objects. Apply a -0.26 rad pitch in
the right-camera pose so unprojected detections line up with what the
user actually sees.
Floating world labels were getting cut up by the passthrough depth mesh
- letters disappearing where the mesh triangles passed in front of them.
Disable depthTest/depthWrite on the troika text and bump renderOrder so
labels always draw on top.
Gemini sometimes calls placeLabel multiple times for what's clearly the
same physical thing (e.g. "laptop" then "macbook"), and unprojection
drift puts the two markers a few cm apart - so the user sees the label
twice. Match by text first, then fall back to a 2m proximity check, and
update the existing marker in place instead of stacking a new one.
When the Gemini Live websocket drops (1011 internal error) and
reconnects, it replays its tool-call context, which fires placeLabel
again with the same items. Cache the last call key for 2s and short-
circuit the duplicate so we don't redo detection or stack new markers
on top of the existing ones.
Was useful while debugging Quest calibration and dedup behaviour but
just noise in the console for everyone else. Error paths keep their
console.warn.
@salmanmkc salmanmkc force-pushed the feat/world-stream-scene branch from c6512b3 to ea50d91 Compare May 11, 2026 06:47
@salmanmkc salmanmkc changed the title Add world.streamScene + world.lookingAt primitives, plus world_companion demo world.streamScene + world.lookingAt + world_companion demo (now with Quest 3 camera support) May 11, 2026
uiblocks UIText has no .text setter, so the want/got detection feedback
stopped updating in XR after the panel moved off SpatialPanel. Use setText.
Pass the capture config (fps, quality, width, height) and the tools through
startGeminiLive instead of only exposing them as fields the caller mutates,
and honor fps via the screenshot interval. This covers the rest of what the
old StreamSceneOptions configured so callers don't need a separate entrypoint.
Hand the fps / downscale and the tools to startGeminiLive directly rather than
setting fields before the call.
@dli7319

dli7319 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

The screenshot sends frames of virtual content (with the camera image optionally underneath). I thought your demo required sending camera images.

@dli7319

dli7319 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Apologies for not being clear with my comments.

My thought was to merge the capabilities of your streamScene into startGeminiLive so an explicit flag decides whether to stream screenshots or stream camera frames. The rest of GeminiManager already handles the callbacks and the microphone so those would be redundant.

@dli7319

dli7319 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Your demo video is super cool btw! We're looking forward to merging it!
We can probably delete demos/aisimulator once your demo is merged.

salmanmkc added 12 commits June 18, 2026 12:55
startGeminiLive can now stream either the rendered scene (captureMode
'screenshot', via core.screenshotSynthesizer, optionally composited over the
camera) or raw passthrough frames (captureMode 'camera', via
xrDeviceCamera.getSnapshot). Defaults to 'screenshot' with the camera overlay
on. This folds the camera-streaming capability in as a mode instead of a
separate entrypoint.
The companion reasons about the real room, so pass captureMode 'camera' to
send raw device-camera frames rather than a render of the virtual content.
initializeAudioContext never resumed the context, so a context created
outside a user gesture could stay suspended and play model audio silently.
Resume it when suspended before scheduling playback.
The companion was pre-judging from its own camera view and refusing ("I'm
not seeing a chair") instead of calling placeLabel. Tell it to call placeLabel
for every item the user named and base its reply on the tool's placed /
not-found result rather than deciding up front.
The detector returns names like "the chair", which read badly as label text
and in the status line. Strip a leading the/a/an from each detected label.
Join the want/got label lists with ", " so the XR status reads cleanly.
Request the gemini-embedding-001 vectors with taskType SEMANTIC_SIMILARITY,
which separates related furniture words from unrelated ones far better than
the default embeddings (true pairs land ~0.89-0.99, unrelated ones stay below
~0.88). Raise the cosine threshold to 0.88 to match and drop the hand-written
synonym list, since the embeddings now cover couch/sofa, tv/television,
chair/stool, light/ceiling lamp and the like on their own.
The live onclose only flipped isAIRunning, so a server-side close left the
mic, audio nodes and screenshot interval running. Run cleanup() there and
dispatch a 'close' event so callers can reset their own UI on a remote close.
Add a starting flag so the XR and DOM buttons can't open two sessions during
the opening window, and listen for the manager's close event to re-enable the
controls (instead of staying stuck on "listening") when the session ends
server-side.
…hreshold

The detector emits verbose names like "art piece on the left", whose trailing
location phrase weakens the embedding match and clutters the labels; strip it
(plus the leading article) down to the core noun. Also lower the cosine
threshold to 0.87 so an armchair the detector calls "sofa" still matches a
request for a chair (chair/sofa ~0.88) while unrelated pairs stay below.
The model often labels with a modified phrase ("another chair", "green
chair"), and embedding the whole phrase pulled it away from the detected word
(e.g. "another chair" vs "sofa" dropped to ~0.84, below threshold). Reduce
both the request and the detected label to their core noun (dropping articles,
determiners and adjectives like colours) before the similarity check, and add
those determiners to the stopword list. "another chair" now matches "sofa"
the same as "chair" does.
…by position

Two fixes so "label another chair" marks a second chair instead of re-hitting
the first: findMatch now prefers detections that aren't already labelled
(falling back to any match when the user re-references the only one), and
placeMarker dedupes by world position instead of by text. Same physical spot
still updates in place, but two different objects of the same kind keep their
own labels.
@salmanmkc

salmanmkc commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

The screenshot sends frames of virtual content (with the camera image optionally underneath). I thought your demo required sending camera images.

ah I see, yea I understand what you meannow

Apologies for not being clear with my comments.
My thought was to merge the capabilities of your streamScene into startGeminiLive so an explicit flag decides whether to stream screenshots or stream camera frames. The rest of GeminiManager already handles the callbacks and the microphone so those would be redundant.

thank you that makes sense, one thing on defaults: i kept captureMode: 'camera' as the default so existing callers of startGeminiLive (7_ai_live etc.) keep streaming the passthrough camera like before. screenshot is opt-in. world_companion passes 'camera' explicitly anyway. lmk if you'd rather screenshot be the default. I've got it setup like this: startGeminiLive now takes captureMode: 'screenshot' | 'camera'. screenshot streams the rendered scene via core.screenshotSynthesizer (with the camera underneath when overlayOnCamera is set), camera streams raw xrDeviceCamera frames. the world_companion demo passes 'camera'

Your demo video is super cool btw! We're looking forward to merging it! We can probably delete demos/aisimulator once your demo is merged.

thanks! and yeah, makes sense to drop demos/aisimulator once this lands.

@salmanmkc salmanmkc force-pushed the feat/world-stream-scene branch from 3f305a9 to 7a1a503 Compare June 18, 2026 07:48
…niLive

existing callers (e.g. templates/7_ai_live) call startGeminiLive with no
captureMode and used to stream raw camera frames. screenshot is now opt-in so
those callers keep streaming the passthrough camera.
screenshotSynthesizer.getScreenshot returns a PNG data URL, but sendVideoFrame
always tagged frames image/jpeg. parse the MIME from the data URL so screenshot
mode sends png and camera mode stays jpeg.
…opening

startLiveAI only resolved on onopen and rejected on onerror; a close before
open left startGeminiLive hanging (UI stuck at 'opening session...'). reject in
onclose when it never opened.
cleanup cleared the queue but left nextAudioStartTime and queuedSourceNodes
intact, so a restarted session's audio was delayed until the new AudioContext
clock caught up. stop playback (which resets both) before closing the context.
the placeLabel description suggests items like ["mug","laptop"], but normItem
rejected bare strings (JSON.parse throws). treat a bare string as {text}, so it
works whether the model follows the schema (objects) or the description.
the XR stop button is always enabled, so pressing it mid-start ran stop while
isAIRunning was still false (stopGeminiLive no-ops) and the start finished
anyway. bail out of stop() when starting.
… re-rolls

the detector is non-deterministic and sometimes returns a sparse/unrelated set
(often just ["coffee table"]). a rapid second placeLabel would re-run it and
report not_found for something the previous call had just placed. reuse a
detection younger than 3.5s, and prefer the richer of fresh vs cached so a
degenerate re-detection can't clobber a good one.
@salmanmkc salmanmkc changed the title world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera) world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match, Quest 3 camera) Jun 18, 2026
@salmanmkc salmanmkc changed the title world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match, Quest 3 camera) world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match) Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

demo New demo for XR Blocks demonstrating novel interactivity or perception features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants