world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match)#268
world_companion demo + startGeminiLive captureMode flag (multi-item placeLabel, embedding match)#268salmanmkc wants to merge 84 commits into
Conversation
world.streamScene(prompt, opts) opens a Gemini Live session and runs a
periodic camera-frame loop into it, with auto-dispatch of agentic tools and
auto-playback of model audio via CoreSound. Returns a {stop, isActive}
handle. Throws cleanly when AI / Live capability / device camera are
missing instead of failing deep in the SDK.
world.lookingAt(controllerId?) is sugar over User.getReticleTarget so demos
can stay on the world.* namespace.
World now takes registry as a Script dependency so the new primitives can
resolve AI / XRDeviceCamera / CoreSound / User without callers wiring it
through every method.
11 tests covering missing-AI, non-Live AI, missing-camera, the frame loop,
text+audio routing, onAudio override, tool dispatch, unknown tool, and
onToolCall intercept.
A small single-file demo that wires xb.core.world.streamScene to a Live session with two demo-local tools: placeLabel drops a marker in front of the camera, and lookCloser reports what the user's reticle is aimed at via xb.core.world.lookingAt. Mirrors the world_ask UI pattern (floating bottom panel, transcript, start/stop) so users have a complete reference for the new primitive without leaving the demos directory.
Switch placeLabel from live reticle sampling to world.objects.runDetection so labels anchor to actual detected objects in world space, not wherever the user was looking when the tool fired. Also render a Troika text label above the marker, not just a bare sphere. Add a SpatialPanel with start/stop/clear controls so the demo is usable in immersive mode, not just from the flat web overlay.
placeLabel now takes a style param so the model can pick how to highlight something: dot for casual noting, arrow for 'point this out for me', pulse for small or hard-to-spot things. Arrow gently bobs, pulse expands and fades on a 1.5s loop.
Default enableDepth() leaves updateFullResolutionGeometry off, so the depth mesh snapshot used by object detection is too sparse to raycast against. Markers were landing near the camera instead of on the actual detected object. Copy the depth flags the gemini_xrobject demo uses.
|
Turns out I hit rate limits of 20 object detections per day when I checked logs, I for some reason though it was broken |
ObjectDetector now switches targetDevice to 'quest' when the Oculus browser is detected, instead of always falling back to galaxyxr params. Adds QuestCameraParams.ts with approximate Quest 3 passthrough intrinsics (fx/fy ~800 at 1280x720, ~77° HFOV from the cropped getUserMedia stream) and an offset for the RGB camera relative to the right XR eye. These are estimates - WebXR doesn't expose the real values - and may need per-device tweaks. Also swaps the detection debug image dump from auto-downloading PNGs (unusable on Quest browser) to a console-log preview that shows the image inline, and adds a few extra logs in world_companion to help see what placeLabel is actually receiving from the detector.
Quest 3 passthrough cameras are physically angled downward; labels were landing too high above table-surface objects. Apply a -0.26 rad pitch in the right-camera pose so unprojected detections line up with what the user actually sees.
Floating world labels were getting cut up by the passthrough depth mesh - letters disappearing where the mesh triangles passed in front of them. Disable depthTest/depthWrite on the troika text and bump renderOrder so labels always draw on top.
Gemini sometimes calls placeLabel multiple times for what's clearly the same physical thing (e.g. "laptop" then "macbook"), and unprojection drift puts the two markers a few cm apart - so the user sees the label twice. Match by text first, then fall back to a 2m proximity check, and update the existing marker in place instead of stacking a new one.
When the Gemini Live websocket drops (1011 internal error) and reconnects, it replays its tool-call context, which fires placeLabel again with the same items. Cache the last call key for 2s and short- circuit the duplicate so we don't redo detection or stack new markers on top of the existing ones.
Was useful while debugging Quest calibration and dedup behaviour but just noise in the console for everyone else. Error paths keep their console.warn.
c6512b3 to
ea50d91
Compare
uiblocks UIText has no .text setter, so the want/got detection feedback stopped updating in XR after the panel moved off SpatialPanel. Use setText.
Pass the capture config (fps, quality, width, height) and the tools through startGeminiLive instead of only exposing them as fields the caller mutates, and honor fps via the screenshot interval. This covers the rest of what the old StreamSceneOptions configured so callers don't need a separate entrypoint.
Hand the fps / downscale and the tools to startGeminiLive directly rather than setting fields before the call.
|
The screenshot sends frames of virtual content (with the camera image optionally underneath). I thought your demo required sending camera images. |
|
Apologies for not being clear with my comments. My thought was to merge the capabilities of your |
|
Your demo video is super cool btw! We're looking forward to merging it! |
startGeminiLive can now stream either the rendered scene (captureMode 'screenshot', via core.screenshotSynthesizer, optionally composited over the camera) or raw passthrough frames (captureMode 'camera', via xrDeviceCamera.getSnapshot). Defaults to 'screenshot' with the camera overlay on. This folds the camera-streaming capability in as a mode instead of a separate entrypoint.
The companion reasons about the real room, so pass captureMode 'camera' to send raw device-camera frames rather than a render of the virtual content.
initializeAudioContext never resumed the context, so a context created outside a user gesture could stay suspended and play model audio silently. Resume it when suspended before scheduling playback.
The companion was pre-judging from its own camera view and refusing ("I'm
not seeing a chair") instead of calling placeLabel. Tell it to call placeLabel
for every item the user named and base its reply on the tool's placed /
not-found result rather than deciding up front.
The detector returns names like "the chair", which read badly as label text and in the status line. Strip a leading the/a/an from each detected label.
Join the want/got label lists with ", " so the XR status reads cleanly.
Request the gemini-embedding-001 vectors with taskType SEMANTIC_SIMILARITY, which separates related furniture words from unrelated ones far better than the default embeddings (true pairs land ~0.89-0.99, unrelated ones stay below ~0.88). Raise the cosine threshold to 0.88 to match and drop the hand-written synonym list, since the embeddings now cover couch/sofa, tv/television, chair/stool, light/ceiling lamp and the like on their own.
The live onclose only flipped isAIRunning, so a server-side close left the mic, audio nodes and screenshot interval running. Run cleanup() there and dispatch a 'close' event so callers can reset their own UI on a remote close.
Add a starting flag so the XR and DOM buttons can't open two sessions during the opening window, and listen for the manager's close event to re-enable the controls (instead of staying stuck on "listening") when the session ends server-side.
…hreshold The detector emits verbose names like "art piece on the left", whose trailing location phrase weakens the embedding match and clutters the labels; strip it (plus the leading article) down to the core noun. Also lower the cosine threshold to 0.87 so an armchair the detector calls "sofa" still matches a request for a chair (chair/sofa ~0.88) while unrelated pairs stay below.
The model often labels with a modified phrase ("another chair", "green
chair"), and embedding the whole phrase pulled it away from the detected word
(e.g. "another chair" vs "sofa" dropped to ~0.84, below threshold). Reduce
both the request and the detected label to their core noun (dropping articles,
determiners and adjectives like colours) before the similarity check, and add
those determiners to the stopword list. "another chair" now matches "sofa"
the same as "chair" does.
…by position Two fixes so "label another chair" marks a second chair instead of re-hitting the first: findMatch now prefers detections that aren't already labelled (falling back to any match when the user re-references the only one), and placeMarker dedupes by world position instead of by text. Same physical spot still updates in place, but two different objects of the same kind keep their own labels.
ah I see, yea I understand what you meannow
thank you that makes sense, one thing on defaults: i kept
thanks! and yeah, makes sense to drop demos/aisimulator once this lands. |
3f305a9 to
7a1a503
Compare
…niLive existing callers (e.g. templates/7_ai_live) call startGeminiLive with no captureMode and used to stream raw camera frames. screenshot is now opt-in so those callers keep streaming the passthrough camera.
screenshotSynthesizer.getScreenshot returns a PNG data URL, but sendVideoFrame always tagged frames image/jpeg. parse the MIME from the data URL so screenshot mode sends png and camera mode stays jpeg.
…opening startLiveAI only resolved on onopen and rejected on onerror; a close before open left startGeminiLive hanging (UI stuck at 'opening session...'). reject in onclose when it never opened.
cleanup cleared the queue but left nextAudioStartTime and queuedSourceNodes intact, so a restarted session's audio was delayed until the new AudioContext clock caught up. stop playback (which resets both) before closing the context.
the placeLabel description suggests items like ["mug","laptop"], but normItem
rejected bare strings (JSON.parse throws). treat a bare string as {text}, so it
works whether the model follows the schema (objects) or the description.
the XR stop button is always enabled, so pressing it mid-start ran stop while isAIRunning was still false (stopGeminiLive no-ops) and the start finished anyway. bail out of stop() when starting.
… re-rolls the detector is non-deterministic and sometimes returns a sparse/unrelated set (often just ["coffee table"]). a rapid second placeLabel would re-run it and report not_found for something the previous call had just placed. reuse a detection younger than 3.5s, and prefer the richer of fresh vs cached so a degenerate re-detection can't clobber a good one.
been wanting a voice + vision companion in xr for a while. start it, talk to it, it sees what you see and can drop markers on stuff.
demo's in
demos/world_companion. it runs on the existingstartGeminiLivein GeminiManager, which now takes acaptureMode: 'screenshot' | 'camera'flag, screenshot streams the rendered scene (optionally over the camera image), camera streams raw passthrough frames. default stayscameraso existing callers keep their old behaviour; the companion passescamerasince it reasons about the real room. tool calls + transcription come back via the manager's events, so nothing new on the world api.there's a
placeLabeltool with three styles (dot, arrow, pulse) so it can pick how to highlight, arrow if you ask it to find something, pulse for tiny stuff, dot otherwise. usesworld.objects.runDetectionso markers stick to the real object via depth, not to wherever your head was when the tool fired, it will do this on the desktop simulator though. alookClosertool answers "what am i pointing at" off the reticle. small spatial panel for start/stop/clear so it's actually usable once you're in immersive.you can ask for several things in one go ("label the couch, tv, and coffee table") and the tool takes an
items[]array so they all get placed in a single call, each with its own style. labels billboard back at the camera so they stay readable when you walk around.detector labels and what you say don't always match, television vs tv, pendant light vs floor lamp, picture vs painting. it matches by meaning through gemini's embedContent api (SEMANTIC_SIMILARITY task type + cosine), reducing to the core noun first so modifiers like "green chair" don't tank the match, and dedupes markers across rephrasings. the embed cache is per-page so it's basically free after the first call.
Try launch the demo and say for example "place an arrow on my water bottle".
I have my thoughts on updating states of objects it has seen, to update later, however for now this seems ok.
Gemini will be able to talk and see screens afaik in Android XR, however this will allow interaction in the real world + gemini live.
I will see if I can get a demo recorded for this. This is open to lots of feedback though, since this is just a very rough version.
Edit, here's a demo! https://youtu.be/-5s_aV6eV_A
I may have to just add key input in again but will double check later when I'm home