SEATauBench evaluates conversational tool-use agents when the user, the agent, the tools, and the domain content do not all speak English. It measures how task success and language fidelity degrade as more of the interaction shifts into a second language (L2), with a focus on Southeast Asian and regional languages.
SEATauBench is built on top of tau2-bench,
branching from upstream main at commit d11a970 (#259, the GA Realtime API
migration). The full upstream simulation framework — domains (airline,
retail, telecom), orchestrator, runner, and tau2 CLI — is preserved
mostly unchanged under src/tau2/.
The SEATauBench layer lives in src/seatau/ and adds, on top of that base:
- A language registry (
data/seatau/languages.json) and multilingual domain assets underdata/tau2/domains/{domain}/{lang_id}/. - An offline translation pipeline (
src/seatau/translation/) for generating multilingual domain assets. - An annotation review workflow (
src/seatau/annotation/) for reviewing and applying corrections to translated assets. - Four scenario presets wired into the
tau2runtime via--seatau-scenario. - Language-correctness metrics (
src/seatau/metrics/) computed with fastText language identification. - An analysis and plotting toolchain (
src/seatau/analysis/,src/seatau/plot/) that produces the paper figures.
Each scenario controls how much of the interaction runs in the target language.
Canonical ids (used in code, data/seatau/experiments.csv, and the
data/simulations/ layout) and display names come from
data/seatau/scenarios.yaml:
| Scenario id | Display name | User & agent | Tool docs | Domain assets (policy/db/tasks) |
|---|---|---|---|---|
english |
En Baseline | English | English | English |
l2_tools |
L2 Tools | English | Mixed (en + L2s) |
English |
l2_interaction |
L2 Interaction | L2 | English | English |
l2_domain |
L2 Domain | L2 | L2 (translated) | L2 (translated) |
Supported domains: airline, retail, telecom. Supported languages: en
(English), th (Thai), vi (Vietnamese), id (Indonesian), zh (Chinese),
tl (Filipino) — see data/seatau/languages.json.
This project uses uv.
git clone git@github.com:SEACrowd/SEATauBench.git seatau
cd seatau
uv sync --extra experiments --extra translation --extra devLanguage-correctness metrics need the fastText language-id model. Put it at the default (gitignored) path:
mkdir -p data/models
curl -L -o data/models/lid.176.bin \
https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.binTo use a different location, set TAU2_FASTTEXT_LID_MODEL_PATH in .env.
Copy the environment template and add your API keys:
cp .env.example .env
# Edit .env — runs default to OpenRouter, so set OPENROUTER_API_KEY at minimum.-
Download the simulation runs and unzip them into
data/simulations/. The archive expands into one subdirectory per scenario (english/,l2_tools/,l2_interaction/,l2_domain/), each containing the run folders withresults.json.mkdir -p data curl -L -o data/simulations.zip \ https://github.com/SEACrowd/SEATauBench/releases/download/v1_simulations/simulations.zip echo "608e63873890841ed19c4ee26417cd2b48415a6e8681513c8e558cc455bf1111 data/simulations.zip" \ | shasum -a 256 -c - rm -rf data/simulations unzip -q data/simulations.zip -d data "simulations/*"
-
Generate the summary metrics across scenarios. This reads every
results.json, normalizes agent model names, computes$\rho^3$ and the language-correctness scores, and writesdata/seatau/experiments.csv:uv run python -m seatau.generate_scenario_summary
-
Generate analysis artifacts in
data/analyses/:uv run python -m seatau.analysis.perf_by_language uv run python -m seatau.analysis.en_vs_l2_perf uv run python -m seatau.analysis.metric_correlations_by_language
-
Generate the figures into
figs/:uv run plot all # regenerate every figure uv run plot list # list available figure stems and their modules uv run plot perf_by_language # regenerate a single figure
The key dependency chain is:
data/simulations/ -> data/seatau/experiments.csv -> data/analyses/ ->
figs/.
Runs go through the tau2 CLI. --seatau-scenario selects the preset and
applies the matching asset mode, language components, and mixed-tool rules.
Results land in data/simulations/.
# En Baseline
uv run tau2 run --domain retail --seatau-scenario english --lang-id en \
--agent-llm openrouter/openai/gpt-5-mini --num-tasks 5
# L2 Tools (mixed-language tool docs)
uv run tau2 run --domain retail --seatau-scenario l2_tools --lang-id vi \
--lang-components tool_mix --tool-mix-config 5lang_uniform_en-th-vi-id-zh \
--agent-llm openrouter/openai/gpt-5-mini --num-tasks 5
# L2 Interaction (user + agent speak L2, assets stay English)
uv run tau2 run --domain retail --seatau-scenario l2_interaction --lang-id vi \
--lang-components user_system agent_system greeting \
--agent-llm openrouter/openai/gpt-5-mini --num-tasks 5
# L2 Domain (everything in L2, using translated assets)
uv run tau2 run --domain retail --seatau-scenario l2_domain --lang-id vi \
--lang-components user_system agent_system greeting tools policy db tasks \
--agent-llm openrouter/openai/gpt-5-mini --num-tasks 5Models are resolved through LiteLLM, defaulting to OpenRouter. Add
OPENROUTER_API_KEY to .env and pass any supported route to --agent-llm /
--user-llm. The paper reports three agent llms:
| Normalized id | Display name |
|---|---|
gpt-5-mini |
GPT 5 Mini |
kimi-k2.5 |
Kimi K2.5 |
qwen-3-235b-it |
Qwen3 235B IT |
- Add an entry to
data/seatau/languages.json(code, display name, instruction label, greeting). - Translate the domain assets for that language (see the next section).
- Run the experiments above with the new
--lang-id.
Model defaults, temperatures, NL-assertion and env-interface models, caching,
and voice settings live in src/tau2/config.py. Scenario presets are defined in
data/seatau/scenarios.yaml, and mixed-tool partitions in
src/seatau/l2_tools_mix/.
Translation is an offline preparation step. It builds the multilingual assets
under data/tau2/domains/{domain}/{lang_id}/ that l2_domain runs load at
evaluation time. Detailed component mappings and artifact rules live in
src/seatau/translation/README.md.
-
Set up Vertex AI. The pipeline uses the exact LiteLLM route
vertex_ai/gemini-3.1-flash-lite-preview. Authenticate with Application Default Credentials and exportVERTEXAI_PROJECTandVERTEXAI_LOCATION. -
Register the language in
data/seatau/languages.json(the CLI rejects any--lang-idnot present there). -
Run the offline translation. Preview first with
--dry-run, validate a narrow slice, then translate the full domain:uv run python -m seatau.translation.cli \ --domains telecom --lang-id zh --components all \ --max-concurrency 8 --batch-size 24
-
Rerun translation when source assets change. Repeating the same command overwrites the selected outputs. To limit cost and review time, rerun only the changed component first:
uv run python -m seatau.translation.cli \ --domains telecom --lang-id zh --components tools \ --max-concurrency 4 --batch-size 12
-
Validate the generated assets with a small
l2_domainrun before scaling up to the full benchmark. -
Optionally review translations in Excel and import reviewer corrections back into the translated asset directory:
uv run python -m seatau.annotation export \ --domains retail telecom --lang-id vi \ -o data/seatau/annotations/annotation_vi_r1.xlsx \ --reviewer alice --round-id r1 uv run python -m seatau.annotation import \ --workbook data/seatau/annotations/annotation_vi_r1.xlsx --lang viThe full workbook schema is documented in
src/seatau/annotation/README.md.
Use this when you need qualitative error labels beyond pass/fail reward metrics.
The review command reads saved results.json files and writes
results_reviewed.json beside each run. Passing a scenario directory reviews
every nested run under it.
# Review all runs in one scenario
uv run python -m tau2.scripts.review_conversation run \
data/simulations/<scenario>
# Or review one run
uv run python -m tau2.scripts.review_conversation run \
data/simulations/<scenario>/<run-dir>/results.jsonThe default review covers both the agent and user simulator. Add --mode user
when you only want user-simulator errors. This step calls an LLM judge, so it
requires the same model credentials as evaluation runs.
Run this after trajectory review if you will aggregate error tags. It catches judge typos and rewrites them to the canonical tag vocabulary.
# Audit first
uv run python -m seatau.utils.error_tags check data/simulations/<scenario>
# Preview, then rewrite reviewed files in place
uv run python -m seatau.utils.error_tags normalize data/simulations/<scenario> --dry-run
uv run python -m seatau.utils.error_tags normalize data/simulations/<scenario>Standard task success is the product of the requested reward bases per task (DB
state checks, environment assertions, action checks, communication checks, and
optional NL assertions). SEATauBench additionally records language_correctness
in reward_info.info for each simulation: fastText LID over text turns, scored
as the proportion detected in the expected language. It is metadata by default
and only affects reward when LANGUAGE_CORRECTNESS is explicitly included in a
task's reward_basis.
| Document | Description |
|---|---|
| SEA-TAU layer | Scenarios, the experiment matrix, and how runs are wired. |
| Translation toolkit | Offline translation pipeline and artifact rules. |
| Annotation review | Excel review/import workflow for translated assets. |
| Mixed-language tools | Tool-partition configs for the l2_tools scenario. |
SEATauBench builds on
@misc{barres2025tau2,
title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
year={2025},
eprint={2506.07982},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.07982},
}
@misc{yao2024tau,
title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
year={2024},
eprint={2406.12045},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2406.12045},
}