SEATauBench: A Multilingual Benchmark for Tool-Agent-User Interaction

SEATauBench overview: an agent serving a user across a shared world state, with agent-side and user-side tools and databases.

SEATauBench evaluates conversational tool-use agents when the user, the agent, the tools, and the domain content do not all speak English. It measures how task success and language fidelity degrade as more of the interaction shifts into a second language (L2), with a focus on Southeast Asian and regional languages.

Where SEATauBench branches off from $\tau^3$-bench

SEATauBench is built on top of tau2-bench, branching from upstream main at commit d11a970 (#259, the GA Realtime API migration). The full upstream simulation framework — domains (airline, retail, telecom), orchestrator, runner, and tau2 CLI — is preserved mostly unchanged under src/tau2/.

The SEATauBench layer lives in src/seatau/ and adds, on top of that base:

A language registry (data/seatau/languages.json) and multilingual domain assets under data/tau2/domains/{domain}/{lang_id}/.
An offline translation pipeline (src/seatau/translation/) for generating multilingual domain assets.
An annotation review workflow (src/seatau/annotation/) for reviewing and applying corrections to translated assets.
Four scenario presets wired into the tau2 runtime via --seatau-scenario.
Language-correctness metrics (src/seatau/metrics/) computed with fastText language identification.
An analysis and plotting toolchain (src/seatau/analysis/, src/seatau/plot/) that produces the paper figures.

The four scenarios

Each scenario controls how much of the interaction runs in the target language. Canonical ids (used in code, data/seatau/experiments.csv, and the data/simulations/ layout) and display names come from data/seatau/scenarios.yaml:

Scenario id	Display name	User & agent	Tool docs	Domain assets (policy/db/tasks)
`english`	En Baseline	English	English	English
`l2_tools`	L2 Tools	English	Mixed (`en` + L2s)	English
`l2_interaction`	L2 Interaction	L2	English	English
`l2_domain`	L2 Domain	L2	L2 (translated)	L2 (translated)

Supported domains: airline, retail, telecom. Supported languages: en (English), th (Thai), vi (Vietnamese), id (Indonesian), zh (Chinese), tl (Filipino) — see data/seatau/languages.json.

Getting started

This project uses uv.

git clone git@github.com:SEACrowd/SEATauBench.git seatau
cd seatau
uv sync --extra experiments --extra translation --extra dev

Language-correctness metrics need the fastText language-id model. Put it at the default (gitignored) path:

mkdir -p data/models
curl -L -o data/models/lid.176.bin \
  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

To use a different location, set TAU2_FASTTEXT_LID_MODEL_PATH in .env.

Copy the environment template and add your API keys:

cp .env.example .env
# Edit .env — runs default to OpenRouter, so set OPENROUTER_API_KEY at minimum.

Reproduce paper figures

Download the simulation runs and unzip them into data/simulations/. The archive expands into one subdirectory per scenario (english/, l2_tools/, l2_interaction/, l2_domain/), each containing the run folders with results.json.

mkdir -p data
curl -L -o data/simulations.zip \
  https://github.com/SEACrowd/SEATauBench/releases/download/v1_simulations/simulations.zip

echo "608e63873890841ed19c4ee26417cd2b48415a6e8681513c8e558cc455bf1111  data/simulations.zip" \
  | shasum -a 256 -c -

rm -rf data/simulations
unzip -q data/simulations.zip -d data "simulations/*"

Generate the summary metrics across scenarios. This reads every results.json, normalizes agent model names, computes $\rho^3$ and the language-correctness scores, and writes data/seatau/experiments.csv:
```
uv run python -m seatau.generate_scenario_summary
```

Generate analysis artifacts in data/analyses/:

uv run python -m seatau.analysis.perf_by_language
uv run python -m seatau.analysis.en_vs_l2_perf
uv run python -m seatau.analysis.metric_correlations_by_language

Generate the figures into figs/:

uv run plot all          # regenerate every figure
uv run plot list         # list available figure stems and their modules
uv run plot perf_by_language   # regenerate a single figure

The key dependency chain is: data/simulations/ -> data/seatau/experiments.csv -> data/analyses/ -> figs/.

Run experiments for the four scenarios

Runs go through the tau2 CLI. --seatau-scenario selects the preset and applies the matching asset mode, language components, and mixed-tool rules. Results land in data/simulations/.

# En Baseline
uv run tau2 run --domain retail --seatau-scenario english --lang-id en \
  --agent-llm openrouter/openai/gpt-5-mini --num-tasks 5

# L2 Tools (mixed-language tool docs)
uv run tau2 run --domain retail --seatau-scenario l2_tools --lang-id vi \
  --lang-components tool_mix --tool-mix-config 5lang_uniform_en-th-vi-id-zh \
  --agent-llm openrouter/openai/gpt-5-mini --num-tasks 5

# L2 Interaction (user + agent speak L2, assets stay English)
uv run tau2 run --domain retail --seatau-scenario l2_interaction --lang-id vi \
  --lang-components user_system agent_system greeting \
  --agent-llm openrouter/openai/gpt-5-mini --num-tasks 5

# L2 Domain (everything in L2, using translated assets)
uv run tau2 run --domain retail --seatau-scenario l2_domain --lang-id vi \
  --lang-components user_system agent_system greeting tools policy db tasks \
  --agent-llm openrouter/openai/gpt-5-mini --num-tasks 5

To run current or other models

Models are resolved through LiteLLM, defaulting to OpenRouter. Add OPENROUTER_API_KEY to .env and pass any supported route to --agent-llm / --user-llm. The paper reports three agent llms:

Normalized id	Display name
`gpt-5-mini`	GPT 5 Mini
`kimi-k2.5`	Kimi K2.5
`qwen-3-235b-it`	Qwen3 235B IT

To add more languages

Add an entry to data/seatau/languages.json (code, display name, instruction label, greeting).
Translate the domain assets for that language (see the next section).
Run the experiments above with the new --lang-id.

To change configurations

Model defaults, temperatures, NL-assertion and env-interface models, caching, and voice settings live in src/tau2/config.py. Scenario presets are defined in data/seatau/scenarios.yaml, and mixed-tool partitions in src/seatau/l2_tools_mix/.

Run and validate machine translation for another language

Translation is an offline preparation step. It builds the multilingual assets under data/tau2/domains/{domain}/{lang_id}/ that l2_domain runs load at evaluation time. Detailed component mappings and artifact rules live in src/seatau/translation/README.md.

Set up Vertex AI. The pipeline uses the exact LiteLLM route vertex_ai/gemini-3.1-flash-lite-preview. Authenticate with Application Default Credentials and export VERTEXAI_PROJECT and VERTEXAI_LOCATION.
Register the language in data/seatau/languages.json (the CLI rejects any --lang-id not present there).

Run the offline translation. Preview first with --dry-run, validate a narrow slice, then translate the full domain:

uv run python -m seatau.translation.cli \
  --domains telecom --lang-id zh --components all \
  --max-concurrency 8 --batch-size 24

Rerun translation when source assets change. Repeating the same command overwrites the selected outputs. To limit cost and review time, rerun only the changed component first:
```
uv run python -m seatau.translation.cli \
  --domains telecom --lang-id zh --components tools \
  --max-concurrency 4 --batch-size 12
```
Validate the generated assets with a small l2_domain run before scaling up to the full benchmark.

Optionally review translations in Excel and import reviewer corrections back into the translated asset directory:

uv run python -m seatau.annotation export \
  --domains retail telecom --lang-id vi \
  -o data/seatau/annotations/annotation_vi_r1.xlsx \
  --reviewer alice --round-id r1

uv run python -m seatau.annotation import \
  --workbook data/seatau/annotations/annotation_vi_r1.xlsx --lang vi

The full workbook schema is documented in src/seatau/annotation/README.md.

Review conversation trajectories

Use this when you need qualitative error labels beyond pass/fail reward metrics. The review command reads saved results.json files and writes results_reviewed.json beside each run. Passing a scenario directory reviews every nested run under it.

# Review all runs in one scenario
uv run python -m tau2.scripts.review_conversation run \
  data/simulations/<scenario>

# Or review one run
uv run python -m tau2.scripts.review_conversation run \
  data/simulations/<scenario>/<run-dir>/results.json

The default review covers both the agent and user simulator. Add --mode user when you only want user-simulator errors. This step calls an LLM judge, so it requires the same model credentials as evaluation runs.

Audit and normalize error tags

Run this after trajectory review if you will aggregate error tags. It catches judge typos and rewrites them to the canonical tag vocabulary.

# Audit first
uv run python -m seatau.utils.error_tags check data/simulations/<scenario>

# Preview, then rewrite reviewed files in place
uv run python -m seatau.utils.error_tags normalize data/simulations/<scenario> --dry-run
uv run python -m seatau.utils.error_tags normalize data/simulations/<scenario>

Evaluation metrics

Standard task success is the product of the requested reward bases per task (DB state checks, environment assertions, action checks, communication checks, and optional NL assertions). SEATauBench additionally records language_correctness in reward_info.info for each simulation: fastText LID over text turns, scored as the proportion detected in the expected language. It is metadata by default and only affects reward when LANGUAGE_CORRECTNESS is explicitly included in a task's reward_basis.

Module docs

Document	Description
SEA-TAU layer	Scenarios, the experiment matrix, and how runs are wired.
Translation toolkit	Offline translation pipeline and artifact rules.
Annotation review	Excel review/import workflow for translated assets.
Mixed-language tools	Tool-partition configs for the `l2_tools` scenario.

Citation

SEATauBench builds on $\tau$-bench and $\tau^2$-bench. If you use this work, please cite the underlying benchmark:

@misc{barres2025tau2,
      title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
      author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
      year={2025},
      eprint={2506.07982},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.07982},
}

@misc{yao2024tau,
      title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
      author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
      year={2024},
      eprint={2406.12045},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.12045},
}

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
.cursor/rules		.cursor/rules
.github		.github
data		data
docs		docs
examples/agents		examples/agents
figs		figs
scripts		scripts
src		src
tests		tests
web/leaderboard		web/leaderboard
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.release-template.md		.release-template.md
AGENTS.md		AGENTS.md
AUTOMATION_GUIDE.md		AUTOMATION_GUIDE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
SEATauBench_v1.pdf		SEATauBench_v1.pdf
VERSIONING.md		VERSIONING.md
github-release-body.md		github-release-body.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SEATauBench: A Multilingual Benchmark for Tool-Agent-User Interaction

Where SEATauBench branches off from $\tau^3$-bench

The four scenarios

Getting started

Reproduce paper figures

Run experiments for the four scenarios

To run current or other models

To add more languages

To change configurations

Run and validate machine translation for another language

Review conversation trajectories

Audit and normalize error tags

Evaluation metrics

Module docs

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SEATauBench: A Multilingual Benchmark for Tool-Agent-User Interaction

Where SEATauBench branches off from $\tau^3$-bench

The four scenarios

Getting started

Reproduce paper figures

Run experiments for the four scenarios

To run current or other models

To add more languages

To change configurations

Run and validate machine translation for another language

Review conversation trajectories

Audit and normalize error tags

Evaluation metrics

Module docs

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages