Skip to content

Add SearchQA split materialization helper#65

Open
summerview1997 wants to merge 4 commits into
microsoft:mainfrom
summerview1997:codex/searchqa-materialize-splits
Open

Add SearchQA split materialization helper#65
summerview1997 wants to merge 4 commits into
microsoft:mainfrom
summerview1997:codex/searchqa-materialize-splits

Conversation

@summerview1997

Copy link
Copy Markdown

Summary

This PR adds a reproducible helper for materializing runnable SearchQA splits from the released ID-only manifest.

The repository ships data/searchqa_id_split, while configs/searchqa/default.yaml points to data/searchqa_split. Without a materialization step, users can hit a missing split directory error or end up creating local data manually. This helper fills that gap by resolving manifest IDs against the Hugging Face lucadiliello/searchqa dataset and writing full train/val/test examples.

Changes

  • Add scripts/materialize_searchqa.py.
  • Preserve the released manifest order for train/val/test.
  • Validate missing manifest IDs and duplicate source IDs before writing output.
  • Write runnable items.json files containing id, question, context, and answers.
  • Add a generated split_manifest.json with source metadata and counts.
  • Document the SearchQA materialization command in data/README.md.
  • Add a skillopt[searchqa] optional dependency for the Hugging Face datasets package.
  • Add unit tests that avoid network access by testing the materialization core with in-memory rows.

Impact

Users can now materialize the SearchQA split expected by the default config with:

python -m pip install 'skillopt[searchqa]'
python scripts/materialize_searchqa.py

The generated output is written to data/searchqa_split, matching configs/searchqa/default.yaml.

Validation

  • /home/thomas/SkillOpt/.venv/bin/python -m pytest -q tests/test_materialize_searchqa.py
  • /home/thomas/SkillOpt/.venv/bin/python -m pytest -q
  • /home/thomas/SkillOpt/.venv/bin/python -m ruff check scripts/materialize_searchqa.py tests/test_materialize_searchqa.py pyproject.toml
  • /home/thomas/SkillOpt/.venv/bin/python -m py_compile scripts/materialize_searchqa.py tests/test_materialize_searchqa.py
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant