Skip to content

commons: anchor performer imagery to verified Wikipedia links + audit/ground-truth tooling#215

Merged
dprodger merged 2 commits into
mainfrom
commons-imagery-wikipedia-anchor
Jun 9, 2026
Merged

commons: anchor performer imagery to verified Wikipedia links + audit/ground-truth tooling#215
dprodger merged 2 commits into
mainfrom
commons-imagery-wikipedia-anchor

Conversation

@dprodger

@dprodger dprodger commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Why

The Commons imagery enricher fell back to a blind Category:<Name> guess when Wikidata had no Commons-category claim. For common names this matched an unrelated, same-named person — e.g. jazz Andrew Williams picked up an archaeologist's catalogued-coin category, feeding medieval coins into the performer's portrait pipeline.

What changed

Correctness

  • resolve_commons_category now resolves only via the performer's validated Wikipedia article (Wikipedia → Wikidata → P373). No name-search guessing; no Wikipedia link → no imagery. No imagery beats wrong imagery.
  • The enrichment sweep is gated to performers that have a Wikipedia URL, so we don't spend worker cycles / vision quota on guaranteed no-ops.
  • Phase-1 progress logging in analyze_and_rank so the long download+gate loop isn't mistaken for a hung worker.
  • Tests updated (fixtures now carry Wikipedia URLs; added an exclusion test). 13 pass locally.

Tooling (all read-only)

  • audit_commons_imagery.py + build_commons_audit_viewer.py — flag existing Commons imagery that can't be re-derived from a trusted anchor, with an HTML thumbnail viewer for keep/delete triage.
  • build_wikipedia_groundtruth_queue.py + build_wikipedia_groundtruth_viewer.py — for performers with imagery but no Wikipedia link, derive candidate articles (category-derived, name-search fallback) and verify them by hand into a manual-provenance ground-truth JSON under data/ground_truth/.

Housekeeping

  • .gitignore rules for transient audit/queue artifacts; verified ground-truth JSON stays trackable.
  • data/ground_truth/README.md documents both JSON schemas.
  • Removed the duplicate un-numbered add_commons_imagery_enrichment.sql (superseded by 020_…, identical content).

Notes / follow-ups

  • Re-ingest of the ground-truth JSON into performers.wikipedia_url is intentionally not built yet.
  • ~1,224 performers have Commons imagery but no Wikipedia link — candidates for the ground-truth pass.

🤖 Generated with Claude Code

dprodger and others added 2 commits June 9, 2026 19:38
…/ground-truth tooling

Resolver no longer guesses a Commons category from a bare artist name — that
matched same-named non-musicians (e.g. an archaeologist's coin-finds category
for jazz "Andrew Williams"). resolve_commons_category now resolves ONLY via the
performer's validated Wikipedia article (Wikipedia -> Wikidata -> P373); no
Wikipedia link means no imagery. The sweep is gated to performers that have a
Wikipedia URL so we don't spend worker cycles / vision quota on guaranteed
no-ops. Phase-1 of analyze_and_rank now logs progress so the long download+gate
loop isn't mistaken for a hung worker.

Tooling (all read-only):
- audit_commons_imagery.py / build_commons_audit_viewer.py: flag existing
  Commons imagery that can't be re-derived from a trusted anchor, with an HTML
  thumbnail viewer for keep/delete triage.
- build_wikipedia_groundtruth_queue.py / build_wikipedia_groundtruth_viewer.py:
  for performers with imagery but no Wikipedia link, derive candidate articles
  (category-derived, name-search fallback) and verify them by hand into a
  manual-provenance ground-truth JSON under data/ground_truth/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
add_commons_imagery_enrichment.sql was superseded by the numbered
020_commons_imagery_enrichment.sql (identical content) when migrations were
renumbered to run under the test framework. Remove the stale duplicate so it is
not applied twice.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dprodger dprodger merged commit 74f51e8 into main Jun 9, 2026
3 checks passed

@dprodger dprodger left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant