commons: anchor performer imagery to verified Wikipedia links + audit/ground-truth tooling#215
Merged
Merged
Conversation
…/ground-truth tooling Resolver no longer guesses a Commons category from a bare artist name — that matched same-named non-musicians (e.g. an archaeologist's coin-finds category for jazz "Andrew Williams"). resolve_commons_category now resolves ONLY via the performer's validated Wikipedia article (Wikipedia -> Wikidata -> P373); no Wikipedia link means no imagery. The sweep is gated to performers that have a Wikipedia URL so we don't spend worker cycles / vision quota on guaranteed no-ops. Phase-1 of analyze_and_rank now logs progress so the long download+gate loop isn't mistaken for a hung worker. Tooling (all read-only): - audit_commons_imagery.py / build_commons_audit_viewer.py: flag existing Commons imagery that can't be re-derived from a trusted anchor, with an HTML thumbnail viewer for keep/delete triage. - build_wikipedia_groundtruth_queue.py / build_wikipedia_groundtruth_viewer.py: for performers with imagery but no Wikipedia link, derive candidate articles (category-derived, name-search fallback) and verify them by hand into a manual-provenance ground-truth JSON under data/ground_truth/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
add_commons_imagery_enrichment.sql was superseded by the numbered 020_commons_imagery_enrichment.sql (identical content) when migrations were renumbered to run under the test framework. Remove the stale duplicate so it is not applied twice. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The Commons imagery enricher fell back to a blind
Category:<Name>guess when Wikidata had no Commons-category claim. For common names this matched an unrelated, same-named person — e.g. jazz Andrew Williams picked up an archaeologist's catalogued-coin category, feeding medieval coins into the performer's portrait pipeline.What changed
Correctness
resolve_commons_categorynow resolves only via the performer's validated Wikipedia article (Wikipedia → Wikidata → P373). No name-search guessing; no Wikipedia link → no imagery. No imagery beats wrong imagery.analyze_and_rankso the long download+gate loop isn't mistaken for a hung worker.Tooling (all read-only)
audit_commons_imagery.py+build_commons_audit_viewer.py— flag existing Commons imagery that can't be re-derived from a trusted anchor, with an HTML thumbnail viewer for keep/delete triage.build_wikipedia_groundtruth_queue.py+build_wikipedia_groundtruth_viewer.py— for performers with imagery but no Wikipedia link, derive candidate articles (category-derived, name-search fallback) and verify them by hand into a manual-provenance ground-truth JSON underdata/ground_truth/.Housekeeping
.gitignorerules for transient audit/queue artifacts; verified ground-truth JSON stays trackable.data/ground_truth/README.mddocuments both JSON schemas.add_commons_imagery_enrichment.sql(superseded by020_…, identical content).Notes / follow-ups
performers.wikipedia_urlis intentionally not built yet.🤖 Generated with Claude Code