Consolidate duplicate Person records + fix same-name url_name collisions (#1275, #1206)#1327
Merged
Merged
Conversation
…1275) Phase B core tooling for consolidating duplicate Person records: - name_utils.build_unique_url_name(): derive a unique url_name preferring a readable middle-initial differentiator (jasminexzhang) for namesakes, with the legacy numeric suffix (jasminezhang2) as fallback. - merge_duplicate_people management command: decisions-CSV-driven, dry-run by default (--apply to write). Relocates every relation pointing at Person via a generic Person._meta.get_fields() walk (FKs incl. the advisor/co_advisor/ grad_mentor self-refs; sorted-M2M author/recipient sets with order preserved via .set(); plain News.people M2M), backfills only-blank scalar fields, leaves the canonical record's image untouched, is atomic per row and idempotent. - Unit + DatabaseTestCase regression tests: author-order preservation, dedup, FK/self-ref/Grant/Award/News reassignment, blanks-only backfill, dry-run no-op, idempotent re-run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Namesake / collision safety net (Phase A), built on the shared build_unique_url_name helper: - Person.save(): derive url_name via build_unique_url_name (readable middle-initial differentiator for namesakes, numeric-suffix fallback); drops the inlined special_chars map + numeric loop (now dead code). - recompute_url_names command: re-derive a unique url_name for every Person in pk order (earliest keeps the bare name), writing via .update() to skip save() side effects. Idempotent, --dry-run; wired into docker-entrypoint.sh (4.8) so historical collisions self-heal on deploy. This is the durable #1206 fix. - member view: on the now-unreachable MultipleObjectsReturned, log loudly and raise a clean Http404 instead of the silent .first() band-aid (which could surface the wrong namesake and hide a real 500). - Tests: recompute de-collision (numeric + middle-initial), idempotency, dry-run no-op; save() namesake derivation; member page resolves 200 for each de-collided namesake and returns 404 (not 500) on an unresolved collision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
normalize_person_name used a hand-maintained accent map that only covered
grave/circumflex/tilde vowels. Acute accents and the cedilla were neither
folded nor matched, so the trailing [^a-zA-Z] strip *deleted* them — mangling
url_names ('Claudio' with an acute a -> 'cludiosilva', 'Francois' with cedilla
-> 'franoisguimbretiere') and silently hiding accented-name duplicates from the
DuplicatePeopleCheck dashboard.
Replace the map with standard Unicode NFKD folding (_ascii_fold), which handles
the whole Latin range generically and needs no hand-syncing. This corrects 5
mangled url_names in prod data and reveals one previously-hidden true duplicate
(Claudio Silva, ids 669+720) for the merge decisions file. build_unique_url_name's
middle-initial folding uses the same helper.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#1275) The decisions file is keyed by production ids, but pushing to master deploys to the test server (a different DB where those ids are other people). Refuse to merge two rows whose normalized names differ, unless --allow-name-mismatch is passed (for a deliberate documented cross-name same-person case). Stops a prod-id file from corrupting test data if it ever runs there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1275) Add the reviewed dedup_decisions.csv (13 merges + 2 jasminezhang keeps; ids only) and a PROD-gated one-shot in docker-entrypoint.sh that runs merge_duplicate_people --apply before recompute_url_names. Gated to DJANGO_ENV=PROD (the decisions file is keyed by production ids) with the command's name-mismatch guard as a backstop. Validated end-to-end against a copy of the prod dump: 589->576 people, url_name collisions 10->0, duplicate clusters reduced to the two intended namesakes. Remove this one-shot in a follow-up once the first prod deploy confirms the merge in /logs (the command is idempotent, so a lingering run is a no-op). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1275. Addresses #1206.
What & why
Over ~15 years the prod DB accumulated duplicate
Personrows (one human, several rows from un-deduped imports) and same-nameurl_namecollisions that causeMultipleObjectsReturned→ HTTP 500 on/member/<url_name>/(#1206). This PR adds the tooling to merge true duplicates, de-collides namesakes with stable readable URLs, and applies a reviewed merge to prod via a one-shot.Sequenced per the issue discussion: dedup/merge first, with url de-collision as the namesake safety net.
Changes
merge_duplicate_peoplecommand — decisions-CSV-driven, dry-run by default (--applyto write). Relocates every relation pointing atPersonvia a genericPerson._meta.get_fields()walk: FKs incl. theadvisor/co_advisor/grad_mentorself-refs; sorted-M2M author/recipient sets with order preserved via.set()(incl.Grant.authors/Award.recipients, which the issue's own table missed); plainNews.people. Blanks-only scalar backfill, image left on the canonical record, atomic per row, idempotent. Refuses name-mismatched pairs unless--allow-name-mismatch(guards a prod-id file from touching the wrong DB).build_unique_url_namehelper +Person.save()refactor — readable middle-initial differentiator for namesakes (jasminexzhang), numeric fallback (jasminezhang2).normalize_person_name— the old hand-maintained map dropped acute accents/cedilla, mangling url_names (Cláudio→cludiosilva) and hiding accented-name duplicates from the Data Health dashboard. Fixes 5 mangled url_names and reveals one previously-hidden duplicate.recompute_url_namescommand — re-derives a unique url_name for every person (earliest pk keeps the bare name), wired intodocker-entrypoint.sh; the durable We should figure out how to support two people with the same name as members of Makeability Lab #1206 fix.Http404instead of the silent.first()band-aid on the (now-unreachable) collision path.docker-entrypoint.shapplying the revieweddedup_decisions.csv(13 merges + 2jasminezhangkeeps).Validation
jasminezhangnamesakes remain).Rollout
ML_WEBSITE_VERSION) → prod deploys; the one-shot runs once. Verify the 13merged X into Ylines in/logs/debug.log.No UI/template changes (member 404 is behavioral), so no a11y/screenshot impact.
🤖 Generated with Claude Code