Skip to content

Consolidate duplicate Person records + fix same-name url_name collisions (#1275, #1206)#1327

Merged
jonfroehlich merged 5 commits into
masterfrom
1275-consolidate-duplicate-people
Jun 18, 2026
Merged

Consolidate duplicate Person records + fix same-name url_name collisions (#1275, #1206)#1327
jonfroehlich merged 5 commits into
masterfrom
1275-consolidate-duplicate-people

Conversation

@jonfroehlich

Copy link
Copy Markdown
Member

Closes #1275. Addresses #1206.

What & why

Over ~15 years the prod DB accumulated duplicate Person rows (one human, several rows from un-deduped imports) and same-name url_name collisions that cause MultipleObjectsReturned → HTTP 500 on /member/<url_name>/ (#1206). This PR adds the tooling to merge true duplicates, de-collides namesakes with stable readable URLs, and applies a reviewed merge to prod via a one-shot.

Sequenced per the issue discussion: dedup/merge first, with url de-collision as the namesake safety net.

Changes

  • merge_duplicate_people command — decisions-CSV-driven, dry-run by default (--apply to write). Relocates every relation pointing at Person via a generic Person._meta.get_fields() walk: FKs incl. the advisor/co_advisor/grad_mentor self-refs; sorted-M2M author/recipient sets with order preserved via .set() (incl. Grant.authors / Award.recipients, which the issue's own table missed); plain News.people. Blanks-only scalar backfill, image left on the canonical record, atomic per row, idempotent. Refuses name-mismatched pairs unless --allow-name-mismatch (guards a prod-id file from touching the wrong DB).
  • build_unique_url_name helper + Person.save() refactor — readable middle-initial differentiator for namesakes (jasminexzhang), numeric fallback (jasminezhang2).
  • Unicode NFKD accent folding in normalize_person_name — the old hand-maintained map dropped acute accents/cedilla, mangling url_names (Cláudiocludiosilva) and hiding accented-name duplicates from the Data Health dashboard. Fixes 5 mangled url_names and reveals one previously-hidden duplicate.
  • recompute_url_names command — re-derives a unique url_name for every person (earliest pk keeps the bare name), wired into docker-entrypoint.sh; the durable We should figure out how to support two people with the same name as members of Makeability Lab #1206 fix.
  • member view hardening — clean logged Http404 instead of the silent .first() band-aid on the (now-unreachable) collision path.
  • PROD-gated one-shot in docker-entrypoint.sh applying the reviewed dedup_decisions.csv (13 merges + 2 jasminezhang keeps).

Validation

  • Full test suite green (241 tests), incl. new regression tests for merge (author-order, FK/self-ref/Grant/Award/News reassignment, blanks-only backfill, dry-run no-op, idempotency, name-mismatch guard), url_name de-collision, accent folding, and member 200/404.
  • End-to-end dry-run + apply against a throwaway copy of the prod dump: 589→576 people, url_name collision groups 10→0, duplicate-people rows 26→2 (only the intended jasminezhang namesakes remain).

Rollout

  1. Merge → test auto-deploys. Merge one-shot is skipped on test (PROD gate); accent fix / view hardening / recompute go live there to eyeball.
  2. Push a SemVer tag (bump ML_WEBSITE_VERSION) → prod deploys; the one-shot runs once. Verify the 13 merged X into Y lines in /logs/debug.log.
  3. Follow-up PR removes the one-shot (merge is idempotent; a lingering run is a no-op).

No UI/template changes (member 404 is behavioral), so no a11y/screenshot impact.

🤖 Generated with Claude Code

jonfroehlich and others added 5 commits June 17, 2026 16:43
…1275)

Phase B core tooling for consolidating duplicate Person records:

- name_utils.build_unique_url_name(): derive a unique url_name preferring a
  readable middle-initial differentiator (jasminexzhang) for namesakes, with
  the legacy numeric suffix (jasminezhang2) as fallback.
- merge_duplicate_people management command: decisions-CSV-driven, dry-run by
  default (--apply to write). Relocates every relation pointing at Person via a
  generic Person._meta.get_fields() walk (FKs incl. the advisor/co_advisor/
  grad_mentor self-refs; sorted-M2M author/recipient sets with order preserved
  via .set(); plain News.people M2M), backfills only-blank scalar fields, leaves
  the canonical record's image untouched, is atomic per row and idempotent.
- Unit + DatabaseTestCase regression tests: author-order preservation, dedup,
  FK/self-ref/Grant/Award/News reassignment, blanks-only backfill, dry-run
  no-op, idempotent re-run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Namesake / collision safety net (Phase A), built on the shared
build_unique_url_name helper:

- Person.save(): derive url_name via build_unique_url_name (readable
  middle-initial differentiator for namesakes, numeric-suffix fallback);
  drops the inlined special_chars map + numeric loop (now dead code).
- recompute_url_names command: re-derive a unique url_name for every Person in
  pk order (earliest keeps the bare name), writing via .update() to skip save()
  side effects. Idempotent, --dry-run; wired into docker-entrypoint.sh (4.8) so
  historical collisions self-heal on deploy. This is the durable #1206 fix.
- member view: on the now-unreachable MultipleObjectsReturned, log loudly and
  raise a clean Http404 instead of the silent .first() band-aid (which could
  surface the wrong namesake and hide a real 500).
- Tests: recompute de-collision (numeric + middle-initial), idempotency,
  dry-run no-op; save() namesake derivation; member page resolves 200 for each
  de-collided namesake and returns 404 (not 500) on an unresolved collision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
normalize_person_name used a hand-maintained accent map that only covered
grave/circumflex/tilde vowels. Acute accents and the cedilla were neither
folded nor matched, so the trailing [^a-zA-Z] strip *deleted* them — mangling
url_names ('Claudio' with an acute a -> 'cludiosilva', 'Francois' with cedilla
-> 'franoisguimbretiere') and silently hiding accented-name duplicates from the
DuplicatePeopleCheck dashboard.

Replace the map with standard Unicode NFKD folding (_ascii_fold), which handles
the whole Latin range generically and needs no hand-syncing. This corrects 5
mangled url_names in prod data and reveals one previously-hidden true duplicate
(Claudio Silva, ids 669+720) for the merge decisions file. build_unique_url_name's
middle-initial folding uses the same helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#1275)

The decisions file is keyed by production ids, but pushing to master deploys to
the test server (a different DB where those ids are other people). Refuse to
merge two rows whose normalized names differ, unless --allow-name-mismatch is
passed (for a deliberate documented cross-name same-person case). Stops a
prod-id file from corrupting test data if it ever runs there.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1275)

Add the reviewed dedup_decisions.csv (13 merges + 2 jasminezhang keeps; ids
only) and a PROD-gated one-shot in docker-entrypoint.sh that runs
merge_duplicate_people --apply before recompute_url_names. Gated to
DJANGO_ENV=PROD (the decisions file is keyed by production ids) with the
command's name-mismatch guard as a backstop. Validated end-to-end against a
copy of the prod dump: 589->576 people, url_name collisions 10->0, duplicate
clusters reduced to the two intended namesakes.

Remove this one-shot in a follow-up once the first prod deploy confirms the
merge in /logs (the command is idempotent, so a lingering run is a no-op).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jonfroehlich jonfroehlich merged commit 2903cb1 into master Jun 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate duplicate Person records and fix same-name member collisions

1 participant