Skip to content

Consolidate duplicate Person records and fix same-name member collisions #1275

@jonfroehlich

Description

@jonfroehlich

Consolidate duplicate Person records and fix same-name member collisions

Related: #1206 (live 500 on same-name members), #582 (2019 cleanup discussion — policy now superseded)

Summary

Over ~15 years we've accumulated many duplicate Person records. "Duplicate" actually covers two different problems that need opposite fixes, and conflating them is why this has never been cleaned up:

  1. True duplicates — one human with multiple Person rows (created by un-deduped imports / hand entry over the years). These should be merged into a single canonical record, reassigning all related objects.
  2. Genuine namesakes — two different people who share a name. These must stay separate, but they currently break the site: two members with the same derived url_name cause Person.MultipleObjectsReturned → HTTP 500 on /member/<url_name>/ (this is We should figure out how to support two people with the same name as members of Makeability Lab #1206; the live example was two jasminezhang rows).

This issue covers both: a safe, reviewable merge workflow for (1), and a de-collision + view-hardening fix for (2).

Background & root cause

  • Person.url_name is auto-derived in Person.save() from first_name + last_name (lowercased, accent-folded, punctuation-stripped), with a numeric-suffix collision loop (jonfroehlich, jonfroehlich2, …).
  • That collision loop is recent and only protects rows created/re-saved after it landed. Historical colliding rows predate it and have never been re-saved, so they still share a bare url_name (and some never-re-saved rows may still hold the model default 'placeholder'). This is why We should figure out how to support two people with the same name as members of Makeability Lab #1206 still fires despite the loop existing in the model today.
  • The member view "fix" noted in our codebase audit — switching get_object_or_404 to filter(...).order_by('-bio_datetime_modified').first()masks the 500 but does not solve namesakes: for two distinct people it silently returns one and makes the other's page permanently unreachable. The correct fix is unique url_names, not picking a winner.

⚠️ Operating model / access constraints (read before designing)

The maintainer can SSH into prod and read files, but cannot run docker/management commands directly on prod and cannot do privileged file ops (much of the tree is apache:makelab-owned; those go through CSE IT). This constrains how anything runs against prod data:

  • Prod DB: PostgreSQL on grabthar.cs.washington.edu, reachable only via the recycle.cs.washington.edu jump host. Credentials live in config.ini on the prod server (not in git; mounted as a volume). A DBeaver/ssh -L tunnel through recycle is the intended direct-access path.
  • Deploys: push to master → auto-deploys test; push a tag (e.g. 2.1.0) → deploys production. Build logs at /logs/buildlog.txt on each server.
  • Migrations are gitignored and regenerated per-environment, so data migrations (RunPython) are unreliable here and must NOT be used. All data operations ship as management commands (the established pattern — see generate_slugs_for_old_news_items, remove_year_from_forum_name, etc.).
  • Therefore: do all analysis and the merge against a local Django pointed at the prod DB through the tunnel (preferred — output stays off the server), with the entrypoint-wired one-shot as a fallback if the tunnel is rejected by pg_hba. Anything written to MEDIA_ROOT on prod (/cse/web/research/makelab/www/media) is web-served — never write CSVs of personal data there (the stale public dumped_data.json is exactly this mistake).

The merge surface (do not hardcode — enumerate)

Merging duplicate B → canonical A must relocate every relation pointing at B. Known relations involving Person:

Relation Type Notes
Publication.authors M2M (sorted, sort_value) Author order matters (citations/BibTeX). Preserve order; dedup if A & B both on a pub.
Talk.authors M2M (sorted) Same care.
Poster.authors M2M (sorted) Same care. (All three are Artifact subclasses; see fix_sortedm2m_columns.py.)
Video.authors M2M Confirm related name / whether sorted.
Position.person FK  
ProjectRole.person FK  
News.author FK (nullable)  
Position.advisor FK → Person Easy to miss — B may be another person's advisor.
Position.co_advisor FK → Person Same.
Position.grad_mentor FK → Person Same.

Implementation requirement: walk relations generically via Person._meta.get_fields() (handle ManyToManyRel, ManyToOneRel, and the self-referential FKs) rather than a hardcoded list, so schema drift can't silently orphan data. For sorted M2Ms, preserve sort_value ordering.

Also on merge:

  • Scalar fields: fill A's blank fields from B (email, websites, bio, etc.); never overwrite a populated field on A.
  • Images: every Person gets a random Star Wars default image in save(), so "has image" is meaningless — detect the default by path and only treat a non-default upload as real. Note Person has a pre_delete signal that deletes its image file; ensure deleting B doesn't remove a file A now relies on (filenames are per-person, but verify).

Proposed approach

Phase A — Fix #1206 (independent, ship first via normal deploy)

A1. Recompute url_name for all rows so the existing collision logic de-collides historical duplicates. New management command (e.g. recompute_url_names) that re-derives and saves; idempotent. Optional enhancement per #1206: prefer a middle-initial differentiator for namesakes (jasminexzhang / jasminelzhang) over a bare 2 suffix, for readable, stable URLs. A2. Harden website/views/member.py to resolve cleanly (404, not 500) if a collision ever recurs, instead of relying on .first(). A3. Regression test: /member/<url_name>/ returns 200 for each of two same-base-name people once de-collided (DB-touching test via DatabaseTestCase).

Phase A stops the active 500s and does not depend on the merge work below.

Phase B — Dedup cleanup (review-gated)

B1. Export command — export_people_for_dedup (read-only, CSV). One row per Person. Columns:

  • Identity: id, first_name, middle_name, last_name, url_name, email, personal_website, github, linkedin
  • Relation counts: pub_count, talk_count, poster_count, video_count, position_count, projectrole_count, news_authored_count
  • Advisor refs (block deletion): advisor_count, co_advisor_count, grad_mentor_count (i.e. Position.objects.filter(advisor=p) etc.)
  • total_refs = sum of all the above (0 ⇒ safe-to-delete shell)
  • Disambiguation aids: has_real_image (non-default, detect by path), earliest_position_date, latest_position_date

Strictly read-only (no .save()). Run locally via the tunnel; write CSV off-server.

B2. Local analysis (human-in-the-loop). Cluster the CSV by accent-folded normalized (first, last) (don't rely on url_name). For each multi-row cluster, classify:

  • Auto / safe: total_refs == 0 shells → delete (nothing to reassign; equivalent to merge). Also high-confidence merges where one row is a shell and another is rich, or emails match exactly.
  • Review: same name, overlapping co-authors/projects/active-dates but no email match → likely same human, confirm by hand.
  • Keep separate: disjoint emails / disjoint pubs / different middle names → namesakes (Phase A gives them distinct URLs).
  • Cross-name same-person (e.g. the documented "Ji Hyuk Bae" = "Sean Bae" high-school→undergrad case): clustering can't find these; the decisions file must accept manually added source→target pairs.

Output: a reviewed decisions file (CSV/JSON) of explicit actions: merge into:<id> / keep / delete.

B3. Merge command — merge_duplicate_people (decisions-file-driven).

  • Reads the committed/reviewed decisions file; --dry-run prints the full plan and changes nothing (default to dry-run).
  • Each merge runs in transaction.atomic().
  • Generic relation walk (above); preserve sorted-M2M order; scalar backfill; image handling.
  • Idempotent and loud-logging (verification is via /logs/ over SSH, not a shell): after a merge the source id is gone, so re-runs are no-ops.
  • Run locally against the prod DB via the tunnel (preferred), or entrypoint-wired one-shot fallback, then removed in a follow-up commit.

Acceptance criteria

  • /member/jasminezhang/ (and any other historical collision) returns 200, with each distinct person reachable at a unique URL.
  • export_people_for_dedup produces the CSV above and performs zero writes.
  • merge_duplicate_people --dry-run reports the plan with no DB changes.
  • A real merge reassigns all relation types in the table above, including the three advisor self-refs, and preserves publication author order.
  • Merging is idempotent (safe to re-run).
  • manage.py test website passes, including new regression tests:
    • Phase A: two same-base-name people both resolve (200).
    • Merge preserves Publication.authors order.
    • Merge reassigns FK relations (Position/ProjectRole/News) and advisor self-refs.
    • --dry-run makes no changes; second run of an applied merge is a no-op.

Use the existing DatabaseTestCase helpers (make_person, make_publication, make_news_item) per CONTRIBUTING.md.

Out of scope / follow-ups

  • Prevention at the creation point (dedup check wherever Person rows are bulk-created/imported) — the real long-term fix so this doesn't re-accrue. Separate issue.
  • Bulk deletion of orphaned person-image files left after merges (existing delete_unused_files / thumbnail_cleanup commands should cover this).

Open questions

  1. Confirm the exact redeploy/restart trigger and one-shot conventions against CLAUDE.md + docker-entrypoint.sh before relying on the entrypoint fallback.
  2. Confirm Video.authors related name and whether it's sorted.
  3. For Phase A1, do we want the middle-initial differentiator now, or just de-collide with the existing numeric-suffix logic and defer readability?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions