Consolidate duplicate Person records and fix same-name member collisions

<html><head></head><body><h1>Consolidate duplicate <code>Person</code> records and fix same-name member collisions</h1>
Related: #1206 (live 500 on same-name members), #582 (2019 cleanup discussion — policy now superseded)
<h2>Summary</h2>
Over ~15 years we've accumulated many duplicate <code>Person</code> records. "Duplicate" actually covers two different problems that need opposite fixes, and conflating them is why this has never been cleaned up:
<ol>
<li>True duplicates — one human with multiple <code>Person</code> rows (created by un-deduped imports / hand entry over the years). These should be merged into a single canonical record, reassigning all related objects.</li>
<li>Genuine namesakes — two different people who share a name. These must stay separate, but they currently break the site: two members with the same derived <code>url_name</code> cause <code>Person.MultipleObjectsReturned</code> → HTTP 500 on <code>/member/&lt;url_name&gt;/</code> (this is #1206; the live example was two <code>jasminezhang</code> rows).</li>
</ol>
This issue covers both: a safe, reviewable merge workflow for (1), and a de-collision + view-hardening fix for (2).
<h2>Background &amp; root cause</h2>
<ul>
<li><code>Person.url_name</code> is auto-derived in <code>Person.save()</code> from <code>first_name + last_name</code> (lowercased, accent-folded, punctuation-stripped), with a numeric-suffix collision loop (<code>jonfroehlich</code>, <code>jonfroehlich2</code>, …).</li>
<li>That collision loop is recent and only protects rows created/re-saved after it landed. Historical colliding rows predate it and have never been re-saved, so they still share a bare <code>url_name</code> (and some never-re-saved rows may still hold the model default <code>'placeholder'</code>). This is why #1206 still fires despite the loop existing in the model today.</li>
<li>The member view "fix" noted in our codebase audit — switching <code>get_object_or_404</code> to <code>filter(...).order_by('-bio_datetime_modified').first()</code> — masks the 500 but does not solve namesakes: for two distinct people it silently returns one and makes the other's page permanently unreachable. The correct fix is unique <code>url_name</code>s, not picking a winner.</li>
</ul>
<h2>⚠️ Operating model / access constraints (read before designing)</h2>
The maintainer can SSH into prod and read files, but cannot run <code>docker</code>/management commands directly on prod and cannot do privileged file ops (much of the tree is <code>apache:makelab</code>-owned; those go through CSE IT). This constrains how anything runs against prod data:
<ul>
<li>Prod DB: PostgreSQL on <code>grabthar.cs.washington.edu</code>, reachable only via the <code>recycle.cs.washington.edu</code> jump host. Credentials live in <code>config.ini</code> on the prod server (not in git; mounted as a volume). A DBeaver/<code>ssh -L</code> tunnel through <code>recycle</code> is the intended direct-access path.</li>
<li>Deploys: push to <code>master</code> → auto-deploys test; push a tag (e.g. <code>2.1.0</code>) → deploys production. Build logs at <code>/logs/buildlog.txt</code> on each server.</li>
<li>Migrations are gitignored and regenerated per-environment, so data migrations (<code>RunPython</code>) are unreliable here and must NOT be used. All data operations ship as management commands (the established pattern — see <code>generate_slugs_for_old_news_items</code>, <code>remove_year_from_forum_name</code>, etc.).</li>
<li>Therefore: do all analysis and the merge against a local Django pointed at the prod DB through the tunnel (preferred — output stays off the server), with the entrypoint-wired one-shot as a fallback if the tunnel is rejected by <code>pg_hba</code>. Anything written to <code>MEDIA_ROOT</code> on prod (<code>/cse/web/research/makelab/www/media</code>) is web-served — never write CSVs of personal data there (the stale public <code>dumped_data.json</code> is exactly this mistake).</li>
</ul>
<h2>The merge surface (do not hardcode — enumerate)</h2>
Merging duplicate <code>B → canonical A</code> must relocate every relation pointing at <code>B</code>. Known relations involving <code>Person</code>:

Relation | Type | Notes
-- | -- | --
Publication.authors | M2M (sorted, sort_value) | Author order matters (citations/BibTeX). Preserve order; dedup if A & B both on a pub.
Talk.authors | M2M (sorted) | Same care.
Poster.authors | M2M (sorted) | Same care. (All three are Artifact subclasses; see fix_sortedm2m_columns.py.)
Video.authors | M2M | Confirm related name / whether sorted.
Position.person | FK |  
ProjectRole.person | FK |  
News.author | FK (nullable) |  
Position.advisor | FK → Person | Easy to miss — B may be another person's advisor.
Position.co_advisor | FK → Person | Same.
Position.grad_mentor | FK → Person | Same.


Implementation requirement: walk relations generically via <code>Person._meta.get_fields()</code> (handle <code>ManyToManyRel</code>, <code>ManyToOneRel</code>, and the self-referential FKs) rather than a hardcoded list, so schema drift can't silently orphan data. For sorted M2Ms, preserve <code>sort_value</code> ordering.
Also on merge:
<ul>
<li>Scalar fields: fill A's blank fields from B (email, websites, bio, etc.); never overwrite a populated field on A.</li>
<li>Images: every Person gets a random Star Wars default image in <code>save()</code>, so "has image" is meaningless — detect the default by path and only treat a non-default upload as real. Note <code>Person</code> has a <code>pre_delete</code> signal that deletes its image file; ensure deleting B doesn't remove a file A now relies on (filenames are per-person, but verify).</li>
</ul>
<h2>Proposed approach</h2>
<h3>Phase A — Fix #1206 (independent, ship first via normal deploy)</h3>
A1. Recompute <code>url_name</code> for all rows so the existing collision logic de-collides historical duplicates. New management command (e.g. <code>recompute_url_names</code>) that re-derives and saves; idempotent. Optional enhancement per #1206: prefer a middle-initial differentiator for namesakes (<code>jasminexzhang</code> / <code>jasminelzhang</code>) over a bare <code>2</code> suffix, for readable, stable URLs.
A2. Harden <code>website/views/member.py</code> to resolve cleanly (404, not 500) if a collision ever recurs, instead of relying on <code>.first()</code>.
A3. Regression test: <code>/member/&lt;url_name&gt;/</code> returns 200 for each of two same-base-name people once de-collided (DB-touching test via <code>DatabaseTestCase</code>).
<blockquote>
Phase A stops the active 500s and does not depend on the merge work below.
</blockquote>
<h3>Phase B — Dedup cleanup (review-gated)</h3>
B1. Export command — <code>export_people_for_dedup</code> (read-only, CSV).
One row per <code>Person</code>. Columns:
<ul>
<li>Identity: <code>id</code>, <code>first_name</code>, <code>middle_name</code>, <code>last_name</code>, <code>url_name</code>, <code>email</code>, <code>personal_website</code>, <code>github</code>, <code>linkedin</code></li>
<li>Relation counts: <code>pub_count</code>, <code>talk_count</code>, <code>poster_count</code>, <code>video_count</code>, <code>position_count</code>, <code>projectrole_count</code>, <code>news_authored_count</code></li>
<li>Advisor refs (block deletion): <code>advisor_count</code>, <code>co_advisor_count</code>, <code>grad_mentor_count</code> (i.e. <code>Position.objects.filter(advisor=p)</code> etc.)</li>
<li><code>total_refs</code> = sum of all the above (<code>0</code> ⇒ safe-to-delete shell)</li>
<li>Disambiguation aids: <code>has_real_image</code> (non-default, detect by path), <code>earliest_position_date</code>, <code>latest_position_date</code></li>
</ul>
Strictly read-only (no <code>.save()</code>). Run locally via the tunnel; write CSV off-server.
B2. Local analysis (human-in-the-loop).
Cluster the CSV by accent-folded normalized <code>(first, last)</code> (don't rely on <code>url_name</code>). For each multi-row cluster, classify:
<ul>
<li>Auto / safe: <code>total_refs == 0</code> shells → delete (nothing to reassign; equivalent to merge). Also high-confidence merges where one row is a shell and another is rich, or emails match exactly.</li>
<li>Review: same name, overlapping co-authors/projects/active-dates but no email match → likely same human, confirm by hand.</li>
<li>Keep separate: disjoint emails / disjoint pubs / different middle names → namesakes (Phase A gives them distinct URLs).</li>
<li>Cross-name same-person (e.g. the documented "Ji Hyuk Bae" = "Sean Bae" high-school→undergrad case): clustering can't find these; the decisions file must accept manually added source→target pairs.</li>
</ul>
Output: a reviewed decisions file (CSV/JSON) of explicit actions: <code>merge into:&lt;id&gt;</code> / <code>keep</code> / <code>delete</code>.
B3. Merge command — <code>merge_duplicate_people</code> (decisions-file-driven).
<ul>
<li>Reads the committed/reviewed decisions file; <code>--dry-run</code> prints the full plan and changes nothing (default to dry-run).</li>
<li>Each merge runs in <code>transaction.atomic()</code>.</li>
<li>Generic relation walk (above); preserve sorted-M2M order; scalar backfill; image handling.</li>
<li>Idempotent and loud-logging (verification is via <code>/logs/</code> over SSH, not a shell): after a merge the source id is gone, so re-runs are no-ops.</li>
<li>Run locally against the prod DB via the tunnel (preferred), or entrypoint-wired one-shot fallback, then removed in a follow-up commit.</li>
</ul>
<h2>Acceptance criteria</h2>
<ul>
<li>[ ] <code>/member/jasminezhang/</code> (and any other historical collision) returns 200, with each distinct person reachable at a unique URL.</li>
<li>[ ] <code>export_people_for_dedup</code> produces the CSV above and performs zero writes.</li>
<li>[ ] <code>merge_duplicate_people --dry-run</code> reports the plan with no DB changes.</li>
<li>[ ] A real merge reassigns all relation types in the table above, including the three advisor self-refs, and preserves publication author order.</li>
<li>[ ] Merging is idempotent (safe to re-run).</li>
<li>[ ] <code>manage.py test website</code> passes, including new regression tests:
<ul>
<li>[ ] Phase A: two same-base-name people both resolve (200).</li>
<li>[ ] Merge preserves <code>Publication.authors</code> order.</li>
<li>[ ] Merge reassigns FK relations (Position/ProjectRole/News) and advisor self-refs.</li>
<li>[ ] <code>--dry-run</code> makes no changes; second run of an applied merge is a no-op.</li>
</ul>
</li>
</ul>
Use the existing <code>DatabaseTestCase</code> helpers (<code>make_person</code>, <code>make_publication</code>, <code>make_news_item</code>) per <code>CONTRIBUTING.md</code>.
<h2>Out of scope / follow-ups</h2>
<ul>
<li>Prevention at the creation point (dedup check wherever Person rows are bulk-created/imported) — the real long-term fix so this doesn't re-accrue. Separate issue.</li>
<li>Bulk deletion of orphaned person-image files left after merges (existing <code>delete_unused_files</code> / <code>thumbnail_cleanup</code> commands should cover this).</li>
</ul>
<h2>Open questions</h2>
<ol>
<li>Confirm the exact redeploy/restart trigger and one-shot conventions against <code>CLAUDE.md</code> + <code>docker-entrypoint.sh</code> before relying on the entrypoint fallback.</li>
<li>Confirm <code>Video.authors</code> related name and whether it's sorted.</li>
<li>For Phase A1, do we want the middle-initial differentiator now, or just de-collide with the existing numeric-suffix logic and defer readability?</li>
</ol></body></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate duplicate Person records and fix same-name member collisions #1275

Consolidate duplicate `Person` records and fix same-name member collisions

Summary

Background & root cause

⚠️ Operating model / access constraints (read before designing)

The merge surface (do not hardcode — enumerate)

Proposed approach

Phase A — Fix #1206 (independent, ship first via normal deploy)

Phase B — Dedup cleanup (review-gated)

Acceptance criteria

Out of scope / follow-ups

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Relation	Type	Notes
Publication.authors	M2M (sorted, sort_value)	Author order matters (citations/BibTeX). Preserve order; dedup if A & B both on a pub.
Talk.authors	M2M (sorted)	Same care.
Poster.authors	M2M (sorted)	Same care. (All three are Artifact subclasses; see fix_sortedm2m_columns.py.)
Video.authors	M2M	Confirm related name / whether sorted.
Position.person	FK
ProjectRole.person	FK
News.author	FK (nullable)
Position.advisor	FK → Person	Easy to miss — B may be another person's advisor.
Position.co_advisor	FK → Person	Same.
Position.grad_mentor	FK → Person	Same.

Consolidate duplicate Person records and fix same-name member collisions #1275

Description

Consolidate duplicate Person records and fix same-name member collisions

Summary

Background & root cause

⚠️ Operating model / access constraints (read before designing)

The merge surface (do not hardcode — enumerate)

Proposed approach

Phase A — Fix #1206 (independent, ship first via normal deploy)

Phase B — Dedup cleanup (review-gated)

Acceptance criteria

Out of scope / follow-ups

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Consolidate duplicate `Person` records and fix same-name member collisions