You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consolidate duplicate Person records and fix same-name member collisions
Related: #1206 (live 500 on same-name members), #582 (2019 cleanup discussion — policy now superseded)
Summary
Over ~15 years we've accumulated many duplicate Person records. "Duplicate" actually covers two different problems that need opposite fixes, and conflating them is why this has never been cleaned up:
True duplicates — one human with multiple Person rows (created by un-deduped imports / hand entry over the years). These should be merged into a single canonical record, reassigning all related objects.
This issue covers both: a safe, reviewable merge workflow for (1), and a de-collision + view-hardening fix for (2).
Background & root cause
Person.url_name is auto-derived in Person.save() from first_name + last_name (lowercased, accent-folded, punctuation-stripped), with a numeric-suffix collision loop (jonfroehlich, jonfroehlich2, …).
That collision loop is recent and only protects rows created/re-saved after it landed. Historical colliding rows predate it and have never been re-saved, so they still share a bare url_name (and some never-re-saved rows may still hold the model default 'placeholder'). This is why We should figure out how to support two people with the same name as members of Makeability Lab #1206 still fires despite the loop existing in the model today.
The member view "fix" noted in our codebase audit — switching get_object_or_404 to filter(...).order_by('-bio_datetime_modified').first() — masks the 500 but does not solve namesakes: for two distinct people it silently returns one and makes the other's page permanently unreachable. The correct fix is unique url_names, not picking a winner.
⚠️ Operating model / access constraints (read before designing)
The maintainer can SSH into prod and read files, but cannot run docker/management commands directly on prod and cannot do privileged file ops (much of the tree is apache:makelab-owned; those go through CSE IT). This constrains how anything runs against prod data:
Prod DB: PostgreSQL on grabthar.cs.washington.edu, reachable only via the recycle.cs.washington.edu jump host. Credentials live in config.ini on the prod server (not in git; mounted as a volume). A DBeaver/ssh -L tunnel through recycle is the intended direct-access path.
Deploys: push to master → auto-deploys test; push a tag (e.g. 2.1.0) → deploys production. Build logs at /logs/buildlog.txt on each server.
Migrations are gitignored and regenerated per-environment, so data migrations (RunPython) are unreliable here and must NOT be used. All data operations ship as management commands (the established pattern — see generate_slugs_for_old_news_items, remove_year_from_forum_name, etc.).
Therefore: do all analysis and the merge against a local Django pointed at the prod DB through the tunnel (preferred — output stays off the server), with the entrypoint-wired one-shot as a fallback if the tunnel is rejected by pg_hba. Anything written to MEDIA_ROOT on prod (/cse/web/research/makelab/www/media) is web-served — never write CSVs of personal data there (the stale public dumped_data.json is exactly this mistake).
The merge surface (do not hardcode — enumerate)
Merging duplicate B → canonical A must relocate every relation pointing at B. Known relations involving Person:
Relation
Type
Notes
Publication.authors
M2M (sorted, sort_value)
Author order matters (citations/BibTeX). Preserve order; dedup if A & B both on a pub.
Talk.authors
M2M (sorted)
Same care.
Poster.authors
M2M (sorted)
Same care. (All three are Artifact subclasses; see fix_sortedm2m_columns.py.)
Video.authors
M2M
Confirm related name / whether sorted.
Position.person
FK
ProjectRole.person
FK
News.author
FK (nullable)
Position.advisor
FK → Person
Easy to miss — B may be another person's advisor.
Position.co_advisor
FK → Person
Same.
Position.grad_mentor
FK → Person
Same.
Implementation requirement: walk relations generically via Person._meta.get_fields() (handle ManyToManyRel, ManyToOneRel, and the self-referential FKs) rather than a hardcoded list, so schema drift can't silently orphan data. For sorted M2Ms, preserve sort_value ordering.
Also on merge:
Scalar fields: fill A's blank fields from B (email, websites, bio, etc.); never overwrite a populated field on A.
Images: every Person gets a random Star Wars default image in save(), so "has image" is meaningless — detect the default by path and only treat a non-default upload as real. Note Person has a pre_delete signal that deletes its image file; ensure deleting B doesn't remove a file A now relies on (filenames are per-person, but verify).
Proposed approach
Phase A — Fix #1206 (independent, ship first via normal deploy)
A1. Recompute url_name for all rows so the existing collision logic de-collides historical duplicates. New management command (e.g. recompute_url_names) that re-derives and saves; idempotent. Optional enhancement per #1206: prefer a middle-initial differentiator for namesakes (jasminexzhang / jasminelzhang) over a bare 2 suffix, for readable, stable URLs.
A2. Harden website/views/member.py to resolve cleanly (404, not 500) if a collision ever recurs, instead of relying on .first().
A3. Regression test: /member/<url_name>/ returns 200 for each of two same-base-name people once de-collided (DB-touching test via DatabaseTestCase).
Phase A stops the active 500s and does not depend on the merge work below.
Phase B — Dedup cleanup (review-gated)
B1. Export command — export_people_for_dedup (read-only, CSV).
One row per Person. Columns:
total_refs = sum of all the above (0 ⇒ safe-to-delete shell)
Disambiguation aids: has_real_image (non-default, detect by path), earliest_position_date, latest_position_date
Strictly read-only (no .save()). Run locally via the tunnel; write CSV off-server.
B2. Local analysis (human-in-the-loop).
Cluster the CSV by accent-folded normalized (first, last) (don't rely on url_name). For each multi-row cluster, classify:
Auto / safe:total_refs == 0 shells → delete (nothing to reassign; equivalent to merge). Also high-confidence merges where one row is a shell and another is rich, or emails match exactly.
Review: same name, overlapping co-authors/projects/active-dates but no email match → likely same human, confirm by hand.
Keep separate: disjoint emails / disjoint pubs / different middle names → namesakes (Phase A gives them distinct URLs).
Cross-name same-person (e.g. the documented "Ji Hyuk Bae" = "Sean Bae" high-school→undergrad case): clustering can't find these; the decisions file must accept manually added source→target pairs.
Output: a reviewed decisions file (CSV/JSON) of explicit actions: merge into:<id> / keep / delete.
Reads the committed/reviewed decisions file; --dry-run prints the full plan and changes nothing (default to dry-run).
Each merge runs in transaction.atomic().
Generic relation walk (above); preserve sorted-M2M order; scalar backfill; image handling.
Idempotent and loud-logging (verification is via /logs/ over SSH, not a shell): after a merge the source id is gone, so re-runs are no-ops.
Run locally against the prod DB via the tunnel (preferred), or entrypoint-wired one-shot fallback, then removed in a follow-up commit.
Acceptance criteria
/member/jasminezhang/ (and any other historical collision) returns 200, with each distinct person reachable at a unique URL.
export_people_for_dedup produces the CSV above and performs zero writes.
merge_duplicate_people --dry-run reports the plan with no DB changes.
A real merge reassigns all relation types in the table above, including the three advisor self-refs, and preserves publication author order.
Merging is idempotent (safe to re-run).
manage.py test website passes, including new regression tests:
Phase A: two same-base-name people both resolve (200).
Merge preserves Publication.authors order.
Merge reassigns FK relations (Position/ProjectRole/News) and advisor self-refs.
--dry-run makes no changes; second run of an applied merge is a no-op.
Use the existing DatabaseTestCase helpers (make_person, make_publication, make_news_item) per CONTRIBUTING.md.
Out of scope / follow-ups
Prevention at the creation point (dedup check wherever Person rows are bulk-created/imported) — the real long-term fix so this doesn't re-accrue. Separate issue.
Bulk deletion of orphaned person-image files left after merges (existing delete_unused_files / thumbnail_cleanup commands should cover this).
Open questions
Confirm the exact redeploy/restart trigger and one-shot conventions against CLAUDE.md + docker-entrypoint.sh before relying on the entrypoint fallback.
Confirm Video.authors related name and whether it's sorted.
For Phase A1, do we want the middle-initial differentiator now, or just de-collide with the existing numeric-suffix logic and defer readability?
Consolidate duplicate
Personrecords and fix same-name member collisionsRelated: #1206 (live 500 on same-name members), #582 (2019 cleanup discussion — policy now superseded)
Summary
Over ~15 years we've accumulated many duplicate
Personrecords. "Duplicate" actually covers two different problems that need opposite fixes, and conflating them is why this has never been cleaned up:Personrows (created by un-deduped imports / hand entry over the years). These should be merged into a single canonical record, reassigning all related objects.url_namecausePerson.MultipleObjectsReturned→ HTTP 500 on/member/<url_name>/(this is We should figure out how to support two people with the same name as members of Makeability Lab #1206; the live example was twojasminezhangrows).This issue covers both: a safe, reviewable merge workflow for (1), and a de-collision + view-hardening fix for (2).
Background & root cause
Person.url_nameis auto-derived inPerson.save()fromfirst_name + last_name(lowercased, accent-folded, punctuation-stripped), with a numeric-suffix collision loop (jonfroehlich,jonfroehlich2, …).url_name(and some never-re-saved rows may still hold the model default'placeholder'). This is why We should figure out how to support two people with the same name as members of Makeability Lab #1206 still fires despite the loop existing in the model today.get_object_or_404tofilter(...).order_by('-bio_datetime_modified').first()— masks the 500 but does not solve namesakes: for two distinct people it silently returns one and makes the other's page permanently unreachable. The correct fix is uniqueurl_names, not picking a winner.The maintainer can SSH into prod and read files, but cannot run
docker/management commands directly on prod and cannot do privileged file ops (much of the tree isapache:makelab-owned; those go through CSE IT). This constrains how anything runs against prod data:grabthar.cs.washington.edu, reachable only via therecycle.cs.washington.edujump host. Credentials live inconfig.inion the prod server (not in git; mounted as a volume). A DBeaver/ssh -Ltunnel throughrecycleis the intended direct-access path.master→ auto-deploys test; push a tag (e.g.2.1.0) → deploys production. Build logs at/logs/buildlog.txton each server.RunPython) are unreliable here and must NOT be used. All data operations ship as management commands (the established pattern — seegenerate_slugs_for_old_news_items,remove_year_from_forum_name, etc.).pg_hba. Anything written toMEDIA_ROOTon prod (/cse/web/research/makelab/www/media) is web-served — never write CSVs of personal data there (the stale publicdumped_data.jsonis exactly this mistake).The merge surface (do not hardcode — enumerate)
Merging duplicate
B → canonical Amust relocate every relation pointing atB. Known relations involvingPerson:Implementation requirement: walk relations generically via
Person._meta.get_fields()(handleManyToManyRel,ManyToOneRel, and the self-referential FKs) rather than a hardcoded list, so schema drift can't silently orphan data. For sorted M2Ms, preservesort_valueordering.Also on merge:
save(), so "has image" is meaningless — detect the default by path and only treat a non-default upload as real. NotePersonhas apre_deletesignal that deletes its image file; ensure deleting B doesn't remove a file A now relies on (filenames are per-person, but verify).Proposed approach
Phase A — Fix #1206 (independent, ship first via normal deploy)
A1. Recompute
url_namefor all rows so the existing collision logic de-collides historical duplicates. New management command (e.g.recompute_url_names) that re-derives and saves; idempotent. Optional enhancement per #1206: prefer a middle-initial differentiator for namesakes (jasminexzhang/jasminelzhang) over a bare2suffix, for readable, stable URLs. A2. Hardenwebsite/views/member.pyto resolve cleanly (404, not 500) if a collision ever recurs, instead of relying on.first(). A3. Regression test:/member/<url_name>/returns 200 for each of two same-base-name people once de-collided (DB-touching test viaDatabaseTestCase).Phase B — Dedup cleanup (review-gated)
B1. Export command —
export_people_for_dedup(read-only, CSV). One row perPerson. Columns:id,first_name,middle_name,last_name,url_name,email,personal_website,github,linkedinpub_count,talk_count,poster_count,video_count,position_count,projectrole_count,news_authored_countadvisor_count,co_advisor_count,grad_mentor_count(i.e.Position.objects.filter(advisor=p)etc.)total_refs= sum of all the above (0⇒ safe-to-delete shell)has_real_image(non-default, detect by path),earliest_position_date,latest_position_dateStrictly read-only (no
.save()). Run locally via the tunnel; write CSV off-server.B2. Local analysis (human-in-the-loop). Cluster the CSV by accent-folded normalized
(first, last)(don't rely onurl_name). For each multi-row cluster, classify:total_refs == 0shells → delete (nothing to reassign; equivalent to merge). Also high-confidence merges where one row is a shell and another is rich, or emails match exactly.Output: a reviewed decisions file (CSV/JSON) of explicit actions:
merge into:<id>/keep/delete.B3. Merge command —
merge_duplicate_people(decisions-file-driven).--dry-runprints the full plan and changes nothing (default to dry-run).transaction.atomic()./logs/over SSH, not a shell): after a merge the source id is gone, so re-runs are no-ops.Acceptance criteria
/member/jasminezhang/(and any other historical collision) returns 200, with each distinct person reachable at a unique URL.export_people_for_dedupproduces the CSV above and performs zero writes.merge_duplicate_people --dry-runreports the plan with no DB changes.manage.py test websitepasses, including new regression tests:Publication.authorsorder.--dry-runmakes no changes; second run of an applied merge is a no-op.Use the existing
DatabaseTestCasehelpers (make_person,make_publication,make_news_item) perCONTRIBUTING.md.Out of scope / follow-ups
delete_unused_files/thumbnail_cleanupcommands should cover this).Open questions
CLAUDE.md+docker-entrypoint.shbefore relying on the entrypoint fallback.Video.authorsrelated name and whether it's sorted.