Context
Spinoff from #1268, which was originally bundling REST API + test-data sync. The actual driver was getting realistic content into localhost so UI/template changes can be reviewed against prod-like data. This issue tracks that work as a standalone effort with no API dependency.
Background
Current test infrastructure (landed in #1267) uses DatabaseTestCase + small make_person / make_publication / make_news_item helpers in website/tests.py. These work for unit/integration tests but:
- Don't compose well for the M2M /
SortedManyToManyField / ProjectRole-through graph.
- Don't include real media files, so
image_cropping / easy_thumbnails / PDF preview code paths only ever see empty file fields.
- Don't produce a coherent dev environment for visual review.
Recommendation — split into two layers
Layer 1: factory_boy + Faker (Django standard)
-
Add factory_boy and Faker to requirements.txt. De-facto Django standard for ~12 years; preferred over model_bakery here because explicit factories are easier to debug with our complex relationship graph.
-
Create website/tests/factories.py with PersonFactory, PublicationFactory, ProjectFactory, ProjectRoleFactory, TalkFactory, PosterFactory, VideoFactory, NewsItemFactory, AwardFactory. Pattern:
class PublicationFactory(DjangoModelFactory):
class Meta:
model = Publication
title = factory.Faker("sentence", nb_words=8)
date = factory.Faker("date_between", start_date="-3y", end_date="today")
pdf_file = factory.django.FileField(from_path=SEED_PDF_PATH)
@factory.post_generation
def authors(self, create, extracted, **kwargs):
if create:
self.authors.set(extracted or PersonFactory.create_batch(3))
-
Refactor existing make_person / make_publication / make_news_item helpers in website/tests.py to thin wrappers that delegate to the new factories (keeps existing tests green).
-
Add website/tests/seed_media/ with ~5 hand-curated representative files (one real PDF, a few JPGs at different aspect ratios, a project logo). Factory FileFields point at these so every test exercises real thumbnail / PDF / image-cropping code paths.
Layer 2: prod media subset + seed_dev_data command
- Add
python manage.py seed_dev_data that uses the factories to build a coherent graph (~5 projects, ~15 people, ~20 pubs with cross-links, ~5 talks/posters, news items, awards). Idempotent. Run after docker-compose up for a realistic-shaped site with zero prod dependency.
- For prod-realistic media (when you actually want to see what real content looks like): add
scripts/pull_prod_subset.sh that rsyncs a curated subset from recycle.
Prod media size analysis (2026-06-14)
Total media tree at /cse/web/research/makelab/www/media on recycle: 56 GB / 13,130 files.
| Folder |
Size |
talks/ |
50 GB |
publications/ |
1.7 GB |
banner/ |
1.3 GB |
person/ |
837 MB |
posters/ |
719 MB |
projects/ |
611 MB |
news/ |
738 MB |
uploads/ |
323 MB |
| others |
~120 MB |
Whole-tree rsync is impractical (talks alone is 50 GB, mostly slide PDFs/videos). But person/ + projects/ + publications/ + posters/ ≈ 3.9 GB is the visual core of the site and is realistic to mirror.
Optional refinement: pair the rsync with a dumpdata JSON of the matching DB rows so the local DB references real filenames. Gitignore both outputs.
Suggested first PR slice
- Add
factory_boy + Faker to requirements.txt.
- Create
website/tests/factories.py with core factories.
- Add
website/tests/seed_media/ with curated files.
- Migrate existing
make_* helpers to delegate to factories.
Subsequent PRs: seed_dev_data management command, then pull_prod_subset.sh.
Out of scope (parked in #1268)
REST API for external consumers. Conversation concluded the only known consumer is Jon's academic page ("Recent Pubs" list), which can be served by a single JsonResponse view rather than a full DRF API.
Context
Spinoff from #1268, which was originally bundling REST API + test-data sync. The actual driver was getting realistic content into localhost so UI/template changes can be reviewed against prod-like data. This issue tracks that work as a standalone effort with no API dependency.
Background
Current test infrastructure (landed in #1267) uses
DatabaseTestCase+ smallmake_person/make_publication/make_news_itemhelpers inwebsite/tests.py. These work for unit/integration tests but:SortedManyToManyField/ProjectRole-through graph.image_cropping/easy_thumbnails/ PDF preview code paths only ever see empty file fields.Recommendation — split into two layers
Layer 1: factory_boy + Faker (Django standard)
Add
factory_boyandFakertorequirements.txt. De-facto Django standard for ~12 years; preferred overmodel_bakeryhere because explicit factories are easier to debug with our complex relationship graph.Create
website/tests/factories.pywithPersonFactory,PublicationFactory,ProjectFactory,ProjectRoleFactory,TalkFactory,PosterFactory,VideoFactory,NewsItemFactory,AwardFactory. Pattern:Refactor existing
make_person/make_publication/make_news_itemhelpers inwebsite/tests.pyto thin wrappers that delegate to the new factories (keeps existing tests green).Add
website/tests/seed_media/with ~5 hand-curated representative files (one real PDF, a few JPGs at different aspect ratios, a project logo). FactoryFileFields point at these so every test exercises real thumbnail / PDF / image-cropping code paths.Layer 2: prod media subset +
seed_dev_datacommandpython manage.py seed_dev_datathat uses the factories to build a coherent graph (~5 projects, ~15 people, ~20 pubs with cross-links, ~5 talks/posters, news items, awards). Idempotent. Run afterdocker-compose upfor a realistic-shaped site with zero prod dependency.scripts/pull_prod_subset.shthat rsyncs a curated subset from recycle.Prod media size analysis (2026-06-14)
Total media tree at
/cse/web/research/makelab/www/mediaon recycle: 56 GB / 13,130 files.talks/publications/banner/person/posters/projects/news/uploads/Whole-tree rsync is impractical (talks alone is 50 GB, mostly slide PDFs/videos). But
person/+projects/+publications/+posters/≈ 3.9 GB is the visual core of the site and is realistic to mirror.Optional refinement: pair the rsync with a
dumpdataJSON of the matching DB rows so the local DB references real filenames. Gitignore both outputs.Suggested first PR slice
factory_boy+Fakertorequirements.txt.website/tests/factories.pywith core factories.website/tests/seed_media/with curated files.make_*helpers to delegate to factories.Subsequent PRs:
seed_dev_datamanagement command, thenpull_prod_subset.sh.Out of scope (parked in #1268)
REST API for external consumers. Conversation concluded the only known consumer is Jon's academic page ("Recent Pubs" list), which can be served by a single
JsonResponseview rather than a full DRF API.