Realistic test data: factory_boy fixtures + curated prod media subset

## Context

Spinoff from #1268, which was originally bundling REST API + test-data sync. The actual driver was getting realistic content into localhost so UI/template changes can be reviewed against prod-like data. This issue tracks that work as a standalone effort with no API dependency.

## Background

Current test infrastructure (landed in #1267) uses `DatabaseTestCase` + small `make_person` / `make_publication` / `make_news_item` helpers in `website/tests.py`. These work for unit/integration tests but:

- Don't compose well for the M2M / `SortedManyToManyField` / `ProjectRole`-through graph.
- Don't include real media files, so `image_cropping` / `easy_thumbnails` / PDF preview code paths only ever see empty file fields.
- Don't produce a coherent dev environment for visual review.

## Recommendation — split into two layers

### Layer 1: factory_boy + Faker (Django standard)

- Add `factory_boy` and `Faker` to `requirements.txt`. De-facto Django standard for ~12 years; preferred over `model_bakery` here because explicit factories are easier to debug with our complex relationship graph.
- Create `website/tests/factories.py` with `PersonFactory`, `PublicationFactory`, `ProjectFactory`, `ProjectRoleFactory`, `TalkFactory`, `PosterFactory`, `VideoFactory`, `NewsItemFactory`, `AwardFactory`. Pattern:

  ```python
  class PublicationFactory(DjangoModelFactory):
      class Meta:
          model = Publication
      title = factory.Faker("sentence", nb_words=8)
      date = factory.Faker("date_between", start_date="-3y", end_date="today")
      pdf_file = factory.django.FileField(from_path=SEED_PDF_PATH)

      @factory.post_generation
      def authors(self, create, extracted, **kwargs):
          if create:
              self.authors.set(extracted or PersonFactory.create_batch(3))
  ```

- Refactor existing `make_person` / `make_publication` / `make_news_item` helpers in `website/tests.py` to thin wrappers that delegate to the new factories (keeps existing tests green).
- Add `website/tests/seed_media/` with ~5 hand-curated representative files (one real PDF, a few JPGs at different aspect ratios, a project logo). Factory `FileField`s point at these so every test exercises real thumbnail / PDF / image-cropping code paths.

### Layer 2: prod media subset + `seed_dev_data` command

- Add `python manage.py seed_dev_data` that uses the factories to build a coherent graph (~5 projects, ~15 people, ~20 pubs with cross-links, ~5 talks/posters, news items, awards). Idempotent. Run after `docker-compose up` for a realistic-shaped site with zero prod dependency.
- For prod-realistic media (when you actually want to _see_ what real content looks like): add `scripts/pull_prod_subset.sh` that rsyncs a curated subset from recycle.

### Prod media size analysis (2026-06-14)

Total media tree at `/cse/web/research/makelab/www/media` on recycle: **56 GB / 13,130 files**.

| Folder | Size |
|---|---|
| `talks/` | 50 GB |
| `publications/` | 1.7 GB |
| `banner/` | 1.3 GB |
| `person/` | 837 MB |
| `posters/` | 719 MB |
| `projects/` | 611 MB |
| `news/` | 738 MB |
| `uploads/` | 323 MB |
| others | ~120 MB |

Whole-tree rsync is impractical (talks alone is 50 GB, mostly slide PDFs/videos). But `person/` + `projects/` + `publications/` + `posters/` ≈ 3.9 GB is the visual core of the site and is realistic to mirror.

Optional refinement: pair the rsync with a `dumpdata` JSON of the matching DB rows so the local DB references real filenames. Gitignore both outputs.

## Suggested first PR slice

1. Add `factory_boy` + `Faker` to `requirements.txt`.
2. Create `website/tests/factories.py` with core factories.
3. Add `website/tests/seed_media/` with curated files.
4. Migrate existing `make_*` helpers to delegate to factories.

Subsequent PRs: `seed_dev_data` management command, then `pull_prod_subset.sh`.

## Out of scope (parked in #1268)

REST API for external consumers. Conversation concluded the only known consumer is Jon's academic page ("Recent Pubs" list), which can be served by a single `JsonResponse` view rather than a full DRF API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realistic test data: factory_boy fixtures + curated prod media subset #1272

Context

Background

Recommendation — split into two layers

Layer 1: factory_boy + Faker (Django standard)

Layer 2: prod media subset + `seed_dev_data` command

Prod media size analysis (2026-06-14)

Suggested first PR slice

Out of scope (parked in #1268)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Folder	Size
`talks/`	50 GB
`publications/`	1.7 GB
`banner/`	1.3 GB
`person/`	837 MB
`posters/`	719 MB
`projects/`	611 MB
`news/`	738 MB
`uploads/`	323 MB
others	~120 MB

Realistic test data: factory_boy fixtures + curated prod media subset #1272

Description

Context

Background

Recommendation — split into two layers

Layer 1: factory_boy + Faker (Django standard)

Layer 2: prod media subset + seed_dev_data command

Prod media size analysis (2026-06-14)

Suggested first PR slice

Out of scope (parked in #1268)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Layer 2: prod media subset + `seed_dev_data` command