You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Production (makeabilitylab.cs.washington.edu) is stuck on release 2.5.0 and will not advance, despite multiple tag pushes (2.8.2, 2.8.2.1). The deploy pipeline fires and rebuilds/restarts the container, but it keeps running old code because the git working tree on the prod host (grabthar) is not being updated to the new commit before the Docker build runs. Separately, we can no longer access the build logs (buildlog.txt) to debug this directly.
This was discovered while shipping the sitemap work (#1252, PRs #1307 / #1308 / #1310), which is fully validated on the test server but cannot reach production.
Symptoms
Prod admin shows 2.5.0, even though master is at 2.8.2 and tags 2.8.2/2.8.2.1 were pushed.
GET /sitemap.xml on prod → 404 (the route exists only in the new code).
GET /robots.txt on prod → still the old body (no Sitemap: line), even though the committed ./robots.txt changed several pushes ago.
After each tag push, prod briefly 502s and restarts, then comes back still on 2.5.0.
Diagnosis: the prod working tree is not advancing
The deploy is running (the container rebuilds and restarts), but it builds stale source:
debug.log shows runserver restarting at the deploy times (e.g. Watching for file changes with StatReloader at 15:34 and 15:58), so docker compose upis recreating the container.
The most recent build did a fresh, uncachedCOPY . /code/ (BuildKit, image sha256:c4af85f2…) — and prod still served the old release. So the build copied old source from grabthar's disk.
The static ./robots.txt is served by Apache straight from the checkout (no Docker involved). The committed file changed in PR Serve sitemap via static robots.txt; drop dead Django robots view (#1252) #1308, yet prod's /robots.txt is unchanged. A plain file in the repo changed and never appeared on the prod disk → the working tree itself is not updating.
Ruled out
❌ Our code / the tag / the webhook firing — webhook deliveries succeed; code is verified on -test.
❌ Docker layer caching — the latest build re-ran COPY . /code/ uncached and still produced old behavior.
❌ A build failure — builds complete successfully (Successfully built … / exporting to image … done).
❌ Disk / dependency errors — no errors in the build output.
Most likely cause
The webhook's git fetch / git checkout step for the makeabilitylab.cs.washington.edu checkout on grabthar is not advancing the working tree — e.g. a detached HEAD, local modifications blocking a pull/checkout, or it's pinned to an old ref. Target commit: 0bc9dca (tag 2.8.2.1).
This requires CSE IT (Jason Howe) — we have no SSH/admin access to grabthar (see docs/DEPLOYMENT.md → server access model).
We can no longer access the build logs
Debugging the above is hampered because the build log is effectively inaccessible to maintainers:
The documented web path /logs/buildlog.txt returns 403 Forbidden (and /logs/ is Shibboleth-gated / forbidden) on prod.
buildlog.txt is not on the shared CSE filesystem — it is not under /cse/web/research/makelab/www/ on recycle (a find over the makelab tree turns up nothing). Only debug.log (+ rotated debug.log.1) is mounted out to the shared filesystem. (docs/DEPLOYMENT.md was corrected for this in docs: correct log locations in DEPLOYMENT.md (buildlog.txt not on shared FS) #1312.)
The only reliable copy of the build output today is the deploy notification email sent on each push.
So when a prod deploy misbehaves, there's no self-serve way to read the build/deploy log — we have to rely on forwarded emails. We should ask CSE IT to either fix access to /logs/buildlog.txt or mount the build log to the shared filesystem (like debug.log).
Next steps
CSE IT (Jason Howe): fix the git fetch/checkout step on grabthar so the prod working tree advances to the tagged commit (0bc9dca / 2.8.2.1).
CSE IT: restore maintainer access to the build log — either un-forbid /logs/buildlog.txt or mount it to /cse/web/research/makelab/www/ alongside debug.log.
Once prod advances: re-run the sitemap crawl validation against production (all <loc> → 200, https, correct domain), then submit the sitemap in Google Search Console.
Summary
Production (
makeabilitylab.cs.washington.edu) is stuck on release 2.5.0 and will not advance, despite multiple tag pushes (2.8.2,2.8.2.1). The deploy pipeline fires and rebuilds/restarts the container, but it keeps running old code because the git working tree on the prod host (grabthar) is not being updated to the new commit before the Docker build runs. Separately, we can no longer access the build logs (buildlog.txt) to debug this directly.This was discovered while shipping the sitemap work (#1252, PRs #1307 / #1308 / #1310), which is fully validated on the test server but cannot reach production.
Symptoms
masteris at2.8.2and tags2.8.2/2.8.2.1were pushed.GET /sitemap.xmlon prod → 404 (the route exists only in the new code).GET /robots.txton prod → still the old body (noSitemap:line), even though the committed./robots.txtchanged several pushes ago.Diagnosis: the prod working tree is not advancing
The deploy is running (the container rebuilds and restarts), but it builds stale source:
debug.logshowsrunserverrestarting at the deploy times (e.g.Watching for file changes with StatReloaderat 15:34 and 15:58), sodocker compose upis recreating the container.COPY . /code/(BuildKit, imagesha256:c4af85f2…) — and prod still served the old release. So the build copied old source from grabthar's disk../robots.txtis served by Apache straight from the checkout (no Docker involved). The committed file changed in PR Serve sitemap via static robots.txt; drop dead Django robots view (#1252) #1308, yet prod's/robots.txtis unchanged. A plain file in the repo changed and never appeared on the prod disk → the working tree itself is not updating.Ruled out
-test.COPY . /code/uncached and still produced old behavior.Successfully built …/exporting to image … done).Most likely cause
The webhook's
git fetch/git checkoutstep for themakeabilitylab.cs.washington.educheckout ongrabtharis not advancing the working tree — e.g. a detached HEAD, local modifications blocking a pull/checkout, or it's pinned to an old ref. Target commit:0bc9dca(tag2.8.2.1).This requires CSE IT (Jason Howe) — we have no SSH/admin access to
grabthar(seedocs/DEPLOYMENT.md→ server access model).We can no longer access the build logs
Debugging the above is hampered because the build log is effectively inaccessible to maintainers:
/logs/buildlog.txtreturns403 Forbidden(and/logs/is Shibboleth-gated / forbidden) on prod.buildlog.txtis not on the shared CSE filesystem — it is not under/cse/web/research/makelab/www/onrecycle(afindover the makelab tree turns up nothing). Onlydebug.log(+ rotateddebug.log.1) is mounted out to the shared filesystem. (docs/DEPLOYMENT.mdwas corrected for this in docs: correct log locations in DEPLOYMENT.md (buildlog.txt not on shared FS) #1312.)So when a prod deploy misbehaves, there's no self-serve way to read the build/deploy log — we have to rely on forwarded emails. We should ask CSE IT to either fix access to
/logs/buildlog.txtor mount the build log to the shared filesystem (likedebug.log).Next steps
git fetch/checkoutstep ongrabtharso the prod working tree advances to the tagged commit (0bc9dca/2.8.2.1)./logs/buildlog.txtor mount it to/cse/web/research/makelab/www/alongsidedebug.log.<loc>→ 200, https, correct domain), then submit the sitemap in Google Search Console.References
2.8.2,2.8.2.1(commit0bc9dca)docs/DEPLOYMENT.md