Skip to content

KBNLresearch/SaveToWaybackMachine

Repository files navigation

SaveToWaybackMachine

Scripts and data for archiving KB-managed websites to the Internet Archive's Wayback Machine.

Maintained by KB, national library of the Netherlands

Website

View the live site →

This repository has a companion GitHub Pages website with screenshot galleries, interactive navigation, and comprehensive documentation.

Purpose

Some websites managed by the KB have been discontinued. To preserve their content for Wikipedia sourcing and cultural heritage purposes, the KB actively archives websites to the Wayback Machine at web.archive.org.

Archived sites

Site Archive date # URLs Link to dataset (.tsv, .txt, .xlsx)
Medieval Manuscripts in Dutch Collections (catalog records) Apr 2026 11.738 Excel file (sheets catalog-pages and catalog-pages-full-metadata)
Medieval Manuscripts in Dutch Collections (static pages, PDFs, assets) Dec 2025 466 Excel file (sheet non-catalog-pages)
Medieval Illuminated Manuscripts (manuscripts.kb.nl) Dec 2025 7.460 Excel file
kb.nl (new) Mar 2022 1.915 Excel file and CSV
Literatuurgeschiedenis.org Mar 2022 465 Excel file and CSV
kb.nl (old) Dec 2021 5.720 Excel file and CSV
Literatuurplein.nl Dec 2019 69.599 See this Data overview
Gidsvoornederland.nl Nov 2018 1.300 TXT
Literaireprijzen.nl Oct 2018 452 TXT
Lezenvoordelijst.nl Aug 2018 12.456 TXT
Leesplein.nl Jun 2018 23.785 TXT

Stories

Read the stories behind some of these archiving projects — narratives of how (parts of) KB websites were rescued from the digital memory hole, and the role AI assistants played along the way.

How this site was built

This project was transformed in December 2025 through an intensive AI-human collaboration:

  • 10+ hours of development across Dec 2-3, 2025
  • 33+ commits reorganizing and enhancing the repository
  • Built using Claude Opus 4.5 AI assistant via Claude Code CLI

Key achievements

  1. Repository reorganization - Clean hierarchical folder structure
  2. Screenshot galleries - 36 Wayback Machine screenshots captured via Python/Playwright
  3. GitHub Pages website - Responsive site with navigation, lightbox, and breadcrumbs
  4. AI vision recognition - Used multimodal AI to extract meaningful captions from screenshots
  5. EU compliance - GDPR, WCAG 2.1 Level AA, comprehensive accessibility features

Read the full story →

Scripts

wbm-archiver

Location: scripts/wbm-archiver/

Python script with three modes:

  1. Save pages to the Wayback Machine
  2. Retrieve the latest archived version
  3. Retrieve the oldest archived version

Requirements: Python 3.x, waybackpy

Alternative method

Archive pages without Python: archive.org/services/wayback-gsheets/

Compliance

The companion website meets European standards:

  • GDPR/AVG - No cookies, no tracking, no personal data
  • WCAG 2.1 Level AA - Full accessibility compliance
  • Responsive design - Desktop, tablet, mobile support
  • SEO optimized - Schema.org, Open Graph, Twitter Cards

View compliance documentation →

License

The source code and text content of this project are dedicated to the public domain under CC0 1.0.

Note: This license does not apply to:

  • Wayback Machine screenshots (third-party copyrights)
  • KB logo (CC BY-SA 3.0)
  • Social media brand icons (respective trademarks)

See Image credits & copyrights for details.

Releases

No releases published

Packages

 
 
 

Contributors

Languages