Automated (unattended) installation from a YAML answer file#180
Open
ggiesen wants to merge 105 commits into
Open
Automated (unattended) installation from a YAML answer file#180ggiesen wants to merge 105 commits into
ggiesen wants to merge 105 commits into
Conversation
The project is licensed GPL-2+ but the license text was declared only in debian/copyright, so GitHub's license detection reports "no license". Add the canonical GPL-2 text as a top-level COPYING file. Also fix Upstream-Name in debian/copyright: the upstream project is live-installer, not livemaker.
Add a unit test suite for the pure-logic helpers in partitioning.py: get_device_naming_scheme_prefix, to_human_readable, is_efi_supported, and the Partition class size/classification math. The conftest stubs gi/parted/dialogs in sys.modules and redirects the import-time read of /usr/share/live-installer/disk-partitions.html to the copy in the repo, so the tests run on any machine with only Python and pytest - no GTK, pyparted or installed package required. Run with: python3 -m venv .venv && .venv/bin/pip install pytest && .venv/bin/pytest
Scenario-driven integration harness that boots the pinned LMDE ISO in QEMU, serves an answer file to the guest over HTTP, waits for the installer's completion marker on the serial console, then reboots into the installed disk and runs SSH assertions. Emits JUnit XML. Supports BIOS, UEFI (OVMF) and UEFI+SecureBoot firmware plus an emulated TPM2 via swtpm, on both EL and Debian/Ubuntu firmware layouts. KVM is used when available, with TCG fallback. The full install phase needs the headless driver (not yet written), so the baseline scenario currently runs in --smoke mode only: boot the ISO, confirm the VM stays up, tear down. The bundled answer file documents the target v1 schema for the bios-simple case.
Strictly-validated YAML answer file (pydantic v2) implementing the automated-install design rules: - Disk targeting by stable match expressions only; raw /dev paths are rejected with an explanatory error - crypt(5) password hashes only; plaintext is rejected with no override - Explicit per-failure-mode abort/continue policy, defaulting to abort - Versioned format (version: 1 required) - Strict parsing throughout: unknown keys are errors, YAML type coercion surprises fail validation instead of being guessed at Custom partition layouts are intentionally unsupported in v1 (presets: simple, lvm, lvm-on-luks); the error message directs users to the GUI. The integration-test answer fixture is parsed in the unit suite so the two cannot drift apart.
Replace the engine's 87 bare os.system() calls and 10 subprocess.getoutput() calls with a CommandRunner instance owned by InstallerEngine (constructor-injectable, defaults to the real thing). do_run_in_chroot and exec_cmd delegate to the runner; the rsync progress stream uses runner.popen(). Behaviour-preserving by design: commands still tolerate failure unless check=True is requested, but every command and every non-zero exit is now logged in one place instead of failing silently. This makes the engine drivable and assertable in tests without touching the host system; component tests using a recording runner are included. partitioning.py's shell-outs are unchanged (separate step).
auto_installer.py drives InstallerEngine without a GUI, as a sibling of the GTK InstallerWindow: - Acquires the answer file from a local path or URL (plain HTTP refused unless --insecure: answer files carry password hashes), validates it against the v1 schema, and maps it onto the engine's Setup object - Resolves the target disk by stable attributes via diskmatch.py (by-id/by-path globs, model, size-min, first-non-removable); no match or an ambiguous match aborts with the list of available disks - Registers console/logfile/serial implementations of the engine's progress and error hooks, then runs the same start_installation / finish_installation sequence the GUI does - Afterwards re-enters the target to create additional users, apply package add/remove and run post-install steps, honouring the answer file's per-failure-mode abort/continue policy - Prints "Automated installation complete" / "... FAILED" as the final serial markers; --dry-run validates the config and disk match only Entry points: live-installer --automated=<source>, or live-installer.auto=<source> on the kernel command line (detected in main.py before any GUI setup). The GUI path is unchanged. Engine change: Setup.password_is_crypted switches setup_user to chpasswd -e, since unattended installs only ever carry crypt(5) hashes. The integration harness now watches for the driver's markers and fails fast when the failure marker appears.
To exercise working-tree installer code against a stock ISO, the harness builds a dev ISO: a small overlay squashfs containing usr/ is added to the ISO's /live directory (live-boot union-mounts every *.squashfs found there, so our files shadow the originals — no root-required remaster of the main squashfs). - harness/isotools.py: overlay build (mksquashfs -all-root), boot-record preserving ISO rewrite (xorriso -boot_image any replay), and vmlinuz/initrd extraction for direct-kernel boot - harness/make_test_iso.py: CLI to (re)build fixtures/lmde-7-dev.iso - run_scenario.py full mode: direct-kernel boots the dev ISO with console=ttyS0 + live-installer.auto=<http url>, generates a per-run SSH keypair and injects the public key into the served answer file, then verifies over SSH with that key New systemd unit live-installer-auto.service (shipped enabled, gated by ExecCondition on live-installer.auto= in /proc/cmdline) launches the installer in the live session — ConditionKernelCommandLine= can't prefix-match key=value arguments, hence the ExecCondition. Schema: users[].ssh_authorized_keys — installs OpenSSH public keys for any user via the headless driver. Useful for fleet provisioning, and what the harness's verify phase logs in with.
First end-to-end run surfaced that the LMDE 7 live ISO does not ship PyYAML (or pydantic) — the headless driver crashed at import. Declare both in debian/control Depends, and have the integration harness stand in for the .deb dependencies by bundling the wheels for the target live system's Python into the overlay squashfs's dist-packages. Also add an ExecStopPost safety net to live-installer-auto.service: if the installer process dies without printing its own failure marker (e.g. an import-time crash), the unit emits the marker to /dev/console so unattended callers fail fast instead of waiting out their timeout. A condition skip counts as success, so normal boots stay silent.
The HTTPS-only rule for answer files had an argv escape hatch (--insecure) but no kernel-cmdline equivalent — and cmdline boots have no argv, which made plain-HTTP delivery impossible for PXE-style deployments on trusted networks (and for the integration harness, whose answer file travels over QEMU's host-only user network). Boot with live-installer.auto-insecure alongside live-installer.auto= to accept an http:// source. HTTPS remains the default requirement.
full_disk_format() read setup.gptonefi through the module-global 'installer', which is only assigned when the GUI calls build_partitions(). On the headless path that global never exists and the automated partition step crashed with NameError. Take the setup object as a parameter; the engine passes self.setup.
The engine fires the progress hook once per copied file during the rsync phase (~400k calls for a stock install). The headless driver logged every call to console, journal and serial, making console I/O the bottleneck of the whole installation. Only log when the rendered progress line actually changes (~100 lines per phase).
finish_installation() unmounts the target filesystem at its end, so the headless driver's attempt to re-enter the chroot afterwards found an empty /target (every bind mount failed rc=32, chroot rc=127). Add an optional before_unmount_hook parameter to finish_installation(), invoked after the system is fully configured (post clean_apt) but while the chroot is still mounted and has working DNS. The driver applies extra users, SSH keys, package changes and post-install scripts there, and no longer carries its own mount/unmount plumbing.
On EFI installs the engine dpkg-installs the bootloader stack (shim-signed, grub-efi-*) from the ISO pool without its dependency closure (shim-signed-common, shim-helpers-amd64-signed), leaving dpkg in a state apt-get refuses to build on — the driver's package step failed with unmet dependencies on UEFI while passing on BIOS. Run 'apt-get install -f -y' after apt-get update so the half-installed bootloader packages are completed from the network before the answer file's package additions are attempted.
Each parted invocation triggers a partition-table rescan during which udev removes and recreates the partition device nodes. On EFI installs (three partitions, plus a trailing 'set 1 boot on') mkfs.ext4 raced a vanishing /dev/vda3: the existence check passed, mke2fs then found no node, and because the failure was tolerated the root mount silently failed and rsync copied the entire system into the live session's tmpfs until ENOSPC. Run 'udevadm settle' before the node-existence check, and raise on a non-zero mkfs exit instead of carrying on with an unformatted target.
Two scenarios assert that bad input fails fast and cleanly — the failure marker appears and nothing is half-installed: - fail-malformed-yaml: syntactically broken answer file - fail-no-disk-match: valid file whose disk matcher matches nothing (on_no_match: abort) The harness gains expect.outcome: failure — the failure marker is the expected result, the answer file is served verbatim (it may be deliberately malformed, so no SSH-key staging), and there is no boot/verify phase. Also adds the uefi-lvm scenario (LVM layout preset, root and swap on the lvmmint VG).
The LUKS keyfile may now be a URL as well as a local path — per-machine
keyfiles served by a provisioning server alongside the answer file.
URLs follow the same HTTPS-only rule (and --insecure escape hatches) as
the answer file; the shared fetch error message is generalized since it
now covers key material too.
Harness: the HTTP server now starts before answer-file staging so its
base URL can be substituted for {server} placeholders, auxiliary files
next to the answer file are served too, and scenarios can set
skip_boot_phase (the LUKS scenario's installed system prompts for the
passphrase at the initramfs, which needs interactive serial support —
its install phase still exercises luksFormat/LVM/crypttab/grub-efi).
The LUKS scenario showed the passphrase verbatim on the serial console and in the journal: the engine pipes it to cryptsetup via echo, and the runner logs every command it executes. Serial output is routinely captured (BMC/SOL logging, test harnesses), so this is a real key-material leak, not a cosmetic issue. CommandRunner.run() gains a secrets parameter: the given strings (and their shell-quoted forms) are replaced with [REDACTED] in the EXEC log lines and in CommandError messages, while the executed command is untouched. The engine's luksFormat/luksOpen call sites use it.
Redacting the passphrase from logs closed the serial/journal leak but the engine still built 'echo -n <pass> | cryptsetup ...', so the passphrase was briefly visible in the process table (ps) on the live system. cryptsetup reads the key from stdin with --key-file -; pass it through the new CommandRunner.run(stdin=) channel instead. The passphrase is now never a command argument: not in ps, not in logs. The secrets= redaction stays as a belt-and-braces guard for any future case where a secret must appear in a command.
Upstream has no CI; this adds two workflows it can adopt incrementally. - unit-tests.yml: runs pytest tests/unit on Python 3.11-3.13 for every push and PR. Fast, no special hardware (the conftest stubs gi/parted). - integration-tests.yml: boots real VMs and performs full unattended installs. KVM-dependent and slow (~10-15 min/scenario), so it runs nightly and on workflow_dispatch, not per-push. A fast failure-mode gate runs first; the four install scenarios then run as a matrix. - actions/vm-setup: composite action shared by both integration jobs — enables KVM on the runner, installs QEMU/OVMF/swtpm + ISO tooling, fetches the pinned ISO (cached), and builds the dev ISO. requirements-test.txt pins the unit suite's deps (pytest + the runtime deps declared in debian/control) for a bare-virtualenv install.
GitHub is deprecating Node 20 actions (forced to Node 24 on 2026-06-16). Bump to the current majors: checkout v4->v6, setup-python v5->v6, cache v4->v5, upload-artifact v4->v7.
Three additions: - kernel.cmdline_extra (schema + driver): append kernel parameters to the installed system's GRUB_CMDLINE_LINUX_DEFAULT and regenerate grub in the post-install hook. A real fleet need (serial console, driver blacklists) and what puts the LUKS prompt on serial. - Interactive serial in the harness: serial is now a bidirectional unix socket teed to the log file, with VM.send_serial() to type into the guest — the channel an admin drives over IPMI Serial-over-LAN. The uefi-lvm-luks scenario now does a full boot/verify: it waits for the initramfs unlock prompt on serial, types the passphrase, and verifies the unlocked system over SSH (no more skip_boot_phase). - bios-multi-disk scenario: two disks where the by-id target is the second and smaller one, proving the installer selects by stable attribute rather than enumeration order, and leaves the decoy disk untouched. Needed multi-disk support in the VM harness (per-disk serials via virtio-blk-pci -> /dev/disk/by-id/virtio-<serial>).
For upstream maintainability, the automated-install feature now ships with the documentation and CI a maintainer needs to support it without automation/VM expertise: - docs/automated-install.md: answer-file reference, triggering, delivery mechanisms, worked examples, failure handling, security notes. - tests/TESTING.md: architecture of the unattended path, the two test layers, how to run/extend the suite, how to debug a scenario from its serial log, and the known traps (udev races, the unmount-hook ordering, dpkg repair, secret handling). - README.md: the repo had none; orients a reader and links both guides. Integration CI hardening: per-job timeout-minutes (a hung VM can no longer consume the 6h ceiling) and concurrency cancellation of superseded runs.
The engine reads the squashfs/kernel and grub-title script from different paths on Mint (Ubuntu/casper) vs LMDE (Debian/live-boot), but every integration scenario exercises only the LMDE branch (LMDE is the edition shipping live-installer today; Mint adopts it in Mint 23). Pin both path sets with unit tests so a wrong Mint path can't regress silently before Mint 23 ships, and document the coverage boundary and what Mint-side integration would require (a casper injection variant in the harness, ideally validated against the Mint 23 beta) in TESTING.md.
The multi-disk install correctly targeted the by-id disk (vdb), but phase 2 booted the first-enumerated disk (vda, the empty decoy) and SSH never came up. Add VM.start(boot_serial=...) to pin bootindex=0 on the disk the OS was installed to, and set boot_disk_serial in the scenario. This is the harness standing in for the firmware boot-order an admin would configure on real multi-disk hardware.
Driving the LUKS unlock over serial revealed that an appended console=ttyS0 is not enough: LMDE's default 'quiet splash' makes plymouth grab the passphrase prompt graphically, so it never reaches ttyS0. Surfacing it on serial needs full serial-console provisioning (drop quiet/splash + a GRUB serial terminal), which is more than the cmdline_extra append can do. Keep the install-phase verification (which exercises luksFormat via stdin, LVM-on-LUKS, crypttab, grub-efi and cmdline_extra) and record the serial-unlock work as a documented follow-up. The harness's send_serial capability remains for when the system is provisioned for serial.
The fixtures directory is empty in git (its only contents — the ISO and sha256sum.txt — are gitignored), so it doesn't exist on a fresh GHA checkout. The ISO-download step's working-directory then failed with "No such file or directory" before running, and the whole integration run aborted in setup. Track the directory with a .gitkeep and mkdir -p defensively in the step.
Three issues surfaced by the first real GitHub Actions run: - The serial unix socket lived in the (deep) work directory. AF_UNIX paths are capped at ~108 bytes; a CI runner's long workdir prefix blew past it so QEMU silently failed to create the socket and every VM start aborted. Put the socket in a short tempdir instead (the log stays in the workdir). This is why it worked locally but not on CI. - KVM enablement raced: the udev rule sets /dev/kvm to 0666 asynchronously, but `test -w` ran before it applied. Apply the mode synchronously with a direct chmod (keep the udev rule for persistence). Hosted runners DO have KVM — the earlier "not available" was this race. - The serial reader raised a misleading "could not connect to socket" even when QEMU had died at launch. Detect process exit and surface QEMU's stderr so the next such failure is diagnosable.
Item 3 (repo/package format, cloud-init A+C). Adopt cloud-init's apt:
shapes in the answer file and execute via distro-native tooling behind a
swappable backend, rather than a neutral repo schema or a cloud-init dep.
- schema.py: replace repositories: [{source, key_url}] with apt:
{sources: {<name>: {source, key|keyid|keyserver|key_url, filename}}},
mirroring cloud-init. Validation: deb/deb-src source line, at most one
signing key per source, https-only key_url, hex keyid, safe names.
Breaking answer-file change (repositories -> apt).
- pkgbackend.py (new): PackageBackend interface + AptBackend; get_backend()
picks per distro family. Applies sources (inline key / keyserver / https
key fetch), runs update (with the EFI apt-get install -f fixup), installs
the agnostic packages list, removes package_remove. All via CommandRunner
so it stays unit-testable and library-free.
- auto_installer.py: _apply_packages now delegates to the backend; keep
packages:/package_remove: top-level and agnostic.
- 22 new unit tests (schema apt + backend command sequences). Docs updated
(### apt). Existing integration scenarios already cover the install path
via packages: openssh-server.
…xmint#1-linuxmint#3) Address the three actionable correctness items from the 2026-06-13 review: linuxmint#1 [High] /boot/efi without the esp flag is now rejected. The reverse check was missing: a vfat /boot/efi partition with empty flags passed validation but its GPT entry lacks the ESP type GUID, so UEFI won't boot it — a silently-broken install. schema.py CustomPartition now requires esp on a /boot/efi partition, with an error explaining why. linuxmint#2 [Med] esp-on-BIOS / bios_grub-on-UEFI now rejected engine-side. The schema can't know firmware mode (runtime), so installer.py validates it in _check_layout_matches_firmware before any partition is created (no destructive action on mismatch). Added a fail-bios-grub-on-uefi regression scenario to the failure-modes CI job. linuxmint#3 [Med] Duplicate LV names within a VG now rejected at validation time (Storage._consistency) instead of failing confusingly at lvcreate. 10 new tests (schema + engine + parsed regression fixture); 240 unit tests pass.
Extend the netplan-v2 network: section with wifis:, rendered to
NetworkManager wifi keyfiles by netconfig.py.
- schema.py: WifiConfig(_IpConfig) reuses all the ethernet IP logic
(static/DHCP, dual-stack, gateways, DNS, routes, match) and adds an
access-points map of SSID -> {password?, hidden?}. Validates SSID
(1..32 bytes), PSK (8..63 chars or 64-hex), at-least-one access point,
and id uniqueness across ethernets/wifis/vlans. vlans may link to a wifi.
- netconfig.py: one NM wifi keyfile per access point (type=wifi, [wifi]
ssid/mode/hidden + mac bind, [wifi-security] wpa-psk; open network omits
security). Filenames suffixed only when a device has >1 AP.
- 15 new unit tests (schema + renderer + a driver test asserting the
PSK-bearing keyfile is 0600). 255 unit tests pass.
Testing boundary: QEMU does not emulate 802.11, so wifi has no end-to-end
integration scenario (documented) — it is covered by schema + renderer
unit tests. WPA-Enterprise (EAP), bonds, and bridges remain unmodelled.
Docs updated: ### network (wifi subsection), comparison table, limitations.
… wifi filter, doc nits) linuxmint#4 [Low] discovery now genuinely excludes wifi from by-mac candidates: the ARPHRD_ETHER (type 1) check does NOT exclude wifi (wifi is type 1 too), so also skip interfaces with a sysfs wireless/ subdir. Comment fixed to describe what the check actually does. linuxmint#5 [Med] discovery drops sentinel/placeholder DMI values (all-zero/all-FF/ QEMU-default product_uuid; 'Default string', 'To Be Filled By O.E.M.', etc. product_serial) so a by-uuid/ or by-serial/ config can't match every defective unit in a fleet; the skip is logged. linuxmint#6 [Low,docs] steer network examples to routes: [{to: default, via}] as the canonical default-gateway form; gateway4/6 documented as legacy. linuxmint#7 [Low,docs] document that ambiguous NM matches are undefined — bind by MAC or a unique interface name; no autoconnect-priority is emitted. linuxmint#8 [Low] document the TFTP client's RFC1350 size assumption (small files; use HTTP for large) and warn past 256 KiB. (Full RFC2347 negotiation remains queued.) linuxmint#9 [Low,docs] document on_no_match's future intended values. 6 new tests (260 unit total).
…ck (item B) The built-in TFTP client now requests RFC 2347 options (RFC 2348 blksize 1428, RFC 2349 tsize) and handles the OACK: parse the negotiated blksize, ACK block 0, then receive at that size. Two fallbacks keep it working against any server: - option-unaware server replies with DATA block 1 -> use the 512-byte RFC 1350 default already in place (no special handling needed); - strict server replies ERROR code 8 (option negotiation) -> retry the request once with no options. The transfer's terminal-block test uses the negotiated blksize, not a hardcoded 512. Tests cover both modes: the existing roundtrip cases now exercise the no-OACK fallback (the default test server ignores options); new cases add blksize=1024 OACK negotiation, ERROR-8 -> bare-retry fallback, and OACK parsing. 263 unit tests pass. Closes review linuxmint#8 / backlog item B. Docs updated (TFTP transport row).
… knob
Add end-to-end NFS coverage against a live server, and a way to force the
protocol version.
- auto_installer: nfs:// URLs accept an optional ?vers= (3/4/4.0/4.1/4.2),
passed through as a vers= mount option; omitted, mount.nfs negotiates.
Useful for version-restricted filers, and lets the harness pin each
version deterministically. Validated; bad versions rejected.
- tests/nfs/: a real-server harness — setup-nfs-server.sh provisions
nfs-kernel-server exporting a small answer file read-only over NFSv3 and
NFSv4.2, reachable via IPv4, IPv6, and a hostname; test_nfs_real.py fetches
it through the actual nfs:// transport across the full {v3,v4.2} x
{127.0.0.1,[::1],li-nfs-host} matrix plus a negotiated-default case.
Gated on LI_NFS_TEST=1 (needs root for mount), so the unit suite stays
hermetic.
- .github/workflows/nfs-tests.yml runs it per push: GHA ubuntu runners have
passwordless sudo and a kernel nfsd, so a real server works in CI.
- Schema versioning policy (review Q1) documented: option (c) — grow v1
additively, fork to v2 only for breaking changes while still parsing v1.
- Docs: NFS vers= and dual-stack/hostname support.
267 unit tests pass; 7 NFS tests run in the new CI job.
The nfs-kernel-server package does not pre-create /etc/exports.d on the runner, so the export tee failed. mkdir -p it first.
Cross-checked every doc against the code and fixed the mismatches a
multi-doc audit surfaced.
Code:
- main.py: the `live-installer` automated wrapper only forwarded
--automated=/--insecure/--dry-run, so the documented `--list-disks` and
`--check`/`--config` invocations silently did nothing. Forward all other
flags through to auto_installer unchanged.
Docs (automated-install.md): drop the ${serial} keyfile example (no such
templating exists, and it contradicted the no-templating rule) and clarify
only passphrase_source: keyfile is implemented; note the install-vs-boot
passphrase distinction; add the LV-name-uniqueness, esp<->/boot/efi, and
engine-side firmware-flag rules; add the vlan-self-link and id-reuse network
rules; correct the apt key path (.asc vs .gpg, name vs filename:); hostname
defaults to 'mint', not DHCP.
Docs (comparison.md): add the missing TFTP delivery row (v3/v4.2 NFS detail
too). getting-started.md: layout list includes custom; fix the stale
"four answer files / one per layout" count. README.md: intro lists the
real capabilities; module list adds discovery/netconfig/pkgbackend.
PR linuxmint#180 body: ${serial} -> concrete keyfile URL; size -> size-min;
NFS/TFTP transport detail. 267 unit tests pass.
True first-boot rekey (the mechanism confirmed with the user): format LUKS with a random throwaway key so the install is fully unattended, embed that key in the initramfs to auto-unlock the first boot, then a one-shot service prompts the operator for the real passphrase, swaps it in, removes the throwaway key + keyfile, rebuilds the initramfs, and reboots — every subsequent boot prompts normally. - build_setup: prompt-on-first-boot (the schema default) now generates a random newline-free key (secrets.token_urlsafe) instead of aborting; sets setup.luks_rekey_on_first_boot. tpm2 still aborts. - installer.py: write_crypttab names the embedded keyfile instead of the prompting placeholder in the rekey case so the first boot auto-unlocks; print_setup no longer logs the LUKS passphrase (it was teed to the log + serial, a pre-existing leak the random key would have worsened). - auto_installer: _setup_luks_first_boot_rekey writes the keyfile byte-exact (no trailing newline, 0600), configures the cryptsetup-initramfs hook to embed it + UMASK=0077, installs/enables the li-luks-rekey one-shot. Runs before the initramfs rebuild. The rekey script adds the new key with printf %s (no newline) so it matches the boot-time prompt. Unit-tested (key byte-exactness, crypttab, hook, service, redaction); 272 unit tests pass. End-to-end two-boot unlock is the next step.
a092622 to
a4399df
Compare
…ekey) uefi-luks-prompt: unattended LVM-on-LUKS install with no secret in the answer file. Phase 2 drives the full rekey dance over serial via a new first_boot_rekey harness branch: first boot auto-unlocks (throwaway key in the initramfs), the one-shot prompts to set + confirm the passphrase, the service rekeys and reboots, the second boot prompts at the initramfs, and the unlocked system is verified over SSH. Also asserts the throwaway keyfile, the rekey service, and the keyfile crypttab entry are all gone afterwards.
The first attempt failed in CI: the one-shot ran at ~7s (before systemd-user-sessions), and systemd-ask-password --no-tty returned non-zero that early, which the set -e script treated as fatal -> service failed before prompting. The install + first-boot auto-unlock worked; only the prompt mechanism was wrong. Fixes: run the unit late (After=systemd-user-sessions.service) but before any getty claims the console (Before=getty.target serial-getty@ttyS0.service), and own the tty (StandardInput=tty-force, TTYPath=/dev/console). Prompt with a plain read on the console instead of systemd-ask-password, and drop set -e so a transient blkid cannot kill the rekey.
Queue item 1. Catches up to kickstart/autoinstall on the multi-layout +
extra-locale case (e.g. an en_CA + fr_CA desktop).
- schema: keyboard.additional_layouts (list of {layout, variant}) + keyboard.
toggle (XKB switch option, only valid with extra layouts); top-level
additional_locales (each validated like locale). XKB layout/variant/toggle
validated.
- build_setup: comma-joins layouts/variants (primary first) into the
keyboard_layout/keyboard_variant Setup fields, passes the toggle as
keyboard_options, and strips codesets from additional_locales.
- installer.py: setup_locale also locale-gen s the supplementary locales (LANG
stays primary); setup_keyboard writes XKBOPTIONS from keyboard_options
(honouring the toggle) and along the way fixes a pre-existing bug where the
XKBOPTIONS line was written without quotes or a trailing newline.
- Integration: bios-simple now installs us + ca/fr with a toggle and an
fr_CA locale, and verifies /etc/default/keyboard and locale -a on the booted
system. 283 unit tests pass.
Queue item 2. autoinstall has it; corporate desktops need it. - schema: top-level proxy: (http(s) URL, validated — scheme, host, and no quotes/whitespace so it cannot break out of the apt.conf string or shell env quoting). - auto_installer: _apply_proxy writes /etc/apt/apt.conf.d/00proxy (Acquire::http(s)::Proxy) and appends http_proxy/https_proxy (+upper) to /etc/environment. Runs BEFORE _apply_packages so the install chroot apt-get already uses it; persists for the installed system. Wired into needs_post. Unit-tested (URL validation + the exact apt.conf/environment content). Not integration-tested: an unreachable proxy would break apt during install, and standing up a real proxy in QEMU is disproportionate — the config generation is fully unit-covered. 291 unit tests pass.
Queue item 3. Adds CA certs to the system trust store (/etc/ssl/certs) for
corporate MITM-proxy CAs / internal PKI roots.
- schema: ca_certs: {remove_defaults, trusted: [<inline PEM>]} mirroring
cloud-init. Validation rejects a private key in trusted (a trust store
holds public certs only) and non-PEM junk; rejects remove_defaults with no
trusted certs (would leave an empty store / break TLS).
- catrust.py (new): CaTrustBackend + DebianCaTrustBackend mirroring
pkgbackend.py. Writes each cert to /usr/local/share/ca-certificates/
li-ca-N.crt (0644, .crt ext required) and runs update-ca-certificates;
remove_defaults disables the bundled certs and --fresh-rebuilds. Seam for
update-ca-trust (RHEL) later.
- auto_installer: _apply_ca_certs runs BEFORE _apply_packages so a private
mirror CA is trusted when apt fetches over HTTPS. Wired into needs_post.
- Deliberately the SYSTEM store only; separate from future per-connection
802.1X/EAP trust.
- Integration: bios-simple installs a harness-generated self-signed CA and
verifies it reached /etc/ssl/certs. 303 unit tests pass.
The previous commit left bios-simple.yaml with a bare {ca_cert} placeholder
that fails schema validation, breaking test_scenario_fixture_parses (a
pytest|tail masked the failure at commit time). Use a structurally valid PEM
placeholder (BEGIN/END CERTIFICATE with an LI_HARNESS_CA_CERT_PLACEHOLDER
body) so the static fixture validates; the harness swaps in a real generated
cert by matching the marker. 303 unit tests pass.
Queue item 4. Adds autoinstall driver-selection primitive.
- schema: drivers: {install: bool} (autoinstall shape).
- auto_installer: _apply_drivers runs ubuntu-drivers install after the
package phase (apt/repos ready). Guarded by command -v ubuntu-drivers, so
it is a logged no-op on LMDE/Debian where ubuntu-drivers-common is absent;
on Mint it installs recommended proprietary/DKMS drivers. Wired into
needs_post.
The MOK wall is universal: under SecureBoot a freshly built DKMS module
still needs interactive MOK enrollment at next boot — documented, not
something this (or any) installer can automate.
Unit-tested (available + unavailable paths). No integration scenario:
QEMU has no proprietary hardware, so ubuntu-drivers install is a no-op
there. 306 unit tests pass.
Queue item 5 (Feature B). Per-connection EAP for wired 802.1X and wifi WPA-Enterprise, via netplan v2 auth:, rendered to the NM keyfile [802-1x]. - schema: Auth model on EthernetConfig (wired) and AccessPoint (wifi). method tls|peap|ttls; identity/anonymous-identity; ca/client cert + key by ABSOLUTE PATH (inline deferred so a private key need not transit the answer file); phase2-auth; password; allow-unvalidated. Validation: REFUSES EAP without a ca-certificate (rogue-AP credential-theft footgun) unless allow-unvalidated; tls needs client-cert+key; peap/ttls need identity+password+phase2-auth; an AP cannot set both a WPA-PSK password and auth. - netconfig: shared _8021x_lines renders [802-1x] for both ethernet and wifi; wifi EAP uses key-mgmt=wpa-eap. EAP secrets land in the 0600 keyfile. - Separate from the system ca_certs trust store by design (can share a cert file by reference). Unit-tested (rendering + validation incl. the no-CA refusal). Not integration-tested: no RADIUS server / 802.11 in QEMU — bare-metal concern, documented. Deferred: inline certs/keys, a first-boot EAP connectivity check. 318 unit tests pass.
Queue item 6. Configures Timeshift at install time on a btrfs root.
- schema: snapshots: {enabled, backend: timeshift-btrfs, schedule:
{boot,daily,weekly,monthly}, initial_snapshot}. THE KEY GUARD: a top-level
model_validator cross-validates backend vs storage — timeshift-btrfs
requires the btrfs @/@home root (decided in storage.layout), so a config
that could not make snapshots fails at validation, not silently after
install. backend: auto / rsync / snapper rejected in v1.
- snapshotbackend.py (new): SnapshotBackend seam + TimeshiftBtrfsBackend
mirroring pkgbackend.py. Ensures timeshift is installed, renders
/etc/timeshift/timeshift.json (btrfs_mode + schedule_/count_ from the neutral
schedule), and installs a self-removing first-boot one-shot that registers
the schedule (timeshift --check) and takes the optional initial snapshot.
- auto_installer: _apply_snapshots after the package phase; wired into needs_post.
- Integration: custom-btrfs now enables snapshots and verifies timeshift.json
btrfs_mode + that timeshift --list parses our config (validates the json
format against the shipped Timeshift) + the first-boot one-shot self-removed.
New fail-snapshots-on-ext4 failure scenario asserts the cross-validation
aborts a btrfs-backend-on-ext4 config at schema time.
329 unit tests pass. Note: timeshift.json keys can vary by Timeshift version;
the integration test validates the format against the real shipped tool.
The timeshift-config-btrfs-mode verify command embedded "btrfs_mode": "true" unquoted, whose colon-space YAML read as a mapping separator -> ScannerError before the VM even booted. Quote the command and drop the literal colon.
Queue item 7, full multi-level + LVM-on-RAID. Stage 1: schema + validation. - storage.disks: a list of disk matches (multi-disk); storage.target made optional (exactly one of target/disks, RAID requires disks). - storage.raid: RaidArray list (name md0.., level 0/1/5/10, metadata, then mount|lvm_pv|subvolumes like a partition/LV). - CustomPartition.raid: marks a partition a member of a named array (the member is replicated on each disk, so device count = number of disks). - Storage._consistency: RAID referential integrity (members<->arrays), per-level minimum disk counts (0/1:2, 5:3, 10:4), LVM PVs may now come from RAID arrays, mounts include array+subvolume mounts. Engine (mdadm create, mdadm.conf, RAID-aware initramfs, grub to every member) and integration scenarios follow in stage 2. 338 unit tests pass.
…nstead timeshift --list requires root, but the verify runs as the unprivileged ssh user, so it returned non-zero on permissions (not a config error). The config-written + first-boot-one-shot-ran (as root) checks already cover our code; swap the root-only check for command -v timeshift.
- build_setup: resolve storage.disks to N distinct disks (setup.disks), reject overlapping matches; pass custom_raid + the partition raid field; a disks= override for tests. - _create_custom_partitions: multi-disk. Partition every disk identically, set the raid flag on member partitions, mdadm --create each array over its per-disk members, then mkfs/pvcreate/btrfs on the md device (LVM PVs may now be md devices). Non-RAID partitions are used from the first disk (ESP/boot copies on the others exist for bootloader redundancy). - finish_installation: install mdadm, write mdadm.conf (--detail --scan), add the md/raid initramfs modules, and grub-install to EVERY member disk so the machine boots if one disk fails. Unit-tested (mdadm/pv/vg/mkfs command sequence over 3 disks, raid flags, per-disk labels; existing single-disk custom path unchanged). 340 unit tests pass. Integration (multi-disk RAID install + boot) follows in stage 3.
BIOS RAID1 install across two disks (/boot + /), booted from the array and verified over SSH: root on an /dev/md device, two active raid1 arrays both healthy ([UU]), mdadm.conf present. Also: the engine now ensures mdadm is in the live environment before building arrays (not on every live image).
raid1-bios proved the boot-from-RAID path green on the first try. Add the full multi-level case (the user preview): BIOS RAID1 /boot + RAID5 across 3 disks used as an LVM PV, with root on the LV. Verifies root on /dev/mapper/vg0-root, a raid1 + raid5 both active, the RAID5 array with all three members [UUU], and the LVM PV sitting on an /dev/md device.
automated-install.md: a Software RAID subsection (disks list, raid arrays, levels 0/1/5/10, LVM-on-RAID, grub-to-every-member, validation rules). comparison.md: Software RAID and custom-partitioning rows now check; roadmap item 4 done. Limitations updated.
RAID5+LVM-on-RAID boots correctly (root on /dev/mapper/vg0-root, RAID5 [UUU]); the only failing check was pvs, which needs root while the verify runs as the unprivileged ssh user. Swap for an lsblk dependency-stack check that proves LVM-on-RAID without root.
Closes the lifecycle gap opposite late_commands. - schema: top-level early_commands: [..]; on_failure.early_command_failure (default abort). - auto_installer: _run_early_commands runs them in the LIVE environment (not a chroot — /target does not exist yet) BEFORE partitioning, as the first step of run(). An on-media file is run with sh. Failure -> early_command_failure policy (fail before any disk is touched). - Deliberately CANNOT change the layout (config is data; per-machine layouts use generated answer files / auto: discovery) — documented. For prepping the install environment: tear down a stale array, mount an out-of-band source, place a cert/keyfile for ca_certs/EAP/LUKS to reference. 7 new tests (347 unit total). Docs + comparison updated (%pre row now check).
Rounds out the package story (apt + flatpak — both package systems Mint uses).
- schema: flatpak: {remotes: [{name, url}], install: [app-ids]}. Validation:
https remote URLs, valid app ids, unique remote names, install needs >=1
remote.
- auto_installer: _apply_flatpak ensures flatpak (apt), adds each remote
(remote-add --if-not-exists), and installs each app in the chroot with
flatpak install --system -y --noninteractive (root, so no polkit/session
needed; the long install timeout absorbs the download). After the package
phase. Wired into needs_post.
- Integration: a flatpak scenario installs Flathub + two small popular apps
(Flatseal, Calculator — shared GNOME runtime) and verifies them present on
the booted system (a REAL flatpak pull, not mocked).
Snap is deliberately unsupported (Mint/LMDE disable snapd) but NOT foreclosed:
a future snap: section would be additive and separate, like flatpak:/apt:.
9 new tests (356 unit total). Docs + comparison updated.
…ak refs Targeted documentation corrections after the feature queue landed: - automated-install.md: fix the Limitations and keyboard TOC/section anchors; note the keyfile uses the same insecure-transport rules as the answer file; correct that swap is a filesystem, not a flag; clarify gateway4/6 produce an equivalent default route; add early_command_failure to the failure list and note the disk-target mismatch always aborts ahead of partition_mismatch; add TFTP to the cleartext-transport list; state the logging defaults; fix a double 'and' in Limitations. - getting-started.md: list the RAID, LUKS-prompt, and flatpak example files. - README.md: add catrust.py and snapshotbackend.py to the module map and the new features (RAID, EAP, first-boot LUKS rekey, Timeshift, etc.) to the intro. - serial-console-and-luks.md: document the prompt-on-first-boot serial rekey path and the _setup_luks_first_boot_rekey step.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #145.
That question has been open since 2022. Rather than answer it on paper, I built a working version and tested it end to end, so there is something concrete to react to.
This is an MVP for unattended installation: mergeable as is, and built to be extended. I am putting it up so you can review it and decide on the direction. It works end to end today (see below), and everything here is open to change.
I want to be clear about one thing up front. The goal is for this to work on Mint, but Mint does not ship live-installer yet (it arrives with Mint 23), so I have only been able to test on LMDE. I cannot test the Mint side until the Mint 23 beta is available. The code has a Mint vs LMDE branch, and I have pinned that branch with unit tests so it will not silently break, but the Mint path has not run end to end. There is more on this further down.
What works today, as automated installs in a VM driven only by an answer file, all on LMDE:
@/@home), and software RAID (md, levels 0/1/5/10, including LVM-on-RAID and a RAID/boot, with GRUB installed to every member disk), validated and verified on the booted systemprompt-on-first-boot: the installer formats with a random throwaway key embedded in the initramfs so the first boot is unattended, then a one-shot service prompts on the console/serial for the real passphrase, rekeys, and reboots)apt:shape), an http(s) proxy, a custom CA trust store (ca_certs:), proprietary/DKMS drivers (drivers:), flatpak remotes and apps, Timeshift snapshots on btrfs, multi-layout keyboards + supplementary locales, pre-installearly_commands(%pre-style), post-install commands, kernel command line, serial consoleauto:identity-based discovery so one boot entry serves a whole fleet (it finds its own file by MAC / SMBIOS serial / UUID)It is covered by 356 unit tests and an integration suite that performs real installs in QEMU/KVM (including custom LVM/btrfs layouts, software RAID 1 and RAID 5 + LVM, a first-boot LUKS passphrase prompt over serial, flatpak app installs, static dual-stack IP + VLAN networking, and a no-media PXE netboot install), plus a real-NFS suite that fetches an answer file from a live NFS server over NFSv3 and NFSv4.2 across IPv4, IPv6, and a hostname. All run green in GitHub Actions on standard hosted runners, with no special hardware.
Bugs this turned up
Building and testing this found three pre-existing issues in the installer that affect more than automated installs. I am offering these as fixes regardless of what happens with the feature.
A udev race in
full_disk_format()(partitioning.py). Eachpartedcall triggers a partition-table rescan, during which udev briefly removes and recreates the partition device nodes. On a multi-partition EFI install,mkfscan race a node that has not reappeared yet. Because the failure was tolerated, the file copy then landed in tmpfs until it filled. This is the sort of sporadic "install failed" or "won't boot" that is very hard to diagnose from a user bug report. The fix isudevadm settlebefore the node check, and treating a failedmkfsas fatal. The integration harness reproduces it reliably.The LUKS passphrase was logged in cleartext. The engine pipes it with
echo <pass> | cryptsetup, which puts it in the process table (ps) and, once commands are logged, on the serial console and in the journal. Serial output is routinely captured off real hardware over BMC and serial-over-LAN. The fix is to feed it to cryptsetup on stdin (--key-file -), and to redact secrets from logs as a backstop.Encrypted installs do not boot in the live environment without help. The engine runs
update-initramfs, but in the live session that is a live-tools-diverted no-op, so the crypttab never reaches the initramfs and the encrypted root cannot be unlocked. It drops to a rescue shell. Plain and LVM installs are not affected because they boot from the stock initramfs, which is why this stayed hidden. The fix is in the automated path for now, and I am happy to discuss whether the engine should handle it for the GUI as well.I am not certain that (2) and (3) reproduce on a normally booted GUI install the same way they do in my harness. They may be partly artifacts of how the test boots. (1) is firmware level and looks real on hardware. The reproductions are in the test suite.
What the answer file looks like
I went with YAML rather than ini. It expresses the lists this needs (users, packages, commands) without ad-hoc conventions. Where it genuinely overlaps with cloud-init, identity, users, packages, apt repositories, CA certs, proxy, and network config, it reuses cloud-init's own key names and structure, so anyone who has used cloud-init or Ubuntu autoinstall should find it familiar. Where the semantics differ I use the name that matches the behaviour rather than overloading a cloud-init key: post-install commands run in the target at install time, so they are
late_commands(Ubuntu autoinstall's term, and kickstart's%post), not cloud-init's first-bootruncmd. It is not a drop-in for either format, but the shared parts are deliberately the same shape. PyYAML is already in the live ISO dependency chain.The storage, users, locale, and keyboard sections map almost one to one onto the existing
Setupclass in installer.py, which is what makes this tractable.Four safety rules
These come from a decade of preseed and kickstart pain.
/dev/sdX. Disk targeting only via stable match expressions. Enumeration order is not stable across NVMe, SATA, and USB, and "wiped the wrong disk" is the bug class that kills trust. No match, or more than one match, aborts with the list of disks. It never guesses.early_commandsrun shell, but cannot rewrite the partition layout — that stays declarative.)The whole file is validated strictly. Unknown keys, wrong types, and malformed YAML are hard errors before any disk is touched. A
--checkmode validates an answer file without touching disks, so it can run in CI, and--list-disksprints a machine's stable disk attributes so an operator can build a match expression.How it is invoked
/cdrom/auto-install.yaml. No infrastructure, covers USB and remastered ISO.live-installer.auto=<source>, where source is a path, anhttp(s)://,nfs://, ortftp://URL, orauto:<base-url>for fleet-wide identity-based discovery (it finds its own file by MAC / SMBIOS serial / UUID). Cleartext transports that carry secrets are refused unlesslive-installer.auto-insecureis also passed.docs/pxe-netboot.md.With no answer file, the GUI runs exactly as today.
How it fits the existing code
This is what made me think it was worth doing. The engine is already shaped for it.
InstallerEnginehas no GTK imports and talks to the UI through two callbacks,set_progress_hookandset_error_hook. The headless driver,auto_installer.py, is a sibling ofInstallerWindowthat registers console and log implementations of those same hooks. It does not change the engine's control flow. The automated partition path (setup.automated = True) already runs from flatSetupfields and never touches the GTK partition widgets.The new modules are self-contained:
auto_installer.py(driver),schema.py(validation),diskmatch.py(disk selection),discovery.py(answer-file discovery),netconfig.py(renders thenetwork:section to NetworkManager keyfiles),pkgbackend.py(a swappable package backend so the repo/install step is not apt-hardcoded),catrust.py(rendersca_certs:into the system trust store),snapshotbackend.py(renderssnapshots:to Timeshift), andcommandrunner.py(one place that all shell-outs go through, for logging and secret handling). The swappable backends (package, CA trust, snapshot) all sit behind thecommandrunnerchroot boundary, so adding dnf/zypper, or a Snapper/rsync snapshot backend, is a drop-in rather than a rewrite. The changes to shared engine code are small and additive, and the CD/GUI path is unaffected: routing the engine's shell-outs throughcommandrunner(behaviour-preserving); one optionalbefore_unmount_hookparameter onfinish_installation()so the driver can do post-install work while the chroot is still mounted; and, for netboot, reading the kernel/initrd/package manifests from the rootfs when they are absent from the medium (a no-op on CD/USB, where they sit next to the squashfs).Tests and CI
There is no CI in the repo today. This ships with three GitHub Actions workflows you can adopt incrementally.
Unit tests run pytest on Python 3.11 to 3.13, on every push and PR. They need no special hardware, since GTK and parted are stubbed. There are 356 of them.
Integration tests perform real installs in QEMU/KVM, nightly and on demand. The whole matrix (BIOS and UEFI across simple, LVM, and LUKS; multi-disk by-id; custom layouts with LVM and with btrfs subvolumes; software RAID 1 and RAID 5 + LVM-on-RAID; a
prompt-on-first-bootLUKS passphrase typed over the serial console; flatpak remote + app installs; Timeshift-on-btrfs; static dual-stack IP + 802.1Q VLAN networking; no-media PXE/iPXE netboot installs on both BIOS and UEFI plus an IPv6 data-path install; and the malformed-input, no-disk-match, bios_grub-on-UEFI, and snapshots-on-ext4 failure cases) runs green on standard GitHub hosted runners. They have KVM, so no self-hosted lab is needed; the PXE scenarios use QEMU's built-in TFTP and iPXE, so they need no real DHCP/TFTP infrastructure. Serial logs are uploaded as artifacts so a failed run is debuggable without reproducing it locally.A third workflow stands up a real NFS server on the runner (GitHub runners have passwordless sudo and a kernel nfsd) and fetches an answer file through the installer's actual
nfs://transport over the full matrix of protocol version (v3, v4.2) × address form (IPv4, IPv6, hostname), so NFS delivery is exercised end to end rather than mocked.What it does not do yet
This is a base for the common desktop and workstation case, not a kickstart-class enterprise provisioner. Not implemented:
prompt-on-first-boot.The biggest caveat is Mint itself, so to repeat it plainly: everything above has been tested on LMDE only. Mint does not ship live-installer until Mint 23, so I cannot run the Mint path end to end until the Mint 23 beta. The integration harness also assumes a Debian-live boot environment, and Mint uses casper, so the harness itself will need a Mint variant before it can cover Mint. The Mint vs LMDE code branch is pinned by unit tests so it will not silently regress, but please treat Mint support as designed-for and not yet verified.
Questions
live-installer.auto=an acceptable kernel argument, and/cdrom/auto-install.yamla reasonable well-known path? I handle both/cdromand LMDE's/run/live/medium.