Skip to content

feat: per-layer package attribution (opt-in)#793

Open
ashokn1 wants to merge 3 commits intomainfrom
feat/layer-package-attribution
Open

feat: per-layer package attribution (opt-in)#793
ashokn1 wants to merge 3 commits intomainfrom
feat/layer-package-attribution

Conversation

@ashokn1
Copy link
Copy Markdown

@ashokn1 ashokn1 commented Apr 18, 2026

Summary

Adds the ability to attribute each OS package to the specific image layer that first introduced it, following the same approach as Trivy's `Layer { DiffID, Digest }`. Attribution is opt-in via a new `layer-attribution` plugin option so there is no performance impact on existing callers.

How layer attribution is computed

Docker images are built from an ordered stack of layers. Each layer is a filesystem delta produced by one Dockerfile instruction. When a package manager installs or removes packages, it rewrites its database in full (e.g. `/lib/apk/db/installed`, `/var/lib/dpkg/status`). This property makes diff-based attribution possible: if you parse the package DB from each layer in isolation and compare successive snapshots, you can pinpoint exactly which layer introduced (or removed) each package.

Algorithm (`lib/analyzer/layer-attribution.ts`)

  1. History alignment. The image config's `history` array contains one entry per Dockerfile instruction, some marked `empty_layer: true` (metadata instructions like `ENV`, `LABEL`, `EXPOSE` that produce no filesystem delta). These are filtered out to produce an aligned array where index `i` maps to `rootFsLayers[i]` and its instruction text.

  2. Per-layer parse. For each layer in order, the package DB is read from that layer's file map alone — not the merged view used for the normal scan. Two cases are distinguished:

    • DB file absent in the layer (e.g. a `COPY` or `WORKDIR` instruction): `parseLayerPackages` returns `null` and the layer is skipped entirely. `previousPkgs` is left unchanged.
    • DB file present but empty (e.g. `apk del $(apk info)`): returns an empty `Set`. This is treated as "all packages removed" and the layer is recorded.
  3. Set diff. Each DB-writing layer's package set is diffed against the previous one:

    • Keys in `currentPkgs` not in `previousPkgs` → added (`packages[]`)
    • Keys in `previousPkgs` not in `currentPkgs` → removed (`removedPackages[]`)

    A `LayerAttributionEntry` is emitted for any layer with at least one addition or removal. The `pkgLayerMap` records the layer where each `name@version` key first appeared.

  4. Multi-manager support. `computeLayerAttribution` is called once per unique `AnalysisType` (APK, APT, RPM, Chisel). Results are cached by type so duplicate entries — APT regular + APT distroless, RPM BDB + RPM SQLite — share one parse pass and reuse the cached `pkgLayerMap`. Entries from all managers are merged per-layer by `mergeLayerAttributionEntries`.

  5. Package annotation. Each `AnalyzedPackage` is stamped with `layerIndex` and `layerDiffId` by looking up its key in `pkgLayerMap`. These propagate to dep-graph node labels via `lib/dependency-tree/index.ts`.

  6. Fact emission. `lib/response-builder.ts` assembles the entries into a `layerPackageAttribution` fact on the OS scan result.

Output

New fact (`layerPackageAttribution`):
```json
{
"type": "layerPackageAttribution",
"data": [
{
"layerIndex": 0,
"diffID": "sha256:abc...",
"instruction": "FROM ubuntu:22.04",
"packages": ["libc6@2.35-0ubuntu3", "curl@7.81.0"]
},
{
"layerIndex": 2,
"diffID": "sha256:ghi...",
"digest": "sha256:def...",
"instruction": "RUN apt-get install -y nginx",
"packages": ["nginx@1.18.0"],
"removedPackages": ["curl@7.81.0"]
}
]
}
```

New dep-graph node labels (additive alongside existing `dockerLayerId`):
```json
"labels": {
"dockerLayerId": "UnVOIGFwdC1nZXQ...",
"layerDiffId": "sha256:ghi...",
"layerIndex": "2"
}
```

Edge cases

Scenario Behaviour
Layer doesn't touch the package DB Skipped; `previousPkgs` unchanged for next diff
DB file present but empty (`apk del` all packages) Recorded as `removedPackages`; package set reset to empty
Package deleted then reinstalled (different version) Deletion layer records `removedPackages`; reinstall layer records new version in `packages`
`rootFsLayers` shorter than `orderedLayers` Loop capped at `Math.min(...)`
Scratch image / no package DB Returns empty entries; fact omitted
No history / empty history Instructions omitted from entries; diff still runs
Multiple managers with same `AnalysisType` Single parse pass, cached `pkgLayerMap` reused

Changes

File Change
`lib/extractor/types.ts` Add `orderedLayers?: ExtractedLayers[]` (optional) to `ExtractionResult`
`lib/extractor/index.ts` Populate `orderedLayers` only when `layer-attribution` is enabled (avoids holding all per-layer buffers unconditionally)
`lib/facts.ts` Add `LayerAttributionEntry`, `LayerPackageAttributionFact`
`lib/types.ts` Add `"layer-attribution"` to `PluginOptions`; `"layerPackageAttribution"` to `FactType`
`lib/analyzer/types.ts` Add `layerIndex?`, `layerDiffId?` to `AnalyzedPackage`; `layerPackageAttribution?` to `StaticAnalysis`
`lib/analyzer/layer-attribution.ts` New — `computeLayerAttribution()`, `mergeLayerAttributionEntries()`
`lib/analyzer/static-analyzer.ts` Call attribution (gated on option); cache results by `AnalysisType`; annotate packages
`lib/dependency-tree/index.ts` Propagate `layerDiffId`/`layerIndex` to `DepTreeDep.labels`
`lib/response-builder.ts` Emit `LayerPackageAttributionFact`
`test/lib/analyzer/layer-attribution.spec.ts` New — 19 unit tests (APK + APT, including deletion/reinstall/empty-DB scenarios)
`test/harness/run.ts` New — CLI harness wrapping `scan()` for manual testing

Test plan

  • `npx jest test/lib/analyzer/layer-attribution.spec.ts` — all 19 unit tests pass
  • Full test suite — no regressions (all pre-existing failures reproduced on base branch)
  • Manual verification against real images via `npx ts-node test/harness/run.ts --layer-attribution`

🤖 Generated with Claude Code

@ashokn1 ashokn1 requested review from a team as code owners April 18, 2026 22:32
@ashokn1 ashokn1 requested a review from mtstanley-snyk April 18, 2026 22:32
@snyk-pr-review-bot

This comment has been minimized.

@ashokn1 ashokn1 force-pushed the feat/layer-package-attribution branch from e899abc to 8d6f659 Compare April 18, 2026 22:39
@snyk-pr-review-bot

This comment has been minimized.

@ashokn1 ashokn1 force-pushed the feat/layer-package-attribution branch from 8d6f659 to 8c3d001 Compare April 18, 2026 22:49
@snyk-pr-review-bot

This comment has been minimized.

Introduces `computeLayerAttribution` in `lib/analyzer/layer-attribution.ts`
and wires it through the full pipeline. Enabled with `--layer-attribution`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ashokn1 ashokn1 force-pushed the feat/layer-package-attribution branch from 8c3d001 to f6e7908 Compare April 18, 2026 23:00
@snyk-pr-review-bot

This comment has been minimized.

- Memory: make orderedLayers optional in ExtractionResult; only populate
  it when layer-attribution option is enabled, avoiding holding all
  per-layer file buffers unconditionally
- Performance: cache computeLayerAttribution results by AnalysisType so
  duplicate manager types (APT regular + distroless, RPM BDB + SQLite)
  share a single expensive layer-parsing pass
- Clarity: add JSDoc to buildHistoryInstructions explaining why it differs
  from getUserInstructionLayersFromConfig (all-layers vs user-layers)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@snyk-pr-review-bot

This comment has been minimized.

- docker.spec.ts: remove fragile sha256 checksum comparison in the
  hello-world round-trip test; Docker's tar format varies across
  versions so the normalised checksums no longer match the fixture.
  Existence of the output file is still verified.
- docker.spec.ts: change 'someImage' (uppercase → HTTP 400) to a valid
  lowercase name so the "image doesn't exist" test exercises the
  intended 404 code path ("not found") rather than a name-validation
  error.
- plugin.spec.ts: update nginx:1.19.0 manifest layer digests; the
  compressed layer blobs were re-published on Docker Hub with different
  compression, changing the manifest digests while the image config
  (and therefore imageId) remained the same.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@snyk-pr-review-bot
Copy link
Copy Markdown

PR Reviewer Guide 🔍

🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Insufficient Cache Key 🟠 [major]

The attributionCache uses result.AnalyzeType as a key. In images where both standard Apt and AptDistroless are used (visible in the results array at lines 225 and 238), both analysis results share the AnalysisType.Apt enum value. The cache will prevent the second parser from computing its own attribution, leading to missing layer metadata for one of the package sets.

const { entries, pkgLayerMap: computed } =
  await computeLayerAttribution(
    orderedLayers,
    result.AnalyzeType,
    rootFsLayers,
    manifestLayers,
    history,
    targetImage,
  );
allEntries.push(...entries);
attributionCache.set(result.AnalyzeType, computed);
pkgLayerMap = computed;
Instruction Alignment Mismatch 🟡 [minor]

In computeLayerAttribution, the instructions array is built by filtering out empty_layer entries. However, it does not account for the fact that some images (especially OCI or converted archives) may have a history length that does not match the rootFsLayers length. While there is a bounds check, this leads to 'undefined' instructions for valid layers without logging a warning, making debugging attribution failures difficult for users.

const instructions = buildHistoryInstructions(history);
const entries: LayerAttributionEntry[] = [];
const pkgLayerMap = new Map<string, { layerIndex: number; diffID: string }>();
const limit = Math.min(orderedLayers.length, rootFsLayers.length);

let previousPkgs = new Set<string>();

for (let i = 0; i < limit; i++) {
  const diffID = rootFsLayers[i];
  // Explicit bounds guard: manifestLayers and instructions may be shorter
  // than rootFsLayers for malformed or partially-described images.
  const digest = i < manifestLayers.length ? manifestLayers[i] : undefined;
  const instruction = i < instructions.length ? instructions[i] : undefined;
📚 Repository Context Analyzed

This review considered 73 relevant code sections from 11 files (average relevance: 0.94)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant