week_3: Module C (The Librarian) — C.1 candidate retriever (in-memory + pgvector) + pipeline switch by PRAteek-singHWY · Pull Request #937 · OWASP/OpenCRE

PRAteek-singHWY · 2026-06-18T13:45:54Z

week_3: Module C (The Librarian) — C.1 semantic candidate retriever + pipeline switch

Stacked on #925 (Week 2). This branch is based on gsocmodule_C_week_2; only the top commit is new. Until #922 and #925 merge, the diff shows the W1+W2 commits too — I will rebase onto main as they land, shrinking it to the Week 3 files only.

Overview

Week 1 built the judge (golden dataset + scorer + harness). Week 2 built the parts that must never guess — the C.0 input boundary and the deterministic explicit-CRE fast path. This is the first part that does guess.

The problem: the explicit fast path only fires when a CRE id (ddd-ddd) is literally written in the chunk. Most chunks just describe a security concept in plain English, so for those C must find which of OpenCRE's CRE nodes the text is about. That is a search problem — and the first place in the pipeline that can be wrong.

This PR's role: build the search step (C.1). Embed the section text, cosine-match it against the CRE-node vector hub, and return the top-K (default 20) rank-ordered candidates — the shortlist the W4 cross-encoder reranks and later weeks turn into a confident yes/no. Because Week 1 already built the judge, this first ML component gets a measured score from day one (recall@k, see Results).

Scope: 1 new module + benchmark + 1 new test, plus config/CLI/harness wiring. No frontend, no migration, no behaviour change to OpenCRE proper. (The pgvector schema migration lands in W8 — see "not here".)

What changed

Area	Files	Description
C.1 retriever	`candidate_retriever.py` (new)	Two interchangeable backends behind one `retrieve()` seam, both emitting the same RFC `RetrievalAudit` (`candidates[]` rank-ordered with `score_vector`; `reranked[]` empty until W4). `CandidateRetriever` — in-memory sklearn cosine over an injected `embed_fn` + immutable `CandidatePool`. `PgVectorRetriever` — pushes the cosine into Postgres via the `<=>` operator over an `embedding_vec` column (scored `1 - distance`), never loading the hub into RAM. `build_retriever(backend, ...)` is the factory. DI keeps the module import-light and hermetically testable (same pattern as W2's resolver). Reuses OpenCRE's embedding stack (`PromptHandler.get_text_embeddings`, `db.get_embeddings_by_doc_type`, `cosine_similarity`).
Backend select	`candidate_retriever.py`, `config_loader.py`	`RetrieverBackend` enum + `CRE_LIBRARIAN_RETRIEVER_BACKEND`. The RFC is silent on retrieval tech (it mandates only the audit trail), so the choice is ours. `in_memory` is the only backend that runs on SQLite, so CI/dev/harness use it; `pgvector` is the proposal's primary for prod Postgres.
Dim gate	`candidate_retriever.py`	`CandidatePool.from_mapping` rejects an empty or ragged hub; `retrieve` asserts query-vector width == stored CRE-vector width. A silent dim mismatch makes every cosine score meaningless → fails loudly (`DimensionMismatchError`). Postgres enforces this structurally via the fixed-width `vector(dim)` column.
Config	`config_loader.py`, `.env.example`	`CRE_LIBRARIAN_RETRIEVER_BACKEND` (validated), plus a documented Librarian env block.
CLI switch	`cre.py`, `cre_main.py`	`--run_librarian` / `--librarian_dry_run` (+ `--librarian_source`): read knowledge-queue sections → explicit resolver (C.0.5) → `build_retriever`-selected backend (C.1) → log the shortlist per section. Writes land W8, so this is dry-run only; `--run_librarian` without writes warns and behaves identically.
Benchmark	`benchmark_retriever.py` (new)	Times in-memory vs pgvector over the real CRE hub and checks top-1 agreement. in-memory always runs; the pgvector leg auto-skips (loudly, never silently passes) unless the DB is Postgres with the `embedding_vec` column present.
Eval harness	`evaluate_librarian.py`	`--use_live_embeddings` measures retrieval recall@k over the positive slice against the real DB + embedding model. Offline (CI default) it stays honest: the retriever is wired, but recall@k needs live embeddings — there is no offline number because the pool must be the real CRE vectors, and seeding it from golden text is exactly the leakage the hub firewall strips.
Tests	`candidate_retriever_test.py` (new)	16 hermetic tests: cosine rank-ordering, top-K truncation + cap, the dim gate (ragged/zero-width/query mismatch), empty-pool rejection, `score_vector`/`cre_name` population, W3 audit shape; pgvector parsing + SQL-binding via a fake recording connection (no DB needed); and the `build_retriever` factory routing + arg-mismatch guards.

How the pieces connect

flowchart TB
    sec["Section (from C.0, W2)"]
    res["explicit_link_resolver (C.0.5, W2)<br/>regex ddd-ddd, no ML"]

    subgraph C1["C.1 — semantic retriever (this PR)"]
        embed["embed_fn(text)<br/>(PromptHandler.get_text_embeddings)"]
        factory["build_retriever(backend)"]
        inmem["CandidateRetriever<br/>sklearn cosine + top-K<br/>(SQLite / CI / harness)"]
        pg["PgVectorRetriever<br/>embedding_vec &lt;=&gt; query<br/>(Postgres prod)"]
        audit["RetrievalAudit<br/>candidates[] (score_vector)<br/>reranked[] = [] (W4)"]
    end

    rev["unknown / conflicting -> review (W2)"]
    rerank["CrossEncoderRanker (C.2, W4)<br/>reranks shortlist -> top 5"]

    sec --> res
    res -- "resolved" --> done["deterministic link"]
    res -- "no_reference" --> embed
    res -- "unknown / conflicting" --> rev
    embed --> factory
    factory --> inmem
    factory --> pg
    inmem --> audit
    pg --> audit
    audit --> rerank

Results

# offline (CI default)
explicit slice (C.0.5 resolver): 5/5 — gate 100%: PASS
semantic retriever (C.1): wired; recall@k needs --use_live_embeddings
correct overall (semantic path still stubbed): 10/319

# live (local, against a populated cache DB + embedding model)
retrieval recall@20 (live, 292 positive rows): any-hit __/292, all-hit __/292   # measured before review

82 librarian tests passing (66 from W1+W2 + 16 new)
both backends return the top-K shortlist rank-ordered by cosine, score_vector set, reranked[] empty (W4 fills it)

Live recall@20 is measured against a populated cache DB before this PR leaves draft — I'll paste the real numbers into the block above. Recall@k is the right W3 metric (not the Jaccard link rule, which grades the final auto-link and needs the W4 reranker): it asks the one question the search step controls — does the expected CRE id reach the shortlist the reranker will see? A miss here is unrecoverable downstream.

What is intentionally not here

The cross-encoder rerank (C.2, W4), decision/threshold routing (C.3–C.4, W5), SafetyGuard (W5), and graph writes (W8) are later weeks. reranked[] is therefore empty, and the pipeline is dry-run only.

The pgvector schema migration is intentionally a W8 deliverable. PgVectorRetriever + the build_retriever factory + the benchmark ship here (the W3 code deliverable), fully unit-tested via a fake connection. But the embedding_vec column it queries is created by an Alembic migration that the deliverables plan schedules for W8 (live integration) — where it can be validated against real Postgres (flask db upgrade/downgrade, make alembic-guardrail, backfill, ivfflat tuning). Until then pgvector is selectable by config but only runnable once W8 lands the column; in_memory is the default and the only backend used on SQLite/CI. Shipping the retriever code ahead of the schema keeps this PR free of an unvalidated shared-table migration.

How to verify locally

# 82 librarian tests (both backends + factory, all hermetic)
python3 -m unittest discover -s application/tests/librarian -p '*_test.py' -t .

# offline harness: C.0 pass rate + explicit gate + retriever wired
python3 scripts/evaluate_librarian.py --dataset application/tests/librarian/fixtures/golden_dataset.json

# live recall@20 (needs a populated cache DB + an embedding-capable LLM)
python3 scripts/evaluate_librarian.py \
    --dataset application/tests/librarian/fixtures/golden_dataset.json \
    --use_live_embeddings --cache_file standards_cache.sqlite

# run the pipeline switch end-to-end (dry run, logs the shortlist per section)
python3 cre.py --librarian_dry_run --cache_file standards_cache.sqlite

# backend benchmark (pgvector leg skips unless Postgres + embedding_vec present)
python3 scripts/benchmark_retriever.py --cache_file standards_cache.sqlite

dataset Contracts + regression ruler before any pipeline code, per the OIE RFC's 'test before the code' directive. RFC OWASP#734 envelopes (KnowledgeItem in, LinkProposal/ReviewItem out) as Pydantic v2, drift-guarded against the vendored owasp-graph schemas; TRACT hub-firewall + multi-link scoring; 319-row golden dataset derived from standards_cache.sqlite with --check drift detection. One prod edit: pydantic>=2,<3 pin.

…nts, fail-fast build, edge cases

…inkResolver The C.0 deterministic input boundary, per the proposal's W2/W3 'data preparation layer' (named validator, not normalizer: the RFC assigns text normalization to Module A; C validates and adapts, never transforms text). - section_validator.py: typed-error validation of both upstream shapes (B's knowledge_queue row + RFC KnowledgeItem envelope) into an internal Section; synthesizes RFC identity fields (chunk_id/artifact_id/source/ locator) from B's reduced row; strips volatile audit metadata. - explicit_link_resolver.py: deterministic ddd-ddd fast path, no ML. Fail-safe: only a single known reference auto-links; unknown or conflicting references route to review. - evaluate_librarian.py: harness now runs every golden row through C.0, prints per-slice validation pass rate, and gates the explicit slice at 100% resolver correctness (exit 1 on regression). Gate: PASS 5/5. - Table-driven tests for every rejection class and resolver outcome. mypy --strict clean, black clean, 66 tests green.

coderabbitai · 2026-06-18T13:46:04Z

Warning

Review limit reached

@PRAteek-singHWY, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 45 minutes and 18 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a04e724b-125d-471f-9d93-38b81d645ff0

📥 Commits

Reviewing files that changed from the base of the PR and between 3c447bf and 695a016.

📒 Files selected for processing (8)

application/tests/librarian/config_loader_test.py
application/tests/librarian/dataset_test.py
application/tests/librarian/section_validator_test.py
application/utils/librarian/knowledge_source.py
application/utils/librarian/schemas.py
application/utils/librarian/section_validator.py
requirements.txt
scripts/build_golden_dataset.py

Walkthrough

This PR adds Module C (“The Librarian”) with RFC-aligned schemas and Pydantic models, environment-based configuration, C.0 section validation, explicit CRE resolution, semantic retrieval, evaluation utilities, golden dataset tooling, and CLI integration.

Changes

Module C — The Librarian

Layer / File(s)	Summary
RFC schemas and models `application/utils/librarian/__init__.py`, `application/utils/librarian/_rfc_schemas/*.json`, `application/utils/librarian/schemas.py`, `requirements.txt`, `application/tests/librarian/schemas_test.py`	Vendored Draft 2020-12 JSON Schemas and mirrored Pydantic v2 models for the RFC envelopes, internal queue rows, and golden dataset rows; `pydantic` is constrained to `>=2,<3`, and canonical-schema round-trip tests cover the contract models.
Librarian config and env defaults `.env.example`, `application/utils/librarian/config_loader.py`, `application/tests/librarian/config_loader_test.py`	Frozen `LibrarianConfig`, `load_config()` for `CRE_LIBRARIAN_*` variables, env defaults, and tests for defaults, overrides, immutability, and invalid values.
C.0 section validation and explicit CRE fast path `application/utils/librarian/section_validator.py`, `application/utils/librarian/knowledge_source.py`, `application/tests/librarian/fixtures/sample_knowledge_queue.jsonl`, `application/tests/librarian/section_validator_test.py`, `application/utils/librarian/explicit_link_resolver.py`, `application/tests/librarian/explicit_link_resolver_test.py`	Section validation for queue rows and knowledge items, fixture-backed knowledge sourcing, and deterministic explicit CRE extraction/resolution with tests for validation and resolution outcomes.
C.1 semantic candidate retrieval `application/utils/librarian/candidate_retriever.py`, `application/tests/librarian/candidate_retriever_test.py`, `scripts/benchmark_retriever.py`	In-memory cosine and pgvector-backed retrieval, candidate-pool validation, retriever factory wiring, backend-specific tests, and a benchmark script that times both backends and compares top-1 agreement.
Hub firewall and scoring `application/utils/librarian/hub_firewall.py`, `application/utils/librarian/scoring.py`, `application/tests/librarian/hub_firewall_test.py`, `application/tests/librarian/scoring_test.py`	Hub leakage detection/filtering and Jaccard-based scoring, with tests covering normalization and score boundaries.
Golden dataset tooling and schema checks `scripts/build_golden_dataset.py`, `application/tests/librarian/fixtures/golden_dataset.schema.json`, `application/tests/librarian/dataset_test.py`	Deterministic golden dataset generation, dataset JSON Schema, shape/determinism tests, and committed dataset drift checks.
CLI wiring and evaluation scripts `cre.py`, `application/cmd/cre_main.py`, `scripts/evaluate_librarian.py`	CLI flags for Module C, `run_librarian()` wiring, and the end-to-end evaluation harness that validates sections, gates explicit resolution, and reports recall.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

Pa04rth

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the Week 3 Librarian candidate retriever and pipeline switch changes.
Description check	✅ Passed	The description is directly aligned with the retriever, CLI wiring, benchmark, and testing changes in this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

…ipeline switch The semantic search step. For sections with no explicit CRE id, embed the text and cosine-rank the CRE vector hub, returning the top-K (default 20) candidates as an RFC RetrievalAudit (reranked[] empty until W4). Two interchangeable backends behind one retrieve() seam, selected by CRE_LIBRARIAN_RETRIEVER_BACKEND: - CandidateRetriever: in-memory sklearn cosine (SQLite/CI/harness) - PgVectorRetriever: Postgres-side <=> cosine over embedding_vec build_retriever() is the factory. Reuses OpenCRE's embedding stack (PromptHandler.get_text_embeddings, db.get_embeddings_by_doc_type). Dim gate fails loudly on empty/ragged hubs and query/hub width mismatch. CLI switch (--run_librarian / --librarian_dry_run) runs the pipeline dry-run; harness --use_live_embeddings measures retrieval recall@k. The pgvector schema migration is a W8 deliverable (validated against real Postgres there); W3 ships only the code path, unit-tested via a fake connection. 82 librarian tests pass.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

application/utils/librarian/knowledge_source.py (1)

29-35: ⚠️ Potential issue | 🟠 Major

Wrap model_validate_json in error handling to prevent a single malformed line from aborting the entire run.

At line 34, an uncaught ValidationError from model_validate_json will propagate out of the generator and terminate the iteration in cre_main, causing the entire run to fail. The codebase already handles similar validation errors elsewhere (e.g., section_validator.py catches ValidationError and continues); apply the same pattern here by catching the exception and logging a warning before continuing to the next line.

Suggested fix

+import logging
 import json
 from abc import ABC, abstractmethod
 from typing import Iterator
 
+from pydantic import ValidationError
+
 from application.utils.librarian.schemas import KnowledgeQueueItem
 
+logger = logging.getLogger(__name__)
+
@@
     def items(self) -> Iterator[KnowledgeQueueItem]:
         with open(self._path, encoding="utf-8") as fh:
-            for line in fh:
+            for line_no, line in enumerate(fh, start=1):
                 line = line.strip()
                 if line:
-                    yield KnowledgeQueueItem.model_validate_json(line)
+                    try:
+                        yield KnowledgeQueueItem.model_validate_json(line)
+                    except ValidationError as exc:
+                        logger.warning(
+                            "Skipping malformed knowledge_queue row at line %d: %s",
+                            line_no,
+                            exc,
+                        )
+                        continue

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/librarian/knowledge_source.py` around lines 29 - 35, The
items method in the KnowledgeSource class does not handle ValidationError
exceptions that can be raised by model_validate_json when processing malformed
lines. Wrap the KnowledgeQueueItem.model_validate_json call in a try-except
block to catch ValidationError, log a warning message when validation fails for
a particular line, and use continue to skip that line and proceed to the next
one instead of allowing the exception to propagate and terminate the entire
iteration. This pattern is already implemented elsewhere in the codebase such as
in section_validator.py and should be applied consistently here.

🧹 Nitpick comments (2)

application/utils/librarian/__init__.py (1)

12-12: ⚡ Quick win

Update this scope note to reflect current module reality.

Line 12 says there is “No linking logic yet,” but this PR stack now includes explicit resolution and semantic retrieval in Module C. Keeping this sentence stale can mislead future maintainers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/librarian/__init__.py` at line 12, Update the scope
documentation at line 12 in the __init__.py file to reflect the current state of
the librarian module. Remove or revise the statement that says "No linking logic
yet" since the PR stack now includes explicit resolution and semantic retrieval
functionality in Module C. Update the scope note to accurately describe what the
module currently contains and supports.

application/tests/librarian/config_loader_test.py (1)

12-19: ⚡ Quick win

Add retriever_backend assertions to complete the config contract checks.

The default and override tests validate most fields but skip retriever_backend, which is part of LibrarianConfig and a key routing control.

Proposed test additions

 class TestConfigLoaderDefaults(unittest.TestCase):
@@
     def test_defaults_when_env_unset(self):
         with mock.patch.dict(os.environ, {}, clear=True):
             cfg = load_config()
+        self.assertEqual(cfg.retriever_backend, "in_memory")
         self.assertEqual(cfg.crossencoder_model, "cross-encoder/ms-marco-MiniLM-L-6-v2")
@@
 class TestConfigLoaderOverrides(unittest.TestCase):
     OVERRIDES = {
+        "CRE_LIBRARIAN_RETRIEVER_BACKEND": "pgvector",
         "CRE_LIBRARIAN_CROSSENCODER_MODEL": "cross-encoder/other",
@@
     def test_env_overrides_apply(self):
         with mock.patch.dict(os.environ, self.OVERRIDES, clear=True):
             cfg = load_config()
+        self.assertEqual(cfg.retriever_backend, "pgvector")
         self.assertEqual(cfg.crossencoder_model, "cross-encoder/other")

Also applies to: 27-35, 40-47

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/tests/librarian/config_loader_test.py` around lines 12 - 19, The
test assertions for the configuration loader are incomplete and missing
validation for the `retriever_backend` field, which is a key part of the
LibrarianConfig contract. Add self.assertEqual assertions to verify the
`retriever_backend` field matches the expected value in both the default
configuration test and the override configuration tests (including the sections
at lines 27-35 and 40-47), following the same pattern used for other fields like
crossencoder_model, top_k_retrieval, batch_size, and conformal_alpha.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 20-23: The test_config_is_frozen method has two issues: it doesn't
isolate os.environ when calling load_config(), which can cause spurious failures
if local CRE_LIBRARIAN_* environment variables are set (other tests in the same
file correctly use mock.patch.dict(os.environ, {}, clear=True)), and it uses
assertRaises(Exception) which is too broad and can mask unrelated failures. Wrap
the load_config() call with mock.patch.dict(os.environ, {}, clear=True) to
isolate the environment, and change assertRaises(Exception) to
assertRaises(FrozenInstanceError) from the dataclasses module to assert the
specific exception that should be raised when attempting to modify a frozen
dataclass instance.

In `@application/tests/librarian/dataset_test.py`:
- Around line 110-114: Add a timeout parameter to the subprocess.run() call that
executes the determinism check with _BUILD_SCRIPT and "--check" arguments. This
will prevent the test from hanging indefinitely if the subprocess stalls due to
database locks or I/O issues. Include an appropriate timeout value (in seconds)
in the subprocess.run() function to ensure the test fails fast rather than
blocking the entire test suite.

In `@application/utils/librarian/schemas.py`:
- Line 67: The Pydantic model fields are currently using plain Optional[str]
types which do not enforce RFC format validation, allowing invalid payloads
through and breaking contract parity. Fix this by updating six fields across the
models: SourceRef.url and SourceRef.committed_at, Locator.url,
KnowledgeItem.filtered_at, LinkProposal.classified_at, and
ReviewItem.created_at. Replace all URI fields (url and similar) with AnyUrl type
imported from pydantic module, and replace all date-time fields (committed_at,
filtered_at, classified_at, created_at) with datetime type imported from the
standard library. Pydantic v2 will natively validate RFC 3339 format for
datetime fields and enforce valid URI schemes for AnyUrl fields, maintaining RFC
schema contract requirements.

In `@application/utils/librarian/section_validator.py`:
- Around line 163-166: Replace the assert statement that checks if item.content
is not None with an explicit if check followed by raising
MalformedKnowledgeItemError. The assert statement on line 164 gets stripped
during optimized Python execution (python -O), causing subsequent calls to
item.content.text and item.content.language in _require_text() and
_require_language() to fail with AttributeError instead of the expected typed
exception. Use an if statement that validates item.content is not None, and if
it fails, raise MalformedKnowledgeItemError with an appropriate message about
missing content to ensure consistent validation behavior and error reporting at
this boundary regardless of Python optimization flags.

In `@scripts/build_golden_dataset.py`:
- Line 261: The loop variable `node_id` in the for loop that iterates over
`rows` is never used in the loop body, which triggers a B007 lint error. Rename
this unused loop variable to an underscore `_` to indicate that it is
intentionally unused and to comply with coding guidelines, ensuring that the
loop still correctly unpacks all the values from each row tuple.

---

Outside diff comments:
In `@application/utils/librarian/knowledge_source.py`:
- Around line 29-35: The items method in the KnowledgeSource class does not
handle ValidationError exceptions that can be raised by model_validate_json when
processing malformed lines. Wrap the KnowledgeQueueItem.model_validate_json call
in a try-except block to catch ValidationError, log a warning message when
validation fails for a particular line, and use continue to skip that line and
proceed to the next one instead of allowing the exception to propagate and
terminate the entire iteration. This pattern is already implemented elsewhere in
the codebase such as in section_validator.py and should be applied consistently
here.

---

Nitpick comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 12-19: The test assertions for the configuration loader are
incomplete and missing validation for the `retriever_backend` field, which is a
key part of the LibrarianConfig contract. Add self.assertEqual assertions to
verify the `retriever_backend` field matches the expected value in both the
default configuration test and the override configuration tests (including the
sections at lines 27-35 and 40-47), following the same pattern used for other
fields like crossencoder_model, top_k_retrieval, batch_size, and
conformal_alpha.

In `@application/utils/librarian/__init__.py`:
- Line 12: Update the scope documentation at line 12 in the __init__.py file to
reflect the current state of the librarian module. Remove or revise the
statement that says "No linking logic yet" since the PR stack now includes
explicit resolution and semantic retrieval functionality in Module C. Update the
scope note to accurately describe what the module currently contains and
supports.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a0a9bece-b59f-4c00-a238-d4dd2b76ad18

📥 Commits

Reviewing files that changed from the base of the PR and between e853cd3 and 73f6a52.

📒 Files selected for processing (34)

.env.example
application/cmd/cre_main.py
application/tests/librarian/__init__.py
application/tests/librarian/candidate_retriever_test.py
application/tests/librarian/config_loader_test.py
application/tests/librarian/dataset_test.py
application/tests/librarian/explicit_link_resolver_test.py
application/tests/librarian/fixtures/golden_dataset.json
application/tests/librarian/fixtures/golden_dataset.schema.json
application/tests/librarian/fixtures/sample_knowledge_queue.jsonl
application/tests/librarian/hub_firewall_test.py
application/tests/librarian/schemas_test.py
application/tests/librarian/scoring_test.py
application/tests/librarian/section_validator_test.py
application/utils/librarian/__init__.py
application/utils/librarian/_rfc_schemas/knowledge-item.json
application/utils/librarian/_rfc_schemas/link-proposal.json
application/utils/librarian/_rfc_schemas/locator.json
application/utils/librarian/_rfc_schemas/proposed-link.json
application/utils/librarian/_rfc_schemas/review-item.json
application/utils/librarian/_rfc_schemas/source-ref.json
application/utils/librarian/candidate_retriever.py
application/utils/librarian/config_loader.py
application/utils/librarian/explicit_link_resolver.py
application/utils/librarian/hub_firewall.py
application/utils/librarian/knowledge_source.py
application/utils/librarian/schemas.py
application/utils/librarian/scoring.py
application/utils/librarian/section_validator.py
cre.py
requirements.txt
scripts/benchmark_retriever.py
scripts/build_golden_dataset.py
scripts/evaluate_librarian.py

Resolve the CodeRabbit review comments on the Module C PR: - schemas.py: enforce RFC format parity — AnyUrl for url fields, datetime for committed_at/filtered_at/classified_at/created_at - section_validator.py: replace assert with a typed MalformedKnowledgeItemError guard (survives python -O) - knowledge_source.py: skip+log malformed JSONL rows instead of aborting the whole iteration on ValidationError - config_loader_test.py: isolate os.environ and assert the specific FrozenInstanceError - dataset_test.py: add a timeout to the determinism subprocess call - build_golden_dataset.py: rename unused loop var node_id -> _node_id - section_validator_test.py: assert committed_at as the typed datetime Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve the CodeRabbit review comments on the Module C PR: - schemas.py: enforce RFC format parity — AnyUrl for url fields, datetime for committed_at/filtered_at/classified_at/created_at - section_validator.py: replace assert with a typed MalformedKnowledgeItemError guard (survives python -O) - knowledge_source.py: skip+log malformed JSONL rows instead of aborting the whole iteration on ValidationError - config_loader_test.py: isolate os.environ and assert the specific FrozenInstanceError - dataset_test.py: add a timeout to the determinism subprocess call - build_golden_dataset.py: rename unused loop var node_id -> _node_id - section_validator_test.py: assert committed_at as the typed datetime

…ASP#925) GHSA pydantic ReDoS affects >=2.0.0,<2.4.0; first patched in 2.4.0.

PRAteek-singHWY added 3 commits June 9, 2026 17:17

week_1: address review — config validation, schema non-empty constrai…

e097016

…nts, fail-fast build, edge cases

PRAteek-singHWY marked this pull request as ready for review June 18, 2026 13:51

PRAteek-singHWY force-pushed the gsocmodule_C_week_3 branch from 73f6a52 to 78b1cb1 Compare June 18, 2026 14:01

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Merge branch 'main' into gsocmodule_C_week_3

2453895

PRAteek-singHWY added 2 commits June 27, 2026 01:29

week_2: pin pydantic>=2.4.0 to avoid v2 ReDoS advisory (CodeRabbit OW…

695a016

…ASP#925) GHSA pydantic ReDoS affects >=2.0.0,<2.4.0; first patched in 2.4.0.

PRAteek-singHWY force-pushed the gsocmodule_C_week_3 branch from 3c447bf to 695a016 Compare June 26, 2026 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

week_3: Module C (The Librarian) — C.1 candidate retriever (in-memory + pgvector) + pipeline switch#937

week_3: Module C (The Librarian) — C.1 candidate retriever (in-memory + pgvector) + pipeline switch#937
PRAteek-singHWY wants to merge 7 commits into
OWASP:mainfrom
PRAteek-singHWY:gsocmodule_C_week_3

PRAteek-singHWY commented Jun 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Review limit reached

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PRAteek-singHWY commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

week_3: Module C (The Librarian) — C.1 semantic candidate retriever + pipeline switch

Overview

What changed

How the pieces connect

Results

What is intentionally not here

How to verify locally

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PRAteek-singHWY commented Jun 18, 2026 •

edited

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading