Skip to content

Mssql#110

Open
myyong wants to merge 34 commits into
mainfrom
mssql
Open

Mssql#110
myyong wants to merge 34 commits into
mainfrom
mssql

Conversation

@myyong
Copy link
Copy Markdown
Collaborator

@myyong myyong commented May 26, 2026

Description for the MSSQL work in issue 109

myyong and others added 30 commits April 10, 2026 15:35
…routing

Addresses issues #93, #94, and #95.

Issue #93 – Driver dependency
- Move psycopg2-binary to an optional 'postgres' extra
- Add optional pyodbc and aioodbc as a new 'mssql' extra
- Remove unconditional top-level `import psycopg2`; replace the
  `psycopg2.errors.UndefinedObject` check with `_is_undefined_object_error`,
  which lazily imports psycopg2 when available and falls back to checking
  the SQLSTATE pgcode (42704) otherwise

Issue #94 – Hardcoded async DSN rewriting
- Add `make_async_dsn()` helper that parses the DSN dialect with
  SQLAlchemy's `make_url` and rewrites the driver via `_ASYNC_DRIVER_MAP`
  (postgresql → asyncpg, mssql → aioodbc); raises ValueError for unknown
  dialects instead of silently producing a broken DSN
- Replace the hardcoded `db_dsn.replace("postgresql://", ...)` in
  `create_db_engine` with a call to `make_async_dsn`

Issue #95 – PostgreSQL search_path schema selection
- Replace `SET search_path TO <schema>` (PostgreSQL-only) with
  `engine.execution_options(schema_translate_map={None: schema_name})`,
  which is dialect-neutral and works for PostgreSQL, MS-SQL, and DuckDB
- The `set_db_settings` connect-event hook is now only wired up for
  DuckDB's `file_search_path`; update its docstring accordingly
- Add `schema_name` parameter to `get_metadata` and pass it to
  `MetaData.reflect(engine, schema=schema_name)` so reflection targets
  the right schema without relying on search_path
- Update `make_tables_file` and `remove_db_tables` to forward schema_name
  to `get_metadata`

Also adds:
- examples/omop-mssql/ with README documenting all eight MS-SQL challenges
  (challenges 1–3 link to GitHub issues #93#95)
- tests/test_utils_mssql.py with 19 unit tests covering all new helpers
  and the schema_translate_map behaviour

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add sqlalchemy.dialects.mssql import to serialize_metadata.py
- Map PostgreSQL-only type strings to cross-dialect equivalents:
    TSVECTOR  → sqltypes.Text        (no MS-SQL equivalent)
    BYTEA     → sqltypes.LargeBinary (renders as VARBINARY(MAX) on MS-SQL)
    CIDR      → sqltypes.String(43)  (stored as plain string)
    REAL      → sqltypes.REAL        (generic, replaces postgresql.REAL)
- Add _mssql_varbinary_parser to handle VARBINARY, VARBINARY(n),
  VARBINARY(max), and VARBINARY(MAX)
- Register parsers for MS-SQL-native types: UNIQUEIDENTIFIER, DATETIMEOFFSET(n),
  DATETIME2(n), BINARY(n), MONEY, SMALLMONEY, IMAGE, TINYINT, SMALLDATETIME,
  NTEXT, SQL_VARIANT, ROWVERSION
- Place DATETIME2/DATETIMEOFFSET before DATETIME in parsy.alt() to avoid
  prefix-collision: ordered-choice parsers must try longer names first
- Rename time_type parameter pg_type → tz_type (no behavioural change)
- Add mssql.UNIQUEIDENTIFIER → UUID generator mapping in make.py
- Add tests/test_serialize_metadata_mssql.py: 50 tests covering all new
  MS-SQL types, PostgreSQL degradation mappings, and regression coverage
  for all pre-existing type strings
- Update examples/omop-mssql/README.md challenge 4 with issue link #96

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add SERIAL, BIGSERIAL and SMALLSERIAL to the type parser so orm.yaml
files using PostgreSQL pseudo-types can be loaded when targeting MS-SQL.
Add a @compiles(CreateColumn, "mssql") hook that strips the IDENTITY
modifier from DDL — MS-SQL rejects explicit INSERTs into IDENTITY
columns without SET IDENTITY_INSERT ON, which datafaker never issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three new tests for _is_undefined_object_error:
- pyodbc-style error (SQLSTATE in args[0], no pgcode) → False, documenting
  that the current pgcode-based fallback does not cover MS-SQL/pyodbc errors
- pgcode=None → False
- SQLAlchemy ProgrammingError wrapper → only e.orig matches, not the wrapper
  itself, confirming the call-site convention in remove_vocab_foreign_key_constraints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
set_db_settings is DuckDB-only; fixing the autocommit toggle in
isolation would leave the SET syntax still broken for MS-SQL. Both
should be addressed together if the function is ever extended.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Setup section to README covering poetry install --extras mssql,
  homebrew ODBC driver installation, odbcinst.ini registration, and
  .env configuration with the correct driver 18 DSN format
- Add .env.example with placeholder DSN for safe version control
  (.env itself is gitignored)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All commands now use relative paths (./orm.yaml etc.) and a cd note
makes clear they must be run from examples/omop-mssql/ so the .env
file is picked up correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ODBC Driver 18 defaults to Encrypt=yes with certificate validation.
SQL Server's self-signed certificate fails that check, causing a login
timeout. TrustServerCertificate=yes bypasses validation for local/dev
containers where a proper cert is not in place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…loses #101)

Strip leading schema qualifier from FK targets before constructing ForeignKey
objects so that 3-part orm.yaml references such as mimic100.concept.concept_id
resolve correctly against a MetaData whose tables carry no schema prefix.

Add @compiles(CreateTable, "mssql") hook to remove ON DELETE CASCADE from
MS-SQL DDL, preventing error 1785 when multiple FK columns on one table all
point at the same vocabulary table (common in OMOP-style schemas).

Both fixes have tests. Also adds pytest to dev dependencies and the
omop-mssql example config.yaml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the remove_mssql_identity DDL hook that stripped IDENTITY from
CREATE TABLE, and stop supplying explicit PK values in generated df.py
files. SQLAlchemy now omits single-column integer PKs from INSERT
statements and SQL Server generates them via IDENTITY(1,1).

Also fix column_value() to use NEWID() instead of random() on MS-SQL,
which has no random() function.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidate the OMOP example under examples/omop-mssql. The mimic_omop
orm.yaml becomes orm_full.yaml; a trimmed orm.yaml and a
copy_vocabulary.sql helper script are added alongside it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace PostgreSQL-only raw SQL in configure-generators with SQLAlchemy
expression API so the same code works on all supported dialects:

- generators/mimesis.py: use sqlalchemy.extract() + cast() instead of
  EXTRACT(YEAR FROM ...) string; compiled SQL strings stored on
  MimesisDateTimeGenerator now use DATEPART(year,...) on MS-SQL and
  EXTRACT(YEAR FROM ...) on PostgreSQL/DuckDB.
- generators/base.py: use func.stdev on MS-SQL and func.stddev elsewhere
  instead of the hard-coded STDDEV() string that MS-SQL does not recognise.
- make.py: remove dead _YEAR_SUMMARY_QUERY constant and GeneratorInfo.summary_query
  field (stored but never read anywhere in the codebase).
- tests/test_generators_dialect.py: new tests asserting that the compiled
  SQL contains the correct function name per dialect.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…loses #106)

Replace raw RANDOM()/LIMIT SQL in choice.py with SQLAlchemy expression API:

- New _choice_stmt() helper builds SELECT expressions using func.random()/
  func.newid() and .limit()/.subquery(); SQLAlchemy compiles these to
  RANDOM()/LIMIT on PostgreSQL/DuckDB and NEWID()/TOP on MS-SQL automatically.
- ChoiceGenerator.__init__() gains an optional dialect= parameter; the stored
  _query string (written to src-stats.yaml for later make-stats execution) is
  now compiled against the source engine's dialect at configure-generators time.
- ChoiceGeneratorFactory.get_generators() builds both live queries as SQLAlchemy
  select() expressions and passes engine.dialect to all ChoiceGenerator
  constructors.
- The suppress-only subquery no longer contains ORDER BY without TOP, which
  MS-SQL rejects in derived table expressions.
- tests/test_generators_dialect.py: 7 new tests covering stored query SQL for
  all four sample/suppress combinations and both dialect-correct live queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orFactory

ChoiceGeneratorFactory.get_generators() was using table(table_name) to build
the FROM clause, which creates a lightweight TableClause with no schema
information. On MS-SQL databases with a non-default schema (e.g. dbo.person
vs myschema.person) this produced 'Invalid object name' errors.

Fix: use column.table directly as the select_from() target. This is the
actual SQLAlchemy Table object reflected from the source DB, which carries
schema metadata and compiles to a fully-qualified name (e.g.
[myschema].[person] on MS-SQL, myschema.person on PostgreSQL).

Add test_schema_qualified_table_appears_in_from to verify the schema name
appears in the compiled SQL for both MS-SQL and PostgreSQL dialects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Buckets.make_buckets() and Buckets.__init__() used table(table_name)
(a schema-less TableClause) and raw text() SQL, so queries against
schema-qualified databases produced unqualified table names and were
rejected with "Invalid object name" on MS-SQL.

Both methods now accept an optional src_table parameter (a SQLAlchemy
Table object with schema metadata). When provided, select_from() uses
it directly, preserving the schema qualifier in the FROM clause.
Callers in continuous.py and mimesis.py pass column.table.

Buckets.__init__() is also converted from raw text() to a SQLAlchemy
expression using func.floor() and group_by(floor_expr), which avoids
the GROUP BY alias pattern that MS-SQL rejects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Six locations in generators and the interactive shell still emitted
RANDOM() and LIMIT which MS-SQL does not support.

Group A — interactive shell (base.py, table.py, generators.py):
  Replace sqlalchemy.text() f-strings with SQLAlchemy expression API.
  func.newid() vs func.random() chosen per dialect.name == "mssql";
  .limit(n) compiles to TOP n / LIMIT n automatically.

Group B — CovariateQuery (continuous.py):
  Add dialect_name parameter to __init__(); _inner_query() now emits
  SELECT TOP n … ORDER BY NEWID() on MS-SQL and SELECT * … ORDER BY
  RANDOM() LIMIT n elsewhere. dialect_name passed from get_generators().

Group C — MissingnessType (missingness.py):
  sampled_query() accepts dialect_name; returns the MS-SQL TOP/NEWID
  form when dialect_name == "mssql". do_sampled() passes
  self.sync_engine.dialect.name.

Tests added in test_generators_dialect.py (CovariateQuery, Missingness)
and new test_interactive_dialect.py (peek, _get_column_data,
print_column_data) covering both MS-SQL and PostgreSQL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace sa_table(self.table_name()) — which creates a schema-less TableClause —
with self.table_metadata() in do_peek(), print_column_data(), print_row_data(),
and _get_column_data(). This fixes "Invalid object name 'person'" (42S02) on
MS-SQL when tables live in a non-default schema (e.g. dbo.person).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
myyong and others added 4 commits May 22, 2026 13:13
schema_translate_map (set by create_db_engine) only rewrites SQLAlchemy Core
constructs — not text() raw SQL. Three sites still built text() with bare table
names, causing 42S02 'Invalid object name' on MS-SQL schemas:

- do_counts (base.py): converted to select_from(self.table_metadata()) so the
  translate map applies, same pattern as do_peek / _get_column_data.
- _get_generators_from_buckets (continuous.py): now receives the Table object
  (not just a name string) and uses select_from(src_table); also replaces raw
  LN/STDDEV SQL with func.log/func.stddev_samp for dialect correctness.
- MultivariateNormalGeneratorFactory.get_generators (continuous.py): qualifies
  the table name via schema_qualified_name() before passing to CovariateQuery
  so the stored SQL string contains the correct schema prefix at create-data
  time; the unqualified name is kept for SRC_STATS config keys.

Adds schema_qualified_name() utility to utils.py that reads the engine's
schema_translate_map execution option.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-generators

_get_aggregate_query builds SQL stored in config.yaml's src-stats section and
executed by make-stats via text() — not affected by schema_translate_map. Two
call sites now pass schema_qualified_name(entry.name, engine) so the stored SQL
contains the schema prefix (e.g. mimic100.person).

SELECT_AGGREGATE_RE regex updated to match schema.table forms, and the auto__
name check now strips the schema prefix before comparing with the config key so
PredefinedGenerator still recognises its own clauses when re-reading a config
that was written with schema-qualified table names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oses #108)

ChoiceGenerator compiled self._query at construction time using table(table_name),
which treated a schema-qualified string like "mimic100.person" as a single identifier
and produced [mimic100.person] on MS-SQL. Using column.table dropped the schema because
dict_to_metadata creates Table objects without schema set.

Fix: pass the schema-qualified name (from schema_qualified_name()) as table_sql and use
text(table_sql) in _choice_stmt, which compiles verbatim without quoting. The SRC_STATS
config key and self.table_name remain unqualified; only the stored SQL string is qualified.

Also refactors _get_aggregate_query to qualify the table name internally rather than
requiring callers to pre-qualify it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant