airflow.providers.common.ai.utils.sql_validation¶

SQL safety validation for LLM-generated queries.

Uses an allowlist approach: only explicitly permitted statement types pass. This is safer than a denylist because new/unexpected statement types (INSERT, UPDATE, MERGE, TRUNCATE, COPY, etc.) are blocked by default.

Attributes¶

`DEFAULT_ALLOWED_TYPES`
`READ_ONLY_METADATA_TYPES`

Exceptions¶

SQLSafetyError

Generated SQL failed safety validation.

Classes¶

TableScan

Result of collect_table_references().

Functions¶

`resolve_sqlglot_dialect`(dialect_name)	Normalize a SQLAlchemy dialect name to a sqlglot dialect.
`parse_sql`(sql, *[, dialect, allow_multiple_statements])	Parse SQL into statements, enforcing the empty- and multi-statement guards only.
`collect_table_references`(statements)	Walk parsed statements and report every real table they reach, scope-correctly.
`validate_sql`(sql, *[, allowed_types, dialect, ...])	Parse SQL and verify all statements are in the allowed types list.

Module Contents¶

airflow.providers.common.ai.utils.sql_validation.resolve_sqlglot_dialect(dialect_name)[source]¶

Normalize a SQLAlchemy dialect name to a sqlglot dialect.

Returns None (dialect-agnostic parsing) for empty, non-string, or unknown inputs, so a bad dialect value never breaks SQL validation.

Parameters:: dialect_name (str | None) – A SQLAlchemy dialect_name (e.g. "postgresql").
Returns:: The matching sqlglot dialect (e.g. "postgres"), or None.
Return type:: str | None

airflow.providers.common.ai.utils.sql_validation.DEFAULT_ALLOWED_TYPES: tuple[type[sqlglot.exp.Expr], Ellipsis][source]¶

airflow.providers.common.ai.utils.sql_validation.READ_ONLY_METADATA_TYPES: tuple[type[sqlglot.exp.Expr], Ellipsis][source]¶

exception airflow.providers.common.ai.utils.sql_validation.SQLSafetyError[source]¶

Bases: Exception

Generated SQL failed safety validation.

airflow.providers.common.ai.utils.sql_validation.parse_sql(sql, *, dialect=None, allow_multiple_statements=False)[source]¶

Parse SQL into statements, enforcing the empty- and multi-statement guards only.

Shared by validate_sql() (which then applies statement-type checks) and by callers that need the parsed AST for their own analysis – e.g. table-reference extraction for allowed_tables enforcement – without the read-only allow-list.

Parameters:

sql (str) – SQL string to parse.
dialect (str | None) – SQL dialect for parsing (postgres, mysql, etc.).
allow_multiple_statements (bool) – Whether to allow multiple semicolon-separated statements. Default False – multi-statement input can hide a dangerous operation after a benign one.

Returns:

List of parsed sqlglot Expression objects (never empty).

Raises:

SQLSafetyError – If the SQL is empty, cannot be parsed, or contains multiple statements when not permitted.

Return type:

list[sqlglot.exp.Expr]

class airflow.providers.common.ai.utils.sql_validation.TableScan[source]¶

Bases: NamedTuple

Result of collect_table_references().

tables: list[tuple[str, str, str]][source]¶

unverifiable_sources: list[str][source]¶

airflow.providers.common.ai.utils.sql_validation.collect_table_references(statements)[source]¶

Walk parsed statements and report every real table they reach, scope-correctly.

This is the AST half of allowed_tables enforcement: it returns the concrete base tables a query reaches (including those nested in subqueries, CTEs, JOINs, set operations, DESCRIBE, and DML) as (catalog, schema, table) so the caller can check each against its allow-list, plus a list of constructs that cannot be checked and must therefore be rejected while an allow-list is active.

Handled carefully (each was a confirmed bypass before it was closed):

CTE references are excluded by lexical scope, not by name. A table is treated as a CTE only when a WITH enclosing that reference defines the name (see _is_in_scope_cte()); a same-named CTE in a sibling/inner query no longer hides a real top-level table. A DML target is always a real table (you cannot write to a CTE, so a same-named CTE does not shadow it), but DML sources follow normal CTE scoping – a CTE used as an INSERT/UPDATE source is not flagged.
Catalog-qualified references are reported with their catalog, so the caller rejects otherdb.public.orders instead of matching it to public.orders.
Unverifiable constructs are listed, not silently dropped: nameless table-valued functions (dblink), TABLE('name') row sources (exp.TableFromRows), SHOW, dynamic SQL (EXEC/Command), the TABLE <name> shorthand (which sqlglot parses incorrectly, leaking the TABLE keyword as a column), a quoted identifier (case-sensitive on the engine but matched case-insensitively here, so "Orders" could otherwise reach a table distinct from the allow-listed orders), and any inline comment – comments are where parser-vs-engine differentials hide (MySQL executable /*! ... */, -- not followed by whitespace, #).

Parameters:: statements (list[sqlglot.exp.Expr]) – Parsed sqlglot statements (from parse_sql()).
Returns:: A TableScan of real table references and unverifiable constructs.
Return type:: TableScan

airflow.providers.common.ai.utils.sql_validation.validate_sql(sql, *, allowed_types=None, dialect=None, allow_multiple_statements=False, allow_read_only_metadata=False)[source]¶

Parse SQL and verify all statements are in the allowed types list.

By default, only a single SELECT-family statement is allowed. Multi-statement SQL (separated by semicolons) is rejected unless allow_multiple_statements=True, because multi-statement inputs can hide dangerous operations after a benign SELECT.

Returns parsed statements on success, raises SQLSafetyError on violation.

Parameters:

sql (str) – SQL string to validate.
allowed_types (tuple[type[sqlglot.exp.Expr], Ellipsis] | None) – Tuple of sqlglot expression types to permit. Defaults to (Select, Union, Intersect, Except). When supplied, the caller takes full control of the allow-list and allow_read_only_metadata is ignored.
dialect (str | None) – SQL dialect for parsing (postgres, mysql, etc.).
allow_multiple_statements (bool) – Whether to allow multiple semicolon-separated statements. Default False.
allow_read_only_metadata (bool) – Also permit read-only metadata statements (DESCRIBE/SHOW) on top of the default read-only allow-list. Ignored when allowed_types is supplied. Note SHOW only parses to a metadata statement when a dialect that supports it is given. Default False.

Returns:

List of parsed sqlglot Expression objects.

Raises:

SQLSafetyError – If the SQL is empty, contains disallowed statement types, or has multiple statements when not permitted.

Return type:

list[sqlglot.exp.Expr]