Skip to content

AI-Powered Metadata Enrichment

Strake provides built-in capability to automatically enrich your data source schema metadata with natural language descriptions using advanced language models (LLMs).

When querying data through AI agents, having high-quality, semantic metadata (descriptions of what columns and tables represent) is crucial. Strake enables you to automatically generate these descriptions when discovering and adding tables to your registry.


How It Works

AI Metadata Enrichment operates in a three-stage lifecycle during data source discovery:

graph TD
    A[1. Introspection] -->|Fetch Raw Schema| B[2. AI Prompting]
    B -->|Generate Descriptions| C[3. Merging]
    C -->|Save to sources.yaml| D[(Metadata Registry)]

1. Introspection

Strake first queries the upstream database or API to inspect the technical schema. It retrieves physical metadata including: * Table name * Column names * Data types (e.g., VARCHAR, INT, TIMESTAMP) * Key constraints (e.g., Primary Keys, Foreign Keys)

2. Prompting

Strake constructs a semantically rich prompt containing the physical schema, constraints, and context. It sends this to your configured AI provider (e.g., Gemini or OpenAI), instructing it to generate concise, human-readable natural language descriptions of what each table and column represents.

3. Merging

The generated descriptions are returned and merged back into your local sources.yaml file. By default, Strake acts defensively to protect your manual annotations: * Merge Mode (Default): Existing manual descriptions are fully preserved. The AI only fills in blank/missing descriptions. * Overwrite Mode: The AI will regenerate and replace all descriptions, which is useful when a schema changes significantly.


Configuration

You can configure the AI provider and model using either your strake.yaml file or environment variables.

1. File Configuration (strake.yaml)

Add the ai: configuration block to your primary strake.yaml:

# AI description generation configuration
ai:
  # The AI provider to use ('gemini' or 'openai')
  provider: gemini

  # The specific LLM model to target
  model: gemini-3.5-flash

  # Sampling temperature for generating variety (0.0 to 1.0)
  temperature: 0.7

  # (Optional) Custom API gateway endpoint URL
  url: "https://generativelanguage.googleapis.com"

2. Environment Variable Configuration

Environment variables take highest priority and are ideal for credentials or CI/CD pipelines:

Environment Variable Description
STRAKE_AI_PROVIDER AI provider for metadata enrichment (gemini, openai).
STRAKE_AI_MODEL Overrides the AI model used for descriptions.
GOOGLE_API_KEY API key required for the Gemini provider.
OPENAI_API_KEY API key required for the OpenAI provider.

CLI Usage

To enrich metadata, use the --ai-descriptions flag with the strake-cli add command.

Introspect & Enrich a Specific Table

To introspect a table from a source and generate AI descriptions:

strake-cli add pg_production public.users --ai-descriptions

Bulk Add with AI Descriptions

To add all tables in a specific schema with AI descriptions:

strake-cli add pg_production --pattern "public.*" --ai-descriptions

Re-generating/Overwriting Descriptions

To refresh descriptions for a table that has already been registered, use the --overwrite flag. This will replace any manual edits:

strake-cli add pg_production public.users --ai-descriptions --overwrite

Dry Run (Preview Changes)

To preview the generated descriptions and see how the sources.yaml file would be updated without actually saving the changes:

strake-cli add pg_production public.users --ai-descriptions --dry-run