Connectors Overview¶

Strake is a high-performance, federation-first query engine. By translating query plans and pushing predicate filters directly to where your data lives, Strake allows you to query diverse databases, files, object stores, and APIs as a single logical data warehouse.

1. Modular Connector Directory¶

Strake supports native, asynchronous federation across relational databases, storage files, lakehouses, APIs, and microservices.

Select a connector below for detailed configuration parameters, authentication schemes, connection string syntaxes, and driver setups:

Relational Databases (`type: sql`)¶

PostgreSQL: Active wire integration with deep predicate/limit pushdown.
MySQL & MariaDB: Optimized SQL dialect translation and connection pooling.
SQLite: Ultra-lightweight local caching and zero-config testing.
DuckDB: Persistent local file or in-memory analytics.
ClickHouse: OLAP-optimized translations for fast table queries.
Oracle: Thin connection syntax and detailed Oracle Instant Client driver requirements.

Storage, API & Microservices¶

Files & Cloud Object Storage (type: file): OpenDAL backend supporting Parquet, CSV, JSON, Avro, and Excel across S3, GCS, and Azure buckets.
Apache Iceberg (type: iceberg_rest): Experimental REST catalog queries and time-travel snapshots.
Arrow Flight SQL (type: flight_sql): Low-overhead vectorized warehouse querying (Snowflake, Dremio).
REST APIs & SaaS (type: rest): Declarative endpoint mapping, auth headers, pagination, and URL pushdowns.
gRPC Services (type: grpc): Microservice calls using Protobuf reflection or FileDescriptorSet binaries.

2. Unified `sources.yaml`¶

This comprehensive sources.yaml registers and configures all supported data sources inside a single Strake project directory:

# sources.yaml
sources:
  # ----------------------------------------------------
  # 1. PostgreSQL (Relational Database)
  # ----------------------------------------------------
  - name: internal_pg
    type: sql
    config:
      dialect: postgres
      connection: "postgres://db_user:secure_password@localhost:5432/production_db?sslmode=prefer"
      pool_size: 15
      tables:
        - name: users
          schema: public

  # ----------------------------------------------------
  # 2. Oracle Database (Requires Instant Client Driver)
  # ----------------------------------------------------
  - name: legacy_oracle
    type: sql
    config:
      dialect: oracle
      # Strict URL format containing the oracle:// scheme
      connection: "oracle://system:OraclePassword123@oracle-free:1521/FREEPDB1"
      pool_size: 10
      tables:
        - name: orders
          schema: sales

  # ----------------------------------------------------
  # 3. SQLite (Local Database)
  # ----------------------------------------------------
  - name: cache_sqlite
    type: sql
    config:
      dialect: sqlite
      connection: "sqlite:///workspaces/rust-postgres/data/app_cache.db"

  # ----------------------------------------------------
  # 4. Amazon S3 Parquet (Files & Cloud Storage)
  # ----------------------------------------------------
  - name: telemetry_s3
    type: file
    source_type: parquet
    predicate_cache: true
    config:
      path: "s3://my-company-analytics-bucket/logs/"
      options:
        aws_access_key_id: "AKIAIOSFODNN7EXAMPLE"
        aws_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
        region: "us-west-2"
      tables:
        - name: clickstream
          schema: public
          path: "s3://my-company-analytics-bucket/logs/clickstream.parquet"

  # ----------------------------------------------------
  # 5. Apache Iceberg (Experimental)
  # ----------------------------------------------------
  - name: analytics_iceberg
    type: iceberg_rest
    config:
      catalog_uri: "http://polaris:8181/api/catalog"
      warehouse: "s3://my-iceberg-lakehouse/"
      region: "us-east-1"
      oauth_client_id: "strake-client"
      oauth_client_secret: "super_secret_oauth_token"
      oauth_token_url: "http://polaris:8181/api/catalog/v1/oauth/tokens"

  # ----------------------------------------------------
  # 6. Arrow Flight SQL (Warehouse Federation)
  # ----------------------------------------------------
  - name: warehouse_flight
    type: flight_sql
    config:
      url: "grpc://localhost:32010"

  # ----------------------------------------------------
  # 7. Generic REST API (SaaS Integration)
  # ----------------------------------------------------
  - name: stripe_api
    type: rest
    config:
      base_url: "https://api.stripe.com/v1/charges"
      method: "GET"
      headers:
        Accept: "application/json"
      auth:
        type: bearer
        token: "sk_test_51Nz..."
      pagination:
        type: token
        token_path: "next_page_token"
        param_name: "starting_after"
      pushdown:
        - column: "customer_id"
          operator: "="
          param: "customer"

  # ----------------------------------------------------
  # 8. gRPC Service (Microservice Federation)
  # ----------------------------------------------------
  - name: user_service_grpc
    type: grpc
    config:
      url: "http://user-service.internal:50051"
      service: "my.company.UserService"
      method: "GetActiveUsers"
      descriptor_set: "/workspaces/rust-postgres/data/user_service_desc.bin"
      request_body: '{"status": "ACTIVE"}'
      columns:
        - name: "user_id"
          type: "Int64"
        - name: "email"
          type: "Utf8"

3. Python Script (`main.py`)¶

Save the following Python script alongside your sources.yaml to initialize the Strake engine in embedded mode, inspect your federated catalogs, and run cross-source database joins:

#!/usr/bin/env python3
import os
import sys
import strake
import pandas as pd

def main():
    config_path = "sources.yaml"

    if not os.path.exists(config_path):
        print(f"Error: Configuration file '{config_path}' not found.", file=sys.stderr)
        print("Please save the sources.yaml configuration in this directory.", file=sys.stderr)
        sys.exit(1)

    print("====================================================")
    print("🚀 Initializing Strake Embedded Federation Engine...")
    print("====================================================")

    # 1. Connect (Embedded Library Mode)
    try:
        conn = strake.StrakeConnection(config_path)
        print("✓ Successfully initialized embedded Strake connection.")
    except Exception as e:
        print(f"✗ Failed to connect: {e}", file=sys.stderr)
        sys.exit(1)

    # 2. Introspect Registered Tables
    print("\n----------------------------------------------------")
    print("🔍 Introspecting Available Federated Tables:")
    print("----------------------------------------------------")
    try:
        # Describe fetches and prints a text grid of all catalogs
        schema_description = conn.describe()
        print(schema_description)
    except Exception as e:
        print(f"Warning: Could not fetch automatic description: {e}")

    # 3. Perform a Cross-Source Federated Join
    print("\n----------------------------------------------------")
    print("⚡ Running Cross-Source Federated JOIN...")
    print("----------------------------------------------------")

    # Joining:
    # - Postgres users (internal_pg.public.users)
    # - Oracle orders (legacy_oracle.sales.orders)
    # - S3 Clickstream parquet files (telemetry_s3.public.clickstream)
    query = """
        SELECT 
            u.user_id,
            u.email,
            o.order_id,
            o.amount,
            c.page_path,
            c.session_duration
        FROM strake.internal_pg.public.users u
        INNER JOIN strake.legacy_oracle.sales.orders o 
            ON u.user_id = o.user_id
        LEFT JOIN strake.telemetry_s3.public.clickstream c 
            ON u.email = c.user_email
        WHERE o.amount > 100.0
        ORDER BY o.amount DESC
        LIMIT 10
    """

    print(f"Executing Query:\n{query}\n")

    try:
        # conn.sql returns a Strake Table
        result_table = conn.sql(query)

        # 4. Convert Arrow Results directly to a Pandas DataFrame
        df = result_table.to_pandas()

        print("====================================================")
        if df.empty:
            print("Query completed successfully, but returned 0 rows.")
        else:
            print(f"✓ Query returned {len(df)} rows:")
            print(df.to_string(index=False))
        print("====================================================")

    except Exception as e:
        print(f"✗ Query execution failed: {e}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

Connectors Overview¶

1. Modular Connector Directory¶

Relational Databases (type: sql)¶

Storage, API & Microservices¶

2. Unified sources.yaml¶

3. Python Script (main.py)¶

Relational Databases (`type: sql`)¶

2. Unified `sources.yaml`¶

3. Python Script (`main.py`)¶