Skip to content

Strake Documentation

Examples

strake-data/strake

Examples¶

Practical, real-world examples demonstrating how to write federated SQL queries and orchestrate diverse data sources.

1. Cross-Source Join: Postgres + S3 Parquet¶

Join hot transactional user records stored in a local PostgreSQL database with historical clickstream logs stored in AWS S3 as Parquet files.

Scenario¶

Your enterprise analytics team has: * A users table inside PostgreSQL containing active accounts and user profile details. * A clickstream dataset inside AWS S3 stored as monthly Parquet objects containing log activity.

Configuration (`sources.yaml`)¶

sources:
  - name: internal_pg
    type: sql
    config:
      dialect: postgres
      connection: "postgres://db_user:secure_password@localhost:5432/production_db?sslmode=prefer"
      tables:
        - name: users
          schema: public

  - name: telemetry_s3
    type: file
    source_type: parquet
    config:
      path: "s3://my-company-analytics-bucket/logs/"
      options:
        aws_access_key_id: "AKIAIOSFODNN7EXAMPLE"
        aws_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
        region: "us-west-2"
      tables:
        - name: clickstream
          schema: public
          path: "s3://my-company-analytics-bucket/logs/clickstream.parquet"

Python Execution Script¶

Run a single federated SQL statement that joins both sources in-memory and extracts the active users with the most event triggers:

import strake

# 1. Connect using the local configuration
conn = strake.StrakeConnection("sources.yaml")

# 2. Run the federated cross-source join query
query = """
    SELECT 
        u.email, 
        count(c.event_id) as event_count
    FROM strake.internal_pg.public.users u
    INNER JOIN strake.telemetry_s3.public.clickstream c 
        ON u.email = c.user_email
    WHERE c.event_date > '2023-01-01'
    GROUP BY u.email
    ORDER BY event_count DESC
    LIMIT 10
"""

print("Executing cross-source query...")
result = conn.sql(query)

# 3. Convert results to a pandas DataFrame and display
df = result.to_pandas()
print(df.to_string(index=False))