Skip to content

Files & Cloud Storage Connector

Strake supports reading structured, semi-structured, and tabular files directly from local storage or cloud object buckets using the Apache OpenDAL abstraction engine.


1. Supported File Formats

  • Apache Parquet (source_type: parquet): Vectorized reading with predicate caching and row-group pruning.
  • CSV (source_type: csv): Delimited files with tunable delimiters (delimiter) and header support (has_header).
  • JSON (source_type: json): NDJSON/semi-structured parsing.
  • Apache Avro (source_type: avro): Binary schema serialization.
  • Microsoft Excel (source_type: excel): XLSX parser (Enterprise Edition only).

2. Cloud Storage Schemes & Credentials

OpenDAL parses standard URL schemes from the path argument and dynamically maps credential options in the config block:

AWS S3 (s3://)

Connects to Amazon S3 or S3-compatible endpoints (such as MinIO, Cloudflare R2, or Backblaze B2). - Required parameters: aws_access_key_id, aws_secret_access_key, region. - Optional parameter: endpoint (for non-AWS engines).

Google Cloud Storage (gs:// or gcs://)

Connects to Google Cloud GCS buckets. - Required parameters: google_application_credentials (path to service account JSON), region.

Azure Blob Storage (az:// or azblob://)

Connects to Azure Container Storage. - Required parameters: account_name, account_key.

SFTP (sftp://)

Connects to secure SFTP servers. - Required parameters: user, password or key_path.


3. Configuration Snippet

Add the following block to your sources.yaml to register a Parquet source backed by AWS S3 with predicate caching enabled:

sources:
  - name: telemetry_s3
    type: file
    source_type: parquet
    predicate_cache: true
    config:
      path: "s3://my-company-analytics-bucket/logs/"
      options:
        aws_access_key_id: "AKIAIOSFODNN7EXAMPLE"
        aws_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
        region: "us-west-2"
      tables:
        - name: clickstream
          schema: public
          path: "s3://my-company-analytics-bucket/logs/clickstream.parquet"