This documentation describes the integration of MindsDB with Amazon S3, an object storage service that offers industry-leading scalability, data availability, security, and performance.

This data source integration is thread-safe, utilizing a connection pool where each thread is assigned its own connection. When handling requests in parallel, threads retrieve connections from the pool as needed.

Prerequisites

Before proceeding, ensure that MindsDB is installed locally via Docker or Docker Desktop.

Connection

Establish a connection to your Amazon S3 bucket from MindsDB by executing the following SQL command:

CREATE DATABASE s3_datasource
WITH
    engine = 's3',
    parameters = {
      "aws_access_key_id": "AQAXEQK89OX07YS34OP",
      "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
      "bucket": "my-bucket"
    };

Note that sample parameter values are provided here for reference, and you should replace them with your connection parameters.

Required connection parameters include the following:

  • aws_access_key_id: The AWS access key that identifies the user or IAM role.
  • aws_secret_access_key: The AWS secret access key that identifies the user or IAM role.

Optional connection parameters include the following:

  • aws_session_token: The AWS session token that identifies the user or IAM role. This becomes necessary when using temporary security credentials.
  • bucket: The name of the Amazon S3 bucket. If not provided, all available buckets can be queried, however, this can affect performance, especially when listing all of the available objects.

Usage

Retrieve data from a specified object (file) in a S3 bucket by providing the integration name and the object key:

SELECT *
FROM s3_datasource.`my-file.csv`;
LIMIT 10;

If a bucket name is provided in the CREATE DATABASE command, querying will be limited to that bucket and the bucket name can be ommitted from the object key as shown in the example above. However, if the bucket name is not provided, the object key must include the bucket name, such as s3_datasource.my-bucket/my-folder/my-file.csv`.

Wrap the object key in backticks (`) to avoid any issues parsing the SQL statements provided. This is especially important when the object key contains spaces, special characters or prefixes, such as my-folder/my-file.csv.

At the moment, the supported file formats are CSV, TSV, JSON, and Parquet.

The above examples utilize s3_datasource as the datasource name, which is defined in the CREATE DATABASE command.

The special files table can be used to list all objects available in the specified bucket or all buckets if the bucket name is not provided:

SELECT *
FROM s3_datasource.files LIMIT 10

The content of files can also be retrieved by explicitly requesting the content column. This column is empty by default to avoid unnecessary data transfer:

SELECT path, content
FROM s3_datasource.files LIMIT 10

This table will return all objects regardless of the file format, however, only the supported file formats mentioned above can be queried.

Troubleshooting Guide

Database Connection Error

  • Symptoms: Failure to connect MindsDB with the Amazon S3 bucket.
  • Checklist:
    1. Make sure the Amazon S3 bucket exists.
    2. Confirm that provided AWS credentials are correct. Try making a direct connection to the S3 bucket using the AWS CLI.
    3. Ensure a stable network between MindsDB and AWS.

SQL statement cannot be parsed by mindsdb_sql

  • Symptoms: SQL queries failing or not recognizing object names containing spaces, special characters or prefixes.
  • Checklist:
    1. Ensure object names with spaces, special characters or prefixes are enclosed in backticks.
    2. Examples:
      • Incorrect: SELECT * FROM integration.travel/travel_data.csv
      • Incorrect: SELECT * FROM integration.‘travel/travel_data.csv’
      • Correct: SELECT * FROM integration.`travel/travel_data.csv`