πŸ“ƒCSV File Upload

Overview

For users looking to quickly prototype or train models without setting up a direct database connection, PigeonsAI allows the creation of training datasets directly from CSV files. This method is straightforward and does not require complex configurations, making it ideal for initial testing and smaller datasets.

Requirements

To create a training dataset from a CSV file, ensure that your file meets the following criteria:

  • The CSV file should be structured with clearly defined columns.

  • The file must be accessible from the location where PigeonsAI is running.

  • The maximum size for the CSV file is 3 GB, which accommodates substantial data but ensures processing efficiency.

Creating a Training Dataset

Here’s how you can create a training dataset using a CSV file in PigeonsAI:

res = client.data_connector.create_train_set(
    type='file',
    train_set_name='demo-user-item-interaction',
    file_path='/home/bs/aaProjects/demo/order_data.csv',
    columns_map={
        'user_id': 'UserId',
        'product_id': 'ProductId',
        'rating': 'Rating',
        'timestamp': 'Timestamp',
        'text_cols': ['ProductType', 'ProductName', 'Color']
    }
)

Example Output

 Train set creation successful: 201 Created
 Train set URI: uri:train-dataset:biraj_pigeonsai.com:f28e919b-0c95-476f-8600-1557ce1cdc5f

The training set URI outputted from the code above will be used to train a model.

Advantages of Using CSV Files for Prototyping

  • Quick Setup: No need for database credentials or configurations. Simply provide the path to the CSV file.

  • Flexibility: Easily test different datasets by switching out CSV files as needed.

  • Simplicity: Ideal for users who may not have the technical expertise to manage database connections.

Limitations

While using CSV files is convenient, it's important to consider that:

  • Only datasets up to 50 MB in size can be processed, which may not be suitable for very large data sets.

  • Data must be manually updated in the CSV file, unlike database connections that can pull updated data automatically.

Last updated