Training

Overview

Before training a recommender model using the transformer architecture in PigeonsAI, it is essential to create a training dataset with specific required columns. These columns ensure that the model can learn effectively from user interactions and other contextual data.

Necessary Train Columns

To train a recommender model, the training dataset must include the following columns:

  • user_id (str): A unique identifier for each user.

  • product_id (str): A unique identifier for each item or product.

  • rating (float): Used alongside the threshold parameter in the train method to determine negative sample, if left empty every interaction will be treated as a positive sample

  • timestamp (datetime): The time at which the interaction occurred.

  • text_cols (list of str): Additional textual information related to the product, such as descriptions or categories, which can be used to enrich the model's understanding.

Creating the Training Dataset

Here's how you can define and create a training dataset suitable for the recommender transformer model:

res = client.data_connector.create_train_set(
    type='connection',
    train_set_name='demo-amazon-beauty-data',
    data_connection_uri='uri:data-connector:biraj_pigeonsai.com:ef57acca-5a2d-4855-a1a5-7dd3f46a02b6',
    table_name='amazon_beauty_data_full',
    columns_map={
        'user_id': 'UserId',
        'product_id': 'ProductId',
        'rating': 'Rating',
        'timestamp': 'Timestamp',
        'text_cols': ['ProductType', 'Brand']
    }
)

Function Overview

The train function in the PigeonsAI recommender module uses transformer architecture to train models that provide personalized recommendations. This function includes several parameters that can be tuned to optimize model performance based on the specific characteristics of your dataset.

Parameters

client.recommender.transformer.train(**training_params)
  • custom_model_name (str): The name for the model, no two models in your account can have the same name.

  • data_source_pri (str):

    • URI outputted by the method, create_train_set ensure you use the right URI.

  • threshold (str,optional, default: None): (Important) :

    • This is an important parameter used for negative sampling, anything interaction rated below threshold will be used a negative interaction during inference and training, if set at left empty, set as None or 0 all interactions will be viewed as a positive interaction

  • batch_size (int, optional, default: 128):

    • The number of training examples utilized in one iteration. Default values are 32, 64, 128, or 256. Larger batch sizes can lead to faster training but may require more memory. Smaller batches often lead to better generalization.

  • epoch (int, optional, default: 30) (Important):

    • The number of complete passes through the training dataset. Increasing epochs can improve learning at the risk of overfitting, between 30-50 is a good number.

  • learning_rate (float, optional, default: 0.0001):

    • Determines the step size at each iteration while moving toward a minimum of a loss function. Typically a smaller learning rate requires more training epochs, 0.0001 works well with most datasets.

  • num_layers (int, optional, default: 1):

    • The number of layers in the transformer model. More layers can model complex patterns but increase computational complexity. (No need to change in most scenarios)

  • num_heads (int, optional, default: 4):

    • The number of attention heads in each transformer layer. More heads provide more capacity to the model to learn from data. (No need to change in most scenarios)

  • optimizer_algorithm (str, optional, default: adamw):

    • (No need to change in most scenarios)

Choosing Batch Size

The choice of batch size depends on the size and characteristics of your dataset:

  • 32-64: Suitable for small datasets or when models require more fine-grained updates, between 0-2M rows.

  • 128: A good middle ground for moderate-sized datasets, between 2M-10M rows.

  • 256: Best for large datasets where model training needs to be accelerated, between 10M+ rows.

Explanation of Rating & Threshold

The threshold parameter is used to indicate to the model which interactions are considered positive and which considered negative, to do this we use the Rating parameter from the train set

Movie Rating Example

In a movie recommendation system:

  • Ratings 1 and 2 star: Considered negative, showing disinterest.

  • Ratings of 3 star and above: Treated as positive, indicating liking or acceptance.

In this case the threshold should be set at 3, indicating to the model that any interaction below a rating of 3 should be treated as a negative sample.

Dating App Example

In a dating app:

  • Left swipe (0): Negative interaction, showing no interest.

  • Right swipe (0.5) and Up swipe (1): Positive interactions, indicating varying degrees of interest.

A suitable threshold might be set at 0.5, using left swipes as negative samples, which helps the model avoid recommending unappealing profiles.

E-Commerce Example

In scenarios like e-commerce where every interaction (e.g., a purchase) is positive:

  • Threshold: Should be set at 0, None, or left empty.

This setting indicates that there are no negative interactions.

Example Usage

# Define training parameters
training_params = {
    'custom_model_name': 'your-model-name',
    'data_source_pri': 'your-data-source-identifier',
    'threshold': 2,
    'batch_size': 128,
    'epoch': 30,
    'learning_rate': 0.0001,
    'num_layers': 1,
    'num_heads': 4,
    'optimizer_algorithm': 'adam'
}

# Train the model
res = client.recommender.transformer.train(**training_params)
print(res)

Example output

Initializing delete-test-8 training 
 Training job creation successful.
 Unique Identifier: delete-test-8-onfvsokb
 Endpoint: https://delete-test-8-onfvsokb.apps.pigeonsai.cloud
 Message: Model is training. It could take a while.

Last updated