Modeling Process

Modeling is a multi-stage methodology for creating trained and tested Machine Learning and AI models.

The Modeling Process is essentially a scientific experiment which includes:

  • Development of a Hypothesis - e.g., data collected about a specific previous consumer behavior can be used to predict future behavior

  • Design of the Experiment - e.g., model/algorithm selection

  • Execution of the Experiment - e.g., model training and testing

  • Evaluation and Explanation of Results - e.g., is the hypothesis true or false, what is the accuracy

Process Phases

Phases in the modeling process, which can be highly recursive/iterative, generally include:

  1. Type Identification

  2. Platform Selection

  3. Data Collection

  4. Model/Algorithm Selection

  5. Model Hyperparameters Setting

  6. Model Training

  7. Model Testing

  8. Model Evaluation

  9. Model Deployment

Type Identification

The type of ML/AI needed can have a significant influence on the details of modelng process phases. Type identification can be driven by:

Major category types include:

Traditional Machine Learning vs. AI Modeling

Generative AI Models such as Large Language Models differ from more traditional Models such as Decision Trees in a number of aspects.

Model Architecture

Training Data

  • AI Models - are trained on very large volumes of data

  • ML Models - are trained on much smaller volumes of data

Training Compute Resources

  • AI Models - use high levels of computing resources during training

  • ML Models - use relatively lower levels of computing resources during training

Model Fine Tuning

  • AI Models - can be fine tuned using small datasets for focused applications

  • ML Models - are typically not fine tuned with additional data after model training

Transfer Learning

  • AI Models - can be applied to a wide variety of applications

  • ML Models - are trained and used for specific applications

Platform Selection

ML and AI Platforms are generally of two types:

Data Collection

Data Collection is the processing of finding, organizing, cleaning, and storing data in a form that can be fed into model training and prediction processing.

Data Collection can involve:

Model/Algorithm Selection

Model/Algorithm Options

Model algorithms to select from include:

Selection Methodologies

Methods of selecting an algorithm include:

Model Hyperparameter Settings

Hyperparameters control aspects of model instantiation and training and can include factors, depending on the model algorithm being used, such as:

  • activation_function: which Activation Function is used in Activation Nodes

  • batch_size: the number of inputs to include in each processing iteration linked to the learning rate

  • hidden_network_layers: the number of nodes in each hidden network layer; hidden layers are those between the input and output layers

  • learning_rate: what algorithm to use for controlling Weight Optimization

  • maximum_number_of_iterations: the maximum number of iterations of data is processed through the neural network

  • number_of_data_features: the number of data features used for model training and inference processing

  • number_of_informative_data_features: the number of data features correlated to the training outputs; this simulates real world model training where the correlation of data features may not be known

  • number_of_model_classes: the number of output classes the neural network is being trained to predict

  • number_of_training_and_test_samples: the number of data samples processed through model training

  • print_training_progress: whether to print the loss after each training iteration; loss is a measure of the difference between calculated outputs and expected outputs

  • tolerance_for_optimization: a numeric value used for ending the model training iteration cycles

  • weight_optimization_algorithm: the algorithm used for Weight Optimization, such as Stochastic Gradient Descent

Model Training

Data is iteratively processed through the model to adjust the weights and biases applied to data array links to produced increasingly more accurate output results. The diagram below illustrates an Artificial Neural Network; the concepts are true for other model algorithms.

  • Data Inputs - data is fed into the training process

  • Iteration - data is iteratively passed through the neural network

  • Forward Propagation - data is passed from node to node

  • Outputs - output results are fed into loss calculations

  • Loss Calculation - the difference between output results and desired results is calculated

  • Weight Optimization - the amount of change to data flow weights is calculated

  • Backpropagation - modifies the weights and biases applied to data array links

Typically the training process is performed iteratively while monitoring for factors such as best accuracy results as illustrated below:

Bias-Variance Tradeoff.png

Model Testing

Data is passed forward through the neural network to produce a result and associated confidence level that the result is true. The diagram below illustrates an Artificial Neural Network; the concepts are true for other model algorithms.

  • Data Inputs - data is fed into the training process

  • Forward Propagation - data is passed from node to node

  • Outputs - output results are fed into confidence level calculations

  • Confidence Level - is a number from 0 to 1 indicating the probability that the output results is correct

Model Evaluation

Model Evaluation involves applying Probability and Statistics using measurements such as:

Depending on the results of model evaluation, previous modeling steps may need to be adjusted and repeated.

To reduce overfitting, consider using:

Model Reinforcement Learning with Human Feedback (RLHF)

RLHF is a type of machine learning that combines reinforcement learning and human feedback to train AI models.

Key Benefits

  • Improved alignment: RLHF helps align the agent's objectives with human values and preferences.

  • Flexibility: RLHF can be applied to various domains, including those with complex or nuanced objectives.

  • Efficient learning: Human feedback accelerates learning, reducing the need for large amounts of data or trial and error.

Challenges and Limitations

  • Scalability: Obtaining high-quality human feedback can be time-consuming and expensive.

  • Bias and variability: Human feedback may be subjective, inconsistent, or biased.

  • Evaluation metrics: Assessing the effectiveness of RLHF can be challenging due to the complexity of human feedback.

By combining reinforcement learning with human feedback, RLHF enables AI agents to learn complex behaviors and make decisions that align with human values and preferences. Steps 2-4 below are repeated, with the agent refining its policy through continuous human feedback and reward signals.

Step 1: Environment and Agent
The AI agent interacts with an environment, such as a game, simulation, or text-based interface.

Step 2: Human Feedback
Humans provide feedback on the agent's actions, such as:

  • Rewards (e.g., +1 for good action, -1 for bad action)

  • Preferences (e.g., "I like this action better than that one")

  • Corrections (e.g., "No, do this instead")

Step 3: Reward Signal
The human feedback is converted into a reward signal, which guides the agent's learning process. One method of doing this is Proximal Policy Optimization (PPO).

  • PPO was introduced by OpenAI in 2017, designed to optimize the policy of an agent in a stable and efficient manner. PPO is a type of policy gradient method, which means it focuses on optimizing the policy directly rather than relying on a value function.

  • The key innovation of PPO lies in its use of a clipped surrogate objective function. This function constrains the policy updates by clipping the probability ratio between the new and old policies within a specified range. By doing so, PPO prevents large, destabilizing updates to the policy, ensuring that changes remain within a "trust region" that maintains training stability.

  • This approach allows PPO to achieve a balance between exploration and exploitation, making it more sample efficient and stable compared to previous methods like Trust Region Policy Optimization (TRPO).

  • PPO's simplicity, combined with its effectiveness, has made it a popular choice for various applications, including robotics, game playing, and other high-dimensional tasks.

Step 4: Policy Update
The agent updates its policy (behavior) based on the reward signal, using reinforcement learning algorithms (e.g., Q-learning, policy gradients).

Q-learning is a reinforcement learning algorithm that enables an agent to learn optimal action-selection policies in an environment. Here's how it works:

1. Q-Table Initialization

The algorithm starts by creating a Q-table, which is a matrix where rows represent states and columns represent actions. All Q-values are initially set to zero or random small values.

2. Exploration and Exploitation

The agent interacts with the environment, balancing between exploring new actions and exploiting known good actions, often using an epsilon-greedy strategy.

3. Action Selection

In each state, the agent selects an action, either randomly (exploration) or based on the highest Q-value for that state (exploitation).

4. Reward Observation

After taking an action, the agent observes the reward received and the new state it has transitioned to.

5. Q-Value Update

The Q-value for the state-action pair is updated using the Q-learning formula:

Q(s,a) = Q(s,a) + α * [R + γ * max(Q(s',a')) - Q(s,a)]

Where:

- Q(s,a) is the current Q-value

- α is the learning rate

- R is the reward received

- γ is the discount factor

- max(Q(s',a')) is the maximum Q-value for the next state

6. Iteration

Steps 3-5 are repeated for many episodes, allowing the agent to learn from various experiences.

7. Convergence

Over time, the Q-values converge to optimal values, representing the expected cumulative reward for each action in each state.

8. Policy Extraction

Once training is complete, the optimal policy can be extracted by selecting the action with the highest Q-value for each state.

Q-learning is model-free (doesn't require knowledge of the environment's dynamics) and off-policy (can learn from actions not in the current policy). It effectively learns to make optimal decisions by iteratively improving its estimates of action values based on the rewards received and the structure of the environment.

Model Deployment

Model software deployment typically involves:

References