click below
click below
Normal Size Small Size show me how
Machine Learning
| Question | Answer |
|---|---|
| k-Nearest Neighboors (kNN) | is a supervised learning algorithm that combines the classification of the k-nearest point to for classification or regression problems |
| Linear Regression | is a supervised learning method that can help predict a quantitative response by approximating a linear relationship (y=b0 +b1x). |
| Loss Function | a function that measures our model error |
| Mean Square Error (MSE) | a metric that measures the averaged squared error of our predictions |
| Max Absolute Error | a metric that measures the largest residue |
| Mean Absolute Error | a metric that takes the average absolute error |
| Multi-Linear Regression | a model of the form y = b0 + b1x1 + b2x2 +...+bpxp + epsilon |
| Polynomial Regerssion | a learning algorithm that captures non-linear relationships between our predictors and response of the form y = b0+b1xi + b2xi^2+ .... + bdx_i^d + epsilon |
| Model Selection | the application of a principled method to determine the complexity of the model, e.g., choosing a subset of predictors, choosing the degree of the polynomial model etc. |
| Underfitting | when a model is too simple to capture data complexities and fails to fit both the training and testing data |
| Reasons for Underfitting | simple model, small training data, excessive regularization. features are not scaled, and features are not good predictors of our target variable |
| Techniques to Reduce Underfitting | increase model complexity, increase number of features, remove noise, increase epoch or training time |
| Overfitting | when a mode does not make accurate predictions on testing data but preforms well on the training data |
| Reasons for Overfitting | high variance and low bias, model is too complex, size of the training data |
| Techniques to Reduce Overfitting | increase training data, reduce model complexity, early stopping, ridge regression, lasso regression, dropout |
| Bias | our models overall accuracy, how close are our model predictions to the actually values |
| Variance | our models precision, if we run our model many time how often do we get the same prediction |
| Low Bias Low Variance Symptom | Training and testing error are both low, BEST CASE |
| Low Bias Low Variance Cause | Model has the right balance between bias and variance. It is able to capture the true relationship between the target and predicted variable and is stable to changes in the training data |
| Low Bias Low Variance Solution | NOT NEEDED |
| Low Bias High Variance Symptom | Training error is low and testing error is high |
| Low Bias High Variance Cause | - Model is overfitting to training data, it is learning both the signal and the noise in the training data and does not generalize well to unknown data - Complex models are usually unstable and can change a lot when any data changes occur |
| Low Bias High Variance Solution | Build a simpler mode - Hyperparameter tuning, Regularization, Dimensional Reduction, Bagging. Get more sample in training data. |
| High Bias Low Variance Symptom | Training Error is very high and test error is almost the same as the training error |
| High Bias Low Variance Cause | - Model is underfitting and it is too simple to capture the true relationship between target and predictor variables. This becomes a source of high bias - Simple models tend to be more stable model with low variance to change in training data |
| High Bias Low Variance Solution | Build a more complex model - Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees (Gradient Boosting Machines) |
| High Bias High Variance Symptom | Training error is high and test error is even higher than the training error |
| High Bias High Variance Cause | - Model is underfitting and is too simple to capture the true relationship between target and predictor variables. - If model has limited samples will contribute to high variance and leads to unstable model |
| High Bias High Variance Solution | - Build a more complex model. Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees. - Get more sample in training data |
| Cross Validation | when we standardized and split our data into 3 subsets: train, validation, and test |
| K-Fold Cross Validation | we first divide our dataset into k equally sized subsets. Then, we repeat the train-test method k times such that each time one of the k subsets is used as a test set and the rest k-1 subsets are used together as a training set. |
| Leave One out Cross Validation | we train our machine-learning model times where is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to train our model. |
| Lasso Regression (L1) | shrinks the coefficients and helps to reduce the model complexity and multi-collinearity. Some of the features might shrink to a mean of zero, thus eliminating the least important features |
| Ridge Regression (L2) | the coefficients shrunk towards the central point as the mean by introducing a penalization factor that takes the square of our coefficients |
| Principal Component Analysis (PCA) | is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It retains the data in the direction of maximum variance. |
| Logistic Regression | a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation |
| Types of Logistic Regression | 1) Binary Logistic Regression 2)Multinomial Logistic Regression 3) Ordinal Logistic Regression |
| Decision Trees | a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. |
| Decision Tree Attribute Identification Methods | 1) Information Gain (Entropy) 2) Gini Index |
| Information Gain (Entropy) | is applied to quantify which feature provides maximal information about the classification based on the notion of uncertainty, disorder or impurity (purer at the bottom) |
| Gini Index | measures the probability for a random instance being misclassified when chosen randomly, the lower the metric the lower the likelihood of misclassification |
| Ensemble Learning Method | is a widely-used and preferred machine learning technique in which multiple individual models, often called base models, are combined to produce an effective optimal prediction model |
| Bagging | an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in parallel order and aggregate the avg/most for our final model |
| Random Forest | is an application of decision tree bootstrapping |
| Boosting | an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in sequential order and aggregate the avg/most for our final model |
| What are the different types of Boosting? | - Ada Boost - Gradient Boosting - XGBoost |
| Ada Boost | a boosting method that gives more weight to incorrectly classified items. This approach does not work well when there is a correlation among features or high data dimensionality. |
| Gradient Boosting | a boosting method that optimizes the loss function by generating base learners sequentially so that the present base learner is always more effective than the previous one |
| XGBoost | a boosting method that uses multiple cores on the CPU so that learning can occur in parallel during training. It is a boosting algorithm that can handle extensive datasets, making it attractive for big data applications |
| Natural Language Processing (NLP) | a branch of artificial intelligence that allows computers to interpret, analyze and manipulate human language |
| NLP Techniques | - Tokenization - Stop-words - Stemming/Lemmatization - Preprocessing steps - Bag of words model |
| Text Processing | - Tokenize - Remove Stop Words - Clean special characters in text - Stemming/Lemmatization |
| Tokenization | divide strings into lists of substrings, it consist of dividing a piece of text into smaller pieces. We can divide paragraph into sentences, sentence into words or word into characters. |
| Stemming | the process of removing prefixes and suffixes from words so that they are reduced to simpler forms |
| Lemmatization | the process of reduce words to their root form |
| Bag of Words Model | This model allows us to extract features from the text by converting the text into a matrix of occurrence of words. The main issue is that different sentences can yield similar vectors |
| Term frequency–inverse document frequency | is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus |
| Term frequency | summarizes how often a word appears within a documents |
| Inverse document frequency | downscales words that appear a lot across documents in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus. |
| Confusion Matrix | a performance evaluation tool in machine learning, representing the accuracy of a classification model. It displays the number of true positives, true negatives, false positives, and false negatives. |
| True Positive (TP) | the actual value was positive, and the model predicted a positive value |
| True Negative (TN) | the actual value was negative, and the model predicted a negative value. |
| False Positive (FP) | the actual value was negative, but the model predicted a positive value. (Type I Error) |
| False Negative (FN) | the actual value was positive, but the model predicted a negative value.(Type II Error) |
| Accuracy | evaluated overall effectiveness of a classifier... accuracy = (TP+TN)/(TP + FP + TN + FN) |
| Recall | evaluated the proportion of actual positive correctly labeled... recall = TP / (TP + FN) |
| Precision | a metric that tells us about the quality of positive predictions...precision = TP / (TP + FP) |
| F1 Score | a harmonic mean of Precision and Recall... F1 = (2 x Precision x Recall)/(Precision + Recall) |
| ROC Curve | is a graph showing the performance of a classification model at all classification. This curve plots two parameters true positive rate (recall) and false positive rate (FPRxTPR).... FPR = FP / (FP + TN) |
| AUC Curve | measures the entire two-dimensional area underneath the entire ROC curve |
| Precision | the number of true positives divided by the total number of positive predictions made by the model |
| Accuracy | the number of correct predictions divided by the total number of predictions made by the model. |
| Recall | the number of true positives divided by the actual positive values |
| Supervised Learning | a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns. |
| Unsupervised Learning | a type of machine learning that learns from data without human supervision or data |
| K-means Clustering | a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster |
| Hierarchical Clustering | a algorithm where clusters are visually represented in a hierarchical tree called a dendrogram. There are two types: - Agglomerative - Divisive |
| Agglomerative Clustering | a bottom-up clustering algorithm. It starts clustering by treating the individual data points as a single cluster then it is merged continuously based on similarity until it forms one big cluster containing all objects. |
| Types of Linkage Methods (hierarchical clustering) | - Complete Linkage - Single Linkage - Average Linkage - Centroid Linkage - Wards Method |
| Complete Linkage | the maximum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters |
| Single Linkage | the minimum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters |
| Average Linkage | the average of all pairwise distances between elements in each pair of clusters is used to measure the distance between two clusters. |
| Centroid Linkage | Before merging, the distance between the two clusters’ centroids are considered. |
| Ward's Method | It uses squared error to compute the similarity of the two clusters for merging. |
| Divisive Clustering | an algorithm that starts by considering all the data points into a big single cluster and later on splitting them into smaller heterogeneous clusters continuously until all data points are in their own cluster. |
| Density-Based Clustering Algorithm (DBSCAN) | unsupervised learning methods that identify distinctive clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density |
| DBSCAN Hyper-parameters | epsilon : the radius of a neighborhood around an observation MinPts: the minimum number of points within an radius of an observation to be considered a “core” point |
| Types of Points in DBSCAN | - Core points - observations with MinPts total observations within an radius - Border points - observations that are not core points, but are within of a core point - Noise points - everything else |
| Clustering Algorithms evaluation | Silhouette Plots, Elbow Method, Average Silhouette Method, Gap Statistic |
| Silhouette Plots | displays a measure of how close each point in one cluster is to points in the neighboring clusters. Observations with s ≈ 1 are well-clustered, s ≈ 0 lie between two clusters, s < 0 are probably in the wrong cluster. |
| Elbow Method | is a technique used in clustering analysis to determine the optimal number of clusters. It involves plotting the within-cluster sum of squares (WCSS) for different cluster numbers and identifying the “elbow” point where WCSS starts to level off. |
| Average Silhouette Method | it determines how well each object lies within its cluster. The method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the metric over a range of possible values for k |
| Gap Statistic | For a particular choice of K clusters, compare the total within cluster variation to the expected within-cluster variation under the assumption that the data have no obvious clustering. |
| Perceptron | a single-layer neural network linear or a Machine Learning algorithm used for supervised learning of various binary classifiers |
| Perceptron Key Components | - Inputs - Weight - Summation & Bias - Activation Function |
| Multi-Layer Perceptron (MLP) | a feedforward Neural Network that learns the relationship between linear and non-linear data. |
| Backpropagation | is the learning mechanism that allows the Multilayer Perceptron to iteratively adjust the weights in the network, with the goal of minimizing the cost function. |
| NN Design Choices | - Activation Function - Loss Function - Output Units - Architecture |
| Activation Functions | - Sigmoid - Softmax - Tanh - ReLU - Leaky ReLU - Softplus - Swish |
| Sigmoid | this curve is an S-shaped curve. It should only be used on the output layer. Used for binary and muli classification. Suffers from vanishing gradient problem |
| Softmax | Used for multinomial logistics regression, hence used in the output layer of a multi-class classification neural network. Uses cross entropy loss. its normalization reduces the influence of outliers in the data |
| Tanh | Tanh is a smoother, zero-centered function having a range between -1 to 1. Suffers from vanishing gradient |
| ReLU | an activation function that will output the input as it is when the value is positive; else, it will output 0. They are easily optimized but suffer from dying neurons |
| Leaky ReLU | An improvement over ReLU by introducing a small negative slope to solve the problem of dying ReLU. helps speed up training |
| Softplus | a smoother version of the rectifying non-linearity activation function and can be used to constrain a machine's output always to be positive |
| Swish | a gated version of the sigmoid activation function. |
| Gradient Descent | measures the change in all weights with regard to the change in error |
| Types of Gradient Descent | Batch Gradient Descent Stochastic Gradient (SGD) Mini-Batch Gradient Descent |
| Batch Gradient Descent | Calculates the error for each example within the training dataset, but only after all training examples have been evaluated does the model get updated. |
| Stochastic Gradient (SGD) | updates the parameters for each training example one by one. |
| Mini-Batch Gradient Descent | Combination of the concepts of SGD and batch gradient descent.It simply splits the training dataset into small batches and performs an update for each of those batches. |
| Gradient Descent Optimizers Types | - Gradient Based (Momentum) - Learning Rate Base (AdaGrad) |
| Momentum | invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction |
| Momentum Advantages | - Reduces the oscillation and high variance of the parameters - converges faster than gradient descent |
| Nesterov Accelerated Gradient | look before you leap. You will look at your two steps ahead, meaning, we will calculate the gradient at a partially updated value instead if calculating at our current position |
| Adaptive Learning Rate | techniques used in optimizing deep learning models by automatically adjusting the learning rates during the training process |
| Adagrad | an algorithm for gradient-based optimization. Performs larger updates (e.g. high learning rates) for those parameters that are related to infrequent features and smaller updates (i.e. low learning rates) for frequent one. |
| Adadelta | is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size. |
| RMSProp | divides the learning rate by an exponentially decaying average of squared gradients which solves Ada shrinking learning rate. |
| Adam | computes adaptive learning rates for each parameter by keeps an exponentially decaying average of past gradients (RMSProp + Momentum) |
| Nadam | thus combines Adam and Nesterov Accelerated Gradient |
| Regularization | is an modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error |
| Parameter Initialization | - Zero Initialization - Random Initialization - Xavier Initialization - He Initialization |
| Zero Initialization | Initialize all the weights and biases to zero. This is not generally used in deep learning as it leads to symmetry in the gradients, resulting in all the neurons learning the same feature. |
| Random Initialization | Initialize the weights and biases randomly from a uniform or normal distribution. This is the most common technique used in deep learning. |
| Xavier Initialization | Initialize the weights with a normal distribution with mean 0 and variance of sqrt(1/n), where n is the number of neurons in the previous layer. This is used for the sigmoid activation function. |
| He Initialization | Initialize the weights with a normal distribution with mean 0 and variance of sqrt(2/n), where n is the number of neurons in the previous layer. This is used for the ReLU activation function. |
| Orthogonal Initialization | Initialize the weights with an orthogonal matrix, which preserves the gradient norm during backpropagation. |
| Uniform Initialization | Initialize the weights with a uniform distribution. This is less commonly used than random initialization. |
| Constant Initialization | Initialize the weights and biases with a constant value. This is rarely used in deep learning |
| Norm Penalties | a NN regularization technique that includes penalties by adding a parameter norm penalty |
| L1 Parameter Regularization (Lasso) | adds “Absolute value of magnitude” of coefficient, as penalty term to the loss function. |
| L2 Parameter Regularization (Ridge Regression) | adds “squared magnitude of the coefficient” as penalty term to the loss function. |
| Types of NN Regularization | - Norma Penalties (L1 &L2) - Parameter Initialization - Early Stopping - Data Augmentation - Dropout |
| Early Stopping | an optimization technique used to reduce overfitting without compromising on model accuracy. The main idea behind this technique is to stop training before a model starts to overfit. Will terminate while validation set performance is better |
| Data Augmentation | a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It includes making minor changes to the dataset or using deep learning to generate new data points |
| Audio Data Augmentation | - Noise injection - Shifting: - Changing the speed - Changing the pitch |
| Text Data Augmentation | - Word or sentence shuffling - word replacement - syntax-tree manipulation - random word insertion - random word deletion |
| Image Data Augmentation | - geometric transformation - color space transformation - kernel filters - random erasing |
| Advance Data Augmentation Techniques | - Generative adversarial networks (GANs) - Neural Style Transfer |
| Data Augmentation Advantages | - Prevents overfitting - Helps with small datasets - Improves the model accuracy |
| Data Augmentation Disadvantages | - Does not address biases - Quality assurance of data is expensive - Can be challenging to implement on all problems |
| Dropout | refers to dropping out the nodes (input and hidden layer) in a neural network , ensures that there is no node codependence |
| Convolutional Neural Networks | is a type of NN that is used for image data by doing Representation Learning or Feature Learning . Nodes are used for every pixel in our image, one for black &white images or one for 3 RGB pixel values. |
| Representation Learning | a technique that allows a system to automatically find relevant features for a given task. Weights are assigned based on the importance of the pixel. |
| Types of Layers in CNN | - Convolutional Layers - Pooling Layers - Fully Connected Layers |
| Convolutional Layers | - Apply filters to extract features - Filters are composed of small kernels, learned - One bias per filter - Apply activation function on every value of feature map |
| Convolutional Layers Parameters | - Number of filters - Size of kernels(W and H only, D is defined by input cube) - Activation Function - Stride - Padding |
| Convolutional Layer I/O | - Input: previous set of feature maps: 3D cuboid - Output: 3D cuboid, one 2D map per filter |
| Pooling Layer | - Reduce dimensionality - Extract maximum or average of a region - Sliding window approach |
| Pooling Layer Parameters | - Stride - Size of window |
| Pooling Layers I/O | - Input: previous set of feature maps, 3D cuboid - Output: 3D cuboid, one 2D map per filter, reduced spatial dimension |
| Fully Connected Layers | - Aggregate information from final feature maps - Generate Final classification, regression, segmentation, etc |
| Fully Connected Layers Parameters | - Number of nodes - Activation function: Usually changes depending on role of the layer. If aggregation info, use ReLU. If producing final classification, use Softmax. If regression use linear |
| Fully Connected Layers I/O | - Input: Flattened previous set of feature maps - Output: Probabilities for each class or simply prediction for regression y_hat |
| Padding | the addition of extra pixels around the borders of the input images or feature map |
| Full Padding | Introduces zeros such that all pixels are visited the same number of times by the filter. Increases size of output |
| Same Padding | Ensures that the output has the same size as the input. |
| Stride | the number of pixels by which we move the filter across the input image. |
| Pooling | a new layer added after the convolutional layer. Specifically, it is added after a nonlinearity (e.g. ReLU) has been applied to the feature maps*. (*Maxpooling can be applied before ReLU) |
| CNN Dropout | layer is similar to multiplying Bernoulli noise into the feature maps of the network. |
| Layers Receptive Field | the region in the input space that a particular CNN feature (or activation) is looking at. Large blank are necessary for high-level recognition task, but diminishing reward |
| Tensorboard | suite of visualization tools to understand, debug, and optimize TensorFlow programs for ML experimentation |
| Occlusion methods | method that attributes importance to a region of an image for classification |
| Saliency Maps | a technique that highlight the pixels that were relevant for a certain image classification in a NN by calculating the gradient |
| DeconvNet | a technique that highlight the pixels that were relevant for a certain image classification in a NN by calculating the gradient but reverses the ReLU layers |
| Guided Backpropagation Algorithm | a combination of Gradient Based Backpropagation and DeconvNet |
| Class Activation Map (CAM) | is another explanation method for interpreting convolutional neural networks. They proposed a network where the fully connected layers at the very end of the model have been replaced by a layerGlobal Average Pooling (GAP) and a class activation mapping |
| Gradient-weighted Class Activation Mapping (Grad-CAM) | uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for |
| Unigrams | Assume each word is independent of all other. They count how often the word occurs. |
| Bigrams | Bigrams look at pairs of consecutive words (conditional probability) |
| Term Frequency – Inverse Document Frequency (TF-IDF ) | a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations in a document amongst a collection of documents |
| Term Frequency | measures the count of a word in a given document in proportion to the total number of words in a document |
| Document Frequency | measures the importance of the word t across all documents in the corpus |
| Recurrent Neural Network (RNN) | is a deep learning model that is trained to process and convert a sequential data input into a specific sequential data output. they have a recurrent workflow: the hidden layer can remember and use previous inputs for future predictions |
| Long Short Term Memory (LSTM) | an RNN variant that enables the model to expand its memory capacity to accommodate a longer timeline by using 3 gates (input, output, forget) and 2 hidden states |
| Gated Recurrent Units | an RNN that enables selective memory retention. The model adds an update and forgets the gate to its hidden layer, which can store or remove information in the memory. |
| Types of RNN Language Models | - Bidirectional RNN - Character CNN |
| Bidirectional RNN | is 2 RNNs stacking on top of each other. The output is then composed based on the hidden state of both RNNs. The idea is that the output may not only depend on previous elements in the sequence but also on future elements |
| Character CNN | a CNN application on sentence tensors created by one hot encoding our words by character |
| Seq2Seq | a language model that consist of an encoder and decoder section where the encoder compresses our input into a context vector that our decoder uses to generate our output sequence one token at a time |
| Attention | a technique that allows the decoder to "look back” at the complete input and extracts significant information via a similarity score that is useful in decoding |
| Transformers | an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it does not implement any RNN |
| Bidirectional Encoder Representations from Transformers (BERT) | is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers |
| Generative Adversarial Network (GANs) | is a machine learning (ML) model in which two neural networks compete with each other by using deep learning methods to become more accurate in their predictions |
| Naive Bayes | is a popular supervised machine learning algorithm used for classification tasks. Based on the assumption that the features of the input data are conditionally independent given the class, allowing the algorithm to make predictions quickly and accurately. |
| Support Vector Machines | it is a supervised machine learning problem where we try to find a hyperplane that best separates the two classes |