k-Nearest Neighboors (kNN)

is a supervised learning algorithm that combines the classification of the k-nearest point to for classification or regression problems

is a supervised learning method that can help predict a quantitative response by approximating a linear relationship (y=b0 +b1x).

a function that measures our model error

Mean Square Error (MSE)

a metric that measures the averaged squared error of our predictions

a metric that measures the largest residue

a metric that takes the average absolute error

Multi-Linear Regression

a model of the form y = b0 + b1x1 + b2x2 +...+bpxp + epsilon

Polynomial Regerssion

a learning algorithm that captures non-linear relationships between our predictors and response of the form y = b0+b1xi + b2xi^2+ .... + bdx_i^d + epsilon

the application of a principled method to determine the complexity of the model, e.g., choosing a subset of predictors, choosing the degree of the polynomial model etc.

when a model is too simple to capture data complexities and fails to fit both the training and testing data

Reasons for Underfitting

simple model, small training data, excessive regularization. features are not scaled, and features are not good predictors of our target variable

Techniques to Reduce Underfitting

increase model complexity, increase number of features, remove noise, increase epoch or training time

when a mode does not make accurate predictions on testing data but preforms well on the training data

Reasons for Overfitting

high variance and low bias, model is too complex, size of the training data

Techniques to Reduce Overfitting

increase training data, reduce model complexity, early stopping, ridge regression, lasso regression, dropout

our models overall accuracy, how close are our model predictions to the actually values

our models precision, if we run our model many time how often do we get the same prediction

Low Bias Low Variance Symptom

Training and testing error are both low, BEST CASE

Low Bias Low Variance Cause

Model has the right balance between bias and variance. It is able to capture the true relationship between the target and predicted variable and is stable to changes in the training data

Low Bias High Variance Symptom

Training error is low and testing error is high

Low Bias High Variance Cause

- Model is overfitting to training data, it is learning both the signal and the noise in the training data and does not generalize well to unknown data - Complex models are usually unstable and can change a lot when any data changes occur

Low Bias High Variance Solution

Build a simpler mode - Hyperparameter tuning, Regularization, Dimensional Reduction, Bagging. Get more sample in training data.

High Bias Low Variance Symptom

Training Error is very high and test error is almost the same as the training error

High Bias Low Variance Cause

- Model is underfitting and it is too simple to capture the true relationship between target and predictor variables. This becomes a source of high bias - Simple models tend to be more stable model with low variance to change in training data

High Bias Low Variance Solution

Build a more complex model - Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees (Gradient Boosting Machines)

High Bias High Variance Symptom

Training error is high and test error is even higher than the training error

High Bias High Variance Cause

- Model is underfitting and is too simple to capture the true relationship between target and predictor variables. - If model has limited samples will contribute to high variance and leads to unstable model

High Bias High Variance Solution

- Build a more complex model. Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees. - Get more sample in training data

when we standardized and split our data into 3 subsets: train, validation, and test

K-Fold Cross Validation

we first divide our dataset into k equally sized subsets. Then, we repeat the train-test method k times such that each time one of the k subsets is used as a test set and the rest k-1 subsets are used together as a training set.

Leave One out Cross Validation

we train our machine-learning model times where is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to train our model.

Lasso Regression (L1)

shrinks the coefficients and helps to reduce the model complexity and multi-collinearity. Some of the features might shrink to a mean of zero, thus eliminating the least important features

Ridge Regression (L2)

the coefficients shrunk towards the central point as the mean by introducing a penalization factor that takes the square of our coefficients

Principal Component Analysis (PCA)

is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It retains the data in the direction of maximum variance.

a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation

Types of Logistic Regression

1) Binary Logistic Regression 2)Multinomial Logistic Regression 3) Ordinal Logistic Regression

a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks.

Decision Tree Attribute Identification Methods

1) Information Gain (Entropy) 2) Gini Index

Information Gain (Entropy)

is applied to quantify which feature provides maximal information about the classification based on the notion of uncertainty, disorder or impurity (purer at the bottom)

measures the probability for a random instance being misclassified when chosen randomly, the lower the metric the lower the likelihood of misclassification

Ensemble Learning Method

is a widely-used and preferred machine learning technique in which multiple individual models, often called base models, are combined to produce an effective optimal prediction model

an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in parallel order and aggregate the avg/most for our final model

is an application of decision tree bootstrapping

an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in sequential order and aggregate the avg/most for our final model

What are the different types of Boosting?

- Ada Boost - Gradient Boosting - XGBoost

a boosting method that gives more weight to incorrectly classified items. This approach does not work well when there is a correlation among features or high data dimensionality.

a boosting method that optimizes the loss function by generating base learners sequentially so that the present base learner is always more effective than the previous one

a boosting method that uses multiple cores on the CPU so that learning can occur in parallel during training. It is a boosting algorithm that can handle extensive datasets, making it attractive for big data applications

Natural Language Processing (NLP)

a branch of artificial intelligence that allows computers to interpret, analyze and manipulate human language

- Tokenization - Stop-words - Stemming/Lemmatization - Preprocessing steps - Bag of words model

Help

Options

focusNode

Didn't know it?
click below

Knew it?
click below

Don't Know

Remaining cards (0)

Know

retry

shuffle

restart

0:00

Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

Normal Size Small Size show me how

Machine Learning

Question	Answer
k-Nearest Neighboors (kNN)	is a supervised learning algorithm that combines the classification of the k-nearest point to for classification or regression problems
Linear Regression	is a supervised learning method that can help predict a quantitative response by approximating a linear relationship (y=b0 +b1x).
Loss Function	a function that measures our model error
Mean Square Error (MSE)	a metric that measures the averaged squared error of our predictions
Max Absolute Error	a metric that measures the largest residue
Mean Absolute Error	a metric that takes the average absolute error
Multi-Linear Regression	a model of the form y = b0 + b1x1 + b2x2 +...+bpxp + epsilon
Polynomial Regerssion	a learning algorithm that captures non-linear relationships between our predictors and response of the form y = b0+b1xi + b2xi^2+ .... + bdx_i^d + epsilon
Model Selection	the application of a principled method to determine the complexity of the model, e.g., choosing a subset of predictors, choosing the degree of the polynomial model etc.
Underfitting	when a model is too simple to capture data complexities and fails to fit both the training and testing data
Reasons for Underfitting	simple model, small training data, excessive regularization. features are not scaled, and features are not good predictors of our target variable
Techniques to Reduce Underfitting	increase model complexity, increase number of features, remove noise, increase epoch or training time
Overfitting	when a mode does not make accurate predictions on testing data but preforms well on the training data
Reasons for Overfitting	high variance and low bias, model is too complex, size of the training data
Techniques to Reduce Overfitting	increase training data, reduce model complexity, early stopping, ridge regression, lasso regression, dropout
Bias	our models overall accuracy, how close are our model predictions to the actually values
Variance	our models precision, if we run our model many time how often do we get the same prediction
Low Bias Low Variance Symptom	Training and testing error are both low, BEST CASE
Low Bias Low Variance Cause	Model has the right balance between bias and variance. It is able to capture the true relationship between the target and predicted variable and is stable to changes in the training data
Low Bias Low Variance Solution	NOT NEEDED
Low Bias High Variance Symptom	Training error is low and testing error is high
Low Bias High Variance Cause	- Model is overfitting to training data, it is learning both the signal and the noise in the training data and does not generalize well to unknown data - Complex models are usually unstable and can change a lot when any data changes occur
Low Bias High Variance Solution	Build a simpler mode - Hyperparameter tuning, Regularization, Dimensional Reduction, Bagging. Get more sample in training data.
High Bias Low Variance Symptom	Training Error is very high and test error is almost the same as the training error
High Bias Low Variance Cause	- Model is underfitting and it is too simple to capture the true relationship between target and predictor variables. This becomes a source of high bias - Simple models tend to be more stable model with low variance to change in training data
High Bias Low Variance Solution	Build a more complex model - Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees (Gradient Boosting Machines)
High Bias High Variance Symptom	Training error is high and test error is even higher than the training error
High Bias High Variance Cause	- Model is underfitting and is too simple to capture the true relationship between target and predictor variables. - If model has limited samples will contribute to high variance and leads to unstable model
High Bias High Variance Solution	- Build a more complex model. Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees. - Get more sample in training data
Cross Validation	when we standardized and split our data into 3 subsets: train, validation, and test
K-Fold Cross Validation	we first divide our dataset into k equally sized subsets. Then, we repeat the train-test method k times such that each time one of the k subsets is used as a test set and the rest k-1 subsets are used together as a training set.
Leave One out Cross Validation	we train our machine-learning model times where is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to train our model.
Lasso Regression (L1)	shrinks the coefficients and helps to reduce the model complexity and multi-collinearity. Some of the features might shrink to a mean of zero, thus eliminating the least important features
Ridge Regression (L2)	the coefficients shrunk towards the central point as the mean by introducing a penalization factor that takes the square of our coefficients
Principal Component Analysis (PCA)	is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It retains the data in the direction of maximum variance.
Logistic Regression	a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation
Types of Logistic Regression	1) Binary Logistic Regression 2)Multinomial Logistic Regression 3) Ordinal Logistic Regression
Decision Trees	a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks.
Decision Tree Attribute Identification Methods	1) Information Gain (Entropy) 2) Gini Index
Information Gain (Entropy)	is applied to quantify which feature provides maximal information about the classification based on the notion of uncertainty, disorder or impurity (purer at the bottom)
Gini Index	measures the probability for a random instance being misclassified when chosen randomly, the lower the metric the lower the likelihood of misclassification
Ensemble Learning Method	is a widely-used and preferred machine learning technique in which multiple individual models, often called base models, are combined to produce an effective optimal prediction model
Bagging	an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in parallel order and aggregate the avg/most for our final model
Random Forest	is an application of decision tree bootstrapping
Boosting	an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in sequential order and aggregate the avg/most for our final model
What are the different types of Boosting?	- Ada Boost - Gradient Boosting - XGBoost
Ada Boost	a boosting method that gives more weight to incorrectly classified items. This approach does not work well when there is a correlation among features or high data dimensionality.
Gradient Boosting	a boosting method that optimizes the loss function by generating base learners sequentially so that the present base learner is always more effective than the previous one
XGBoost	a boosting method that uses multiple cores on the CPU so that learning can occur in parallel during training. It is a boosting algorithm that can handle extensive datasets, making it attractive for big data applications
Natural Language Processing (NLP)	a branch of artificial intelligence that allows computers to interpret, analyze and manipulate human language
NLP Techniques	- Tokenization - Stop-words - Stemming/Lemmatization - Preprocessing steps - Bag of words model
Text Processing	- Tokenize - Remove Stop Words - Clean special characters in text - Stemming/Lemmatization
Tokenization	divide strings into lists of substrings, it consist of dividing a piece of text into smaller pieces. We can divide paragraph into sentences, sentence into words or word into characters.
Stemming	the process of removing prefixes and suffixes from words so that they are reduced to simpler forms
Lemmatization	the process of reduce words to their root form
Bag of Words Model	This model allows us to extract features from the text by converting the text into a matrix of occurrence of words. The main issue is that different sentences can yield similar vectors
Term frequency–inverse document frequency	is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus
Term frequency	summarizes how often a word appears within a documents
Inverse document frequency	downscales words that appear a lot across documents in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
Confusion Matrix	a performance evaluation tool in machine learning, representing the accuracy of a classification model. It displays the number of true positives, true negatives, false positives, and false negatives.
True Positive (TP)	the actual value was positive, and the model predicted a positive value
True Negative (TN)	the actual value was negative, and the model predicted a negative value.
False Positive (FP)	the actual value was negative, but the model predicted a positive value. (Type I Error)
False Negative (FN)	the actual value was positive, but the model predicted a negative value.(Type II Error)
Accuracy	evaluated overall effectiveness of a classifier... accuracy = (TP+TN)/(TP + FP + TN + FN)
Recall	evaluated the proportion of actual positive correctly labeled... recall = TP / (TP + FN)
Precision	a metric that tells us about the quality of positive predictions...precision = TP / (TP + FP)
F1 Score	a harmonic mean of Precision and Recall... F1 = (2 x Precision x Recall)/(Precision + Recall)
ROC Curve	is a graph showing the performance of a classification model at all classification. This curve plots two parameters true positive rate (recall) and false positive rate (FPRxTPR).... FPR = FP / (FP + TN)
AUC Curve	measures the entire two-dimensional area underneath the entire ROC curve
Precision	the number of true positives divided by the total number of positive predictions made by the model
Accuracy	the number of correct predictions divided by the total number of predictions made by the model.
Recall	the number of true positives divided by the actual positive values
Supervised Learning	a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns.
Unsupervised Learning	a type of machine learning that learns from data without human supervision or data
K-means Clustering	a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster
Hierarchical Clustering	a algorithm where clusters are visually represented in a hierarchical tree called a dendrogram. There are two types: - Agglomerative - Divisive
Agglomerative Clustering	a bottom-up clustering algorithm. It starts clustering by treating the individual data points as a single cluster then it is merged continuously based on similarity until it forms one big cluster containing all objects.
Types of Linkage Methods (hierarchical clustering)	- Complete Linkage - Single Linkage - Average Linkage - Centroid Linkage - Wards Method
Complete Linkage	the maximum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters
Single Linkage	the minimum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters
Average Linkage	the average of all pairwise distances between elements in each pair of clusters is used to measure the distance between two clusters.
Centroid Linkage	Before merging, the distance between the two clusters’ centroids are considered.
Ward's Method	It uses squared error to compute the similarity of the two clusters for merging.
Divisive Clustering	an algorithm that starts by considering all the data points into a big single cluster and later on splitting them into smaller heterogeneous clusters continuously until all data points are in their own cluster.
Density-Based Clustering Algorithm (DBSCAN)	unsupervised learning methods that identify distinctive clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density
DBSCAN Hyper-parameters	epsilon : the radius of a neighborhood around an observation MinPts: the minimum number of points within an radius of an observation to be considered a “core” point
Types of Points in DBSCAN	- Core points - observations with MinPts total observations within an radius - Border points - observations that are not core points, but are within of a core point - Noise points - everything else
Clustering Algorithms evaluation	Silhouette Plots, Elbow Method, Average Silhouette Method, Gap Statistic
Silhouette Plots	displays a measure of how close each point in one cluster is to points in the neighboring clusters. Observations with s ≈ 1 are well-clustered, s ≈ 0 lie between two clusters, s < 0 are probably in the wrong cluster.
Elbow Method	is a technique used in clustering analysis to determine the optimal number of clusters. It involves plotting the within-cluster sum of squares (WCSS) for different cluster numbers and identifying the “elbow” point where WCSS starts to level off.
Average Silhouette Method	it determines how well each object lies within its cluster. The method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the metric over a range of possible values for k
Gap Statistic	For a particular choice of K clusters, compare the total within cluster variation to the expected within-cluster variation under the assumption that the data have no obvious clustering.
Perceptron	a single-layer neural network linear or a Machine Learning algorithm used for supervised learning of various binary classifiers
Perceptron Key Components	- Inputs - Weight - Summation & Bias - Activation Function
Multi-Layer Perceptron (MLP)	a feedforward Neural Network that learns the relationship between linear and non-linear data.
Backpropagation	is the learning mechanism that allows the Multilayer Perceptron to iteratively adjust the weights in the network, with the goal of minimizing the cost function.
NN Design Choices	- Activation Function - Loss Function - Output Units - Architecture
Activation Functions	- Sigmoid - Softmax - Tanh - ReLU - Leaky ReLU - Softplus - Swish
Sigmoid	this curve is an S-shaped curve. It should only be used on the output layer. Used for binary and muli classification. Suffers from vanishing gradient problem
Softmax	Used for multinomial logistics regression, hence used in the output layer of a multi-class classification neural network. Uses cross entropy loss. its normalization reduces the influence of outliers in the data
Tanh	Tanh is a smoother, zero-centered function having a range between -1 to 1. Suffers from vanishing gradient
ReLU	an activation function that will output the input as it is when the value is positive; else, it will output 0. They are easily optimized but suffer from dying neurons
Leaky ReLU	An improvement over ReLU by introducing a small negative slope to solve the problem of dying ReLU. helps speed up training
Softplus	a smoother version of the rectifying non-linearity activation function and can be used to constrain a machine's output always to be positive
Swish	a gated version of the sigmoid activation function.
Gradient Descent	measures the change in all weights with regard to the change in error
Types of Gradient Descent	Batch Gradient Descent Stochastic Gradient (SGD) Mini-Batch Gradient Descent
Batch Gradient Descent	Calculates the error for each example within the training dataset, but only after all training examples have been evaluated does the model get updated.
Stochastic Gradient (SGD)	updates the parameters for each training example one by one.
Mini-Batch Gradient Descent	Combination of the concepts of SGD and batch gradient descent.It simply splits the training dataset into small batches and performs an update for each of those batches.
Gradient Descent Optimizers Types	- Gradient Based (Momentum) - Learning Rate Base (AdaGrad)
Momentum	invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction
Momentum Advantages	- Reduces the oscillation and high variance of the parameters - converges faster than gradient descent
Nesterov Accelerated Gradient	look before you leap. You will look at your two steps ahead, meaning, we will calculate the gradient at a partially updated value instead if calculating at our current position
Adaptive Learning Rate	techniques used in optimizing deep learning models by automatically adjusting the learning rates during the training process
Adagrad	an algorithm for gradient-based optimization. Performs larger updates (e.g. high learning rates) for those parameters that are related to infrequent features and smaller updates (i.e. low learning rates) for frequent one.
Adadelta	is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size.
RMSProp	divides the learning rate by an exponentially decaying average of squared gradients which solves Ada shrinking learning rate.
Adam	computes adaptive learning rates for each parameter by keeps an exponentially decaying average of past gradients (RMSProp + Momentum)
Nadam	thus combines Adam and Nesterov Accelerated Gradient
Regularization	is an modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error
Parameter Initialization	- Zero Initialization - Random Initialization - Xavier Initialization - He Initialization
Zero Initialization	Initialize all the weights and biases to zero. This is not generally used in deep learning as it leads to symmetry in the gradients, resulting in all the neurons learning the same feature.
Random Initialization	Initialize the weights and biases randomly from a uniform or normal distribution. This is the most common technique used in deep learning.
Xavier Initialization	Initialize the weights with a normal distribution with mean 0 and variance of sqrt(1/n), where n is the number of neurons in the previous layer. This is used for the sigmoid activation function.
He Initialization	Initialize the weights with a normal distribution with mean 0 and variance of sqrt(2/n), where n is the number of neurons in the previous layer. This is used for the ReLU activation function.
Orthogonal Initialization	Initialize the weights with an orthogonal matrix, which preserves the gradient norm during backpropagation.
Uniform Initialization	Initialize the weights with a uniform distribution. This is less commonly used than random initialization.
Constant Initialization	Initialize the weights and biases with a constant value. This is rarely used in deep learning
Norm Penalties	a NN regularization technique that includes penalties by adding a parameter norm penalty
L1 Parameter Regularization (Lasso)	adds “Absolute value of magnitude” of coefficient, as penalty term to the loss function.
L2 Parameter Regularization (Ridge Regression)	adds “squared magnitude of the coefficient” as penalty term to the loss function.
Types of NN Regularization	- Norma Penalties (L1 &L2) - Parameter Initialization - Early Stopping - Data Augmentation - Dropout
Early Stopping	an optimization technique used to reduce overfitting without compromising on model accuracy. The main idea behind this technique is to stop training before a model starts to overfit. Will terminate while validation set performance is better
Data Augmentation	a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It includes making minor changes to the dataset or using deep learning to generate new data points
Audio Data Augmentation	- Noise injection - Shifting: - Changing the speed - Changing the pitch
Text Data Augmentation	- Word or sentence shuffling - word replacement - syntax-tree manipulation - random word insertion - random word deletion
Image Data Augmentation	- geometric transformation - color space transformation - kernel filters - random erasing
Advance Data Augmentation Techniques	- Generative adversarial networks (GANs) - Neural Style Transfer
Data Augmentation Advantages	- Prevents overfitting - Helps with small datasets - Improves the model accuracy
Data Augmentation Disadvantages	- Does not address biases - Quality assurance of data is expensive - Can be challenging to implement on all problems
Dropout	refers to dropping out the nodes (input and hidden layer) in a neural network , ensures that there is no node codependence
Convolutional Neural Networks	is a type of NN that is used for image data by doing Representation Learning or Feature Learning . Nodes are used for every pixel in our image, one for black &white images or one for 3 RGB pixel values.
Representation Learning	a technique that allows a system to automatically find relevant features for a given task. Weights are assigned based on the importance of the pixel.
Types of Layers in CNN	- Convolutional Layers - Pooling Layers - Fully Connected Layers
Convolutional Layers	- Apply filters to extract features - Filters are composed of small kernels, learned - One bias per filter - Apply activation function on every value of feature map
Convolutional Layers Parameters	- Number of filters - Size of kernels(W and H only, D is defined by input cube) - Activation Function - Stride - Padding
Convolutional Layer I/O	- Input: previous set of feature maps: 3D cuboid - Output: 3D cuboid, one 2D map per filter
Pooling Layer	- Reduce dimensionality - Extract maximum or average of a region - Sliding window approach
Pooling Layer Parameters	- Stride - Size of window
Pooling Layers I/O	- Input: previous set of feature maps, 3D cuboid - Output: 3D cuboid, one 2D map per filter, reduced spatial dimension
Fully Connected Layers	- Aggregate information from final feature maps - Generate Final classification, regression, segmentation, etc
Fully Connected Layers Parameters	- Number of nodes - Activation function: Usually changes depending on role of the layer. If aggregation info, use ReLU. If producing final classification, use Softmax. If regression use linear
Fully Connected Layers I/O	- Input: Flattened previous set of feature maps - Output: Probabilities for each class or simply prediction for regression y_hat
Padding	the addition of extra pixels around the borders of the input images or feature map
Full Padding	Introduces zeros such that all pixels are visited the same number of times by the filter. Increases size of output
Same Padding	Ensures that the output has the same size as the input.
Stride	the number of pixels by which we move the filter across the input image.
Pooling	a new layer added after the convolutional layer. Specifically, it is added after a nonlinearity (e.g. ReLU) has been applied to the feature maps. (Maxpooling can be applied before ReLU)
CNN Dropout	layer is similar to multiplying Bernoulli noise into the feature maps of the network.
Layers Receptive Field	the region in the input space that a particular CNN feature (or activation) is looking at. Large blank are necessary for high-level recognition task, but diminishing reward
Tensorboard	suite of visualization tools to understand, debug, and optimize TensorFlow programs for ML experimentation
Occlusion methods	method that attributes importance to a region of an image for classification
Saliency Maps	a technique that highlight the pixels that were relevant for a certain image classification in a NN by calculating the gradient
DeconvNet	a technique that highlight the pixels that were relevant for a certain image classification in a NN by calculating the gradient but reverses the ReLU layers
Guided Backpropagation Algorithm	a combination of Gradient Based Backpropagation and DeconvNet
Class Activation Map (CAM)	is another explanation method for interpreting convolutional neural networks. They proposed a network where the fully connected layers at the very end of the model have been replaced by a layerGlobal Average Pooling (GAP) and a class activation mapping
Gradient-weighted Class Activation Mapping (Grad-CAM)	uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for
Unigrams	Assume each word is independent of all other. They count how often the word occurs.
Bigrams	Bigrams look at pairs of consecutive words (conditional probability)
Term Frequency – Inverse Document Frequency (TF-IDF )	a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations in a document amongst a collection of documents
Term Frequency	measures the count of a word in a given document in proportion to the total number of words in a document
Document Frequency	measures the importance of the word t across all documents in the corpus
Recurrent Neural Network (RNN)	is a deep learning model that is trained to process and convert a sequential data input into a specific sequential data output. they have a recurrent workflow: the hidden layer can remember and use previous inputs for future predictions
Long Short Term Memory (LSTM)	an RNN variant that enables the model to expand its memory capacity to accommodate a longer timeline by using 3 gates (input, output, forget) and 2 hidden states
Gated Recurrent Units	an RNN that enables selective memory retention. The model adds an update and forgets the gate to its hidden layer, which can store or remove information in the memory.
Types of RNN Language Models	- Bidirectional RNN - Character CNN
Bidirectional RNN	is 2 RNNs stacking on top of each other. The output is then composed based on the hidden state of both RNNs. The idea is that the output may not only depend on previous elements in the sequence but also on future elements
Character CNN	a CNN application on sentence tensors created by one hot encoding our words by character
Seq2Seq	a language model that consist of an encoder and decoder section where the encoder compresses our input into a context vector that our decoder uses to generate our output sequence one token at a time
Attention	a technique that allows the decoder to "look back” at the complete input and extracts significant information via a similarity score that is useful in decoding
Transformers	an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it does not implement any RNN
Bidirectional Encoder Representations from Transformers (BERT)	is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers
Generative Adversarial Network (GANs)	is a machine learning (ML) model in which two neural networks compete with each other by using deep learning methods to become more accurate in their predictions
Naive Bayes	is a popular supervised machine learning algorithm used for classification tasks. Based on the assumption that the features of the input data are conditionally independent given the class, allowing the algorithm to make predictions quickly and accurately.
Support Vector Machines	it is a supervised machine learning problem where we try to find a hyperplane that best separates the two classes

Created by: ivonnem25

"Know" box contains:
Time elapsed:
Retries: