Save
Upgrade to remove ads
Busy. Please wait.
Log in with Clever
or

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever
or

Username is available taken
show password


Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.


Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
focusNode
Didn't know it?
click below
 
Knew it?
click below
Don't Know
Remaining cards (0)
Know
0:00
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how

Machine Learning

QuestionAnswer
k-Nearest Neighboors (kNN) is a supervised learning algorithm that combines the classification of the k-nearest point to for classification or regression problems
Linear Regression is a supervised learning method that can help predict a quantitative response by approximating a linear relationship (y=b0 +b1x).
Loss Function a function that measures our model error
Mean Square Error (MSE) a metric that measures the averaged squared error of our predictions
Max Absolute Error a metric that measures the largest residue
Mean Absolute Error a metric that takes the average absolute error
Multi-Linear Regression a model of the form y = b0 + b1x1 + b2x2 +...+bpxp + epsilon
Polynomial Regerssion a learning algorithm that captures non-linear relationships between our predictors and response of the form y = b0+b1xi + b2xi^2+ .... + bdx_i^d + epsilon
Model Selection the application of a principled method to determine the complexity of the model, e.g., choosing a subset of predictors, choosing the degree of the polynomial model etc.
Underfitting when a model is too simple to capture data complexities and fails to fit both the training and testing data
Reasons for Underfitting simple model, small training data, excessive regularization. features are not scaled, and features are not good predictors of our target variable
Techniques to Reduce Underfitting increase model complexity, increase number of features, remove noise, increase epoch or training time
Overfitting when a mode does not make accurate predictions on testing data but preforms well on the training data
Reasons for Overfitting high variance and low bias, model is too complex, size of the training data
Techniques to Reduce Overfitting increase training data, reduce model complexity, early stopping, ridge regression, lasso regression, dropout
Bias our models overall accuracy, how close are our model predictions to the actually values
Variance our models precision, if we run our model many time how often do we get the same prediction
Low Bias Low Variance Symptom Training and testing error are both low, BEST CASE
Low Bias Low Variance Cause Model has the right balance between bias and variance. It is able to capture the true relationship between the target and predicted variable and is stable to changes in the training data
Low Bias Low Variance Solution NOT NEEDED
Low Bias High Variance Symptom Training error is low and testing error is high
Low Bias High Variance Cause - Model is overfitting to training data, it is learning both the signal and the noise in the training data and does not generalize well to unknown data - Complex models are usually unstable and can change a lot when any data changes occur
Low Bias High Variance Solution Build a simpler mode - Hyperparameter tuning, Regularization, Dimensional Reduction, Bagging. Get more sample in training data.
High Bias Low Variance Symptom Training Error is very high and test error is almost the same as the training error
High Bias Low Variance Cause - Model is underfitting and it is too simple to capture the true relationship between target and predictor variables. This becomes a source of high bias - Simple models tend to be more stable model with low variance to change in training data
High Bias Low Variance Solution Build a more complex model - Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees (Gradient Boosting Machines)
High Bias High Variance Symptom Training error is high and test error is even higher than the training error
High Bias High Variance Cause - Model is underfitting and is too simple to capture the true relationship between target and predictor variables. - If model has limited samples will contribute to high variance and leads to unstable model
High Bias High Variance Solution - Build a more complex model. Add more features, build bigger networks with more hidden nodes and layers (deep learning), deeper trees (random forest), more trees. - Get more sample in training data
Cross Validation when we standardized and split our data into 3 subsets: train, validation, and test
K-Fold Cross Validation we first divide our dataset into k equally sized subsets. Then, we repeat the train-test method k times such that each time one of the k subsets is used as a test set and the rest k-1 subsets are used together as a training set.
Leave One out Cross Validation we train our machine-learning model  times where  is to our dataset’s size. Each time, only one sample is used as a test set while the rest are used to train our model.
Lasso Regression (L1) shrinks the coefficients and helps to reduce the model complexity and multi-collinearity. Some of the features might shrink to a mean of zero, thus eliminating the least important features
Ridge Regression (L2) the coefficients shrunk towards the central point as the mean by introducing a penalization factor that takes the square of our coefficients
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It retains the data in the direction of maximum variance.
Logistic Regression a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation
Types of Logistic Regression 1) Binary Logistic Regression 2)Multinomial Logistic Regression 3) Ordinal Logistic Regression
Decision Trees a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks.
Decision Tree Attribute Identification Methods 1) Information Gain (Entropy) 2) Gini Index
Information Gain (Entropy) is applied to quantify which feature provides maximal information about the classification based on the notion of uncertainty, disorder or impurity (purer at the bottom)
Gini Index measures the probability for a random instance being misclassified when chosen randomly, the lower the metric the lower the likelihood of misclassification
Ensemble Learning Method is a widely-used and preferred machine learning technique in which multiple individual models, often called base models, are combined to produce an effective optimal prediction model
Bagging an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in parallel order and aggregate the avg/most for our final model
Random Forest is an application of decision tree bootstrapping
Boosting an ensemble learning technique that generates different data subsets at random and with replacement, trained base models in sequential order and aggregate the avg/most for our final model
What are the different types of Boosting? - Ada Boost - Gradient Boosting - XGBoost
Ada Boost a boosting method that gives more weight to incorrectly classified items. This approach does not work well when there is a correlation among features or high data dimensionality.
Gradient Boosting a boosting method that optimizes the loss function by generating base learners sequentially so that the present base learner is always more effective than the previous one
XGBoost a boosting method that uses multiple cores on the CPU so that learning can occur in parallel during training. It is a boosting algorithm that can handle extensive datasets, making it attractive for big data applications
Natural Language Processing (NLP) a branch of artificial intelligence that allows computers to interpret, analyze and manipulate human language
NLP Techniques - Tokenization - Stop-words - Stemming/Lemmatization - Preprocessing steps - Bag of words model
Text Processing - Tokenize - Remove Stop Words - Clean special characters in text - Stemming/Lemmatization
Tokenization divide strings into lists of substrings, it consist of dividing a piece of text into smaller pieces. We can divide paragraph into sentences, sentence into words or word into characters.
Stemming the process of removing prefixes and suffixes from words so that they are reduced to simpler forms
Lemmatization the process of reduce words to their root form
Bag of Words Model This model allows us to extract features from the text by converting the text into a matrix of occurrence of words. The main issue is that different sentences can yield similar vectors
Term frequency–inverse document frequency is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus
Term frequency summarizes how often a word appears within a documents
Inverse document frequency downscales words that appear a lot across documents in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
Confusion Matrix a performance evaluation tool in machine learning, representing the accuracy of a classification model. It displays the number of true positives, true negatives, false positives, and false negatives.
True Positive (TP) the actual value was positive, and the model predicted a positive value
True Negative (TN) the actual value was negative, and the model predicted a negative value.
False Positive (FP) the actual value was negative, but the model predicted a positive value. (Type I Error)
False Negative (FN) the actual value was positive, but the model predicted a negative value.(Type II Error)
Accuracy evaluated overall effectiveness of a classifier... accuracy = (TP+TN)/(TP + FP + TN + FN)
Recall evaluated the proportion of actual positive correctly labeled... recall = TP / (TP + FN)
Precision a metric that tells us about the quality of positive predictions...precision = TP / (TP + FP)
F1 Score a harmonic mean of Precision and Recall... F1 = (2 x Precision x Recall)/(Precision + Recall)
ROC Curve is a graph showing the performance of a classification model at all classification. This curve plots two parameters true positive rate (recall) and false positive rate (FPRxTPR).... FPR = FP / (FP + TN)
AUC Curve measures the entire two-dimensional area underneath the entire ROC curve
Precision the number of true positives divided by the total number of positive predictions made by the model
Accuracy the number of correct predictions divided by the total number of predictions made by the model.
Recall the number of true positives divided by the actual positive values
Supervised Learning a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns.
Unsupervised Learning a type of machine learning that learns from data without human supervision or data
K-means Clustering a centroid-based algorithm or a distance-based algorithm, where we calculate the distances to assign a point to a cluster
Hierarchical Clustering a algorithm where clusters are visually represented in a hierarchical tree called a dendrogram. There are two types: - Agglomerative - Divisive
Agglomerative Clustering a bottom-up clustering algorithm. It starts clustering by treating the individual data points as a single cluster then it is merged continuously based on similarity until it forms one big cluster containing all objects.
Types of Linkage Methods (hierarchical clustering) - Complete Linkage - Single Linkage - Average Linkage - Centroid Linkage - Wards Method
Complete Linkage the maximum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters
Single Linkage the minimum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters
Average Linkage the average of all pairwise distances between elements in each pair of clusters is used to measure the distance between two clusters.
Centroid Linkage Before merging, the distance between the two clusters’ centroids are considered.
Ward's Method It uses squared error to compute the similarity of the two clusters for merging.
Divisive Clustering an algorithm that starts by considering all the data points into a big single cluster and later on splitting them into smaller heterogeneous clusters continuously until all data points are in their own cluster.
Density-Based Clustering Algorithm (DBSCAN) unsupervised learning methods that identify distinctive clusters in the data, based on the idea that a cluster in data space is a contiguous region of high point density, separated from other such clusters by contiguous regions of low point density
DBSCAN Hyper-parameters epsilon : the radius of a neighborhood around an observation MinPts: the minimum number of points within an radius of an observation to be considered a “core” point
Types of Points in DBSCAN - Core points - observations with MinPts total observations within an radius - Border points - observations that are not core points, but are within of a core point - Noise points - everything else
Clustering Algorithms evaluation Silhouette Plots, Elbow Method, Average Silhouette Method, Gap Statistic
Silhouette Plots displays a measure of how close each point in one cluster is to points in the neighboring clusters. Observations with s ≈ 1 are well-clustered, s ≈ 0 lie between two clusters, s < 0 are probably in the wrong cluster.
Elbow Method is a technique used in clustering analysis to determine the optimal number of clusters. It involves plotting the within-cluster sum of squares (WCSS) for different cluster numbers and identifying the “elbow” point where WCSS starts to level off.
Average Silhouette Method it determines how well each object lies within its cluster. The method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximizes the metric over a range of possible values for k
Gap Statistic For a particular choice of K clusters, compare the total within cluster variation to the expected within-cluster variation under the assumption that the data have no obvious clustering.
Perceptron a single-layer neural network linear or a Machine Learning algorithm used for supervised learning of various binary classifiers
Perceptron Key Components - Inputs - Weight - Summation & Bias - Activation Function
Multi-Layer Perceptron (MLP) a feedforward Neural Network that learns the relationship between linear and non-linear data.
Backpropagation is the learning mechanism that allows the Multilayer Perceptron to iteratively adjust the weights in the network, with the goal of minimizing the cost function.
NN Design Choices - Activation Function - Loss Function - Output Units - Architecture
Activation Functions - Sigmoid - Softmax - Tanh - ReLU - Leaky ReLU - Softplus - Swish
Sigmoid this curve is an S-shaped curve. It should only be used on the output layer. Used for binary and muli classification. Suffers from vanishing gradient problem
Softmax Used for multinomial logistics regression, hence used in the output layer of a multi-class classification neural network. Uses cross entropy loss. its normalization reduces the influence of outliers in the data
Tanh Tanh is a smoother, zero-centered function having a range between -1 to 1. Suffers from vanishing gradient
ReLU an activation function that will output the input as it is when the value is positive; else, it will output 0. They are easily optimized but suffer from dying neurons
Leaky ReLU An improvement over ReLU by introducing a small negative slope to solve the problem of dying ReLU. helps speed up training
Softplus a smoother version of the rectifying non-linearity activation function and can be used to constrain a machine's output always to be positive
Swish a gated version of the sigmoid activation function.
Gradient Descent measures the change in all weights with regard to the change in error
Types of Gradient Descent Batch Gradient Descent Stochastic Gradient (SGD) Mini-Batch Gradient Descent
Batch Gradient Descent Calculates the error for each example within the training dataset, but only after all training examples have been evaluated does the model get updated.
Stochastic Gradient (SGD) updates the parameters for each training example one by one.
Mini-Batch Gradient Descent Combination of the concepts of SGD and batch gradient descent.It simply splits the training dataset into small batches and performs an update for each of those batches.
Gradient Descent Optimizers Types - Gradient Based (Momentum) - Learning Rate Base (AdaGrad)
Momentum invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction
Momentum Advantages - Reduces the oscillation and high variance of the parameters - converges faster than gradient descent
Nesterov Accelerated Gradient look before you leap. You will look at your two steps ahead, meaning, we will calculate the gradient at a partially updated value instead if calculating at our current position
Adaptive Learning Rate techniques used in optimizing deep learning models by automatically adjusting the learning rates during the training process
Adagrad an algorithm for gradient-based optimization. Performs larger updates (e.g. high learning rates) for those parameters that are related to infrequent features and smaller updates (i.e. low learning rates) for frequent one.
Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size.
RMSProp divides the learning rate by an exponentially decaying average of squared gradients which solves Ada shrinking learning rate.
Adam computes adaptive learning rates for each parameter by keeps an exponentially decaying average of past gradients (RMSProp + Momentum)
Nadam thus combines Adam and Nesterov Accelerated Gradient
Regularization is an modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error
Parameter Initialization - Zero Initialization - Random Initialization - Xavier Initialization - He Initialization
Zero Initialization Initialize all the weights and biases to zero. This is not generally used in deep learning as it leads to symmetry in the gradients, resulting in all the neurons learning the same feature.
Random Initialization Initialize the weights and biases randomly from a uniform or normal distribution. This is the most common technique used in deep learning.
Xavier Initialization Initialize the weights with a normal distribution with mean 0 and variance of sqrt(1/n), where n is the number of neurons in the previous layer. This is used for the sigmoid activation function.
He Initialization Initialize the weights with a normal distribution with mean 0 and variance of sqrt(2/n), where n is the number of neurons in the previous layer. This is used for the ReLU activation function.
Orthogonal Initialization Initialize the weights with an orthogonal matrix, which preserves the gradient norm during backpropagation.
Uniform Initialization Initialize the weights with a uniform distribution. This is less commonly used than random initialization.
Constant Initialization Initialize the weights and biases with a constant value. This is rarely used in deep learning
Norm Penalties a NN regularization technique that includes penalties by adding a parameter norm penalty
L1 Parameter Regularization (Lasso) adds “Absolute value of magnitude” of coefficient, as penalty term to the loss function.
L2 Parameter Regularization (Ridge Regression) adds “squared magnitude of the coefficient” as penalty term to the loss function.
Types of NN Regularization - Norma Penalties (L1 &L2) - Parameter Initialization - Early Stopping - Data Augmentation - Dropout
Early Stopping an optimization technique used to reduce overfitting without compromising on model accuracy. The main idea behind this technique is to stop training before a model starts to overfit. Will terminate while validation set performance is better
Data Augmentation a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It includes making minor changes to the dataset or using deep learning to generate new data points
Audio Data Augmentation - Noise injection - Shifting: - Changing the speed - Changing the pitch
Text Data Augmentation - Word or sentence shuffling - word replacement - syntax-tree manipulation - random word insertion - random word deletion
Image Data Augmentation - geometric transformation - color space transformation - kernel filters - random erasing
Advance Data Augmentation Techniques - Generative adversarial networks (GANs) - Neural Style Transfer
Data Augmentation Advantages - Prevents overfitting - Helps with small datasets - Improves the model accuracy
Data Augmentation Disadvantages - Does not address biases - Quality assurance of data is expensive - Can be challenging to implement on all problems
Dropout refers to dropping out the nodes (input and hidden layer) in a neural network , ensures that there is no node codependence
Convolutional Neural Networks is a type of NN that is used for image data by doing Representation Learning or Feature Learning . Nodes are used for every pixel in our image, one for black &white images or one for 3 RGB pixel values.
Representation Learning a technique that allows a system to automatically find relevant features for a given task. Weights are assigned based on the importance of the pixel.
Types of Layers in CNN - Convolutional Layers - Pooling Layers - Fully Connected Layers
Convolutional Layers - Apply filters to extract features - Filters are composed of small kernels, learned - One bias per filter - Apply activation function on every value of feature map
Convolutional Layers Parameters - Number of filters - Size of kernels(W and H only, D is defined by input cube) - Activation Function - Stride - Padding
Convolutional Layer I/O - Input: previous set of feature maps: 3D cuboid - Output: 3D cuboid, one 2D map per filter
Pooling Layer - Reduce dimensionality - Extract maximum or average of a region - Sliding window approach
Pooling Layer Parameters - Stride - Size of window
Pooling Layers I/O - Input: previous set of feature maps, 3D cuboid - Output: 3D cuboid, one 2D map per filter, reduced spatial dimension
Fully Connected Layers - Aggregate information from final feature maps - Generate Final classification, regression, segmentation, etc
Fully Connected Layers Parameters - Number of nodes - Activation function: Usually changes depending on role of the layer. If aggregation info, use ReLU. If producing final classification, use Softmax. If regression use linear
Fully Connected Layers I/O - Input: Flattened previous set of feature maps - Output: Probabilities for each class or simply prediction for regression y_hat
Padding the addition of extra pixels around the borders of the input images or feature map
Full Padding Introduces zeros such that all pixels are visited the same number of times by the filter. Increases size of output
Same Padding Ensures that the output has the same size as the input.
Stride the number of pixels by which we move the filter across the input image.
Pooling a new layer added after the convolutional layer. Specifically, it is added after a nonlinearity (e.g. ReLU) has been applied to the feature maps*. (*Maxpooling can be applied before ReLU)
CNN Dropout layer is similar to multiplying Bernoulli noise into the feature maps of the network.
Layers Receptive Field the region in the input space that a particular CNN feature (or activation) is looking at. Large blank are necessary for high-level recognition task, but diminishing reward
Tensorboard suite of visualization tools to understand, debug, and optimize TensorFlow programs for ML experimentation
Occlusion methods method that attributes importance to a region of an image for classification
Saliency Maps a technique that highlight the pixels that were relevant for a certain image classification in a NN by calculating the gradient
DeconvNet a technique that highlight the pixels that were relevant for a certain image classification in a NN by calculating the gradient but reverses the ReLU layers
Guided Backpropagation Algorithm a combination of Gradient Based Backpropagation and DeconvNet
Class Activation Map (CAM) is another explanation method for interpreting convolutional neural networks. They proposed a network where the fully connected layers at the very end of the model have been replaced by a layerGlobal Average Pooling (GAP) and a class activation mapping
Gradient-weighted Class Activation Mapping (Grad-CAM) uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for
Unigrams Assume each word is independent of all other. They count how often the word occurs.
Bigrams Bigrams look at pairs of consecutive words (conditional probability)
Term Frequency – Inverse Document Frequency (TF-IDF ) a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations in a document amongst a collection of documents
Term Frequency measures the count of a word in a given document in proportion to the total number of words in a document
Document Frequency measures the importance of the word t across all documents in the corpus
Recurrent Neural Network (RNN) is a deep learning model that is trained to process and convert a sequential data input into a specific sequential data output. they have a recurrent workflow: the hidden layer can remember and use previous inputs for future predictions
Long Short Term Memory (LSTM) an RNN variant that enables the model to expand its memory capacity to accommodate a longer timeline by using 3 gates (input, output, forget) and 2 hidden states
Gated Recurrent Units an RNN that enables selective memory retention. The model adds an update and forgets the gate to its hidden layer, which can store or remove information in the memory.
Types of RNN Language Models - Bidirectional RNN - Character CNN
Bidirectional RNN is 2 RNNs stacking on top of each other. The output is then composed based on the hidden state of both RNNs. The idea is that the output may not only depend on previous elements in the sequence but also on future elements
Character CNN a CNN application on sentence tensors created by one hot encoding our words by character
Seq2Seq a language model that consist of an encoder and decoder section where the encoder compresses our input into a context vector that our decoder uses to generate our output sequence one token at a time
Attention a technique that allows the decoder to "look back” at the complete input and extracts significant information via a similarity score that is useful in decoding
Transformers an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it does not implement any RNN
Bidirectional Encoder Representations from Transformers (BERT) is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers
Generative Adversarial Network (GANs) is a machine learning (ML) model in which two neural networks compete with each other by using deep learning methods to become more accurate in their predictions
Naive Bayes is a popular supervised machine learning algorithm used for classification tasks. Based on the assumption that the features of the input data are conditionally independent given the class, allowing the algorithm to make predictions quickly and accurately.
Support Vector Machines it is a supervised machine learning problem where we try to find a hyperplane that best separates the two classes
Created by: ivonnem25
 

 



Voices

Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
Retries:
restart all cards