Question 1

What is supervised learning?

Accepted Answer

A prevalent learning algorithm, using a teacher, which has an expected output, label, class etc. It solves classification and regression problems.

Question 2

Give an example of a classification problem and an example of a regression problem.

Accepted Answer

Classification: Spam detection (output is yes or no)
Regression: Stock price prediction

Question 3

How do you formulate a supervised learning algorithm?

Accepted Answer

Given some input x, predict an appropriate output y, with the goal being a function f such that f(x) = y. It follows a learning process: examples of input-output pairs (training data), and finds a good function to predict outputs of new inputs.

Question 4

What is overfitting vs underfitting?

Accepted Answer

Overfitting: When the model is more complex than required.
Underfitting: When the model is simpler than required.

Question 5

What is univariate linear regression?

Accepted Answer

A machine learning algorithm for regression problems. It is univariate because it takes in one input attribute.
Equation example: y = f(x; w0, w1) = w1x + w0
Where y = dependent variable, w0, w1 are free parameters, and x is independent variable.

Question 6

What is a loss/cost function?

Accepted Answer

Used when drawing a line of best fit, it is a criteria suggesting how good/bad that line is. It averages losses on training examples by find the distance between the output of the model and the observed target.

Question 7

What is gradient descent and how does it work?

Accepted Answer

A strategy to minimise cost functions in ML algorithms.
E.g, minimise g(w0, w1):
Start at a random point.
Repeat until no change:
Update w0, w1 by taking a small step in direction of steepest descent of cost (w := w - a ▽g(w), a = learning rate)

Question 8

What does the vector of partial derivatives represent?

Accepted Answer

The gradient vector. Partial derivative with respect to one variable is the ordinary derivative of the function by treating others as constants. the negative of the gradient evaluated gives direction of steepest descent.

Question 9

What are the steps in the algorithm for univariate linear regression using GD?

Accepted Answer

Input: a>0, training set {x(n), y(n): n = 1, 2 … N}
Initialise w0 = 0, w1 = 0
Repeat for n = 1, 2 …. N:
w0 := w0 - a((w1x(n) + w0) - y(n))
w1 := w1 - a((w1x(n) + w0) - y(n))x(n)
Until change in cost remains below small threshold. Return w0, w1.

Question 10

What is logistic regression?

Accepted Answer

A linear and parametric model, similar to linear regression, except it’s for classification problems.

Question 11

What is the difference between parametric and non-parametric models?

Accepted Answer

Parametric models summarise data it’s a finite set of parameters and make assumptions on data distributions.
Non-parametric models cannot be characterised by bounded set of parameters, and no assumptions are made no data distributions.

Question 12

What is the sigmoid function and it’s meaning used in a model?

Accepted Answer

y = 1 / (1 + e ^ -x). It is a smoothed version of a step function. In a model, x is replaced with attributes (w0 + w1x1 + … + wdxd), with the value being the probability that the label is 1.

Question 13

If an algorithm has three attributes w0, w1, w2, and uses the sigmoid function, what is the decision boundary?

Accepted Answer

w0 + w1x1 + w2x2 = 0. It is the set o inputs for which sigmoid outputs 0.5.

Question 14

What is the name of  logistic loss?

Accepted Answer

Cross-entropy loss.

Question 15

List some features of the kNN algorithm.

Accepted Answer

- Non parametric
- Instance-based (prediction based on comparison of new point with other data points, not a model)
- Lazy algorithm (no explicit training step, defers computation until prediction)
- Used for classification and regression.

Question 16

What are the steps involved in the kNN algorithm?

Accepted Answer

Input neighbour size k, distance metric D, training set, a new data x(j).
For n = 1.. N: Calculate D(x(j), x(n)), select k training samples closest to x(j)
Return y(j): plurality vote of labels  (classification), or average of y values (regression)

Question 17

What is normalisation and why is it used?

Accepted Answer

It is the process of linearly scaling range of each attribute to be in e.g [0, 1], using the following equation:
x(n)_(j_new) = (x(n)_j - min x_j) / (max x_j - min x_j)
Used so larger ranges don’t affect results.

Question 18

What is standardisation?

Accepted Answer

The process of linearly scaling each dimension to have 0 mean and variance 1. x(n)j_new = (x(n)_j - mean_j) / variance_j,

Question 19

What are the pros/cons of the kNN algorithm?

Accepted Answer

Pros: Easy to implement and interpret, accurate.
Cons: Large memory space, sensitive to noise, performance degrades as data dimension increases.

Question 20

What are the steps involved in holdout validation?

Accepted Answer

1. Randomly choose 30% of data to form a validation set, rest is training set, and train it.
2. Estimate test performance on validation set.
3. Choose model with lowest validation error, and re train for predictor, then estimate on test set.

Question 21

When estimating test performance on a validation set, what do we compute in regression algorithms versus classification algorithms?

Accepted Answer

Regression: Compute cost function (MSE) on examples of validation set.
Classification: Compute 0-1 error metric, i.e number of wrong predictions / number of predictions.

Question 22

What are the steps involved in k-Fold Cross-Validation?

Accepted Answer

1. Split training set randomly into k equal sets.
2. Use k-1 of those for training, and remaining for validation.
3. Permute the k sets and repeat k times
4. Average performances on k validation sets.

Question 23

What are the steps involved in leave-one-out validation?

Accepted Answer

Leave out a single example for validation, and train on rest of the annotated data.
For a total of N examples, repeat N times, each time leaving out a single example.
Take average of validation errors as measured on left-out points.

Question 24

What are the advantages and disadvantages of holdout validation, 3-fold and 10-fold validation, and leave-one-out validation?

Accepted Answer

Adv: Holdout - Computationally cheap. 3-fold - Slightly more reliable. 10-fold - Wastes only 10%, reliable. L-O-O - Doesn’t waste data.
Disadv: Holdout - Unreliable if sample size not large. 3/10 fold - wastes data. L-O-O - Computationally expensive.

Question 25

What is unsupervised learning?

Accepted Answer

Uses unlabelled data sets of feature vectors, and finds sub-groups (clusters) among feature vectors with similar traits. Also finds patterns within feature vector to indemnify a lower dimensional representation (dimensionality reduction)

Question 26

What are the benefits and challenges of unsupervised learning?

Accepted Answer

Unlabelled data is cheap to collect and is abundant, compressed representation saves on storage and computation, reduces noise, used in exploratory data analysis. However, it has no simple goal and validation is subjective.

Question 27

What is the goal and respective observations in clustering?

Accepted Answer

The goal is to find natural groupings among objects. The clusters/groups can be observed as objects within a cluster having high similarity (high intra-cluster similarity), and objects across clusters having low similarity (low inter-cluster similarity)

Question 28

How do you identify similar feature vectors?

Accepted Answer

Use proximity indices to quantify strength of relationship between any two feature vectors. Continuous valued attributes use their values, but nominal feature attributes (e.g large, medium, small) must be mapped to consistent discrete values.

Question 29

What is the inertia of a cluster C defined as?

Accepted Answer

Sum of Euclidean distances between each example in a cluster and its centroid. The centroid is the average of all examples in a cluster. It determines how compact the cluster is.

Question 30

What is the Within Cluster Sum of Squares (WCSS) defined as?

Accepted Answer

The inertia of a cluster.

Question 31

What is the K-means algorithm?

Accepted Answer

An iterative, greedy descent algorithm aiming to find a sub-optimal solution. It switches between the assignment step, where each example goes to a cluster with closest centroid, and refitting step, where it  updates cluster centroids.  WCSS will decrease

Question 32

What are the steps in the K-means algorithm?

Accepted Answer

Input: K clusters and N examples (x1, … N)
1. Select K of N examples at random for centroids (c1… cK)
2. Repeat until centroids don’t change: Assignment step (minimum Euc distance), refitting step (compute new centroids, doesn’t need to be examples)

Question 33

What is the space and time complexity of K-means?

Accepted Answer

Modest space complexity as only data observations and centroids are stored. It is of order O((N + K)m), where m is number of feature attributes.
Time complexity is of order O(IKNm), where I is number of iterations required for convergence. Linear in N.

Question 34

What is the K-means ++ Algorithm?

Accepted Answer

Choose first centroid at random.
Repeat until K centroids are found: For each point, compute distance from its nearest centroid. Choose a new data point x randomly with propositional probability to d(x)^2 as next centroid. Obtained centroids used for K-M

Question 35

What is the Elbow method?

Accepted Answer

A data based approach that estimates number of natural clusters in data set. Run K-means for increasing values of K, and evaluate WCSS of obtained clustering. Plot WCSS as a function. Optimal K lies at elbow of the plot.

Question 36

What are some limitations of the K-means algorithm?

Accepted Answer

It has problems when outliers are present, as they influence clusters and increase WCSS. It has problems with clusters have differing size, densities and when clusters are of non-globular shapes. It may converge to local minima, implying multiple restarts

Question 37

What is hierarchical clustering?

Accepted Answer

A clustering algorithm that only requires the user to specify a measure of similarity between a pair of clusters. It creates a hierarchical decomposition of the set of examples, producing a dendrogram.

Question 38

List the features of a dendrogram.

Accepted Answer

It is a rooted binary tree, where nodes represent clusters.

Question 39

Describe the differences between agglomerative clustering and divisive clustering.

Accepted Answer

Agglomerative: Bottom-up approach, starting at bottom of each cluster containing a single observation. Recursively merges pair of clusters with smallest inter-cluster dissimilarity into a single cluster.
Divisive: Top-down approach, splitting clusters.

Question 40

What is a Single Linkage distance vs a Complete Linkage distance vs Group Average?

Accepted Answer

SL: Shortest distance from any member of the cluster to any member of the other cluster.
CL: Largest distance from any member of the cluster to any member of the other cluster.
GA: Average distance between members of the two clusters.

Question 41

What is the space and time complexity of hierarchical clustering?

Accepted Answer

Storage: O(N^2). Storing distance matrix requires storage of (N^2) / 2 entries.
Time: O(N^3) in many cases. N iterations, each with N^2 size distance matrix needing to be updated and searched.

Question 42

What is the goal of cluster validation?

Accepted Answer

Evaluate in a quantitative and objective manner the cluster structure found by an algorithm according to a validation criterion. This is an index used to measure the adequacy of found cluster structures.

Question 43

List three examples of unsupervised validation criteria for partitional and hierarchical clustering algorithms.

Accepted Answer

Partitional: Variability and Separation-Based, Silhouette Coefficient.
Hierarchical: Cophenetic Correlation

Question 44

Explain what Variability and Separation Criteria entails.

Accepted Answer

It quantifiers the inter-cluster (separation) and intra-cluster (variability) dissimilarity. The criteria then can be used to define an overall validation criterion for a clustering structure C.

Question 45

What is the equation for centroid-based variability of a cluster C and an equation for centroid-based separation between clusters C1 and C2?

Accepted Answer

Variability = Sum of distances between each point in the cluster and its centroid.
Separation = Distance between cluster C1 and cluster C2.

Question 46

What is the Between Cluster Sum of Squares (BCSS)?

Accepted Answer

Sum of squared Euclidean distances between each cluster centroid and a central centroid (separation).

Question 47

For a clustering structure C, what relation holds between WCSS and BCSS and what does this entail?

Accepted Answer

WCSS(C) + BCSS(C) = constant.
This means minimising WCSS ensures maximising BCSS.

Question 48

What is the Silhouette Coefficient (SC) used for?

Accepted Answer

Evaluating an individual example, for a clustering structure C, as well as clustering structure of K clusters. It combines the ideas of variability and separation.

Question 49

How do you compute the Silhouette Coefficient?

Accepted Answer

Let e(i) belong to C
Calculate a(i) = average distance of ith example to all other examples in its cluster.
Calculate b(i) = minimum (over clusters) of average distances of ith example to examples in another cluster
SC = (b(i) - a(i)) / max{b(i), a(i)}

Question 50

List some properties of the Silhouette Coefficient.

Accepted Answer

Varies between -1 and 1. If -1, data is better fit to a neighbouring cluster. If 0, data is on border between two clusters. If 1, data is well matched to cluster. 
SC of a cluster = average of SC in cluster 
SC of a clustering = average of SC of all

Question 51

How can Sihouette Coefficients be used to estimate number of clusters?

Accepted Answer

By plotting average SC of clustering as a function of number of clusters, peak in the plot gives estimate to number of clusters.

Question 52

What are the two classes of supervised validation criteria for partitional clustering algorithms?

Accepted Answer

Classification-oriented, measuring from classification and quantifies extent to which a cluster contains objects of a single class.

Similarity-oriented, quantifying extent to which two objects in the same class are in the same cluster and vice versa.

Question 53

What is the probability that an example of cluster i belongs to class j?

Accepted Answer

p(i, j) = number of examples of class j in cluster i / |Ci |.

Question 54

What is the precision and recall of a cluster i with respect to class j?

Accepted Answer

Precision(i,j) = p(i, j): Measures extent to which a cluster contains objects of a single class.

Recall(i, j) = number of objects of class j in cluster i / number of objects in class j: Determines fraction of class j contained in cluster i.

Question 55

What is the F-measure of cluster i with respect to class j?

Accepted Answer

F(i, j) = (2 * precision(i,j) * recall(i, j)) / (precision(i, j) + recall(i, j))
Measures the extent to which a cluster contains only objects of a particular class and all objects of that class. Combines precision and recall.

Question 56

What is entropy, and what does low entropy mean?

Accepted Answer

The degree to which each cluster consists of examples of a single class.
Low entropy means clusters consist mostly of examples of same class.

Question 57

What is purity?

Accepted Answer

Another measure of the extent to which a cluster consists of examples of a single class. Purity(Ci) = max p(i,j). Purity is ideally high (close to 1)

Question 58

What is benchmarking?

Accepted Answer

An approach to determine quality of an algorithm, by measuring time taking to run and memory consumption on a given computer with specific input data.

Question 59

What is asymptotic analysis?

Accepted Answer

An approach to determine quality of an algorithm. It is a mathematical abstraction over exact number of operations and content of input, independent of any particular implementation.

Question 60

What is a problem-solving agent?

Accepted Answer

An agent is an ‘entity’ that perceives and acts in an environment. A problem-solving agent uses atomic representations (each state is perceived indivisible), requiring a precise definition of a problem and it’s goal/solution.

Question 61

What assumptions do we make about the environment when formulating a search problem?

Accepted Answer

Observable - agent able to know current state
Discrete - only finitely many actions at any state
Known - possible to determine which states are reached by which action
Deterministic - each action has exactly one outcome

Question 62

When formulating a search problem, what five components define the said search problem?

Accepted Answer

Initial state - state where agent starts
Action set - set A describing actions executed in any state si 
Transition model - mapping between states and actions
Goal test - determine if a state is a goal state
Path cost function -assigns a cost to paths

Question 63

What is the solution and cost of a search problem?

Accepted Answer

Solution: sequence of actions from initial to goal state
Cost: sum of cost of actions from initial to goal state

Question 64

When forming a search tree, what components of the tree is used for the initial state, actions and state space?

Accepted Answer

Initial state: Root
Actions: Branches
State space: Nodes

AI1

Artificial Intelligence 1

"Know" box contains:
Time elapsed:
Retries: