click below
click below
Normal Size Small Size show me how
AI1
Artificial Intelligence 1
| Question | Answer |
|---|---|
| What is supervised learning? | A prevalent learning algorithm, using a teacher, which has an expected output, label, class etc. It solves classification and regression problems. |
| Give an example of a classification problem and an example of a regression problem. | Classification: Spam detection (output is yes or no) Regression: Stock price prediction |
| How do you formulate a supervised learning algorithm? | Given some input x, predict an appropriate output y, with the goal being a function f such that f(x) = y. It follows a learning process: examples of input-output pairs (training data), and finds a good function to predict outputs of new inputs. |
| What is overfitting vs underfitting? | Overfitting: When the model is more complex than required. Underfitting: When the model is simpler than required. |
| What is univariate linear regression? | A machine learning algorithm for regression problems. It is univariate because it takes in one input attribute. Equation example: y = f(x; w0, w1) = w1x + w0 Where y = dependent variable, w0, w1 are free parameters, and x is independent variable. |
| What is a loss/cost function? | Used when drawing a line of best fit, it is a criteria suggesting how good/bad that line is. It averages losses on training examples by find the distance between the output of the model and the observed target. |
| What is gradient descent and how does it work? | A strategy to minimise cost functions in ML algorithms. E.g, minimise g(w0, w1): Start at a random point. Repeat until no change: Update w0, w1 by taking a small step in direction of steepest descent of cost (w := w - a ▽g(w), a = learning rate) |
| What does the vector of partial derivatives represent? | The gradient vector. Partial derivative with respect to one variable is the ordinary derivative of the function by treating others as constants. the negative of the gradient evaluated gives direction of steepest descent. |
| What are the steps in the algorithm for univariate linear regression using GD? | Input: a>0, training set {x(n), y(n): n = 1, 2 … N} Initialise w0 = 0, w1 = 0 Repeat for n = 1, 2 …. N: w0 := w0 - a((w1x(n) + w0) - y(n)) w1 := w1 - a((w1x(n) + w0) - y(n))x(n) Until change in cost remains below small threshold. Return w0, w1. |
| What is logistic regression? | A linear and parametric model, similar to linear regression, except it’s for classification problems. |
| What is the difference between parametric and non-parametric models? | Parametric models summarise data it’s a finite set of parameters and make assumptions on data distributions. Non-parametric models cannot be characterised by bounded set of parameters, and no assumptions are made no data distributions. |
| What is the sigmoid function and it’s meaning used in a model? | y = 1 / (1 + e ^ -x). It is a smoothed version of a step function. In a model, x is replaced with attributes (w0 + w1x1 + … + wdxd), with the value being the probability that the label is 1. |
| If an algorithm has three attributes w0, w1, w2, and uses the sigmoid function, what is the decision boundary? | w0 + w1x1 + w2x2 = 0. It is the set o inputs for which sigmoid outputs 0.5. |
| What is the name of logistic loss? | Cross-entropy loss. |
| List some features of the kNN algorithm. | - Non parametric - Instance-based (prediction based on comparison of new point with other data points, not a model) - Lazy algorithm (no explicit training step, defers computation until prediction) - Used for classification and regression. |
| What are the steps involved in the kNN algorithm? | Input neighbour size k, distance metric D, training set, a new data x(j). For n = 1.. N: Calculate D(x(j), x(n)), select k training samples closest to x(j) Return y(j): plurality vote of labels (classification), or average of y values (regression) |
| What is normalisation and why is it used? | It is the process of linearly scaling range of each attribute to be in e.g [0, 1], using the following equation: x(n)_(j_new) = (x(n)_j - min x_j) / (max x_j - min x_j) Used so larger ranges don’t affect results. |
| What is standardisation? | The process of linearly scaling each dimension to have 0 mean and variance 1. x(n)j_new = (x(n)_j - mean_j) / variance_j, |
| What are the pros/cons of the kNN algorithm? | Pros: Easy to implement and interpret, accurate. Cons: Large memory space, sensitive to noise, performance degrades as data dimension increases. |
| What are the steps involved in holdout validation? | 1. Randomly choose 30% of data to form a validation set, rest is training set, and train it. 2. Estimate test performance on validation set. 3. Choose model with lowest validation error, and re train for predictor, then estimate on test set. |
| When estimating test performance on a validation set, what do we compute in regression algorithms versus classification algorithms? | Regression: Compute cost function (MSE) on examples of validation set. Classification: Compute 0-1 error metric, i.e number of wrong predictions / number of predictions. |
| What are the steps involved in k-Fold Cross-Validation? | 1. Split training set randomly into k equal sets. 2. Use k-1 of those for training, and remaining for validation. 3. Permute the k sets and repeat k times 4. Average performances on k validation sets. |
| What are the steps involved in leave-one-out validation? | Leave out a single example for validation, and train on rest of the annotated data. For a total of N examples, repeat N times, each time leaving out a single example. Take average of validation errors as measured on left-out points. |
| What are the advantages and disadvantages of holdout validation, 3-fold and 10-fold validation, and leave-one-out validation? | Adv: Holdout - Computationally cheap. 3-fold - Slightly more reliable. 10-fold - Wastes only 10%, reliable. L-O-O - Doesn’t waste data. Disadv: Holdout - Unreliable if sample size not large. 3/10 fold - wastes data. L-O-O - Computationally expensive. |
| What is unsupervised learning? | Uses unlabelled data sets of feature vectors, and finds sub-groups (clusters) among feature vectors with similar traits. Also finds patterns within feature vector to indemnify a lower dimensional representation (dimensionality reduction) |
| What are the benefits and challenges of unsupervised learning? | Unlabelled data is cheap to collect and is abundant, compressed representation saves on storage and computation, reduces noise, used in exploratory data analysis. However, it has no simple goal and validation is subjective. |
| What is the goal and respective observations in clustering? | The goal is to find natural groupings among objects. The clusters/groups can be observed as objects within a cluster having high similarity (high intra-cluster similarity), and objects across clusters having low similarity (low inter-cluster similarity) |
| How do you identify similar feature vectors? | Use proximity indices to quantify strength of relationship between any two feature vectors. Continuous valued attributes use their values, but nominal feature attributes (e.g large, medium, small) must be mapped to consistent discrete values. |
| What is the inertia of a cluster C defined as? | Sum of Euclidean distances between each example in a cluster and its centroid. The centroid is the average of all examples in a cluster. It determines how compact the cluster is. |
| What is the Within Cluster Sum of Squares (WCSS) defined as? | The inertia of a cluster. |
| What is the K-means algorithm? | An iterative, greedy descent algorithm aiming to find a sub-optimal solution. It switches between the assignment step, where each example goes to a cluster with closest centroid, and refitting step, where it updates cluster centroids. WCSS will decrease |
| What are the steps in the K-means algorithm? | Input: K clusters and N examples (x1, … N) 1. Select K of N examples at random for centroids (c1… cK) 2. Repeat until centroids don’t change: Assignment step (minimum Euc distance), refitting step (compute new centroids, doesn’t need to be examples) |
| What is the space and time complexity of K-means? | Modest space complexity as only data observations and centroids are stored. It is of order O((N + K)m), where m is number of feature attributes. Time complexity is of order O(IKNm), where I is number of iterations required for convergence. Linear in N. |
| What is the K-means ++ Algorithm? | Choose first centroid at random. Repeat until K centroids are found: For each point, compute distance from its nearest centroid. Choose a new data point x randomly with propositional probability to d(x)^2 as next centroid. Obtained centroids used for K-M |
| What is the Elbow method? | A data based approach that estimates number of natural clusters in data set. Run K-means for increasing values of K, and evaluate WCSS of obtained clustering. Plot WCSS as a function. Optimal K lies at elbow of the plot. |
| What are some limitations of the K-means algorithm? | It has problems when outliers are present, as they influence clusters and increase WCSS. It has problems with clusters have differing size, densities and when clusters are of non-globular shapes. It may converge to local minima, implying multiple restarts |
| What is hierarchical clustering? | A clustering algorithm that only requires the user to specify a measure of similarity between a pair of clusters. It creates a hierarchical decomposition of the set of examples, producing a dendrogram. |
| List the features of a dendrogram. | It is a rooted binary tree, where nodes represent clusters. |
| Describe the differences between agglomerative clustering and divisive clustering. | Agglomerative: Bottom-up approach, starting at bottom of each cluster containing a single observation. Recursively merges pair of clusters with smallest inter-cluster dissimilarity into a single cluster. Divisive: Top-down approach, splitting clusters. |
| What is a Single Linkage distance vs a Complete Linkage distance vs Group Average? | SL: Shortest distance from any member of the cluster to any member of the other cluster. CL: Largest distance from any member of the cluster to any member of the other cluster. GA: Average distance between members of the two clusters. |
| What is the space and time complexity of hierarchical clustering? | Storage: O(N^2). Storing distance matrix requires storage of (N^2) / 2 entries. Time: O(N^3) in many cases. N iterations, each with N^2 size distance matrix needing to be updated and searched. |
| What is the goal of cluster validation? | Evaluate in a quantitative and objective manner the cluster structure found by an algorithm according to a validation criterion. This is an index used to measure the adequacy of found cluster structures. |
| List three examples of unsupervised validation criteria for partitional and hierarchical clustering algorithms. | Partitional: Variability and Separation-Based, Silhouette Coefficient. Hierarchical: Cophenetic Correlation |
| Explain what Variability and Separation Criteria entails. | It quantifiers the inter-cluster (separation) and intra-cluster (variability) dissimilarity. The criteria then can be used to define an overall validation criterion for a clustering structure C. |
| What is the equation for centroid-based variability of a cluster C and an equation for centroid-based separation between clusters C1 and C2? | Variability = Sum of distances between each point in the cluster and its centroid. Separation = Distance between cluster C1 and cluster C2. |
| What is the Between Cluster Sum of Squares (BCSS)? | Sum of squared Euclidean distances between each cluster centroid and a central centroid (separation). |
| For a clustering structure C, what relation holds between WCSS and BCSS and what does this entail? | WCSS(C) + BCSS(C) = constant. This means minimising WCSS ensures maximising BCSS. |
| What is the Silhouette Coefficient (SC) used for? | Evaluating an individual example, for a clustering structure C, as well as clustering structure of K clusters. It combines the ideas of variability and separation. |
| How do you compute the Silhouette Coefficient? | Let e(i) belong to C Calculate a(i) = average distance of ith example to all other examples in its cluster. Calculate b(i) = minimum (over clusters) of average distances of ith example to examples in another cluster SC = (b(i) - a(i)) / max{b(i), a(i)} |
| List some properties of the Silhouette Coefficient. | Varies between -1 and 1. If -1, data is better fit to a neighbouring cluster. If 0, data is on border between two clusters. If 1, data is well matched to cluster. SC of a cluster = average of SC in cluster SC of a clustering = average of SC of all |
| How can Sihouette Coefficients be used to estimate number of clusters? | By plotting average SC of clustering as a function of number of clusters, peak in the plot gives estimate to number of clusters. |
| What are the two classes of supervised validation criteria for partitional clustering algorithms? | Classification-oriented, measuring from classification and quantifies extent to which a cluster contains objects of a single class. Similarity-oriented, quantifying extent to which two objects in the same class are in the same cluster and vice versa. |
| What is the probability that an example of cluster i belongs to class j? | p(i, j) = number of examples of class j in cluster i / |Ci |. |
| What is the precision and recall of a cluster i with respect to class j? | Precision(i,j) = p(i, j): Measures extent to which a cluster contains objects of a single class. Recall(i, j) = number of objects of class j in cluster i / number of objects in class j: Determines fraction of class j contained in cluster i. |
| What is the F-measure of cluster i with respect to class j? | F(i, j) = (2 * precision(i,j) * recall(i, j)) / (precision(i, j) + recall(i, j)) Measures the extent to which a cluster contains only objects of a particular class and all objects of that class. Combines precision and recall. |
| What is entropy, and what does low entropy mean? | The degree to which each cluster consists of examples of a single class. Low entropy means clusters consist mostly of examples of same class. |
| What is purity? | Another measure of the extent to which a cluster consists of examples of a single class. Purity(Ci) = max p(i,j). Purity is ideally high (close to 1) |
| What is benchmarking? | An approach to determine quality of an algorithm, by measuring time taking to run and memory consumption on a given computer with specific input data. |
| What is asymptotic analysis? | An approach to determine quality of an algorithm. It is a mathematical abstraction over exact number of operations and content of input, independent of any particular implementation. |
| What is a problem-solving agent? | An agent is an ‘entity’ that perceives and acts in an environment. A problem-solving agent uses atomic representations (each state is perceived indivisible), requiring a precise definition of a problem and it’s goal/solution. |
| What assumptions do we make about the environment when formulating a search problem? | Observable - agent able to know current state Discrete - only finitely many actions at any state Known - possible to determine which states are reached by which action Deterministic - each action has exactly one outcome |
| When formulating a search problem, what five components define the said search problem? | Initial state - state where agent starts Action set - set A describing actions executed in any state si Transition model - mapping between states and actions Goal test - determine if a state is a goal state Path cost function -assigns a cost to paths |
| What is the solution and cost of a search problem? | Solution: sequence of actions from initial to goal state Cost: sum of cost of actions from initial to goal state |
| When forming a search tree, what components of the tree is used for the initial state, actions and state space? | Initial state: Root Actions: Branches State space: Nodes |
| What is the frontier / open list? | Set of all leaf nodes available for expansion at any given time. |
| What is an uninformed search? | A term used to define the set of strategies having no addition information about the state space beyond that provided in problem formulation. |
| What are the 3 steps of the BFS algorithm when formulating a search tree? | 1. Expand the shallowest node in frontier. 2. Do not add children in the frontier if the node is already in frontier or in list of visited nodes (avoids loopy paths, i.e paths repeating themselves) 3. Stops when a goal node is added to the frontier. |
| List the criteria we use to evaluate the performance of an algorithm. | Completeness: Is the algorithm guaranteed to find a solution provided one exists? Optimality: Is the strategy capable of finding optimal solution? Time complexity: How long does it take to find a solution? Space complexity: How much memory is needed? |
| What three quantities are used when using an implicit representation of the graph? | Branching factor, maximum number of successors of any node: b Depth of shallowest goal node (no. steps from root): d Maximum length of any path in state space: m |
| What are the 3 steps of the DFS algorithm when formulating a search tree? | 1. Expand the deepest node in the frontier. 2. Do not add children in frontier if the node is already in frontier or in list of visited nodes 3. Stop when a goal node is visited. |
| What is the depth-limited search? | A DFS where a depth limit l is set, as DFS is not complete in infinite state spaces. Fully explored subtrees can be removed from memory. |
| What is an informed search? | A search strategy using problem-specific knowledge beyond the definition of the problem itself. Solutions found more efficiently compared to uninformed searches. |
| What is best-first search? | A search strategy that d determines which node to expand based on an evaluation function f(n). It acts as a cost estimate: node with lowest cost is one expanded next. |
| What is a heuristic function h(n)? | The estimated cost of the cheapest path from node n to a goal node. If n is a goal node, h(n) = 0. |
| What is the name of the search when f(n) = h(n)? | Greedy best-first search. |
| What are three 3 steps of the A* algorithm? | Expand node with smallest f(n) = g(n) + h(n), g(n) = cost to reach node If the state of a child is in frontier + not visited: If frontier has a larger g(n), replace child into frontier and remove node with larger g(n) Stop when goal node visited |
| What makes a heuristic consistent? | If the estimate is always no greater than the estimated distance from any neighbouring node n’ to the goal, plus cost of etching that neighbour. h(n) <= cost(n, n’) + h(n’) |
| Summarise the performance of A*. | If heuristic h(n) is consistent, A* is complete and optimal. Time complexity: O(b ^ ed), e =relative error of heuristic Space complexity: O(b^ d), since we keep in memory all expanded nodes and nodes in frontier |
| What is the canonical representation of an optimisation problem? | Min/max f(x), st. gi(x) <= 0, i = 1, …., m, hj(x) = 0, j = 1, …, n, |
| What is the search space of a problem defined as? | The space of all possible x values, where f(x) is the objective function and x is the design variable. |
| What are local search algorithms / optimisation algorithms? | Algorithms that operate by searching from an initial state to neighbouring states. They do not keep track of paths nor visited states, so they use little memory and can find reasonable solutions. |
| What is hill climbing? | The aim of finding a global maximum on a graph, i.e maximising an objective function whose value is defined as the elevation on the y-axis. |
| What are the steps in hill climbing? | 1. Generate initial candidate solution at random. 2. Generate neighbour solutions and move to the one with highest value 3. If no neighbour has a higher value than current solution, terminate 4. Otherwise, repeat with a new best neighbour solution. |
| Evaluate the performance of hill climbing. | It is not complete as it depends on problem formulation and design. It is not optimal as it can get stuck in local maximum/minimum. Time complexity: O(mnp), m = max no. iterations, n = max no. neighbours taking O(p) to generate Space complexity: O(nq) |
| What is stochastic hill climbing? | -Chooses at random uphill moves -Probability of picking a specific move depends on steepness -Converges more slowly than steepest ascent -Can find higher quality solutions |
| What is first-choice hill climbing? | -Implements stochastic hill climbing by randomly generating successors until a better successor than the current state is generated -Performs well when a state has many successors |
| What is random-restart hill climbing? | -Generates a series of hill-climbing searches from random initial states -Stops when a goal is found -Complete with probability 1 as it will eventually generate a goal state as initial state |
| What are the steps in simulated annealing? | -Generate initial candidate solution at random -Generate neighbour solutions, pick one at random -If the move improves current solution, its accepted -Otherwise, accepts a worse move with probability less than 1 -Terminate when solution is unchanged |
| In simulated annealing, how should we set the probability of accepting bad moves? | -High probability in beginning -Lower as time passes (exponentially) |
| Evaluate the performance of simulated annealing. | It is not complete as it depends on problem formulation and design. It is not optimal as it depends on termination criteria and schedule Time complexity: depends on schedule and termination criterion Space complexity: depends on design variable rep |
| What are the pros and cons of simulated annealing? | Pros: Usually find a good solution in reasonable time, avoids getting stuck in poor local maxima and plateau Cons: Not guaranteed to be complete or optimal, time/space complexity is problem and representation dependent |
| Using the death penalty approach, how can we adjust the canonical representation of the optimisation problem to avoid infeasible solutions? | Modify the objective function to be min f(x) + Q(x), where the penalty Q(x) is defined as Q(x) { 0, if x is feasible { C, otherwise C is a large positive constant, ensuring infeasible solutions are worse |
| Using the levels of infeasibility approach, how can we adjust the canonical representation of the optimisation problem to avoid infeasible solutions? | Modify the objective function to be min f(x) + Q(x), Only add penalties corresponding to violated constraints, with squared values to make distinction larger. |