Question 1

the role of machine learning within the broader AI hierarchy

Accepted Answer

Machine learning is a subset of AI that learns patterns from data

Question 2

Example of unsupervised learning

Accepted Answer

Grouping customers into segments based on purchasing behavior (no predicting)

Question 3

R-Squared (coefficient of determination) measures:

Accepted Answer

the proportion of variance in Y explained by the model

Question 4

Multicollinearity in regression refers to

Accepted Answer

High correlation among the predictor variables themselves

Question 5

When applying a preprocessing pipeline to scoring (deployment) data, you should use

Accepted Answer

scaler.transform(X_score) to apply the scale learned from training data

Question 6

The primary reason for splitting data into training and test sets is to

Accepted Answer

Evaluate how well the model generalizes to data it has not seen before

Question 7

What does Lasso regularization do to coefficients of predictors that contribute little to the model

Accepted Answer

It shrinks them toward zero, potentially to exactly zero

Question 8

In healthcare classification, which metric is most critical to minimize when missing a positive case (e.g., a patient a risk) carries severe consequences

Accepted Answer

False Negative Rate

Question 9

In the lab's clustering dataset (Weight, Cholesterol, Gender), which feature dominated cluster assignments when the data was NOT standardized

Accepted Answer

Cholesterol, because it had the highest variance

Question 10

The Elbow method for selecting k plots which quantity on the y-axis?

Accepted Answer

Within-Cluster Sum of Squares (WCSS / inertia)

Question 11

In Lab 1 (Boston Housing), how many PCA components were needed to retain at least 90% of the variance in the 13-feature dataset?

Accepted Answer

7

Question 12

The explained variance ratio of a principal component represents

Accepted Answer

the proportion of total variance in the original data captured by that component

Question 13

In TF-IDF vectorization, each column in the resulting feature matrix represents

Accepted Answer

A unique word (term) from the vocabulary, weighted by frequency and rarity

Question 14

Cosine similarity between two vectors measures

Accepted Answer

the angle between the two vectors, with 1.0 meaning identical direction

Question 15

The all-MiniLM-L6-v2 embedding model used in Lab 3 produces vectors with how many dimensions

Accepted Answer

384

Question 16

Why does a plain LLM fail when asked about a private company's Q3 revenue

Accepted Answer

The document is not in the model's training data, so the model either refuses or fabricates a confident-sounding answer

Question 17

Place the six stages of a RAG pipeline in the correct roder

Accepted Answer

Load --> Chunk --> Embed --> Store --> Retrieve --> Generate

Question 18

The embedding model (gemini-embedding-001) used in this lab returns

Accepted Answer

A 3,072-number vector that captures the meaning of the input text

Question 19

Why does cosine similarity on embeddings reliably retrieve the right document even when the query phrase does not appear verbatim in the corpus

Accepted Answer

Because embeddings encode meaning, so synonyms and paraphrases land near each other in vector space

Question 20

The two main types of supervised learning problems covered in Lab 1 are ____ (predicting a number) and _____ (predicting a category)

Accepted Answer

regression, classification

Question 21

In k-Means clustering, the algorithm repeats two steps until convergence: assigning each point to the nearest ___, and then recalculating the ___ of each cluster

Accepted Answer

centroid, centroid

Question 22

In the AI hierarchy, ____ AI extends generative AI by taking actions and autonomously completing multi-step goals

Accepted Answer

Agentic

Question 23

Deep learning is preferred over traditional ML when working with ________ data such as images, audio, and raw text

Accepted Answer

unstructured

Question 24

ColumnTransformer allows you to apply ______ to numeric features and ____ to categorical features within a single preprocessing step

Accepted Answer

transformers/scalers, encoders

Question 25

A false negative in the heart attack classifier means a patient who ___ have a second heart attack was predicted ____

Accepted Answer

did, healthy

Question 26

Logistic Regression outputs a value between o and 1 that represents the _____ of the positive class, which can be thresholded at 0.5 to produce a class label

Accepted Answer

probability

Question 27

The Euclidean distance from (5,5) to (0,0) as the square root of (5^2 + 5^2), equals approximately ___

Accepted Answer

7.07

Question 28

To profile the clusters and understand what each group represents, we calculate the ____ of each feature within each cluster label

Accepted Answer

mean/centroid

Question 29

PCA components are ordered so that the first component explains the _____ amount of variance, the second explains the next most, and so on

Accepted Answer

largest

Question 30

TF-IDF stands for Term Frequency - ___ Document Frequency

Accepted Answer

Inverse

Question 31

Unlike TF-IDF, embedding vectors are ____ -- every dimension carries meaning and no entries are zero

Accepted Answer

dense

Question 32

Cosine similarity measures the ____ between two vectors rather than their absolute distance, making it well-suited for comparing text embeddings of different lengths

Accepted Answer

angle

Question 33

The six stages of every RAG pipeline are: Load, ____, Embed, Store, Retrieve, and Generate

Accepted Answer

Chunk

Question 34

The embedding model gemini-embedding-001 produces vectors with _______ dimensions, so embedding 100 chunks yields a matrix of shape (100,____)

Accepted Answer

3072, 3072

Question 35

ML model workflow

Accepted Answer

define x y variables, split validation, initialize, fit(), predict(), compare actual y with predicted y, deploy model

Question 36

PCA is used for

Accepted Answer

numerical columns

Question 37

Embedding is used for

Accepted Answer

text data

Question 38

RAG stands for

Accepted Answer

Retrieval Augmented Generation

Question 39

RAGs are used for

Accepted Answer

private data, uses embedding

Question 40

LLM vs RAG

Accepted Answer

LLM trained on public internet data from memory, RAG uses provided knowledge base and is a smart search system

Question 41

What are the 3 main types of ML problems?

Accepted Answer

Regression (numerical output), Classification (categorical output), Clustering (no labeled output).

Question 42

What is regression?

Accepted Answer

A supervised learning method that predicts continuous numerical values.

Question 43

What is classification?

Accepted Answer

A supervised learning method that predicts categories or labels.

Question 44

What is clustering?

Accepted Answer

An unsupervised learning method that groups similar data points without labels.

Question 45

What is the relationship between AI, ML, and DL?

Accepted Answer

AI ⟶ ML ⟶ DL (DL is a subset of ML, ML is a subset of AI).

Question 46

What is ML best at?

Accepted Answer

Structured/tabular data.

Question 47

What is Deep Learning best at?

Accepted Answer

Unstructured data like images, text, and video.

Question 48

What is the order of AI evolution?

Accepted Answer

ML → DL → Generative AI → RAG → Agentic AI

Question 49

What is the goal of regression?

Accepted Answer

Minimize sum of squared errors between actual and predicted values.

Question 50

Why do we split validation data?

Accepted Answer

To test model performance on unseen data and prevent overfitting

Question 51

Why do we standardize numerical columns?

Accepted Answer

To put all features on the same scale so no variable dominates.

Question 52

What is multicollinearity?

Accepted Answer

When features are highly correlated with each other.

Question 53

How do you handle multicollinearity?

Accepted Answer

Remove variables, use Lasso regression, or feature selection.

Question 54

What is Lasso regression?

Accepted Answer

A regression method that penalizes large coefficients and performs feature selection

Question 55

What is f_regression used for?

Accepted Answer

Feature selection by measuring relationship between features and target variable.

Question 56

What is ColumnTransformer?

Accepted Answer

A tool to apply different preprocessing steps to different column types.

Question 57

Why use ColumnTransformer?

Accepted Answer

To scale numerical data and encode categorical data correctly in one pipeline.

Question 58

What is a confusion matrix?

Accepted Answer

A table showing TP, TN, FP, FN results of a classification model.

Question 59

ML workflow steps?

Accepted Answer

1) Initialize model → 2) Fit → 3) Predict

Question 60

fit() vs transform()?

Accepted Answer

fit learns patterns; transform applies learned transformation.

Question 61

What is fit_transform()?

Accepted Answer

Fits data and then immediately transforms it.

Question 62

What is a pipeline in ML?

Accepted Answer

A sequence of preprocessing + modeling steps combined into one workflow.

Question 63

Name common classification models.

Accepted Answer

Decision Tree, KNN, Logistic Regression, Random Forest.

Question 64

What is the goal of clustering?

Accepted Answer

To group similar data points for profiling or pattern discovery.

"Know" box contains:
Time elapsed:
Retries:

MIS 665 Final