Save
Upgrade to remove ads
Busy. Please wait.
Log in with Clever
or

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever
or

Username is available taken
show password


Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.


Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
focusNode
Didn't know it?
click below
 
Knew it?
click below
Don't Know
Remaining cards (0)
Know
0:00
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how

MIS 665 Final

QuestionAnswer
the role of machine learning within the broader AI hierarchy Machine learning is a subset of AI that learns patterns from data
Example of unsupervised learning Grouping customers into segments based on purchasing behavior (no predicting)
R-Squared (coefficient of determination) measures: the proportion of variance in Y explained by the model
Multicollinearity in regression refers to High correlation among the predictor variables themselves
When applying a preprocessing pipeline to scoring (deployment) data, you should use scaler.transform(X_score) to apply the scale learned from training data
The primary reason for splitting data into training and test sets is to Evaluate how well the model generalizes to data it has not seen before
What does Lasso regularization do to coefficients of predictors that contribute little to the model It shrinks them toward zero, potentially to exactly zero
In healthcare classification, which metric is most critical to minimize when missing a positive case (e.g., a patient a risk) carries severe consequences False Negative Rate
In the lab's clustering dataset (Weight, Cholesterol, Gender), which feature dominated cluster assignments when the data was NOT standardized Cholesterol, because it had the highest variance
The Elbow method for selecting k plots which quantity on the y-axis? Within-Cluster Sum of Squares (WCSS / inertia)
In Lab 1 (Boston Housing), how many PCA components were needed to retain at least 90% of the variance in the 13-feature dataset? 7
The explained variance ratio of a principal component represents the proportion of total variance in the original data captured by that component
In TF-IDF vectorization, each column in the resulting feature matrix represents A unique word (term) from the vocabulary, weighted by frequency and rarity
Cosine similarity between two vectors measures the angle between the two vectors, with 1.0 meaning identical direction
The all-MiniLM-L6-v2 embedding model used in Lab 3 produces vectors with how many dimensions 384
Why does a plain LLM fail when asked about a private company's Q3 revenue The document is not in the model's training data, so the model either refuses or fabricates a confident-sounding answer
Place the six stages of a RAG pipeline in the correct roder Load --> Chunk --> Embed --> Store --> Retrieve --> Generate
The embedding model (gemini-embedding-001) used in this lab returns A 3,072-number vector that captures the meaning of the input text
Why does cosine similarity on embeddings reliably retrieve the right document even when the query phrase does not appear verbatim in the corpus Because embeddings encode meaning, so synonyms and paraphrases land near each other in vector space
The two main types of supervised learning problems covered in Lab 1 are ____ (predicting a number) and _____ (predicting a category) regression, classification
In k-Means clustering, the algorithm repeats two steps until convergence: assigning each point to the nearest ___, and then recalculating the ___ of each cluster centroid, centroid
In the AI hierarchy, ____ AI extends generative AI by taking actions and autonomously completing multi-step goals Agentic
Deep learning is preferred over traditional ML when working with ________ data such as images, audio, and raw text unstructured
ColumnTransformer allows you to apply ______ to numeric features and ____ to categorical features within a single preprocessing step transformers/scalers, encoders
A false negative in the heart attack classifier means a patient who ___ have a second heart attack was predicted ____ did, healthy
Logistic Regression outputs a value between o and 1 that represents the _____ of the positive class, which can be thresholded at 0.5 to produce a class label probability
The Euclidean distance from (5,5) to (0,0) as the square root of (5^2 + 5^2), equals approximately ___ 7.07
To profile the clusters and understand what each group represents, we calculate the ____ of each feature within each cluster label mean/centroid
PCA components are ordered so that the first component explains the _____ amount of variance, the second explains the next most, and so on largest
TF-IDF stands for Term Frequency - ___ Document Frequency Inverse
Unlike TF-IDF, embedding vectors are ____ -- every dimension carries meaning and no entries are zero dense
Cosine similarity measures the ____ between two vectors rather than their absolute distance, making it well-suited for comparing text embeddings of different lengths angle
The six stages of every RAG pipeline are: Load, ____, Embed, Store, Retrieve, and Generate Chunk
The embedding model gemini-embedding-001 produces vectors with _______ dimensions, so embedding 100 chunks yields a matrix of shape (100,____) 3072, 3072
ML model workflow define x y variables, split validation, initialize, fit(), predict(), compare actual y with predicted y, deploy model
PCA is used for numerical columns
Embedding is used for text data
RAG stands for Retrieval Augmented Generation
RAGs are used for private data, uses embedding
LLM vs RAG LLM trained on public internet data from memory, RAG uses provided knowledge base and is a smart search system
What are the 3 main types of ML problems? Regression (numerical output), Classification (categorical output), Clustering (no labeled output).
What is regression? A supervised learning method that predicts continuous numerical values.
What is classification? A supervised learning method that predicts categories or labels.
What is clustering? An unsupervised learning method that groups similar data points without labels.
What is the relationship between AI, ML, and DL? AI ⟶ ML ⟶ DL (DL is a subset of ML, ML is a subset of AI).
What is ML best at? Structured/tabular data.
What is Deep Learning best at? Unstructured data like images, text, and video.
What is the order of AI evolution? ML → DL → Generative AI → RAG → Agentic AI
What is the goal of regression? Minimize sum of squared errors between actual and predicted values.
Why do we split validation data? To test model performance on unseen data and prevent overfitting
Why do we standardize numerical columns? To put all features on the same scale so no variable dominates.
What is multicollinearity? When features are highly correlated with each other.
How do you handle multicollinearity? Remove variables, use Lasso regression, or feature selection.
What is Lasso regression? A regression method that penalizes large coefficients and performs feature selection
What is f_regression used for? Feature selection by measuring relationship between features and target variable.
What is ColumnTransformer? A tool to apply different preprocessing steps to different column types.
Why use ColumnTransformer? To scale numerical data and encode categorical data correctly in one pipeline.
What is a confusion matrix? A table showing TP, TN, FP, FN results of a classification model.
ML workflow steps? 1) Initialize model → 2) Fit → 3) Predict
fit() vs transform()? fit learns patterns; transform applies learned transformation.
What is fit_transform()? Fits data and then immediately transforms it.
What is a pipeline in ML? A sequence of preprocessing + modeling steps combined into one workflow.
Name common classification models. Decision Tree, KNN, Logistic Regression, Random Forest.
What is the goal of clustering? To group similar data points for profiling or pattern discovery.
What is Euclidean distance? The straight-line distance between two points.
What is the elbow method? A technique to choose the optimal number of clusters (K).
What is PCA (Principal Component Analysis)? A method that reduces features while keeping most variance
Why standardize before PCA? Because PCA is sensitive to scale.
Why is TF-IDF a sparse matrix? Because most word positions are zero (most words don’t appear).
What is chunking in RAG? Splitting documents into smaller sections (like paragraphs).
What does retrieval do in RAG? Finds the most relevant chunks (top-k) based on similarity.
What does the LLM do in RAG? Generates a final response using retrieved information.
**Phase 1** Data Capture
**Phase 2** Data Preparation & Transformation
**Phase 3** Descriptive Analytics / Predictive Analytics
ML uses unstructured/structured data? Structured
DL uses unstructures/structured data? Unstructured
Created by: lexi.welte
Popular Business sets

 

 



Voices

Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
Retries:
restart all cards