fewer features with similar R^2 is preferred

automatic feature selection via L1 penalty

never fit_transform on full X --> Split first!

auto-selects fewest PCs for 90% variance

represent text > reduce dims > build classifier

A silhouette score of 0.50 generally indicates

moderately well-separated clusters

In healthcare, which metric is most important?

Recall (measure of how many actual positives were caught)

Help

Options

focusNode

Didn't know it?
click below

Knew it?
click below

Don't Know

Remaining cards (0)

Know

retry

shuffle

restart

0:00

Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

Normal Size Small Size show me how

MIS 665 Final2SS

Terms from Summary Sheets

Term	Definition
_More_ or _Less_ features with similar R^2 "wins?"	Less
What do we fit data on?	Train data (NEVER test)
What do we check before modeling to catch multicollinearity	Correlation Heatmap
Is a lower or higher MSE better?	Lower
Higher R^2 means	more variance explained
StnadardScaler changes ___, NOT ___	coef scale, model accuracy
Parsimony	fewer features with similar R^2 is preferred
Lasso (alpha=1)	automatic feature selection via L1 penalty
DATA LEAKAGE	never fit_transform on full X --> Split first!
KNN requires ___	StandardScaler (uses distance)
Logistic Regression gives ____, not just class labels	probabilities
SelectKBest: fit on training data only	leakage rule
0.5 is __ guess, 1.0 is ___ guess	random, perfect
Use _____ for reliable accuracy	cross_val_score(cv=10)
what avoids dummy trap in OneHotEncoder	drop='first'
ALWAYS do what before clustering	standardize
______ features dominate distance	high-variance
Clustering is supervised/unsupervised	unsupervised (NO Y)
Silhouette score of ___+ is strong, ____ is reasonable	0.71+, 0.51-0.70
K-Means++ init reduces	sensitivity to random start
Profile clusters on _________ data for meaning	original (unscaled)
PCA is ____-based, MUST scale first	variance
Fit PCA on ___ data only, transform both separately	training
n_components=0/90	auto-selects fewest PCs for 90% variance
PCA replaces feature names	use loadings to interpret
Embeddings understand ___	synonyms
Pipeline:	represent text > reduce dims > build classifier
A silhouette score of 0.50 generally indicates	moderately well-separated clusters
In healthcare, which metric is most important?	Recall (measure of how many actual positives were caught)

Created by: lexi.welte

"Know" box contains:
Time elapsed:
Retries: