Save
Upgrade to remove ads
Busy. Please wait.
Log in with Clever
or

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever
or

Username is available taken
show password


Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.


Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
focusNode
Didn't know it?
click below
 
Knew it?
click below
Don't Know
Remaining cards (0)
Know
0:00
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how

MIDA

QuestionAnswer
What does "machine learning" mean in formal terms? It means constructing algorithms that improve automatically through experience. Formally, given a hypothesis set 𝐻 H, a learning algorithm selects a hypothesis ℎ ∈ 𝐻 h∈H that approximates the target function 𝑓 f, using data 𝐷 D.
What is supervised learning? A paradigm where the algorithm learns from a dataset of input-output pairs ( 𝑥 , 𝑦 ) (x,y). The goal is to approximate the function mapping inputs to outputs, so it can predict the label 𝑦 y for unseen input 𝑥 x.
What are two main categories of supervised learning problems? Classification: Predicting a discrete label (e.g., spam vs. not spam). Regression: Predicting a continuous value (e.g., house prices).
State Hoeffding’s inequality. For independent random variables 𝑋 1 , … , 𝑋 𝑁 X 1 ​ ,…,X N ​ bounded in [ 0 , 1 ] [0,1], the probability that their empirical mean 𝜇 ^ μ ^ ​ deviates from the true mean 𝜇 μ by more than 𝜖 ϵ is bounded b
Why is Hoeffding’s inequality useful in machine learning? It quantifies how likely it is that the training error (empirical error) differs from the true error (generalization error). With more samples 𝑁 N, the probability of large deviations decreases exponentially.
How does Hoeffding’s inequality relate to overfitting? It guarantees that with enough data, the empirical error approximates the true error well. If 𝑁 N is small, the inequality allows for larger deviations, increasing the risk of overfitting.
In the PAC (Probably Approximately Correct) framework, how does Hoeffding’s inequality apply? It provides the probabilistic bound (“probably”) that the hypothesis chosen has error close to the target function’s error (“approximately correct”).
What is the hypothesis function in linear regression? hw​(x)=w0​+w1​x1​+w2​x2​+⋯+wd​xd​=wTx.
What is the objective of linear regression? To minimize the difference between predictions ℎ 𝑤 ( 𝑥 ) h w ​ (x) and true values 𝑦 y, usually by minimizing the Mean Squared Error (MSE).
Write the MSE cost function. E(w)=N1​i=1∑N​(hw​(x(i))−y(i))2.
How are the optimal weights 𝑤 w found in closed form? By solving the normal equation: 𝑤 ∗ = ( 𝑋 𝑇 𝑋 ) − 1 𝑋 𝑇 𝑦 , w ∗ =(X T X) −1 X T y, where 𝑋 X is the design matrix of inputs and 𝑦 y the vector of outputs.
What is a key limitation of linear regression? It assumes a linear relationship between inputs and outputs. It can underfit when the true relationship is nonlinear.
How does regularization modify linear regression? It adds a penalty term to the cost function (e.g., Ridge regression adds 𝜆 ∥ 𝑤 ∥ 2 λ∥w∥ 2 ) to prevent overfitting and control the complexity of the model.
Why can’t we use linear regression to model a binary outcome? Linear regression can predict values <0 or >1, which aren’t valid probabilities. Also, the error term is not normally distributed with heteroscedasticity, violating linear regression assumptions. Logistic regression models the log-odds and bounds predicti
Explain the connection between odds, probability, and log-odds. Probability: p = P(Y=1); Odds: odds = p/(1-p); Log-odds (logit): logit(p) = ln(p/(1-p)). Logistic regression models log-odds linearly as a function of predictors.
How do you interpret a coefficient β_i in logistic regression? β_i = change in log-odds per one-unit increase in x_i. Odds ratio: e^β_i = multiplicative change in odds for one-unit increase. Example: β_i=0.7 ⇒ e^0.7≈2 → odds double per unit increase.
What is the maximum likelihood estimation (MLE) in logistic regression? MLE finds coefficients β that maximize the probability of observing the data: L(β)=∏[y_i * ŷ_i + (1-y_i)*(1-ŷ_i)]. Equivalent to minimizing log loss. No closed-form solution; solved via iterative optimization (e.g., Newton-Raphson, gradient descent).
How does logistic regression handle non-linear relationships? Logistic regression assumes linearity between predictors and log-odds. Non-linear effects can be modeled by adding polynomial terms (x²), interaction terms (x₁*x₂), or splines.
What problems arise with perfect separation? Perfect separation occurs when a predictor perfectly predicts the outcome. MLE doesn’t converge (coefficients → ∞). Solution: use regularization (L1/L2) or Bayesian logistic regression.
Explain the difference between logit, probit, and complementary log-log models. Logit: ln(p/(1-p)), logistic error distribution. Probit: Φ⁻¹(p), normal error distribution. Complementary log-log: ln(-ln(1-p)), asymmetric, often for rare events. All link functions map probabilities to real line.
How do you evaluate model fit in logistic regression? Deviance (compare to null model), Pseudo-R² (McFadden’s), ROC-AUC (discrimination), Calibration (how predicted probabilities match observed outcomes).
How do regularization techniques affect logistic regression? L1 (Lasso): sparsity/feature selection. L2 (Ridge): coefficient shrinkage, reduces overfitting. Elastic Net: combination of L1 + L2. Useful with correlated predictors or small data.
What is the decision boundary in logistic regression? The hyperplane where predicted probability = threshold (usually 0.5): β₀ + β₁x₁ + ... + β_n x_n = 0. Linear in feature space; can be made non-linear with feature transformations.
What is the bias–variance tradeoff in model selection? It refers to the balance between model complexity and generalization. High bias leads to underfitting; high variance leads to overfitting. The goal is to minimize total expected prediction error.
How does cross-validation help in model selection? Cross-validation partitions data into training and validation subsets multiple times to estimate generalization error, helping choose models that perform well on unseen data.
Compare k-fold cross-validation and leave-one-out cross-validation (LOOCV). k-fold splits data into k groups, balancing bias and variance in error estimates; LOOCV uses n folds (one per observation), reducing bias but increasing variance and computation.
What is the purpose of regularization in model selection? Regularization penalizes model complexity (e.g., large coefficients) to prevent overfitting, guiding selection toward simpler, more generalizable models.
Define the Akaike Information Criterion (AIC) and its role in model selection. AIC = 2k - 2ln(L), where k is the number of parameters and L the likelihood. It balances fit quality and complexity — lower AIC indicates a preferred model.
How does the Bayesian Information Criterion (BIC) differ from AIC? BIC = k ln(n) - 2ln(L); it penalizes complexity more heavily, especially for large samples, often favoring simpler models.
What is the concept of model evidence in Bayesian model selection? Model evidence integrates over parameter uncertainties, quantifying how well the model explains the data while accounting for model complexity.
Explain the concept of overfitting in the context of model selection. Overfitting occurs when a model fits training data noise, leading to poor generalization. It often results from selecting a model solely based on training performance.
Why can test set performance not be used for iterative model selection? Repeatedly using the test set biases model choice toward that specific test data, invalidating it as an unbiased measure of generalization.
What is nested cross-validation and why is it used? Nested cross-validation has an inner loop for model selection and an outer loop for performance estimation, preventing information leakage and providing an unbiased assessment of model performance.
When is a model identifiable? A model is identifiable if distinct parameter values produce distinct model outputs; non-identifiability can make estimation ambiguous.
Describe how the Gauss-Newton method is used in nonlinear least squares. It approximates the Hessian by JᵀJ (where J is the Jacobian) to iteratively update parameter estimates without computing second derivatives.
What is the role of Jacobian and Hessian matrices in nonlinear regression? The Jacobian represents first derivatives of residuals with respect to parameters; the Hessian contains second derivatives, informing curvature and convergence behavior.
Why are transformations (like log or Box–Cox) sometimes applied in nonlinear modeling? To linearize relationships, stabilize variance, or make model assumptions (e.g., normality) more appropriate.
What are the main challenges in estimating parameters for nonlinear models? Non-convex optimization, sensitivity to initial values, potential for local minima, and complex error surfaces.
What defines a nonparametric classifier? A classifier that does not assume a specific parametric form for the data distribution, relying instead on data-driven structure.
Explain how k-nearest neighbors (k-NN) classification works. It assigns a class based on the majority label among the k closest points in the feature space, using a distance metric like Euclidean distance.
What is the curse of dimensionality in nonparametric methods? As dimensionality increases, data becomes sparse, and distances lose discriminatory power, degrading performance of methods like k-NN.
How does kernel density estimation (KDE) relate to nonparametric classification? KDE estimates class-conditional densities, which can be used in Bayes’ rule for classification without assuming a parametric distribution.
What is the bandwidth parameter in KDE and why is it crucial? It controls the smoothness of the estimated density; too small leads to overfitting, too large to oversmoothing.
How does decision tree classification qualify as nonparametric? It partitions the feature space recursively without assuming any parametric model of the data distribution.
Compare bias and variance trade-offs in nonparametric vs. parametric classifiers. Nonparametric methods have low bias but high variance; parametric methods have higher bias but lower variance.
What role does distance metric choice play in nonparametric classification? The distance metric determines neighborhood relationships and strongly influences classification performance, especially in heterogeneous feature spaces.
What differentiates convex and non-convex optimization problems? Convex problems have a single global minimum; non-convex problems can have multiple local minima and saddle points.
What is the difference between gradient descent and Newton’s method? Gradient descent uses first-order derivatives to move toward minima; Newton’s method uses second-order curvature information for faster (but costlier) convergence.
Define a saddle point in the context of optimization. A point where the gradient is zero but the Hessian has both positive and negative eigenvalues — neither a local minimum nor maximum.
Why is line search used in iterative optimization algorithms? To determine an optimal step size that sufficiently decreases the objective function while maintaining convergence stability.
Explain the role of Lagrange multipliers in constrained nonlinear optimization. They transform constrained problems into unconstrained ones by incorporating constraints into the objective using multipliers.
What is the difference between global and local optimization algorithms? Local methods (e.g., gradient descent) find nearby minima; global methods (e.g., simulated annealing, genetic algorithms) explore to avoid local minima traps.
What is the Karush–Kuhn–Tucker (KKT) condition? A generalization of Lagrange conditions providing necessary (and sometimes sufficient) optimality conditions for constrained nonlinear problems.
Describe the purpose of quasi-Newton methods. They approximate the Hessian matrix using gradient information to achieve faster convergence than gradient descent without full second derivatives.
What is the difference between deterministic and stochastic optimization methods? Deterministic methods follow a fixed update rule; stochastic methods incorporate randomness (e.g., stochastic gradient descent) to escape local minima.
What is a trust-region method in nonlinear optimization? It optimizes a local model of the objective function within a region around the current point, adjusting region size based on model accuracy.
Created by: Filotì
 

 



Voices

Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
Retries:
restart all cards