click below
click below
Normal Size Small Size show me how
MIDA
| Question | Answer |
|---|---|
| What does "machine learning" mean in formal terms? | It means constructing algorithms that improve automatically through experience. Formally, given a hypothesis set 𝐻 H, a learning algorithm selects a hypothesis ℎ ∈ 𝐻 h∈H that approximates the target function 𝑓 f, using data 𝐷 D. |
| What is supervised learning? | A paradigm where the algorithm learns from a dataset of input-output pairs ( 𝑥 , 𝑦 ) (x,y). The goal is to approximate the function mapping inputs to outputs, so it can predict the label 𝑦 y for unseen input 𝑥 x. |
| What are two main categories of supervised learning problems? | Classification: Predicting a discrete label (e.g., spam vs. not spam). Regression: Predicting a continuous value (e.g., house prices). |
| State Hoeffding’s inequality. | For independent random variables 𝑋 1 , … , 𝑋 𝑁 X 1 ,…,X N bounded in [ 0 , 1 ] [0,1], the probability that their empirical mean 𝜇 ^ μ ^ deviates from the true mean 𝜇 μ by more than 𝜖 ϵ is bounded b |
| Why is Hoeffding’s inequality useful in machine learning? | It quantifies how likely it is that the training error (empirical error) differs from the true error (generalization error). With more samples 𝑁 N, the probability of large deviations decreases exponentially. |
| How does Hoeffding’s inequality relate to overfitting? | It guarantees that with enough data, the empirical error approximates the true error well. If 𝑁 N is small, the inequality allows for larger deviations, increasing the risk of overfitting. |
| In the PAC (Probably Approximately Correct) framework, how does Hoeffding’s inequality apply? | It provides the probabilistic bound (“probably”) that the hypothesis chosen has error close to the target function’s error (“approximately correct”). |
| What is the hypothesis function in linear regression? | hw(x)=w0+w1x1+w2x2+⋯+wdxd=wTx. |
| What is the objective of linear regression? | To minimize the difference between predictions ℎ 𝑤 ( 𝑥 ) h w (x) and true values 𝑦 y, usually by minimizing the Mean Squared Error (MSE). |
| Write the MSE cost function. | E(w)=N1i=1∑N(hw(x(i))−y(i))2. |
| How are the optimal weights 𝑤 w found in closed form? | By solving the normal equation: 𝑤 ∗ = ( 𝑋 𝑇 𝑋 ) − 1 𝑋 𝑇 𝑦 , w ∗ =(X T X) −1 X T y, where 𝑋 X is the design matrix of inputs and 𝑦 y the vector of outputs. |
| What is a key limitation of linear regression? | It assumes a linear relationship between inputs and outputs. It can underfit when the true relationship is nonlinear. |
| How does regularization modify linear regression? | It adds a penalty term to the cost function (e.g., Ridge regression adds 𝜆 ∥ 𝑤 ∥ 2 λ∥w∥ 2 ) to prevent overfitting and control the complexity of the model. |
| Why can’t we use linear regression to model a binary outcome? | Linear regression can predict values <0 or >1, which aren’t valid probabilities. Also, the error term is not normally distributed with heteroscedasticity, violating linear regression assumptions. Logistic regression models the log-odds and bounds predicti |
| Explain the connection between odds, probability, and log-odds. | Probability: p = P(Y=1); Odds: odds = p/(1-p); Log-odds (logit): logit(p) = ln(p/(1-p)). Logistic regression models log-odds linearly as a function of predictors. |
| How do you interpret a coefficient β_i in logistic regression? | β_i = change in log-odds per one-unit increase in x_i. Odds ratio: e^β_i = multiplicative change in odds for one-unit increase. Example: β_i=0.7 ⇒ e^0.7≈2 → odds double per unit increase. |
| What is the maximum likelihood estimation (MLE) in logistic regression? | MLE finds coefficients β that maximize the probability of observing the data: L(β)=∏[y_i * ŷ_i + (1-y_i)*(1-ŷ_i)]. Equivalent to minimizing log loss. No closed-form solution; solved via iterative optimization (e.g., Newton-Raphson, gradient descent). |
| How does logistic regression handle non-linear relationships? | Logistic regression assumes linearity between predictors and log-odds. Non-linear effects can be modeled by adding polynomial terms (x²), interaction terms (x₁*x₂), or splines. |
| What problems arise with perfect separation? | Perfect separation occurs when a predictor perfectly predicts the outcome. MLE doesn’t converge (coefficients → ∞). Solution: use regularization (L1/L2) or Bayesian logistic regression. |
| Explain the difference between logit, probit, and complementary log-log models. | Logit: ln(p/(1-p)), logistic error distribution. Probit: Φ⁻¹(p), normal error distribution. Complementary log-log: ln(-ln(1-p)), asymmetric, often for rare events. All link functions map probabilities to real line. |
| How do you evaluate model fit in logistic regression? | Deviance (compare to null model), Pseudo-R² (McFadden’s), ROC-AUC (discrimination), Calibration (how predicted probabilities match observed outcomes). |
| How do regularization techniques affect logistic regression? | L1 (Lasso): sparsity/feature selection. L2 (Ridge): coefficient shrinkage, reduces overfitting. Elastic Net: combination of L1 + L2. Useful with correlated predictors or small data. |
| What is the decision boundary in logistic regression? | The hyperplane where predicted probability = threshold (usually 0.5): β₀ + β₁x₁ + ... + β_n x_n = 0. Linear in feature space; can be made non-linear with feature transformations. |
| What is the bias–variance tradeoff in model selection? | It refers to the balance between model complexity and generalization. High bias leads to underfitting; high variance leads to overfitting. The goal is to minimize total expected prediction error. |
| How does cross-validation help in model selection? | Cross-validation partitions data into training and validation subsets multiple times to estimate generalization error, helping choose models that perform well on unseen data. |
| Compare k-fold cross-validation and leave-one-out cross-validation (LOOCV). | k-fold splits data into k groups, balancing bias and variance in error estimates; LOOCV uses n folds (one per observation), reducing bias but increasing variance and computation. |
| What is the purpose of regularization in model selection? | Regularization penalizes model complexity (e.g., large coefficients) to prevent overfitting, guiding selection toward simpler, more generalizable models. |
| Define the Akaike Information Criterion (AIC) and its role in model selection. | AIC = 2k - 2ln(L), where k is the number of parameters and L the likelihood. It balances fit quality and complexity — lower AIC indicates a preferred model. |
| How does the Bayesian Information Criterion (BIC) differ from AIC? | BIC = k ln(n) - 2ln(L); it penalizes complexity more heavily, especially for large samples, often favoring simpler models. |
| What is the concept of model evidence in Bayesian model selection? | Model evidence integrates over parameter uncertainties, quantifying how well the model explains the data while accounting for model complexity. |
| Explain the concept of overfitting in the context of model selection. | Overfitting occurs when a model fits training data noise, leading to poor generalization. It often results from selecting a model solely based on training performance. |
| Why can test set performance not be used for iterative model selection? | Repeatedly using the test set biases model choice toward that specific test data, invalidating it as an unbiased measure of generalization. |
| What is nested cross-validation and why is it used? | Nested cross-validation has an inner loop for model selection and an outer loop for performance estimation, preventing information leakage and providing an unbiased assessment of model performance. |
| When is a model identifiable? | A model is identifiable if distinct parameter values produce distinct model outputs; non-identifiability can make estimation ambiguous. |
| Describe how the Gauss-Newton method is used in nonlinear least squares. | It approximates the Hessian by JᵀJ (where J is the Jacobian) to iteratively update parameter estimates without computing second derivatives. |
| What is the role of Jacobian and Hessian matrices in nonlinear regression? | The Jacobian represents first derivatives of residuals with respect to parameters; the Hessian contains second derivatives, informing curvature and convergence behavior. |
| Why are transformations (like log or Box–Cox) sometimes applied in nonlinear modeling? | To linearize relationships, stabilize variance, or make model assumptions (e.g., normality) more appropriate. |
| What are the main challenges in estimating parameters for nonlinear models? | Non-convex optimization, sensitivity to initial values, potential for local minima, and complex error surfaces. |
| What defines a nonparametric classifier? | A classifier that does not assume a specific parametric form for the data distribution, relying instead on data-driven structure. |
| Explain how k-nearest neighbors (k-NN) classification works. | It assigns a class based on the majority label among the k closest points in the feature space, using a distance metric like Euclidean distance. |
| What is the curse of dimensionality in nonparametric methods? | As dimensionality increases, data becomes sparse, and distances lose discriminatory power, degrading performance of methods like k-NN. |
| How does kernel density estimation (KDE) relate to nonparametric classification? | KDE estimates class-conditional densities, which can be used in Bayes’ rule for classification without assuming a parametric distribution. |
| What is the bandwidth parameter in KDE and why is it crucial? | It controls the smoothness of the estimated density; too small leads to overfitting, too large to oversmoothing. |
| How does decision tree classification qualify as nonparametric? | It partitions the feature space recursively without assuming any parametric model of the data distribution. |
| Compare bias and variance trade-offs in nonparametric vs. parametric classifiers. | Nonparametric methods have low bias but high variance; parametric methods have higher bias but lower variance. |
| What role does distance metric choice play in nonparametric classification? | The distance metric determines neighborhood relationships and strongly influences classification performance, especially in heterogeneous feature spaces. |
| What differentiates convex and non-convex optimization problems? | Convex problems have a single global minimum; non-convex problems can have multiple local minima and saddle points. |
| What is the difference between gradient descent and Newton’s method? | Gradient descent uses first-order derivatives to move toward minima; Newton’s method uses second-order curvature information for faster (but costlier) convergence. |
| Define a saddle point in the context of optimization. | A point where the gradient is zero but the Hessian has both positive and negative eigenvalues — neither a local minimum nor maximum. |
| Why is line search used in iterative optimization algorithms? | To determine an optimal step size that sufficiently decreases the objective function while maintaining convergence stability. |
| Explain the role of Lagrange multipliers in constrained nonlinear optimization. | They transform constrained problems into unconstrained ones by incorporating constraints into the objective using multipliers. |
| What is the difference between global and local optimization algorithms? | Local methods (e.g., gradient descent) find nearby minima; global methods (e.g., simulated annealing, genetic algorithms) explore to avoid local minima traps. |
| What is the Karush–Kuhn–Tucker (KKT) condition? | A generalization of Lagrange conditions providing necessary (and sometimes sufficient) optimality conditions for constrained nonlinear problems. |
| Describe the purpose of quasi-Newton methods. | They approximate the Hessian matrix using gradient information to achieve faster convergence than gradient descent without full second derivatives. |
| What is the difference between deterministic and stochastic optimization methods? | Deterministic methods follow a fixed update rule; stochastic methods incorporate randomness (e.g., stochastic gradient descent) to escape local minima. |
| What is a trust-region method in nonlinear optimization? | It optimizes a local model of the objective function within a region around the current point, adjusting region size based on model accuracy. |