Question 1

What does "machine learning" mean in formal terms?

Accepted Answer

It means constructing algorithms that improve automatically through experience. Formally, given a hypothesis set 
𝐻
H, a learning algorithm selects a hypothesis 
ℎ
∈
𝐻
h∈H that approximates the target function 
𝑓
f, using data 
𝐷
D.

Question 2

What is supervised learning?

Accepted Answer

A paradigm where the algorithm learns from a dataset of input-output pairs 
(
𝑥
,
𝑦
)
(x,y). The goal is to approximate the function mapping inputs to outputs, so it can predict the label 
𝑦
y for unseen input 
𝑥
x.

Question 3

What are two main categories of supervised learning problems?

Accepted Answer

Classification: Predicting a discrete label (e.g., spam vs. not spam).

Regression: Predicting a continuous value (e.g., house prices).

Question 4

State Hoeffding’s inequality.

Accepted Answer

For independent random variables 
𝑋
1
,
…
,
𝑋
𝑁
X
1
	​

,…,X
N
	​

 bounded in 
[
0
,
1
]
[0,1], the probability that their empirical mean 
𝜇
^
μ
^
	​

 deviates from the true mean 
𝜇
μ by more than 
𝜖
ϵ is bounded b

Question 5

Why is Hoeffding’s inequality useful in machine learning?

Accepted Answer

It quantifies how likely it is that the training error (empirical error) differs from the true error (generalization error). With more samples 
𝑁
N, the probability of large deviations decreases exponentially.

Question 6

How does Hoeffding’s inequality relate to overfitting?

Accepted Answer

It guarantees that with enough data, the empirical error approximates the true error well. If 
𝑁
N is small, the inequality allows for larger deviations, increasing the risk of overfitting.

Question 7

In the PAC (Probably Approximately Correct) framework, how does Hoeffding’s inequality apply?

Accepted Answer

It provides the probabilistic bound (“probably”) that the hypothesis chosen has error close to the target function’s error (“approximately correct”).

Question 8

What is the hypothesis function in linear regression?

Accepted Answer

hw​(x)=w0​+w1​x1​+w2​x2​+⋯+wd​xd​=wTx.

Question 9

What is the objective of linear regression?

Accepted Answer

To minimize the difference between predictions 
ℎ
𝑤
(
𝑥
)
h
w
	​

(x) and true values 
𝑦
y, usually by minimizing the Mean Squared Error (MSE).

Question 10

Write the MSE cost function.

Accepted Answer

E(w)=N1​i=1∑N​(hw​(x(i))−y(i))2.

Question 11

How are the optimal weights 
𝑤
w found in closed form?

Accepted Answer

By solving the normal equation:

𝑤
∗
=
(
𝑋
𝑇
𝑋
)
−
1
𝑋
𝑇
𝑦
,
w
∗
=(X
T
X)
−1
X
T
y,

where 
𝑋
X is the design matrix of inputs and 
𝑦
y the vector of outputs.

Question 12

What is a key limitation of linear regression?

Accepted Answer

It assumes a linear relationship between inputs and outputs. It can underfit when the true relationship is nonlinear.

Question 13

How does regularization modify linear regression?

Accepted Answer

It adds a penalty term to the cost function (e.g., Ridge regression adds 
𝜆
∥
𝑤
∥
2
λ∥w∥
2
) to prevent overfitting and control the complexity of the model.

Question 14

Why can’t we use linear regression to model a binary outcome?

Accepted Answer

Linear regression can predict values <0 or >1, which aren’t valid probabilities. Also, the error term is not normally distributed with heteroscedasticity, violating linear regression assumptions. Logistic regression models the log-odds and bounds predicti

Question 15

Explain the connection between odds, probability, and log-odds.

Accepted Answer

Probability: p = P(Y=1); Odds: odds = p/(1-p); Log-odds (logit): logit(p) = ln(p/(1-p)). Logistic regression models log-odds linearly as a function of predictors.

Question 16

How do you interpret a coefficient β_i in logistic regression?

Accepted Answer

β_i = change in log-odds per one-unit increase in x_i. Odds ratio: e^β_i = multiplicative change in odds for one-unit increase. Example: β_i=0.7 ⇒ e^0.7≈2 → odds double per unit increase.

Question 17

What is the maximum likelihood estimation (MLE) in logistic regression?

Accepted Answer

MLE finds coefficients β that maximize the probability of observing the data: L(β)=∏[y_i * ŷ_i + (1-y_i)*(1-ŷ_i)]. Equivalent to minimizing log loss. No closed-form solution; solved via iterative optimization (e.g., Newton-Raphson, gradient descent).

Question 18

How does logistic regression handle non-linear relationships?

Accepted Answer

Logistic regression assumes linearity between predictors and log-odds. Non-linear effects can be modeled by adding polynomial terms (x²), interaction terms (x₁*x₂), or splines.

Question 19

What problems arise with perfect separation?

Accepted Answer

Perfect separation occurs when a predictor perfectly predicts the outcome. MLE doesn’t converge (coefficients → ∞). Solution: use regularization (L1/L2) or Bayesian logistic regression.

Question 20

Explain the difference between logit, probit, and complementary log-log models.

Accepted Answer

Logit: ln(p/(1-p)), logistic error distribution. Probit: Φ⁻¹(p), normal error distribution. Complementary log-log: ln(-ln(1-p)), asymmetric, often for rare events. All link functions map probabilities to real line.

Question 21

How do you evaluate model fit in logistic regression?

Accepted Answer

Deviance (compare to null model), Pseudo-R² (McFadden’s), ROC-AUC (discrimination), Calibration (how predicted probabilities match observed outcomes).

Question 22

How do regularization techniques affect logistic regression?

Accepted Answer

L1 (Lasso): sparsity/feature selection. L2 (Ridge): coefficient shrinkage, reduces overfitting. Elastic Net: combination of L1 + L2. Useful with correlated predictors or small data.

Question 23

What is the decision boundary in logistic regression?

Accepted Answer

The hyperplane where predicted probability = threshold (usually 0.5): β₀ + β₁x₁ + ... + β_n x_n = 0. Linear in feature space; can be made non-linear with feature transformations.

Question 24

What is the bias–variance tradeoff in model selection?

Accepted Answer

It refers to the balance between model complexity and generalization. High bias leads to underfitting; high variance leads to overfitting. The goal is to minimize total expected prediction error.

Question 25

How does cross-validation help in model selection?

Accepted Answer

Cross-validation partitions data into training and validation subsets multiple times to estimate generalization error, helping choose models that perform well on unseen data.

Question 26

Compare k-fold cross-validation and leave-one-out cross-validation (LOOCV).

Accepted Answer

k-fold splits data into k groups, balancing bias and variance in error estimates; LOOCV uses n folds (one per observation), reducing bias but increasing variance and computation.

Question 27

What is the purpose of regularization in model selection?

Accepted Answer

Regularization penalizes model complexity (e.g., large coefficients) to prevent overfitting, guiding selection toward simpler, more generalizable models.

Question 28

Define the Akaike Information Criterion (AIC) and its role in model selection.

Accepted Answer

AIC = 2k - 2ln(L), where k is the number of parameters and L the likelihood. It balances fit quality and complexity — lower AIC indicates a preferred model.

Question 29

How does the Bayesian Information Criterion (BIC) differ from AIC?

Accepted Answer

BIC = k ln(n) - 2ln(L); it penalizes complexity more heavily, especially for large samples, often favoring simpler models.

Question 30

What is the concept of model evidence in Bayesian model selection?

Accepted Answer

Model evidence integrates over parameter uncertainties, quantifying how well the model explains the data while accounting for model complexity.

Question 31

Explain the concept of overfitting in the context of model selection.

Accepted Answer

Overfitting occurs when a model fits training data noise, leading to poor generalization. It often results from selecting a model solely based on training performance.

Question 32

Why can test set performance not be used for iterative model selection?

Accepted Answer

Repeatedly using the test set biases model choice toward that specific test data, invalidating it as an unbiased measure of generalization.

Question 33

What is nested cross-validation and why is it used?

Accepted Answer

Nested cross-validation has an inner loop for model selection and an outer loop for performance estimation, preventing information leakage and providing an unbiased assessment of model performance.

Question 34

When is a model identifiable?

Accepted Answer

A model is identifiable if distinct parameter values produce distinct model outputs; non-identifiability can make estimation ambiguous.

Question 35

Describe how the Gauss-Newton method is used in nonlinear least squares.

Accepted Answer

It approximates the Hessian by JᵀJ (where J is the Jacobian) to iteratively update parameter estimates without computing second derivatives.

Question 36

What is the role of Jacobian and Hessian matrices in nonlinear regression?

Accepted Answer

The Jacobian represents first derivatives of residuals with respect to parameters; the Hessian contains second derivatives, informing curvature and convergence behavior.

Question 37

Why are transformations (like log or Box–Cox) sometimes applied in nonlinear modeling?

Accepted Answer

To linearize relationships, stabilize variance, or make model assumptions (e.g., normality) more appropriate.

Question 38

What are the main challenges in estimating parameters for nonlinear models?

Accepted Answer

Non-convex optimization, sensitivity to initial values, potential for local minima, and complex error surfaces.

Question 39

What defines a nonparametric classifier?

Accepted Answer

A classifier that does not assume a specific parametric form for the data distribution, relying instead on data-driven structure.

Question 40

Explain how k-nearest neighbors (k-NN) classification works.

Accepted Answer

It assigns a class based on the majority label among the k closest points in the feature space, using a distance metric like Euclidean distance.

Question 41

What is the curse of dimensionality in nonparametric methods?

Accepted Answer

As dimensionality increases, data becomes sparse, and distances lose discriminatory power, degrading performance of methods like k-NN.

Question 42

How does kernel density estimation (KDE) relate to nonparametric classification?

Accepted Answer

KDE estimates class-conditional densities, which can be used in Bayes’ rule for classification without assuming a parametric distribution.

Question 43

What is the bandwidth parameter in KDE and why is it crucial?

Accepted Answer

It controls the smoothness of the estimated density; too small leads to overfitting, too large to oversmoothing.

Question 44

How does decision tree classification qualify as nonparametric?

Accepted Answer

It partitions the feature space recursively without assuming any parametric model of the data distribution.

Question 45

Compare bias and variance trade-offs in nonparametric vs. parametric classifiers.

Accepted Answer

Nonparametric methods have low bias but high variance; parametric methods have higher bias but lower variance.

Question 46

What role does distance metric choice play in nonparametric classification?

Accepted Answer

The distance metric determines neighborhood relationships and strongly influences classification performance, especially in heterogeneous feature spaces.

Question 47

What differentiates convex and non-convex optimization problems?

Accepted Answer

Convex problems have a single global minimum; non-convex problems can have multiple local minima and saddle points.

Question 48

What is the difference between gradient descent and Newton’s method?

Accepted Answer

Gradient descent uses first-order derivatives to move toward minima; Newton’s method uses second-order curvature information for faster (but costlier) convergence.

Question 49

Define a saddle point in the context of optimization.

Accepted Answer

A point where the gradient is zero but the Hessian has both positive and negative eigenvalues — neither a local minimum nor maximum.

Question 50

Why is line search used in iterative optimization algorithms?

Accepted Answer

To determine an optimal step size that sufficiently decreases the objective function while maintaining convergence stability.

Question 51

Explain the role of Lagrange multipliers in constrained nonlinear optimization.

Accepted Answer

They transform constrained problems into unconstrained ones by incorporating constraints into the objective using multipliers.

Question 52

What is the difference between global and local optimization algorithms?

Accepted Answer

Local methods (e.g., gradient descent) find nearby minima; global methods (e.g., simulated annealing, genetic algorithms) explore to avoid local minima traps.

Question 53

What is the Karush–Kuhn–Tucker (KKT) condition?

Accepted Answer

A generalization of Lagrange conditions providing necessary (and sometimes sufficient) optimality conditions for constrained nonlinear problems.

Question 54

Describe the purpose of quasi-Newton methods.

Accepted Answer

They approximate the Hessian matrix using gradient information to achieve faster convergence than gradient descent without full second derivatives.

Question 55

What is the difference between deterministic and stochastic optimization methods?

Accepted Answer

Deterministic methods follow a fixed update rule; stochastic methods incorporate randomness (e.g., stochastic gradient descent) to escape local minima.

Question 56

What is a trust-region method in nonlinear optimization?

Accepted Answer

It optimizes a local model of the objective function within a region around the current point, adjusting region size based on model accuracy.

"Know" box contains:
Time elapsed:
Retries:

MIDA