click below
click below
Normal Size Small Size show me how
Deep Learning final
| Question | Answer |
|---|---|
| CNN biological inspiration | D. Hubel and T. Wiesel |
| CNN backstory | Neocognitron, LeNet-5 |
| CNN output size | (W-F+2P)/S)+ 1 |
| receptive field | region of the input feature map whose values contribute to the response of that unit |
| receptive field size | 1 + L(F - 1) |
| early CNNs | nearly all learnable parameters are located in the FC layers |
| Google LeNet / Inception | Uses a "stem" network to aggressively downsample the image early on. |
| ResNet | uses skip connections |
| ResNext | splits feature maps into groups, concats separate convolutions |
| DenseNet | concats all previous layers' feature maps to the next layer |
| small batches | less memory, more gradient noise (regularization) |
| large batches | lower gradient variance, faster hardware utilization |
| learning rate decay | needed so network can step into local minimum |
| momentum | adds a fraction of the previous update to the current one |
| adagrad | adapts learning rate per parameter. flaw: accumulates all past gradients so the LR eventually drops to zero |
| RMSprop | fixes adagrad by using exponential moving average of past squared gradients |
| Adam | combines RMSProp and Momentum |
| pre-processing | zero-centering and normalizing variance. important so that all inputs to activation functions aren't strictly positive/negative. |
| weight initialization | don't initialize to zero (causes symmetric gradients) |
| batch norm | shifts and rescales activations using the mean and variance of current mini-batch |
| L1/L2 (weight decay) | penalizes large weights in the loss function |
| early stopping | stop training when validation error starts to rise |
| dropout | randomly turns off neurons during training with probability p |
| label smoothing | replaces hard labels with soft targets |
| distillation | train a large teacher network to train smaller student network to fit softmax probabilities generated by the teacher |
| IoU (intersection over union) | area of overlap / area of union. measures how well a predicted box aligns with the ground truth. >0.5 is decent |
| NMS (non-maximum suppression) | detectors output multiple overlapping boxes for one object. NMS selects the highest scoring box and deletes all other boxes that overlap it significantly |
| evaluation (mAP) | calculate precision vs. recall curve. area under curve = average precision (AP). mean AP averages across all classes |
| R-CNN | 1. generate regions using external tool (selective search) 2. warp regions to fixed size. 3. pass each region through a CNN. 4. Classify with SVM. Flaw: extremely slow |
| Fast R-CNN | 1. Pass the whole image through CNN to get a feature map. 2. Project region proposals onto the feature map. 3. RoI Pooling to extract fixed-size features for each region. 4. FC layers for class + box offsets. Flaw: still relies on selective search |
| Faster R-CNN | replaces external proposal generator with region proposal network. fully end-to-end. |
| Mask R-CNN | adds a third branch branch to Faster R-CNN to predict a pixel-level binary mask for the object. bilinear interpolation |
| YOLO | divide image into a coarse grid. each grid cell directly predicts class probabilities and bounding boxes. |
| SSD | predicts boxes from multiple different convolutional feature maps at different resolutions |
| RetinaNet & Focal loss | down weight loss for easy samples so network focus on hard examples |
| vanilla RNN | h_t = tanh(W_hh*h_t-1 + Wxh*xt) hidden state acts as memory |
| vanishing/exploding gradients | because the exact same weight matrix W is multiplied repeatedly across time steps, gradients shrink to 0 or grow to inf |
| bi-directional RNN | process sequence both ways, concating hidden states. good when you need future context |
| LSTM | adds a separate cell state c_t. uses sigmoid gates to decide updating memory |
| GRU (gated recurrent unit) | simplifiedLSTM. merges cell state and hidden state. merges forget and input gates into single update gate |
| beam search | keep top k best overall sequences at every step |
| seq2seq | standard encoder-decoder bottlenecks the entire input sequence into a single hidden vector |
| Attention(Q,K,V) | softmax(QK^T/sqrt(D)) V |
| query | what the current token is looking for |
| key | what the other tokens contain |
| value | actual information the token will pass along if selected |
| why divide by sqrt(D) | prevent vanishing gradients, maintaining consistent variance |
| encoder self-attention | Q,K,V all come from previous encoder layer. every token looks at eveyr other token |
| decoder self-attention | Q,K,V come from previous decoder layer. masked so token can only look at past tokens |
| encoder-decoder (cross) attention | Q comes from decoder. K and V come from encoder's final output |
| transformer memory | O(N^2) w.r.t. sequence length |
| positional encoding | adds sine/cosine waves of different frequencies to the input empeddings |
| RoPE (Rotary Positional Embedding) | instead of adding position to embeddings, it rotates Q and K vectors in space. |
| ViT (vision transformer) | split image into a grid of non-overlapping patches, flatten each patch, apply linear projection, add positional embeddings, feed them into Transfer encoder |
| Swin (shifted window transformer) | solves quadratic cost of ViT. computes self-attention only within local windows. Merges patches in deeper layers |
| Markov decision process | states (s), actions (a), transition model P(s' | s, a), and reward function r(s). Next state s' depends only on the current state s and action a |
| Bellman-equations | V(s)=r(s) + discount * max_aE[V(s')] Q(s,a) = r(s) + discount * E[max_a Q(s', a')] |
| Q learning | uses neural net to approximate Q values. experience replay buffer and frozen target network to prevent instability |
| double DQN | Standard DQN uses the max operator to both select and evaluate actions, causing overestimation. Double DQN uses the online network to select the action, and the target network to evaluate its value. |
| dueling DQN | splits network head into two paths: one predicts the general state values V(s) other predicts advantage of each action A(s, a). |
| REINFORCE | Learns the policy directly pi(a | s). Multiply gradient of the log-probability of an action by the actual reward received. |
| Actor-Critic | Combines policy gradients (actor) with Q-learning (critic). |
| PPO (proximal policy optimization) | uses clipped surrogate objective that prevents new policy from changing too much from the old policy. allows safe reuse of mini-batches. |