Term
click below
click below
Term
Normal Size Small Size show me how
383 Final
Term | Definition |
---|---|
What would it mean for a machine to pass the Turing Test? | That a human cannot distinguish the difference bewteen that machine and another human over a conversation |
What is the primary advantage of the rational agent approach for the purpose of science research? | It provides a framework for creating machines whose "intelligence" is measurable and comparable |
Following Moravec's Paradox, which of these tasks would require the most computational power? a. A program that solves tic-tac-toe b. A machine that can pick up objects using eating utensils c. A program that can solve math eqns | A machine that can pick up objects using eating utensils |
What is Moravec’s Paradox? | Moravec’s Paradox highlights that AI systems easily outperform humans on abstract, logic‑based tasks (like chess or arithmetic) yet struggle with “simple” skills (like recognizing objects ) that evolution has optimized over millions of years. |
What is the difference between Tree-Search Algorithm and Graph-Search Algorithm? | Graph Search algorithms keep track of explored nodes to prevent cycles |
T or F: A* search is always optimal | False |
In what year was the field of AI officially founded? | 1956 |
Which of the following best describes how leading textbooks define the study of AI? | The study of Intelligent Agents |
What are the inputs to a search problem? a. Initial State b. Goal State c. Action cost function d. Transition Model e. All of the above | initial state, goal state, action cost function, transition model (all of the above) |
T or F: Adversarial Search is exactly the same as performing regular search in a multi-agent environment? | False |
What is the name of the computer that famously defeated world champion Gary Kasparov in chess in 1997? | Deep Blue |
What are the space and time complexities of BFS? | Space: Exponential, Time: Exponential |
What is the main way that graph search algorithms differ? | By how they select the next node for exploration |
Which element of search algorithms is unique to informed search strategies? | Heuristic function |
WHat is the output of a general search algorithm? | A sequence of actions to reach the goal state |
What is the key difference between a goal-based agent and a utility-based agent? | A goal-based agent selects action to achieve a specific goal, while a utility-based agent selects actions based on how desirable they ate |
In what way does online search demonstrate an example of machine learning? | The agent learns the transition model of a problem after completing online search |
Which of the following is an example of a pruning algorithm? | Alpha-beta |
In bline search problem formulation, how is the goal test changed? | the goal test checks if all states in the beliefe state are the goal state |
Which of the following tasks would always involve non-deterministic actions? | off-road driving |
Which of the following best describes the Physical Symbol System Hypothesis? | Any system capable of general intelligence must operate on sybmols and symbolic manipulation |
Which best describes the argument in Rodney Brook's 1990 influential paper "Elephants Don't Play Chess"? | Elephants don't play chess but that does not mean they are unintelligent and lack behavior worth studying |
In the field of AI, what are the two necessary components of an agent? | Sensors and Actuators |
h(n) = 0 is this heuristic function admissible given non-negative edge weights? | True |
What are the four environment assumptions needed to execute proper execution of "pure" search algorithms? | Deterministic, Observable, Discrete, Known |
What conditions must be met in order for a reflex agent to be rational? | A reflex agent is rational if it selects actions that maximizes its performance measure based on current percept and knowledge, given its environment and actions. ex. thermostat controlling heating system is rational if it turns on when temp is below |
what is the critical assumption of the minimax algorithm? If that assumption were not true, would the algorithm still yield the optiomal action to take? | - the critical assumption is that your apponent (MIN) will take the optimal move at every step to minimize MAX's score - if this is not true, the algorithm may not yield optimal actions |
what is the backpropogation algorithm used for? | to find incremental adjustments to make to all the weights in a neural network |
in general, how do we optimize an ML model? | - Minimize a loss function with respect to the parameters of the network - Minimize a cost function with respect to the parameters of the network |
Gradient Descent is a general algorithm to do what? | Find the minimim of a function |
Why is a step function not an ideal activation function for a neural network? | Its not differentiable |
T or F: Linear regression can be applied only on 2 dimensional data | false |
which best describes a loss function? | a measure of the imperfection of our prediction |
in the contect of Machine learning, which best describes a Validation Set? | A partition of data used to tune hyper-parameters of a model before testing |
what does alpha describe in the gradient descent update equation? | the learning rate |
in the context of machine learning, which statement best describes the purpose of a hypothesis function h(x)? | h(x) is an estimate function of the underlying tre function f(x) which relates features, x to labels, y |
which best describes the intuition for k-nearest neighbors classifications? | - each nearby datapoint within a certain distance radius votes for the classification of a novel datapoint --> set of datapoints closest to a novel datapoint vote for its classification |
which of the following best describes the process of selection a model for a machine learning problem? | model selection involves experimentation influenced by the problem, data, and evaluation |
what is the backpropogation algorithm used to compute? | the influence that changing any weight of a neural network has on the prediction error |
what is a valid interpretation of a trained perceptron's output? | a binary linear classifier |
which of the following best describe the intuition behind linear regression? | a hyperplane through a set of datapoints using the residuals between estimate and actual output |
which of the following is not a category of ML? a. unsupervised learning b. reinforcement learning c. supervised learning d. harmonic learning | harmonic learning |
T or F: A neural network of sufficient size can in theory approximate any continuous function | true |
which of the following is a commonly used type of artificial neural network? | multi-layer perceptron |
in context of machine learning, which best describes the concept of Ockham's Razor? | given equal performance the least complex model is often preferred |
which describes a likely observation you could make on an overfit model? | it performs poorly on the validation data but well on the training data |
which of the following best describes the intuition behind a SVM? | a hyperplave divides data in such a way that maximizes the margin between the categories |
what is the main reason why a step function is not an ideal activation function for a neural network? | its not differentiable and thus incompatible with differential optimization techniques |
T or F: When approaching a ML problem, there's only one correct model to use to create accurate predictions | false |
for a perceptron, which of the following best describes a hard threshold activation? | a linear sum of weighted inputs is taken. if that sum exceeds a set value then the perceptron activates sending a fixed signal to its downstream connections |
Which best describes the "Kernel Trick"? | A trick to find a decision boundary in a different coordinate space |
T or F: Multilayer neural networks can in theory predict any continous function | true |
T or F: a ML model that is overfit to a dataset will generlize well | false |
physical symbol system hypothesis: | the idea that manipulating symbols according to rules is sufficient for general intelligence. - top-down - rule-based |
Which hypothesis underlies “Good Old-Fashioned AI” | Physical symbol system hypothesis |
In connectionist AI systems, “learning” primarily means: | Adjusting neural-network parameters from data |
Connectionist AI: | - bottom-up, inductive approach. - Systems learn rules and patterns directly from data (observations) rather than being explicitly programmed. |
Which approach is top-down and rule-based? A. Symbolic AI B. Connectionist AI C. Reinforcement learning D. Evolutionary algorithms | A. Symbolic AI |
Natural Language Processing (NLP) is primarily concerned with: A. Visual scene understanding B. Physical robot control C. Understanding and generating human language D. Optimizing search algorithms | C. Understanding and generating human language |
Computer Vision: | Enabling agents to "see" and interpret visual information. |
The chain rule in language modeling expresses | joint probability as a product of conditional probabilities |
A bigram model relies on which assumption? | P(wᵢ|w₁…wᵢ₋₁) ≈ P(wᵢ|wᵢ₋₁) - a bigram is two units ex. 'th', or 'the cat' - Edge only from Wi−1 to Wi. |
Which n-gram model ignores word order entirely? | Unigram |
Corpus | Corpus: A collection of text or speech data used for analysis or training |
Why is the full joint distribution P(w₁…wₘ) infeasible to compute directly? | Table size grows as |V|ᵐ (explodes) |
Language identification via n-grams works by: A. Counting parts of speech B. Comparing sequence probability under each language model C. Parsing with a CFG D. Measuring sentence length | Comparing sequence probability under each language model |
In LLMs, the “context window” is: A. Number of GPUs used B. Maximum length of prefix tokens the model can attend to C. Batch size during training D. Size of the model vocabulary | Maximum length of prefix tokens the model can attend to |
Which stage of LLM training uses human-labelled prompt/output pairs to teach formatting? A. Pre-training B. Tokenization C. Instruction tuning D. Inference | Instruction tuning |
RLHF stands for: A. Reinforcement Learning from Human Feedback B. Recurrent Language Hierarchical Framework C. Randomized Learning Hyperparameter Fitting D. Rule-based Language Heuristic Fusion Answer: A | Reinforcement Learning from Human Feedback |
Chain-of-thought” prompting aims to: A. Force shorter outputs B. Make the model list bullet points C. Have the model articulate its step-by-step reasoning D. Restrict vocabulary usage | Have the model articulate its step-by-step reasoning |
Information Extraction | - Extract specific pieces of structured information from unstructured or semi-structured text - An example is extracting product names and their prices from websites |
Regex | A sequence of characters defining a search pattern (ex. format of a price or phone number) |
Risk of LLM's being prompted to perform information extraction: | LLMs may "hallucinate" or invent information that isn't actually present in the text, making them less reliable than regex for tasks requiring high accuracy |
Information retrieval | find documents relevant to a user's query from a large collection (corpus) - ex. web search enginges (google, bing, etc.) |
components of information retrieval | - document collection: The large set of documents to search within - query: The user's expression of their information need - retrieval system: The algorithm/system that processes the query and retruns ranked subset of documents deemed relevant |
In TF-IDF scoring, a term gets high weight if it is: A. Frequent in all documents B. Rare in the corpus but frequent in the current document C. Absent from the current document D. Only appears in stop-word list | high score if frequent in the doc but rare overall. Sum scores for query terms |
TF-IDF | - TF: How often a term appears in the document. High scores suggests relevance - IDF: How rare a term is across the entire corpus. Rarer terms are considered more informative IDF(t) = log(N/df(t)), [N = total docs, d(f) = # docs containing term] |
PageRank ranks web pages based on: A. Term frequency B. Link structure (“importance” via incoming links) C. Document length D. Keyword density | Scores pages based on link structure ("popularity contest") |
In a TF-IDF scheme, what role does the IDF component play? | It downweights terms that appear in many documents, so common words carry less influence. (ex. "the", "and", etc) |
NLP Task: Syntactic Parsing | Goal: Analyze the grammatical structure of a sentence according to a formal grammar - relies on formal grammars, often CFGs, which define rules for how words (terminals) group into constituents (non-terminals) like noun phrases and verb phrases. |
Context-Free Grammar | A type of grammer where rules apply regardless of surrounding context In a plain CFG, if you have two ways to expand NP—say 1. NP → Det Noun (ex. "the cat") 2. NP → Name (ex. Sam) you don’t say which one is preferred; both parses are just “allowed.” |
A Probabilistic CFG (PCFG) extends a CFG by | Assigning probabilities to each production rule - instead of treating every grammar rule as equally “possible,” a PCFG lets you say “Rule X is twice as likely as Rule Y.” |
Probabilistic CFG (PCFG) | Assigns probabilities to each grammer rule based on its observed frequency in a corpus |
NP → Det Noun (tree) | NP └── Det Noun | | “the” “dog” |
Calculate the PCFP of "the cat": Rule Probability NP → Det Noun 0.6 NP → "dog" 0.5 Det → “the” 1.0 Det → “cat” 0.5 | Probability: 0.6 x 1.0 x 0.5 = 0.3 NP ├─ Det → “the” └─ Noun → “cat” |
In a parse tree, “terminals” are: A. Non-terminal symbols like NP or VP B. Actual words of the sentence C. Probability values D. Grammar rules | - Actual words of a sentence (leaves of the parse tree) - the set of terminals is the lexicon/vocabulary |
Word embeddings differ from one-hot vectors because they are: A. Sparse and high-dimensional B. Dense and low-dimensional, learned to capture similarity C. Randomly assigned D. Always binary | - word embed: Dense and low-dimensional, learned to capture similarity - one-hot: very sparse and high dimensional, treats words as independent symbols |
Word embedding example | Vocabulary: "cat" "dog" "mouse" Example sentence: “cat dog cat” , Indexing: c→0, d→1, m→2 Word-count vector looks like: [2, 1, 0] What you see: "c" appears twice. "d" once, "m" zero Drawbacks: - order is lost ("dog cat cat" gives same vector) |
One-Hot Encoding example | "cat" → [1,0,0] "dog" → [0,1,0] "mouse" → [0,0,1] every word is turned into a vector of length |V| = 3, with a single 1 at its index |
why are word embeddings better than one-hot encoding? | One-hot are sparse and don't capture meaning. Word embeddings are better: each word is mapped to a dense, relatively low-dimensional vector whose values are learned during training. These embeddings often capture semantic relationships between words. |
Softmax activation is used to: A. Normalize outputs into a probability distribution (sum=1) B. Compute the maximum activation only C. Introduce non-linearity by thresholding at 0 D. Pool features spatially | Normalize outputs into a probability distribution (sum=1) |
An RNN’s hidden state ht is updated by combining: A. Previous output only B. Previous hidden state ht₋₁ and current input embedding C. Unrelated random noise D. Future target tokens | Previous hidden state ht₋₁ and current input embedding |
RNNs often struggle to learn long-range dependencies due to: A. Exploding gradients only B. Vanishing gradients only C. Both vanishing and exploding gradients D. Lack of embeddings | Both vanishing and exploding gradients |
Greedy decoding always picks: A. A random next word B. The highest-probability next word C. The least probable next word D. A word based on TF-IDF | - The highest-probability next word - often leads to repetivite and deterministic output |
How do RNN's build off of fixed-windows? | -RNN's were developed to handle sequential data more effectively -added: hidden state (ht) that is updated at each time step (t) based on the current input (Et) and the prev hidden state (ht-1) -allows network to maintain a summary of sequences seen |
Search algorithm components | states actions initial state goal state transition edge cost |
temperature | the parameter controlling randomness - higher temp = more randomness (flattens distribution) - lower temp = more greedy (sharpens distribution) |
nucleau sampling (top-p) | -A common advanced sampling method that considers only the most probable words whose cumulative probability exceeds a threshold 'p'. -all tokens with cumulative probability ≥ p -balance temp by sampling over core set (nucl) of most probable next words |
Rule-based chatbots rely on: A. Large neural networks B. Predefined patterns and templates (often regex) C. Probabilistic context-free grammars D. Reinforcement learning | Predefined patterns and templates (often regex) |
Conversational Agents | Agents interacting within a conversational environment using natural language |
two categories of conversational agents | -chatbots: designed for open-ended convo -task-oriented: designed to help user accomplish specific goals (e.g. Siri/Alexa, automated phone system) |
Corpus-based chatbot | retrieves responses from a large database of existing conversations (e.g. movie scripts, twitter data) |
Rule-based chatbot strength and weakness | -precise for inputs they recognize -fail on anything unexpected and requires significant manual effort |
Corpus-based chatbots strength and weakness | -handle more variety -lack control and can inherit biases or undesirable content from the training data |
Automatic Speech Recognition (ASR) | -Convert user's spoken audio into text ("utterance") -modern ASR uses deep learning (e.g. transformers) |
Natural Language Understanding (NLU) | Analyze the utterance text to determine the user's goal. steps: 1. Domain classification 2. Intent Determination 3. Slot filling |
NLU: Domain Classification | Identify the general topic (e.g. Weather, Music) |
NLU: Intent Determination | Identify the specific action requested within the doman (e.g. GetWeather, PlayMusic etc.) |
NLU: Slot filling | Extract specific parameters (slots) needed to fulfill the users intent (e.g. Location="Boston", Date="Tomorrow" for GetWeather) |
Computer vision | - extract high-level knowledge and understanding from visual data - focuses on enabling machines to interpret and understand information from images and videos |
computer vision relation to agent paradigm | comp. vision provides the "perceptron" component, allowing agents to sense and interpret their visual environment to inform state representation and action selection |
Digital pixel colors: Grayscale and Color | Grayscale: On value per pixel --> single intensity value Color (RGB): typically 3 values per pixel |
Haar Cascade classifier | - an algorithm for object detection, particularly face detection - uses rectangualr features called Haar-like features which capture basic patterns of intensity differences in faces (e.g. the eye region is typically darker than upper cheeks) |
Convolutional Neural Networks (CNNs) | A CNN scans an image with small, learnable filters to pick out features like edges or textures, then uses pooling to shrink and summarize those feature maps before feeding them into a final classifier. |
What key operations are a part of CNN's? | - convolution - pooling |
convolution | Apply small, learnable filters across input img Each filter slides over img, computing dot-product to detect patterns Outputs a feature map showing where each pattern appears Uses parameter sharing (same filter everywhere) to keep the model compact |
Pooling | – Downsample each feature map by summarizing small regions (e.g., taking the maximum in each 2×2 block). – Reduces spatial dimensions and computation. – Introduces slight invariance to shifts or distortions in the input. |
A convolutional layer differs from a fully-connected layer by: A. Using hand-designed filters only B. Applying the same small filter (kernel) across the spatial dimensions (parameter sharing) C. Operating on one pixel at a time | Applying the same small filter (kernel) across the spatial dimensions (parameter sharing) |
Pooling layers (e.g., max-pooling) serve to: A. Increase feature map size B. Reduce spatial dimensions and add invariance to small shifts C. Normalize pixel intensities D. Learn filter weights | Reduce spatial dimensions and add invariance to small shifts |
AlexNet (2012): | A deep CNN architecture (8 layers: 5 convolutional, 3 fully connected) developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. |
Why is AlexNet so important? | -Dramatically outperformed all previous approaches (which used traditional computer vision methods) -ignited the modern deep learning era |
Moravec’s paradox states that tasks easy for humans (walking, grasping) are often harder for AI than: A. Simple lookup tables B. High-level abstract reasoning (e.g., chess) C. Calculating arithmetic D. Sorting numbers | High-level abstract reasoning (e.g., chess) |
The “localization” problem in robotics is about determining: | The robot’s current state (position/orientation/joint angles) |
A Kalman filter is used to: A. Generate trajectories B. Estimate the true state by filtering noisy sensor measurements C. Plan collision-free paths D. Optimize control gains offline | Estimate the true state by filtering noisy sensor measurements |
n an MDP for reinforcement learning, the agent aims to learn a policy that maximizes: A. Immediate reward only B. Cumulative (discounted) future rewards C. Total number of states visited D. Size of the action space | Cumulative (discounted) future rewards |
PID Controller | tracks the error between the desired and actual states and adjusts its output using proportional, integral, and derivative terms to counter disturbances and accurately reach and maintain the target state. |
Kalman filters algorithm | used to estimate the true underlying state by statistically averaging noisy measurements over time, producing a smoother, more reliable signal. |
Control Theory: | A field dedicated to designing control systems that handle uncertainty and ensure desired behavior despite noise and disturbances. -ex. PID Controllers Reinforcement Learning (RL) |
Reinforcement Learning (RL) | Agent learns through trial-and-error interaction with an environment, receiving feedback in the form of rewards or punishments, without explicit data/label pairs like supervised learning. |
Markov Decision Process (MDP) | The standard mathematical framework for RL problems -defined by: States (S), Action (A), Transition Probabilities, Reward, Discount Factor |
MPD Policy (π(s)→a) | A function mapping states to actions. RL aims to find the optimal policy π∗ that maximizes expected discounted future rewards. |
Reinforcement learning (RL) vs Supervied Learning (SL) | In SL, a model learns to map inputs to known outputs by minimizing prediction error on labeled training data. In RL, an agent learns through trial and error by interacting with an environment and optimizing its actions to maximize cumulative reward. |
Deep Reinforcement Learning (Deep RL) | combines reinforcement learning with deep neural networks, using networks to approximate policies or value functions so agents can handle very large or continuous state/action spaces. |
How to train Deep RL | agent interacts with the environment using its current neural network policy. The collected experiences (state, action, reward, next state) are used to update the network parameters (θ) via gradient-based optimization, aiming to improve expected rewards. |
Adversarial Search | deals with multi-agent environments where other agents are actively trying to prevent the agent from reaching its goal -ex. board games |
A* search time and space complexity | Both exponential |
unifrom cost search time and space complexity | both exponential |
Uniform Cost Search | search algorithm that finds the path withe the lowest cumulative cost |
DFS time and space complexity | both linear |
BFS vs DFS | BFS: explores level by level DFS: explores as deep as possible branch by branch |
utility-based agents | Extend goal-based agents by adding a utility function to measure how well a state achieves the goal. |
rational agent | defined as one that selects actions expected to maximize its performance measure, given its perceptions and built-in knowledge. |
the perceptron model | mathematical model of a biological neuron, a foundational element of artificial neural networks. -developed in 1958 |
Artificial Neural Network (Multilayer Perceptrons) | -The output of one layer of Perceptrons serves as the input to the next layer. -This interconnected structure allows ANNs to represent much more complex functions than a single Perceptron. |