click below
click below
Normal Size Small Size show me how
Tut 11 - AoL Aa
| Term | Definition |
|---|---|
| Learning | Minimising Prediction Error |
| RATS, MONKEYS, MACHINES | Learning happens when reality does not equal expectation |
| Discrepancy | Prediction error drives updates in knowledge and behaviour |
| Rescorla Wagner Model | Classical Conditioning & Prediction error |
| RW model - CC and Prediction error | Goal: learn associations between stimuli Mechanism: update associative strength using prediction error (PE) |
| RW equation | Change in associative strength Salience of CS (HOW NOTICEABLE) Learning rate related to US Max associative strength (eg 1 is US present) Current total strength (expectation) |
| Prediction error | Outcomes: Positive PE -> learning increases Negative PE -> learning decreases Zero PE -> No learning |
| PE explains | acquisition extinction blocking condition inhibition |
| PE - acquisition | Learning is rapid when PE is large |
| PE - extinction | CS without US -> Negative PE -> Decline in strength |
| PE - blocking | Prior CS already predicts US -> New CS gains nothing |
| PE - Conditioned inhibition | A CS predicts absence of US -> Negative Value |
| Dopamine as Reward Prediction Error | VTA dopamine neurons encode PE Dopamine activity matches RW model |
| Pattern | Unexpected reward -> dopamine burst Predicted reward -> dopamine fires at cue, not reward Omitted reward -> dopamine dip Surprise in timing -> dopamine response shifts |
| Implications | Dopamine firing = biological prediction error Supports both pavlovian and instrumental learning |
| Addiction | hijacking dopamine PE -> Overlearning drug cues |
| Temporal difference TD learning | from biology to AI |
| TD equation | key points sequential learning: update at every time step (not just trial end) discount factor - how much future rewards matter can explain dopamine shift to predictive cues over time |
| TD equation Comparison to RW | both use PE for learning TD handles multi step learning and timing RW updates once per trial, TD updates continuously |
| Q learning - decision making in agents | Extends TD by adding addictions: learns how valuable each action is in a given state Q learning equation |
| Q learning equation key concepts | Q = value of taking action a in state s max Q = best expected future reward from next state learns optimal actions in complex uncertain moments |
| Practical application | Grid world - agents learn to reach the goal while avoiding punishment Agent's policy improves by adjusting Q values based on prediction errors |
| Hull's goal gradient and links to AI | Hull Rats speed up as they near reward Motivation increases with proximity |
| Connection to td/q learning | states/actions near reward have higher value agents and animals both act more decisively near goal applies to consumer behaviour too (eg loyalty cards) |
| Rescorla Wagner | Focus is on stimulus - outcome Learns from trial end PE Updates - Once per trial |
| TD learning | Focus on State values Learns from step by step PE Updates at each time step |
| Q learning | Focus on state action values Learns from step by step PE + future reward Updates at each time step and action |
| All models | Use prediction error as a signal to learn Adjust internal expectations/values to improve future outcomes Connected psychology (RW), neuroscience (dopamine) and AI (TD, Q) |
| Learning | reducing surprise |
| dopamine | brain's prediction error system |
| td/q learning | ai's version of this biological strategy |
| psychology -> neuroscience -> ai | a shared learning architecture |