강의 계획서 (Lecture Plan)
| Week | Topic | Contents | Download |
|---|---|---|---|
| 01 | Introduction to Reinforce Learning | Introduction, Basic Ideas, RL Fields | |
| 02 | What does RL Learn? | Key Concepts, Trajectory, Discount Factor, Value Functions |
|
| 03 | Taxonomy & MDP Basics | RL Taxonomy, MDP Basic Theory | |
| 04 | Markov Decision Process in RL | Concept of MDP, Relationship between MDP and RL | |
| 05 | Dynamic Programming in RL | Recap DP, Bellman Equation, Value Propagation, Policy and Value Iteration |
|
| 06 | Monte Carlo Methods in RL | DP Limitations, Sample Return (Gt), First-visit, Every-visit, MC Controls, Limitations of MC |
|
| 07 | Temporal Difference | TD prediction and controls, SARSA, Q-learning, Practical Advantage of TD |
|
| 08 | MAB, Exploration vs. Exploitation | Multi-Armed Bandit Problem, Exploration, Exploitation |
|
| 09 | Multi-Armed Bandit Practice | Multi-Armed Bandit Implementation, Analytics | Guide Page Code Exercise: ◦ python script ◦ notebook |
| 10 | Q-Learning | Bellman and Q-learning Theroy | |
| 11 | DQN | Deep Q-learning & Practice | ◦ DQN Theroy (pdf) ◦ DQN Practice (pdf) ◦ Codes (.zip) |
| 12 | Coding Practice | Practice for Code Repair using RL & LLM | ◦ Guide Page ◦ Codes (.zip) |
| 13 | Double DQN | Double Q-learning & Practice | ◦ Theory ◦ Practice Guide ◦ Codes (.zip) |
| 14 | Dueling DQN | Dueling Deep Q-learning Theory | ◦ Theory (.pdf) |
| 15 | Policy Gradient | Policy Optimization, Derivation, Tricks, Intro to Advanced Policy Gradient Methods | ◦ Theory (.pdf) |
| 16 | Play with SB3 | Introduction to SB3, Gymnasium Integration, Practical RL Experiments () | ◦ Theory (.pdf) ◦ Codes (.zip) |
| 17 | Actor-Critic Algorithm | Actor-Critic basics: critic-based bootstrapping, baseline/advantage with TD error (A2C), and A2C/A3C with practical stability ideas. | |
| 18 | DDPG, TD3, SAC | Continuous control problems, deterministic policy gradient, actor-critic structure, TD3 stabilization tricks, and SAC entropy-regularized policy optimization | |
| 19 | TRPO, PPO | Safe Region, KL-Divergence, Computational Issues, Policy Ratio, Clipping | |
| 20 | IL, RLHF | Immitation Learning, Reinfroce Learning from Human Feedback |