π Reinforcement Learning Learning Journey
Introductionβ
This section documents my journey of learning Reinforcement Learning (RL) using Stanford CS234 and additional research resources.
My goal is not only to understand RL theory but also to implement algorithms from scratch, apply them to robotics, and eventually build intelligent decision-making systems for real-world applications.
Why Reinforcement Learning?β
Reinforcement Learning studies how an agent can learn through interaction with an environment.
Unlike supervised learning, the agent is not given correct answers. Instead, it learns by taking actions, observing consequences, and receiving rewards.
RL combines:
- Learning
- Decision Making
- Optimization
- Planning
- Exploration
Applications include:
- Robotics
- Game Playing
- Autonomous Systems
- Recommendation Systems
- RLHF for Large Language Models
Learning Resourcesβ
Primary Resourceβ
Stanford CS234 Reinforcement Learning
Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEbuOSosZdX
Theory Roadmapβ
Phase 1 β Foundations of Reinforcement Learningβ
- Artificial Intelligence
- Machine Learning
- Reinforcement Learning
- Sequential Decision Making
- AgentβEnvironment Interaction
- States, Actions, Rewards
- Markov Property
- Markov Processes
- MDPs
- Returns
- Value Functions
- Bellman Equations
- Policy Evaluation
- Policy Improvement
- Policy Iteration
- Value Iteration
- Dynamic Programming
Phase 2 β Learning From Experienceβ
- Monte Carlo Methods
- Temporal Difference Learning
- Bootstrapping
- Generalized Policy Iteration
- Exploration vs Exploitation
- Epsilon Greedy
- SARSA
- Q-Learning
- Function Approximation
- Deep Q Networks
Phase 3 β Policy Searchβ
- Policy Gradient
- REINFORCE
- Baselines
- Advantage Functions
- Actor-Critic
- PPO
- GAE
Phase 4 β Offline RL & Imitation Learningβ
- Offline RL
- Behavior Cloning
- DAGGER
- IRL
- GAIL
Phase 5 β RLHF & Alignmentβ
- Human Preferences
- Reward Models
- RLHF
- PPO in RLHF
- DPO
- Alignment
Phase 6 β Exploration Theoryβ
- Multi-Armed Bandits
- Regret
- UCB
- Thompson Sampling
- PAC Learning
Phase 7 β Planning & Gamesβ
- Tree Search
- MCTS
- AlphaGo
- AlphaZero
- Self-Play
Phase 8 β Advanced RLβ
- Multi-Agent RL
- Credit Assignment
- Uncertainty
- Reward Engineering
- AI Safety
Phase 9 β Research & Masteryβ
- Reading Papers
- Reproducing Papers
- Building RL Systems
- Benchmarking
- Open Source Contributions
Long-Term Projectsβ
Statistics Management Systemβ
A decision-support system that combines Business Rules, NLP, and Reinforcement Learning to help organizations make better decisions.
Dynamic Expert Systemβ
A continuously learning expert system capable of adapting to new information and changing environments.
Spiritual AI Assistantβ
A personal research project focused on combining structured knowledge, learning systems, and AI assistance to support spiritual learning and guidance.
Current Progressβ
Theoryβ
β Completed
- Phase 1
- Phase 2
- Phase 3
Practicalβ
π Starting Implementation Journey
Current Focus:
- RL Engineering Foundations
- LineWorld
- Multi-Armed Bandits
This is a living document that will be updated continuously as the journey progresses. π