Skip to main content

πŸš€ Reinforcement Learning Learning Journey

Introduction​

This section documents my journey of learning Reinforcement Learning (RL) using Stanford CS234 and additional research resources.

My goal is not only to understand RL theory but also to implement algorithms from scratch, apply them to robotics, and eventually build intelligent decision-making systems for real-world applications.


Why Reinforcement Learning?​

Reinforcement Learning studies how an agent can learn through interaction with an environment.

Unlike supervised learning, the agent is not given correct answers. Instead, it learns by taking actions, observing consequences, and receiving rewards.

RL combines:

  • Learning
  • Decision Making
  • Optimization
  • Planning
  • Exploration

Applications include:

  • Robotics
  • Game Playing
  • Autonomous Systems
  • Recommendation Systems
  • RLHF for Large Language Models

Learning Resources​

Primary Resource​

Stanford CS234 Reinforcement Learning
Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEbuOSosZdX


Theory Roadmap​

Phase 1 β€” Foundations of Reinforcement Learning​

  • Artificial Intelligence
  • Machine Learning
  • Reinforcement Learning
  • Sequential Decision Making
  • Agent–Environment Interaction
  • States, Actions, Rewards
  • Markov Property
  • Markov Processes
  • MDPs
  • Returns
  • Value Functions
  • Bellman Equations
  • Policy Evaluation
  • Policy Improvement
  • Policy Iteration
  • Value Iteration
  • Dynamic Programming

Phase 2 β€” Learning From Experience​

  • Monte Carlo Methods
  • Temporal Difference Learning
  • Bootstrapping
  • Generalized Policy Iteration
  • Exploration vs Exploitation
  • Epsilon Greedy
  • SARSA
  • Q-Learning
  • Function Approximation
  • Deep Q Networks

  • Policy Gradient
  • REINFORCE
  • Baselines
  • Advantage Functions
  • Actor-Critic
  • PPO
  • GAE

Phase 4 β€” Offline RL & Imitation Learning​

  • Offline RL
  • Behavior Cloning
  • DAGGER
  • IRL
  • GAIL

Phase 5 β€” RLHF & Alignment​

  • Human Preferences
  • Reward Models
  • RLHF
  • PPO in RLHF
  • DPO
  • Alignment

Phase 6 β€” Exploration Theory​

  • Multi-Armed Bandits
  • Regret
  • UCB
  • Thompson Sampling
  • PAC Learning

Phase 7 β€” Planning & Games​

  • Tree Search
  • MCTS
  • AlphaGo
  • AlphaZero
  • Self-Play

Phase 8 β€” Advanced RL​

  • Multi-Agent RL
  • Credit Assignment
  • Uncertainty
  • Reward Engineering
  • AI Safety

Phase 9 β€” Research & Mastery​

  • Reading Papers
  • Reproducing Papers
  • Building RL Systems
  • Benchmarking
  • Open Source Contributions

Long-Term Projects​

Statistics Management System​

A decision-support system that combines Business Rules, NLP, and Reinforcement Learning to help organizations make better decisions.


Dynamic Expert System​

A continuously learning expert system capable of adapting to new information and changing environments.


Spiritual AI Assistant​

A personal research project focused on combining structured knowledge, learning systems, and AI assistance to support spiritual learning and guidance.


Current Progress​

Theory​

βœ… Completed

  • Phase 1
  • Phase 2
  • Phase 3

Practical​

πŸ”„ Starting Implementation Journey

Current Focus:

  • RL Engineering Foundations
  • LineWorld
  • Multi-Armed Bandits

This is a living document that will be updated continuously as the journey progresses. πŸ“š