Policy GradientPPOProximal Policy Optimization with clipped surrogate objective.Copy MarkdownOpenPlaceholder content for PPO.TRPOTrust Region Policy Optimization with KL divergence constraint.RL for Sequence Generation and RLHFApplying policy gradient to sequence models, reward modeling, and alignment.