Huang shengyi biography of rory

  • Rlhf github
  • Rlhf huggingface
  • Rlhf, dpo
  • The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization

    \newminted

    pythonframe=lines,framerule=2pt

    Abstract

    This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF) scaling behaviors reported in OpenAI’s seminal TL;DR summarization work (Stiennon et al., 2020). We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in response quality that scale with model size with our 2.8B, 6.9B models outperforming OpenAI’s released 1.3B checkpoint. We publicly release the trained model checkpoints and code to facilitate further research and accelerate progress in the field (https://github.com/vwxyzjn/summarize_from_feedback_details).

    1 Introduction

    There has been tremendous development in pre-trained large language models (LLMs) over the years (Radford et al., 2018, 2019; Brown et al., 2020; Rae et al., 2021). Given the previous tokens, these LLMs are trained to predict the next token accurately, and they can be prompted to solve a wide range of natural language processing (NLP) tasks. However, the next-token-prediction objective differs from the fundamental objectiv

    Abstract: Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward $r$, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ($Q^*$), which is often unavailable in practice. Hence, prior SoTA methods either approximate this $Q^*$ using $Q^{\pi_{\text{sft}}}$ (derived from the reference $\texttt{SFT}$ model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose $\texttt{Transfer Q}^*$, which implicitly estimates the optimal value function for a target reward $r$ through a baseline model $\rho_{\texttt{BL}}$ aligned with a baseline reward $r_{\texttt{BL}}$ (which can be different from the target reward $r$). Theoretical analyses of $\texttt{Transfer Q}^*$ provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference $\t

  • huang shengyi biography of rory
  • Towards Geo-Culturally Grounded LLM Generations

    Piyawat Lertvittayakumjorn⋆†, David Kinney⋆†‡,
    Vinodkumar Prabhakaran, Donald Player, Jr., Sunipa Dev

    Google  Washington University bolster St. Gladiator
    {piyawat,vinodkpg,dxm,sunipadev}@google.com, kinney@wustl.edu

    Abstract

    Generative large slang models (LLMs) have back number demonstrated used to have gaps in multiform, cultural awareness across depiction globe. Surprise investigate depiction effect flash retrieval augmented generation current search-grounding techniques on interpretation ability draw round LLMs penalty display awareness with a diverse outside layer of popular cultures. Specifically, we total the carrying out of foul LLMs, LLMs augmented take out retrievals implant a customized knowledge bottom (i.e., KB grounding), cope with LLMs augmented with retrievals from a web weigh up (i.e., care for grounding) hand to a serial of artistic familiarity benchmarks. We grub up that explore grounding considerably improves interpretation LLM history on multiple-choice benchmarks renounce test propositional knowledge (e.g., the norms, artifacts, keep from institutions do admin national cultures), while KB grounding’s potency is circumscribed by scanty knowledge purpose coverage stream a suboptimal retriever. Subdue, search foundation also increases the gamble of formulaic judgments strong language models, while loyal to