Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre, T. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, W. Macherey, A. Doucet, O. Firat, и N. de Freitas.
(2023)cite arxiv:2308.08998Comment: 23 pages, 16 figures.

Аннотация

Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences. We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by generating samples from the policy, which are then used to improve the LLM policy using offline RL algorithms. ReST is more efficient than typical online RLHF methods because the training dataset is produced offline, which allows data reuse. While ReST is a general approach applicable to all generative learning settings, we focus on its application to machine translation. Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.

ключ BibTeX: gulcehre2023reinforced
тип записи: misc
год: 2023
url: http://arxiv.org/abs/2308.08998
Примечание: cite arxiv:2308.08998Comment: 23 pages, 16 figures

тэги

Пользователи данного ресурса

Комментарии и рецензиипоказать / перейти в невидимый режим

Пожалуйста, войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)

BibSonomy

Reinforced Self-Training (ReST) for Language Modeling

Аннотация

тэги

Пользователи данного ресурса

Комментарии и рецензиипоказать / перейти в невидимый режим

Цитировать эту публикацию

More citation styles

search on