Making language models bigger does not inherently make them better at
following a user's intent. For example, large language models can generate
outputs that are untruthful, toxic, or simply not helpful to the user. In other
words, these models are not aligned with their users. In this paper, we show an
avenue for aligning language models with user intent on a wide range of tasks
by fine-tuning with human feedback. Starting with a set of labeler-written
prompts and prompts submitted through the OpenAI API, we collect a dataset of
labeler demonstrations of the desired model behavior, which we use to fine-tune
GPT-3 using supervised learning. We then collect a dataset of rankings of model
outputs, which we use to further fine-tune this supervised model using
reinforcement learning from human feedback. We call the resulting models
InstructGPT. In human evaluations on our prompt distribution, outputs from the
1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3,
despite having 100x fewer parameters. Moreover, InstructGPT models show
improvements in truthfulness and reductions in toxic output generation while
having minimal performance regressions on public NLP datasets. Even though
InstructGPT still makes simple mistakes, our results show that fine-tuning with
human feedback is a promising direction for aligning language models with human
intent.
Описание
Training language models to follow instructions with human feedback
%0 Generic
%1 ouyang2022training
%A Ouyang, Long
%A Wu, Jeff
%A Jiang, Xu
%A Almeida, Diogo
%A Wainwright, Carroll L.
%A Mishkin, Pamela
%A Zhang, Chong
%A Agarwal, Sandhini
%A Slama, Katarina
%A Ray, Alex
%A Schulman, John
%A Hilton, Jacob
%A Kelton, Fraser
%A Miller, Luke
%A Simens, Maddie
%A Askell, Amanda
%A Welinder, Peter
%A Christiano, Paul
%A Leike, Jan
%A Lowe, Ryan
%D 2022
%K toread
%T Training language models to follow instructions with human feedback
%U http://arxiv.org/abs/2203.02155
%X Making language models bigger does not inherently make them better at
following a user's intent. For example, large language models can generate
outputs that are untruthful, toxic, or simply not helpful to the user. In other
words, these models are not aligned with their users. In this paper, we show an
avenue for aligning language models with user intent on a wide range of tasks
by fine-tuning with human feedback. Starting with a set of labeler-written
prompts and prompts submitted through the OpenAI API, we collect a dataset of
labeler demonstrations of the desired model behavior, which we use to fine-tune
GPT-3 using supervised learning. We then collect a dataset of rankings of model
outputs, which we use to further fine-tune this supervised model using
reinforcement learning from human feedback. We call the resulting models
InstructGPT. In human evaluations on our prompt distribution, outputs from the
1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3,
despite having 100x fewer parameters. Moreover, InstructGPT models show
improvements in truthfulness and reductions in toxic output generation while
having minimal performance regressions on public NLP datasets. Even though
InstructGPT still makes simple mistakes, our results show that fine-tuning with
human feedback is a promising direction for aligning language models with human
intent.
@misc{ouyang2022training,
abstract = {Making language models bigger does not inherently make them better at
following a user's intent. For example, large language models can generate
outputs that are untruthful, toxic, or simply not helpful to the user. In other
words, these models are not aligned with their users. In this paper, we show an
avenue for aligning language models with user intent on a wide range of tasks
by fine-tuning with human feedback. Starting with a set of labeler-written
prompts and prompts submitted through the OpenAI API, we collect a dataset of
labeler demonstrations of the desired model behavior, which we use to fine-tune
GPT-3 using supervised learning. We then collect a dataset of rankings of model
outputs, which we use to further fine-tune this supervised model using
reinforcement learning from human feedback. We call the resulting models
InstructGPT. In human evaluations on our prompt distribution, outputs from the
1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3,
despite having 100x fewer parameters. Moreover, InstructGPT models show
improvements in truthfulness and reductions in toxic output generation while
having minimal performance regressions on public NLP datasets. Even though
InstructGPT still makes simple mistakes, our results show that fine-tuning with
human feedback is a promising direction for aligning language models with human
intent.},
added-at = {2023-03-11T17:39:46.000+0100},
author = {Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ryan},
biburl = {https://www.bibsonomy.org/bibtex/2e33c0df9fc8e486b5f372aeac3fcf7b8/hotho},
description = {Training language models to follow instructions with human feedback},
interhash = {f7d728e71594758edda91ecd8c84ee49},
intrahash = {e33c0df9fc8e486b5f372aeac3fcf7b8},
keywords = {toread},
note = {cite arxiv:2203.02155},
timestamp = {2023-03-11T17:39:46.000+0100},
title = {Training language models to follow instructions with human feedback},
url = {http://arxiv.org/abs/2203.02155},
year = 2022
}