Author of the publication

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

U. Anwar, A. Saparov, J. Rando, D. Paleka, M. Turpin, P. Hase, E. Lubana, E. Jenner, S. Casper, O. Sourbut, B. Edelman, Z. Zhang, M. Günther, A. Korinek, J. Hernández-Orallo, L. Hammond, E. Bigelow, A. Pan, L. Langosco, T. Korbak, H. Zhang, R. Zhong, S. hÉigeartaigh, G. Recchia, G. Corsi, A. Chan, M. Anderljung, L. Edwards, Y. Bengio, D. Chen, S. Albanie, T. Maharaj, J. Foerster, F. Tramèr, H. He, A. Kasirzadeh, Y. Choi, and D. Krueger. CoRR, (2024)

Please choose a person to relate this publication to

To differ between persons with the same name, the academic degree and the title of an important publication will be displayed. You can also use the button next to the name to display some publications already assigned to the person.

Wilhelm Casper

Rudolf Casper

Johannes Casper

Birge Casper

Patrick Casper

Other publications of authors with the same name

Frivolous Units: Wider Networks Are Not Really That Wide.S. Casper, X. Boix, V. D'Amario, L. Guo, M. Schrimpf, K. Vinken, and G. Kreiman. AAAI, page 6921-6929. AAAI Press, (2021)Benchmarking Interpretability Tools for Deep Neural Networks.S. Casper, Y. Li, J. Li, T. Bu, K. Zhang, and D. Hadfield-Menell. CoRR, (2023)Robust Feature-Level Adversaries are Interpretability Tools.S. Casper, M. Nadeau, D. Hadfield-Menell, and G. Kreiman. NeurIPS, (2022)Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience.Z. He, J. Achterberg, K. Collins, K. Nejad, D. Akarca, Y. Yang, W. Gurnee, I. Sucholutsky, Y. Tang, R. Ianov and 6 other author(s). CoRR, (2024)Open Problems in Technical AI Governance.A. Reuel, B. Bucknall, S. Casper, T. Fist, L. Soder, O. Aarne, L. Hammond, L. Ibrahim, A. Chan, P. Wills and 21 other author(s). CoRR, (2024)The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence.P. Slattery, A. Saeri, E. Grundy, J. Graham, M. Noetel, R. Uuk, J. Dao, S. Pour, S. Casper, and N. Thompson. CoRR, (2024)White-Box Adversarial Policies in Deep Reinforcement Learning.S. Casper, D. Hadfield-Menell, and G. Kreiman. CoRR, (2022)Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.S. Casper, X. Davies, C. Shi, T. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire and 22 other author(s). Trans. Mach. Learn. Res., (2023)Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V. Hebbar, H. Sleight, A. Stickland, E. Perez, D. Hadfield-Menell and 1 other author(s). CoRR, (2024)Rethinking Machine Unlearning for Large Language Models.S. Liu, Y. Yao, J. Jia, S. Casper, N. Baracaldo, P. Hase, X. Xu, Y. Yao, H. Li, K. Varshney and 3 other author(s). CoRR, (2024)

BibSonomy

Disambiguation of "Casper, Stephen"

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

Please choose a person to relate this publication to

Wilhelm Casper

Rudolf Casper

Johannes Casper

Birge Casper

Patrick Casper

Other publications of authors with the same name

Disambiguation

BibSonomy

Disambiguation of "Casper, Stephen"

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Foundational Challenges in Assuring Alignment and Safety of Large Language Models.

Please choose a person to relate this publication to

Wilhelm Casper

Rudolf Casper

Johannes Casper

Birge Casper

Patrick Casper

Other publications of authors with the same name

Disambiguation

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Foundational Challenges in Assuring Alignment and Safety of Large Language Models.