copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Y. Ji, Z. Zhou, H. Liu, and R. Davuluri. Bioinformatics, 37 (15): 2112-2120 (February 2021)
DOI: 10.1093/bioinformatics/btab083

Abstract

Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).Supplementary data are available at Bioinformatics online.

Links and resources

BibTeX key: 10.1093/bioinformatics/btab083
entry type: article
year: 2021
month: 02
journal: Bioinformatics
number: 15
pages: 2112-2120
volume: 37
eprint: https://academic.oup.com/bioinformatics/article-pdf/37/15/2112/39622303/btab083.pdf
issn: 1367-4803
DOI: 10.1093/bioinformatics/btab083
url: https://doi.org/10.1093/bioinformatics/btab083

@hotho's tags highlighted

Cite this publication

%0 Journal Article %1 10.1093/bioinformatics/btab083 %A Ji, Yanrong %A Zhou, Zhihan %A Liu, Han %A Davuluri, Ramana V %D 2021 %J Bioinformatics %K BERT DNA RNA model %N 15 %P 2112-2120 %R 10.1093/bioinformatics/btab083 %T DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome %U https://doi.org/10.1093/bioinformatics/btab083 %V 37 %X Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).Supplementary data are available at Bioinformatics online.

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Comments and Reviews
(0)