Inproceedings,

Fine Tuning of RoBERTa for Document Classification of Arxiv Dataset

K. Bohara, A. Shakya, and B. Pande.
Mobile Computing and Sustainable Informatics: Proceedings of the 4th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI 2023), volume 166 of Lecture Notes on Data Engineering and Communications Technologies, Springer Singapore, (May 2023)
DOI: https://doi.org/10.1007/978-981-99-0835-6_18

Abstract

In this paper, a short-length document classification of the arxiv dataset using RoBERTa(Robustly Optimized BERT Pre-training Approach) was performed. Here, the document classification was performed using the abstract and the title of the papers combined as it summarizes the whole paper. The maximum sequence length that can be processed by RoBERTa is 512. The length of words in the abstract varies from 150 to 250 words. The experiments performed showed that RoBERTa outperformed BERT in two datasets viz. AAPD(Arxiv Academic Paper Dataset) and Reuters dataset as compared to those stated by Adhikari et al. The work extensively explored the AAPD dataset for abstract-based document classification. The model was fine-tuned for the AAPD dataset. The hyperparameters tuned were maximum sequence length, batch size, Adam optimizer, and learning rate. The model was trained and tested for different paper frequencies, which resulted in different paper categories. The accuracy and F1-score obtained for the 68 paper categories were 0.68 and 0.69. The accuracy and F1-score of the model were 0.68 and 0.69 for the 51 paper categories. The accuracy and F1-score of the model were 0.79 for the 32 paper categories. Using the larger number of papers in each category the accuracy and F1-score of the model was increased with the increased training time.

BibTeX key: bohara2023tuning
entry type: inproceedings
booktitle: Mobile Computing and Sustainable Informatics: Proceedings of the 4th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI 2023)
year: 2023
month: may
publisher: Springer Singapore
series: Lecture Notes on Data Engineering and Communications Technologies
volume: 166
venue: Lalitpur, Nepal
isbn: 978-981-99-0835-6
eventdate: 11-12 January, 2023
eventtitle: 4th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI 2023)
issn: 2367-4512
DOI: https://doi.org/10.1007/978-981-99-0835-6_18
url: https://link.springer.com/chapter/10.1007/978-981-99-0835-6_18

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{bohara2023tuning, abstract = {In this paper, a short-length document classification of the arxiv dataset using RoBERTa(Robustly Optimized BERT Pre-training Approach) was performed. Here, the document classification was performed using the abstract and the title of the papers combined as it summarizes the whole paper. The maximum sequence length that can be processed by RoBERTa is 512. The length of words in the abstract varies from 150 to 250 words. The experiments performed showed that RoBERTa outperformed BERT in two datasets viz. AAPD(Arxiv Academic Paper Dataset) and Reuters dataset as compared to those stated by Adhikari et al. The work extensively explored the AAPD dataset for abstract-based document classification. The model was fine-tuned for the AAPD dataset. The hyperparameters tuned were maximum sequence length, batch size, Adam optimizer, and learning rate. The model was trained and tested for different paper frequencies, which resulted in different paper categories. The accuracy and F1-score obtained for the 68 paper categories were 0.68 and 0.69. The accuracy and F1-score of the model were 0.68 and 0.69 for the 51 paper categories. The accuracy and F1-score of the model were 0.79 for the 32 paper categories. Using the larger number of papers in each category the accuracy and F1-score of the model was increased with the increased training time.}, added-at = {2023-03-22T17:13:19.000+0100}, author = {Bohara, Kshetraphal and Shakya, Aman and Pande, Bishal Debb}, biburl = {https://www.bibsonomy.org/bibtex/2015b0d77b700317f616177a232a6c484/amanshakya}, booktitle = {Mobile Computing and Sustainable Informatics: Proceedings of the 4th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI 2023)}, doi = {https://doi.org/10.1007/978-981-99-0835-6_18}, editor = {Shakya, Subarna and Papakostas, George and Kamel, Khaled A.}, eventdate = {11-12 January, 2023}, eventtitle = {4th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI 2023)}, interhash = {6dc0a281aad07596ca56401ee3e79335}, intrahash = {015b0d77b700317f616177a232a6c484}, isbn = {978-981-99-0835-6}, issn = {2367-4512}, keywords = {myown}, month = may, publisher = {Springer Singapore}, series = {Lecture Notes on Data Engineering and Communications Technologies}, timestamp = {2023-05-31T18:41:12.000+0200}, title = {Fine Tuning of RoBERTa for Document Classification of Arxiv Dataset}, url = {https://link.springer.com/chapter/10.1007/978-981-99-0835-6_18}, venue = {Lalitpur, Nepal}, volume = 166, year = 2023 }

BibSonomy

Fine Tuning of RoBERTa for Document Classification of Arxiv Dataset

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on