Artikel in einem Konferenzbericht,

Fine Tuning of RoBERTa for Document Classification of Arxiv Dataset

, , und .
Mobile Computing and Sustainable Informatics: Proceedings of the 4th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI 2023), Volume 166 von Lecture Notes on Data Engineering and Communications Technologies, Springer Singapore, (Mai 2023)
DOI: https://doi.org/10.1007/978-981-99-0835-6_18

Zusammenfassung

In this paper, a short-length document classification of the arxiv dataset using RoBERTa(Robustly Optimized BERT Pre-training Approach) was performed. Here, the document classification was performed using the abstract and the title of the papers combined as it summarizes the whole paper. The maximum sequence length that can be processed by RoBERTa is 512. The length of words in the abstract varies from 150 to 250 words. The experiments performed showed that RoBERTa outperformed BERT in two datasets viz. AAPD(Arxiv Academic Paper Dataset) and Reuters dataset as compared to those stated by Adhikari et al. The work extensively explored the AAPD dataset for abstract-based document classification. The model was fine-tuned for the AAPD dataset. The hyperparameters tuned were maximum sequence length, batch size, Adam optimizer, and learning rate. The model was trained and tested for different paper frequencies, which resulted in different paper categories. The accuracy and F1-score obtained for the 68 paper categories were 0.68 and 0.69. The accuracy and F1-score of the model were 0.68 and 0.69 for the 51 paper categories. The accuracy and F1-score of the model were 0.79 for the 32 paper categories. Using the larger number of papers in each category the accuracy and F1-score of the model was increased with the increased training time.

Tags

Nutzer

  • @amanshakya

Kommentare und Rezensionen