Zusammenfassung
In this paper, a short-length document classification of the arxiv dataset using RoBERTa(Robustly Optimized BERT Pre-training Approach) was performed. Here, the document classification was performed using the abstract and the title of the papers combined as it summarizes the whole paper. The maximum sequence length that can be processed by RoBERTa is 512. The length of words in the abstract varies from 150 to 250 words. The experiments performed showed that RoBERTa outperformed BERT in two datasets viz. AAPD(Arxiv Academic Paper Dataset) and Reuters dataset as compared to those stated by Adhikari et al. The work extensively explored the AAPD dataset for abstract-based document classification. The model was fine-tuned for
the AAPD dataset. The hyperparameters tuned were maximum sequence length, batch size, Adam optimizer, and learning rate. The model was trained and tested for different paper frequencies, which resulted in different paper categories. The accuracy and F1-score obtained for the 68 paper categories were 0.68 and 0.69. The accuracy and F1-score of the model were 0.68 and 0.69 for the 51 paper categories. The accuracy and F1-score of the model were 0.79 for the 32 paper categories. Using the
larger number of papers in each category the accuracy and F1-score of the model was increased with the increased training time.
Nutzer