ICITR - 2021

Browse

Now showing 1 - 1 of 1

item: Conference-Full-text
How to pretrain an efficient cross-disciplinary language model: the scilitbert use case
(Faculty of Information Technology, University of Moratuwa., 2021-12) la Broise, JBD; Bernard, N; Dubuc, JP; Perlato, A; Latard, B; Ganegoda, GU; Mahadewa, KT
Transformer based models are widely used in various text processing tasks, such as classification, named entity recognition. The representation of scientific texts is a complicated task, and the utilization of general English BERT models for this task is suboptimal. We observe the lack of models for multidisciplinary academic texts representation, and on a broader scale, a lack of specialized models pretrained on specific domains, for which general English BERT models are suboptimal. This paper introduces ScilitBERT, a BERT model pretrained on an inclusive cross-disciplinary academic corpus. ScilitBERT is half as deep as RoBERTa, and has a much lower pretraining computation cost. ScilitBERT obtains at least 96% of RoBERTa's accuracy on two academic domain downstream tasks. The presented cross-disciplinary academic model has been publicly released11https://github.com/JeanBaptiste-dlb/ScilitBERT. The results obtained show that for domains that use a technolect and have a sizeable amount of raw text data; the pretraining of dedicated models should be considered and favored.