How to pretrain an efficient cross-disciplinary language model: the scilitbert use case

dc.contributor.authorla Broise, JBD
dc.contributor.authorBernard, N
dc.contributor.authorDubuc, JP
dc.contributor.authorPerlato, A
dc.contributor.authorLatard, B
dc.contributor.editorGanegoda, GU
dc.contributor.editorMahadewa, KT
dc.date.accessioned2022-11-09T08:25:37Z
dc.date.available2022-11-09T08:25:37Z
dc.date.issued2021-12
dc.description.abstractTransformer based models are widely used in various text processing tasks, such as classification, named entity recognition. The representation of scientific texts is a complicated task, and the utilization of general English BERT models for this task is suboptimal. We observe the lack of models for multidisciplinary academic texts representation, and on a broader scale, a lack of specialized models pretrained on specific domains, for which general English BERT models are suboptimal. This paper introduces ScilitBERT, a BERT model pretrained on an inclusive cross-disciplinary academic corpus. ScilitBERT is half as deep as RoBERTa, and has a much lower pretraining computation cost. ScilitBERT obtains at least 96% of RoBERTa's accuracy on two academic domain downstream tasks. The presented cross-disciplinary academic model has been publicly released11https://github.com/JeanBaptiste-dlb/ScilitBERT. The results obtained show that for domains that use a technolect and have a sizeable amount of raw text data; the pretraining of dedicated models should be considered and favored.en_US
dc.identifier.citationJ. -B. de la Broise, N. Bernard, J. -P. Dubuc, A. Perlato and B. Latard, "How to pretrain an efficient cross-disciplinary language model: The ScilitBERT use case," 2021 6th International Conference on Information Technology Research (ICITR), 2021, pp. 1-6, doi: 10.1109/ICITR54349.2021.9657164.en_US
dc.identifier.conference6th International Conference in Information Technology Research 2021en_US
dc.identifier.departmentInformation Technology Research Unit, Faculty of Information Technology, University of Moratuwa.en_US
dc.identifier.doidoi: 10.1109/ICITR54349.2021.9657164en_US
dc.identifier.facultyITen_US
dc.identifier.placeMoratuwa, Sri Lankaen_US
dc.identifier.proceedingProceedings of the 6th International Conference in Information Technology Research 2021en_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/19439
dc.identifier.year2021en_US
dc.language.isoenen_US
dc.publisherFaculty of Information Technology, University of Moratuwa.en_US
dc.relation.urihttps://ieeexplore.ieee.org/document/9657164/en_US
dc.subjectLanguage modelsen_US
dc.subjectClusteringen_US
dc.subjectClassificationen_US
dc.subjectAssociation rulesen_US
dc.subjectBenchmarkingen_US
dc.subjectText analysisen_US
dc.titleHow to pretrain an efficient cross-disciplinary language model: the scilitbert use caseen_US
dc.typeConference-Full-texten_US

Files

Collections