3

I need some help with continuing pre-training on Bert. I have a very specific vocabulary and lots of specific abbreviations at hand. I want to do an STS task. Let me specify my task: I have domain-specific sentences and want to pair them in terms of their semantic similarity. But as very uncommon language is used here, I need to train Bert on it.

  • How does one continue the pre-training (I read the GitHub release from google about it, but don't really understand it) Any examples?
  • What structure does my training data need to have, so that BERT can understand it?
  • Maybe training BERT from scratch would be even better. I guess it's the same process as continuing the pretraining just the starting checkpoint would be different. Is that correct?

Also, very happy about all other tips from you guys.

nbro
  • 42,615
  • 12
  • 119
  • 217
Adrian_G
  • 31
  • 1

0 Answers0