0

I have transaction data and I would like to extract the merchant from the transaction description. I am new to this but I just came across Named Entity Recognition and SpaCy. I have hundreds of thousands of different merchants.

Some questions that I have:

  • How much labelling do I need to do given the number of merchants I need to extract?

  • How many different instances of the same merchant I need to label to get decent results?

nbro
  • 42,615
  • 12
  • 119
  • 217
Unicorn07
  • 1
  • 2

3 Answers3

0

There is no specific number of labels that is "enough". For simple cases you can start with a few hundred examples, but normally you'll want several thousand.

Since you have a large number of classes your problem might be a harder one, but on the other hand it could be easy if most of your text is like "This merchant is called XXX".

polm23
  • 101
  • 2
0

In my experience with NER with Spacy, and disagreeing with this stackoverflow solution and as @polm23 rightly mentioned, a several thousand samples for each entity should generate/predict entities, otherwise spacy would just recognise them based on default spacy entity types (mainly 'work-of-art')

0

It depend on your workflow, language of text. Your official guide at here https://github.com/explosion/assets/blob/main/Prodigy/Prodigy_NER_flowchart_v2_0_0_light.pdf . You can know how many data is enough. You can see few number 4000, 25%, 2000, etc.