0

Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:

thanks for sharing    [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL>          [2 tokens] --> 1 (Spam)

Should I pad the texts on the right:

thanks for     sharing --
this   article is      great
here's URL     --      --

Or, pad on the left:

--   thanks  for    sharing
this article is     great
--   --      here's URL

What are the pros and cons of either pad left or right?

Dan D
  • 1,318
  • 1
  • 14
  • 39

1 Answers1

1

For any model that does not take a time series approach like an RNN does, the padding shouldn't make a difference.

I prefer padding right simply because there also might be text you need to cut-off. Then padding is more intuitive as you either cut-off a text if it's too long or pad a text when it's too short.

Either way, when a model is trained a certain way, it shouldn't make a difference so long the testing is also padded the way it was presented in training.

N. Kiefer
  • 321
  • 3
  • 9