I know word2vec is not enough to achieve high quality text prediction just by itself. But has any scientific work tried to do it anyway, just to know what the baseline is that you have to beat with more sophisticated ideas? It doesn't have to be word2vec specifically, any similarly simple model would be fine as well. I'm only interested in the achieved perplexity of next-token-prediction, accuracy and other performance metrics don't interest me. If you're wondering what the point is: text compression. The original word2vec paper unfortunately doesn't concern itself with perplexity. Yes, I know I can do it myself, and I eventually will, but I still want to read what others have written about this topic.
Asked
Active
Viewed 35 times
1 Answers
1
My own answer for the enwik8 dataset ( https://huggingface.co/datasets/LTCB/enwik8 ) and a vocabulary size of 1024 is about 58.6 (corresponding to a cross entropy of about 4.07 nats). This assumes that the dimension of the vector space used for word2vec is essentially unlimited. I haven't tried it with other datasets or vocabulary sizes yet.
user2845840
- 111
- 2