6

I am trying to perform a binary classification of tweets using machine learning.

The usual way of doing this seems to be putting a hand-classified tweet's words into a big vector, then use that vector as input to an algorithm, which then predicts new tweets based on this data.

Is there a standard method, or algorithm, that can include other input, such as the location of a tweet, in this process?

I could just add tweet location at the end of the vector, I suppose, but that would give it a very small weighting.

Any pointers are much appreciated.

nbro
  • 42,615
  • 12
  • 119
  • 217
schoon
  • 237
  • 2
  • 7

1 Answers1

3

Data pre-processing and feature extraction are by far the most important part of any machine learning algorithm. It's even more important that the model you choose to do the classification.

Unfortunately, pre-processing and feature extraction are completely different for each type of data. You need to play around with the data yourself to find out what works best with the nature of your data. With experience you start noticing some patterns with different data types. For example, as you are doing, building a word vector is an effective means of feature extraction with text based data.

"I could just add tweet location at the end of the vector I suppose, but that would give it a very small weighting."

This is entirely untrue for any machine learning algorithm I would choose. Your model should not associate weights to the inputs based on their array location. They should be associated based on the explained variance (information gain) it provides.


After you do your pre-processing and feature extraction you can then further refine your feature set with some common methods which can be found in libraries such as: principle component analysis (PCA) and linear discriminant analysis (LDA).

JahKnows
  • 470
  • 2
  • 8