3

Sorry if this is too noob question, I'm just a beginner.

I have a data set with companies' info. There are 2 kinds of features: financial (revenue and so on) and general info (like the number of employees and date of registration)

I have to predict the probability of default. And the data has gaps: about the half of the companies have no financial data at all. But general features are 100% filled.

What is the best practice for such a situation?

Will be great if you can give some example links to read.

nbro
  • 42,615
  • 12
  • 119
  • 217
Denis Ka
  • 31
  • 1

1 Answers1

3

You should look into "missing values". This is an entire research field in itself.

First, you need to identify the type of missing values:

  1. They can be missing purely at random.
  2. Whether they are missing or not is itself a useful feature, and should be treated as a class of its own.

(Those two are the best case scenarios.)

  1. Whether they are missing or not depends on the underlying (unknown) value. For example, a thermometer might fail occasionally if the temperatures get too high. In your case, certain types of companies might be more likely to not share their information.
  2. Information might be missing specifically to mislead you, the data analyst. This is the worst possible scenario, and there is not much you can do.

So, what do you do about it? A few typical options:

  1. Throw out all the rows with missing data: we do not have enough information about these companies.
  2. Throw out all the columns with missing data: this field is not reliably measurable and we shouldn't use it.
  3. Try to guess the missing values. This can be done if the amount of missing data is small. Either you train a predictive model based on the non-missing data, or you fill in the median for that type of row, or you fill in the value of the "closest" matching row. This can be dangerous.
  4. Some algorithms are OK with missing data. Check the documentation for your models and algorithms to see how they deal with missing values.