4

I was wondering if anyone can suggest a good framework for reasoning with incomplete information.

I have found Large Knowledge Collider but it appears dead for some time. Do you possibly have any other suggestions for a maintained project worth checking?

Since many comments are gravitating towards a different direction let me add one approach that I found a potentially good answer to my question - Rough Set Based Decision Trees.

I would hope there is more than only this approach... could you please help me identify them?

sophros
  • 159
  • 2
  • 8

2 Answers2

0

I am very thankful to the people who responded with hints and suggestions. However, I think what seems most applicable for my case is Gen - A general-purpose probabilistic programming system with programmable inference from MIT described in the paper "Gen: A General-Purpose Probabilistic Programming System with Programmable Inference" by M. F. Cusumano-Towner et al.

In case you are looking for something along these lines it looks like a very good start for an application in probabilistic programming.

sophros
  • 159
  • 2
  • 8
0

Firstly, before we commence I will recommend that you refer to similar questions on the network i.e https://stackoverflow.com/questions/39386936/machine-learning-with-incomplete-data , https://stats.stackexchange.com/questions/103500/machine-learning-algorithms-to-handle-missing-data

Row Deletion

If a particular row has more than 70% missing values, you can delete the row to handle the null values. This method is advised only when there are enough samples in the data set. The major disadvantage of this method is that it reduces the power of the model because it reduces the sample size.

Replacing With Mean/Median/Mode

We can calculate the mean, median or mode of the feature and replace the missing values with it. Another approach is to approximate it with the deviation of neighbouring values.

Although this approach adds variance to the data set, it yields better results compared to removing rows and columns.

KNN or Random Forest imputation

In this approach, the missing values of an attribute are imputed using existing attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function.

The advantage of this approach is that k-nearest neighbour can predict both qualitative and quantitative attributes. Additionally you do not need to create a prediction model for each attribute with missing data in the dataset.

Predicting the Missing Values

Prediction is one of the more sophisticated methods for handling missing data. Using the features which do not have missing values, we can predict the null values with the help of a machine learning algorithm.

In this case, we divide our data set into two. One set with no missing values and another set with missing values. The first data set becomes the training data set of the model while the second data set with missing values is the test data set.

We then create a model to predict target variables based on other attributes of the training data set and populate the missing values of the test data set.(Sayali S 2016)

Caret or randomForestSRC packages in R

The R package randomForestSRC can handle missing data for a wide class of analyses i.e. regression, classification, unsupervised and multivariate (Ankur C 2014). Additionally, the Caret R package can be used to predict missing data.

Reference : https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/

Seth Simba
  • 1,186
  • 1
  • 11
  • 29