5

Suppose we have a labeled data set with columns $A$, $B$, and $C$ and a binary outcome variable $X$. Suppose we have rows as follows:

 col  A B C X
  1   1 2 3 1
  2   4 2 3 0
  3   6 5 1 1
  4   1 2 3 0

Should we throw away either row 1 or row 4 because they have different values of the outcome variable X? Or keep both of them?

naive
  • 709
  • 6
  • 15
guest_guy
  • 51
  • 1

3 Answers3

4

The problem you are portraying looks like a modified XOR problem. You can't throw away the lines with a label of 1 because a the model won't be able to learn this class.

CaucM
  • 141
  • 2
1

This is perfectly acceptable in a stochastic environment. Generally your loss is to minimize $-log\ p(Y|X)$ or equivalently $-\sum_i log\ p(y_i|x_i)$. This optimization is equivalent to $-\mathbb{E}\log\ p(y_i|x_i)$. In other words you are minimizing in this case:

$$ \begin{align*} L &= -log\ p(1|x_0) - log\ p(0|x_0) \\ &= -log [p(1|x_0) * p(0|x_0)] \\ &= -log [p(1|x_0) * (1 - p(1|x_0))] \\ \end{align*} $$
or since log is concave equivalently minimizing
$$ \hat L = -p(1|x_0) * (1 - p(1|x_0)) $$ After some basic calc 1, we see the optimal result we want the system to learn is that
$$ p(1|x_0) = .5$$

Note that if you had more evidence, the result would just be that you want it to learn that it is $1$ with probability $\mathbb{E}_i\ y_i | x$

mshlis
  • 2,399
  • 9
  • 23
0

I might consider 2 models (throw away col 1 and throw away col 4), and one more that keeps both, and see which generalises better to test set.

joek47
  • 11
  • 1