What should we do when we have equal observations with different labels?

Question

Suppose we have a labeled data set with columns $A$, $B$, and $C$ and a binary outcome variable $X$. Suppose we have rows as follows:

 col  A B C X
  1   1 2 3 1
  2   4 2 3 0
  3   6 5 1 1
  4   1 2 3 0

Should we throw away either row 1 or row 4 because they have different values of the outcome variable X? Or keep both of them?

score 4 · Answer 1 · answered Aug 24 '19 at 00:13

4

The problem you are portraying looks like a modified XOR problem. You can't throw away the lines with a label of 1 because a the model won't be able to learn this class.

answered Aug 24 '19 at 00:13

CaucM

141
2

score 1 · Answer 2 · answered Aug 24 '19 at 15:37

This is perfectly acceptable in a stochastic environment. Generally your loss is to minimize $-log\ p(Y|X)$ or equivalently $-\sum_i log\ p(y_i|x_i)$. This optimization is equivalent to $-\mathbb{E}\log\ p(y_i|x_i)$. In other words you are minimizing in this case:

$$ \begin{align*} L &= -log\ p(1|x_0) - log\ p(0|x_0) \\ &= -log [p(1|x_0) * p(0|x_0)] \\ &= -log [p(1|x_0) * (1 - p(1|x_0))] \\ \end{align*} $$
or since log is concave equivalently minimizing
$$ \hat L = -p(1|x_0) * (1 - p(1|x_0)) $$ After some basic calc 1, we see the optimal result we want the system to learn is that
$$ p(1|x_0) = .5$$

Note that if you had more evidence, the result would just be that you want it to learn that it is $1$ with probability $\mathbb{E}_i\ y_i | x$

score 0 · Answer 3 · answered Sep 04 '19 at 01:25

0

I might consider 2 models (throw away col 1 and throw away col 4), and one more that keeps both, and see which generalises better to test set.

answered Sep 04 '19 at 01:25

joek47

11
1

What should we do when we have equal observations with different labels?

3 Answers3