5

I was trying to solve an XOR problem, and the dataset seems like the one in the image.

dataset

I plotted the tree and got this result:

enter image description here

As I understand, the tree should have depth 2 and four leaves. The first comparison is annoying, because it is close to the right x border (0.887). I've tried other parameterizations, but the same result persists.

I used the code below:

from sklearn.tree import DecisionTreeClassifier    

clf = DecisionTreeClassifier(criterion='gini')
clf = clf.fit(X, y)

fn=['V1','V2']

fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (3,3), dpi=300)

tree.plot_tree(clf, feature_names = fn, class_names=['1', '2'], filled = True);

I would be grateful if anyone can help me to clarify this issue.

2 Answers2

5

I can reproduce this problem for an even more easily separable dataset:

enter image description here

The ideal tree for it should be as follows:

enter image description here

However, when I run DecisionTreeClassifier with the maximal depth = 2 in scikit-learn many times, it splits the dataset randomly and never gets it right.

This is an example of 4 different runs:

enter image description here

The problem is that scikit-learn has only two measures of the quality of a split: gini, and entropy. Both of them estimate mutual information between the target and only one predictor.

However, in XOR problem, mutual information of each predictor with the target is zero. You can read more about it here: link from which you can see that this problem exists not only for XOR but for any task where interaction between features is important.

In order to solve it, the tree should be built based neither on the Gini impurity, nor on the information gain but on measures that estimate how the target depends on multiple features, e.g. multivariate mutual information, distance correlation, etc which might solve simple problems like XOR but might fail in the case of real tasks. It is easy to find simple cases when they fail (just try them for a regression for simple non-linear functions of a few variables). There is no such a measure that would estimate the dependence of a target on multiple interacting predictors very well and would work for all problems.

EDIT to answer Asher's comment: I did several runs for max_depth=3. It is better than for max_depth=2 but still misses the correct classification from time to time. Taking max_depth=4 almost always gets XOR correctly with the occasional misses. Below are pictures of some runs for max_depth=3 and max_depth=4.

enter image description here

enter image description here

However, the trees for max_depth=3 and max_depth=4 become ugly. They are ugly not only because they are bigger than the ideal tree shown above but they totally obscure the XOR function. For example, can you decipher an XOR from this tree?

enter image description here

It is probably possible with some pruning technique but still, an extra work.

2

The algorithm fails because it is greedy. This means that it takes the first split decision immediately, without taking into account what will happen in next steps.

An alternative would be given by the Viterbi algorithm, that would select the best sequence of splits by backtracking over the best final cumulated information gain.

In the example given, two different split sequences would achieve maximum separation: first split along $V_1$ or first split along $V_2$.

For other configurations of the data points, the order may be relevant, so all alternatives should be evaluated.

Some resources about non-greedy algorithms for decision trees follow.

https://proceedings.neurips.cc/paper/2015/file/1579779b98ce9edb98dd85606f2c119d-Paper.pdf

https://medium.com/mlearning-ai/optimal-decision-trees-dbd16dfca427

https://faculty.ucmerced.edu/mcarreira-perpinan/papers/ijcnn21a-slides.pdf