Background
I've been interested in and reading about neural networks for several years, but I haven't gotten around to testing them out until recently.
Both for fun and to increase my understanding, I tried to write a class library from scratch in .Net. For tests, I've tried some simple functions, such as generating output identical to the input, working with the MNIST dataset, and a few binary functions (two input OR, AND and XOR, with two outputs: one for true, one for false).
Everything seemed fine when I used a sigmoid function as the activation function, but, after reading about the ReLUs, I decided to switch over for speed.
Problem
My current problem is that, when I switch to using ReLUs, I found that I was unable to train a network of any complexity (tested from as few as 2 internal nodes up to a mesh of 100x100 nodes) to correctly function as an XOR gate.
I see two possibilities here:
- My implementation is faulty. (This one is frustrating, as I've re-written the code multiple times in various ways, and I still get the same result). 
- Aside from being faster or slower to train, there are some problems that are impossible to solve given a specific activation function. (Fascinating idea, but I've no idea if it's true or not). 
My inclination is to think that 1) above is correct. However, given the amount of time I've invested, it would be nice if I could rule out 2) definitively before I spend even more time going over my implementation.
More details
For the XOR network, I have tried both using two inputs (0 for false, 1 for true), and using four inputs (each pair, one signals true and one false, per "bit" of input). I have also tried using 1 output (with a 1 (really, >0.9) corresponding to true and a 0 (or <0.1) corresponding to false), as well as two outputs (one signaling true and the other false).
Each training epoch, I run against a set of 4 inputs $\{ (00, 0), (01, 1), (10, 1), (11, 0) \}$.
I find that the first three converge towards the correct answer, but the final input (11) converges towards 1, even though I train it with an expected value of 0.
 
     
     
    


