I have a dataset of about 8k points and I am trying to employ active learning with the random forest regressor. I have split the dataset to train and test with train being around 20 points. The test serves as the unlabelled pool (although I have the labels).
My workflow is the following:
- Select a budget
c. - Train the RF on
train. - Select the
samplefrom thetestfor which the predictions have thegreatest variance. - Train the RF on
train+sample
and the process continues until there is no more budget available. At each retrain I am calculating the accuracy on the test with the coefficient of determination.
Is the above workflow valid? What I have observed is that accuracy isn't improved compared to random sampling. Is there any other query strategy that can work with Random Forests for regression?
I could have used Gaussian processes but from my experience they need a lot of tuning and for large training sets, training time is very large. That is the reason I selected Random Forest.