Finding the right questions to increase accuracy in classification

Question

Lets say I have a list of 100k medical cases from my hospital, each row = patient with symptoms (such as fever , funny smell, pain etc.. ) and my labels are medical conditions such as Head trauma, cancer , etc..

The patient come and say "I have fever" and I need to predict his medical condition according to the symptoms.According to my data set I know that both fever and vomiting goes with condition X. So i would like to ask him if he is vomiting to increase certainty in my classification.

What is the best algorithmic approach to find the right question (generating question from my data set of historical data). I thought about trying active learning on the features but I am not sure that it is the right direction.

score 2 · Accepted Answer · answered Jul 12 '18 at 19:40

The problem you're trying to address can, in some sense, be viewed as a Feature Selection problem. If you look for literature using only those words, you're not going to find what you're looking for though. In general, "Feature Selection" simply refers to the problem where you already have a large amount of features, and you're simply deciding to select which ones to keep and which ones to throw away (because they're not informative or you don't have the processing power to try training with all features for example).

I'd recommend looking around for a combination of "Feature Selection" and "Cost-Sensitive". This is because, in your case, there are costs associated with selecting features; values may be costly to obtain for some features. Searching for this combination leads to publications which look to be interesting for you, such as:

I cannot personally vouch for any of those techniques since I've never used them, but those papers certainly look relevant for your problem.

When you're looking around for more literature, terms like "cost", "cost-based", maybe "budgeted" are crucial to include. If you don't include those, you're just going to get papers on problems like:

Feature Selection: given a set of features/columns, which ones am I going to use across all samples/instances/rows?
Feature Extraction: given data (typically without clear human-defined features, like images, sound, etc.), how am I going to extract relevant features from this?
Active Learning: given a bunch of samples without labels but feature values already assigned, which one would I like an oracle/human expert/etc. to have a look at so that they can tell me what the true label is?

Those kinds of problems all do not really appear to be relevant in your case. Active Learning may be somewhat interesting in that it is about trying to figure out which rows would be valuable to learn from, whereas your problem is about which columns would be valuable to learn from. There does seem to be a connection there, Active Learning techniques might to some extent be able to inspire techniques for your problem, but just that; inspire, they likely won't be 100% directly applicable without additional work.

score 1 · Answer 2 · answered Jul 12 '18 at 18:39

Feature Extraction

Patterson and Gibson's Deep Learning, A Practitioner's Approach, O'Reiley, 2017 states, "Convolutional Neural Networks (CNNs) ... consistently top image classification competitions," which is consistent with our experience in the lab. If your data is multi-dimensional in that pain is on a scale from one to ten, fever is in degrees, and smell can be a result of blood components which can be quantified in lab reports, you can have a hypercube that can be treated just as frames in a movie can. Movie learning is in ℝ⁴, the third being frame index and the fourth being sample index. With subjective pain, digital thermometer temperature, and three blood component concentrations, you have {P, T, C₁, C₂, C₃} and learning in ℝ⁶ for your CNN design.

Selecting Input Channels

Asking 100 questions and taking 10 blood panels is probably prohibitive. So you will need to stuff all the data from limited questioning and panels into a hyper-cube and find what will similarly extract features from sparse data input. Then the weighting leading from input to feature layers will identify the questions from which the most important features can be extracted. By searching scholarly articles for, "Feature extraction sparse data," a large number of options will be presented.

Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms, B Zheng, SW Yoon, SS Lam - Expert Systems with Applications, 2014 - Elsevier may be particularly interesting, given the common domain.

Outcomes Analysis

The above is a limited approach because the loop is not closed. Only if the outcomes of treatment are used to produce labels or a real time (over the course of months or years) reinforcement will the system produce an optimization that is meaningful. Unsupervised learning for this particular problem is not likely to produce any significant improvement in treatment efficacy.

Finding the right questions to increase accuracy in classification

2 Answers2