1

This is my first post so please forgive me for any mistakes.

I am working on an object detection algorithm that can detect abnormalities in an x-ray. As a prototype, I will be using yolov3 (more about yolo here: 'https://pjreddie.com/darknet/yolo/') However, one radiologist mentioned that in order to produce a good result you need to take into account the demographics of the patient. In order to do that, my neural network must take into account both text an an image. Some suggestions have been made by other people for this question. For example, someone recommended taking the result of a convolution neural network and a seperate text neural network. Here is an image for clarification: Here is an image for clarification

Image Credits: This image (https://cdn-images-1.medium.com/max/1600/1*oiLg3C3-7Ocklg9_xubRRw.jpeg) from Christopher Bonnett's article (https://blog.insightdatascience.com/classifying-e-commerce-products-based-on-images-and-text-14b3f98f899e)

For more details, please refer to above-mentioned article. It has explained how e-commerce products can be classified into various category hierarchies using both image and text data.

However, a when convolution neural network is mention it usssualy means it is used for classification instead of detection https://www.quora.com/What-is-the-difference-between-detection-and-classification-in-computer-vision (Link for comparison between detection and classification)

In my case, when I am using yolov3, how would it work. Would I be using yolov3 output vector which would be like this format class, center_x, center_y, width and height

My main question is how would the overall structure of my neural network be like if I have both image and text as input while using yolov3. Thank you for taking the time to read this.

2 Answers2

0

First of all you don't need "text" input as in Christopher Bonnett's blog. Your case is more easy - demographic is table data, which can be expressed as vector of numeric values. This data should be processed - pushed through one or two fully connected layers. The trick is where, to what part of yolo to concatenate results of processing of this vector. Because it's vector data it should concatenated to fully connected layer. Where exactly should be found by experiments, but as starting point it could be concatenated to before-last (before output) fully connected layer (I think for yolo it's 4096-size layer).

Overall I'd say that is not a trivial task. It require some experience with deep learning, good understanding of yolo design, yolo algorithm and a lot of experimentation, both with architecture and hyperparameters. Probably worth solid paper. Good luck.

mirror2image
  • 735
  • 7
  • 15
0

you need to take into account the demographics of the patient

How, exactly?

  • Is it a difference of, say, threshold? In this case you can do this serially (as @mirror2image mentions): process the image and then conclude by comparing the size of what you saw to, say, an age-dependent threshold.
  • Or has the whole processing to be different? In the extreme, you would not wait until the very end before asking whether the patient is a man if you are looking for prostate cancer.

To design the model, you need enough medical understanding to make such choices. The model can handle the parameters, but you have to choose the architecture.