0

Suppose there is problem of image detection in a CCTV. Look these tensor dimensions below:

Input: Image -> (Height, Width, RGB)

Output: Mask of Image -> (Height, Width, Num of Object Class)

object class for example person, vehicle, and others (three classes).

But, there is additional feature for example camera placement point of views, we call this view class. Suppose there are three views.

But keep in mind, the main goal is to detecting object class, not classifying view class. View class is intended to help model learn better (my hipothesis). So, the model input still just RGB image.

The Curse of High Dimensionality said that as many features we fed to model (at input layer), the model performance became worse. This is looks counter-intuitive.

In another word, when we fed the model with View Class in input layer, the model might be harder to learn instead of easily.

My hipothesis still is insisting that the object class is depending on view class.

So, here, instead of High Curse Dimensionality like this:

Input: Image+View -> (View, Height, Width, RGB)

Output: Mask of Image -> (Height, Width, Object)

It must be like this:

Input: Image -> (Height, Width, RGB)

Intermediary Sub Output: View -> (Height, Width, View)

Output: Mask of Image -> (Height, Width, Object)

So, here features definition is not only constrained to input.

I suggest that the view is actually output instead of input. Since my assumption was "object output performance is depending on view output performance", it must be sits in hidden layer or intermediary layer.

Here, the question, shall I preserve tensor dimension like:

(View, Height, Width, RGB)

Or just use direct dense MLP where num of dense is num of view, so the dimension like this:

(View)

Or even, I don't need view features at all? Just train directly regular input-output traditional way between RGB image and mask image (object detection output).

0 Answers0