Reconstructing 3D models from 2D images using autoencoders

Question

I went through a research paper ("Voxel-Based 3D Object Reconstruction from Single 2D Image Using Variational Autoencoders") and tried to implement the approach following this diagram:

![link to image of reference network- https://ibb.co/4JgbQ9s

Here is my implementation for the same:

image = Input(shape=(None, None, 3))
Encoder
l1 = Conv2D(64, (3,3), strides = (2), padding='same', activation='leaky_relu')(image)

l2 = MaxPooling2D(padding='same')(l1)
l3 = Conv2D(32, (5,5), strides = (2), padding='same', activation='leaky_relu')(l2)
l4 = MaxPooling2D(padding='same')(l3)
l5 = Conv2D(16, (7,7), strides = (2), padding='same', activation='leaky_relu')(l4)
l6 = MaxPooling2D(padding='same')(l5)
l7 = Conv2D(8, (5, 5), strides = (2), padding = 'same', activation = 'leaky_relu')(l6)
l8 = MaxPooling2D(padding='same')(l7)
l9 = Conv2D(4, (3, 3), strides = (2), padding = 'same', activation = 'leaky_relu')(l8)
l10 = MaxPooling2D(padding='same')(l9)
l11 = Conv2D(2, (4, 4), strides = (2), padding = 'same', activation = 'leaky_relu')(l10)
l12 = MaxPooling2D(padding='same')(l11)
l13 = Conv2D(1, (2, 2), strides = (2), padding = 'same', activation = 'leaky_relu')(l12)
#latent variable z
l14 = Reshape((60,512))(l13)
l15 = Dense((60512), activation = 'leaky_relu')(l14)
l16 = Dense((128444), activation = 'leaky_relu')(l15)
l17 = Reshape((60,4,4,4,128))(l16)
#Decoder
l18 = UpSampling3D()(l17)
l19 = Conv3DTranspose(60, (8, 8, 8), strides = (64), padding='same', activation = 'leaky_relu') (l17)
l20 = UpSampling3D()(l19)
l21 = Conv3DTranspose(60, (16,16,16), strides =(32), padding='same', activation = 'leaky_relu')(l20)
l22 = UpSampling3D()(l21)
l23 = Conv3DTranspose(60, (32, 32, 32), strides = (32), padding='same', activation = 'lealy_relu')(l22)
l24 = UpSampling3D()(l23)
l25 = Conv3DTranspose(60, (64, 64, 64), strides = (24), padding='same', activation = 'leaky_relu')(l24)
l26 = UpSampling3D()(l25)
l27 = Conv3DTranspose(60, (64, 64, 64), strides = (1), padding='same', activation = 'leaky_relu')(l26)
model3D = Model(image, l27)

This gives me error for l10 saying:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_33/351640059.py in <module>
     24 #Decoder
     25 l18 = UpSampling3D()(l17)
---> 26 l19 = Conv3DTranspose(60, (8, 8, 8), strides = (64), padding='same', activation = 'leaky_relu') (l17)
     27 l20 = UpSampling3D()(l19)
     28 l21 = Conv3DTranspose(60, (16,16,16), strides =(32), padding='same', activation = 'leaky_relu')(l20)
/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in call(self, args, *kwargs)
    975     if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
    976       return self._functional_construction_call(inputs, args, kwargs,
--> 977                                                 input_list)
    978 
    979     # Maintains info about the Layer.call stack.
/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
   1113       # Check input assumptions set after layer building, e.g. input shape.
   1114       outputs = self._keras_tensor_symbolic_call(
-> 1115           inputs, input_masks, args, kwargs)
   1116 
   1117       if outputs is None:
/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in _keras_tensor_symbolic_call(self, inputs, input_masks, args, kwargs)
    846       return tf.nest.map_structure(keras_tensor.KerasTensor, output_signature)
    847     else:
--> 848       return self._infer_output_signature(inputs, args, kwargs, input_masks)
    849 
    850   def _infer_output_signature(self, inputs, args, kwargs, input_masks):
/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in _infer_output_signature(self, inputs, args, kwargs, input_masks)
    884           # overridden).
    885           # TODO(kaftan): do we maybe_build here, or have we already done it?
--> 886           self._maybe_build(inputs)
    887           inputs = self._maybe_cast_inputs(inputs)
    888           outputs = call_fn(inputs, args, *kwargs)
/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in _maybe_build(self, inputs)
   2657         # operations.
   2658         with tf_utils.maybe_init_scope(self):
-> 2659           self.build(input_shapes)  # pylint:disable=not-callable
   2660       # We must set also ensure that the layer is marked as built, and the build
   2661       # shape is stored since user defined build functions may not be calling
/opt/conda/lib/python3.7/site-packages/keras/layers/convolutional.py in build(self, input_shape)
   1546     if len(input_shape) != 5:
   1547       raise ValueError('Inputs should have rank 5, received input shape:',
-> 1548                        str(input_shape))
   1549     channel_axis = self._get_channel_axis()
   1550     if input_shape.dims[channel_axis].value is None:
ValueError: ('Inputs should have rank 5, received input shape:', '(None, 60, 4, 4, 4, 128)')"```
Any help and guidance is appreciated.

NikoNyrh · Answer 1 · 2022-07-19T09:01:01.263

You are missing a Reshape step between layers 9 and 10. In addition I suggest you to add an activation function to your dense layers.

Edit 1:

Actually in Keras we don't usually care about the tensor's first dimension, since it is the batch size: docs. I assume you set the batch_size at your input layer, so the tensors' dimensions first value isn't None?

Check the dimensions of your l7, according to the attached image it should be (60, 512). l8 should be Dense(512, activation = 'leaky_relu')(l7) and l9 would be Dense(128*4*4*4, activation = 'leaky_relu')(l8) which can be reshaped to (60, 4, 4, 4, 128) by calling Reshape((4, 4, 4, 128))(l9).

Now that I look at your implementation more carefully I noticed several issues:

It has six Conv2D layers but the reference architecture has seven
The number of filters and the kernel size doesn't match, except for the l1 layer. For example l2 should have 64 filters and a size of 5. And the number of filters increases as we go deeper, the last layer has 512 of them but yours has only one.
The reference architecture seems to use pooling layers, their resolution is halved at each step.

I suggest you try some simpler tutorials first, to get familiar with tensor sizes and how they relate to different network layers and their parameters.

Reconstructing 3D models from 2D images using autoencoders

Encoder

1 Answers1