0

I am working on a project involving texture compression and reconstruction, specifically targeting mipmap upscaling for game development assets. I'm encountering challenges in achieving the desired prediction accuracy and am seeking advice on neural network architectures and training strategies, with a strict constraint: only MLPs are permitted (no CNNs or traditional autoencoders).

The core task involves reconstructing higher-resolution mipmap levels, such as 256x256, from their lower-resolution counterparts, like 128x128. This should be achieved using MLPs (athough I am open to suggestions).

For each target pixel in the higher-resolution mipmap, the network receives 24 input features derived from a non-grid, circular sampling pattern around the corresponding location in the lower-resolution input. The network's output is the predicted pixel value (currently a grayscale value, although it will eventually be an RGB value). The training data originates from approximately 100 diverse videogame-oriented texture images (metals, stone, vegetation, etc.) processed by randomly selecting textures and sampling random texel groups, resulting in hundreds of millions of sample pairs (24 features + 1 label), stored in CSV format for training (currently working on Google Colab).

The system architecture relies on a hierarchy of independently trained models. Firstly, separate models are trained for distinct material categories (e.g., metal, stone, vegetation, etc.). Secondly, unique models handle each specific mipmap level transition (e.g., 128x128 -> 256x256 is treated separately from 64x64 -> 128x128). Finally, within each mipmap transition that doubles resolution (e.g., from N×N to 2N×2N), separate models are employed to predict the newly inserted pixel values based on their spatial context relative to the original grid. One model type predicts pixels inserted horizontally between original columns, another predicts pixels inserted vertically between original rows, and a third predicts the pixels inserted diagonally, filling the remaining grid points.

Instead of storing traditional image deltas (which span [-256, 256]), I store the prediction residual value "residual = actual_pixel_value - predicted_pixel_value". The motivation is that a well-trained MLP should produce a predicted_pixel_value that is much closer to the actual_pixel_value than a simple predictor could. Consequently, these residuals are expected to be even more tightly clustered around zero compared to standard deltas. This significantly reduces the numerical range and, crucially, the entropy of the values needing storage. These near-zero residuals are far more compressible using general-purpose algorithms (like DEFLATE/ZIP) than standard deltas.

Any pointers to relevant research, practical techniques, or architectural suggestions tailored to these constraints and goals would be highly valuable. Even though I have reviewed alternative MLP variants such as S2-MLP, CycleMLP and ResMLP, my attempts to adapt them have not yet produced the desired level of performance on this specific fixed-size, spatially-sampled regression problem.

0 Answers0