Problem Context: I'm building a Morse code audio decoder using CNN-BiLSTM with CTC loss. My current 4-layer model achieves Levenshtein distance ≈0.6, but attempts to improve performance by adding a 5th CNN layer caused training collapse (loss stuck at ~4). I need practical suggestions to reduce the error rate further.
Current Architecture (Working 4-Layer):
self.cnn = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
# ... 3 more blocks ...
)
self.rnn = nn.LSTM(128*8, 256, bidirectional=True)
Specific Questions:
Architecture Improvements:
Would replacing later CNN layers with dilated convolutions better preserve temporal features?
Is there an optimal ratio between CNN depth and LSTM complexity for this task?
Noise Robustness:
Which techniques work best for Morse decoding in noisy conditions?
Additive noise layers during training?
Learnable spectral suppression?
Sequence Modeling:
Would adding attention between CNN and LSTM help?
Should I try hybrid CTC/attention loss?
Training Tricks:
What curriculum learning strategies work for Morse?
Are there better ways to initialize the LSTM for CTC?
Alternative Approaches:
Would a U-Net style encoder-decoder outperform CNN-BiLSTM?
Are there lightweight transformer variants suitable for this task?