I was trying to make a multimodal architecture for SER. For that I required to extract features from Emotion2Vec model for audio features. Reading the paper and going through the github codebase I was not able to exactly identify it. Hence I require some help as I'm quite new to this.
Here is the link to the paper:1
Here is the codebase link:2