In this tutorial, they build a speech recognition model to classify a one-second audio clip as one of ten predefined words. Suppose that we modified this problem as the following: Given an Arabic dataset, we aim to build a dialects recognition model to classify a two-second audio clip as one of $n$ local dialects using ten predefined sentences. I.e. for each of these ten sentences, there are $x$ different phrases and idioms which refer to the same meaning$^*$. Now how can I take advantage of the mentioned tutorial to solve the modified problem?
$*$ The $x$ different phrases and idioms for each sentence are not predefined.
 
    