Automatically generate subtitles/close caption from a video using speech-to-text?

Question

I have a video that I want to create subtitles for. Is there a program that can perform rudimentary speech-to-text in order to

Set the correct start/stop of each individual subtitle
Create rudimentary text subtitles (using some sort of speech-to-text)

I know about gnome-subtitles. However, it requires extensive effort to create those subtitles manually. You need to select yourself the start and stop for each sentence.

YouTube has the above features (creates rudimentary text subtitles at the correct timings, using speech-to-text). However, I would rather not upload the videos to YouTube just to get my subtitles. Is it possible to do the subtitles efficiently on Ubuntu?

Update: I plan to use the .srt subtitles only, and do not need to hard code them on the videos. My biggest requirement is to have the program automatically find the start/stop for each sentence, so that I write the text in it.

Update #2: There is Speech-to-Text software for Linux, with the CMU Sphinx package. It is possible to use CMU Sphinx with a subtitle program according to this post. In addition, one subtitle tool is aware of this CMU Sphinx feature (web based tool), however there is no reference in the latest source code that they added CMU Sphinx. The quest continues to find a program that uses CMU Sphinx for rudimentary speech to text (which would set the correct timings as well), as YouTube already does.

Pablo Bianchi · Answer 1 · 2023-09-28T04:06:44.217

You have several alternatives:

YouTube

For the ones who do accept having to temporarily upload the video to YouTube (is mandatory to select video language) to get its subtitle (close caption, lyrics): Is possible to extract/download it with youtube-dl or yt-dlp:

yt-dlp --write-auto-sub \  # Write automatically generated subtitle file (YouTube only)
  --write-sub \                # Write subtitle file
  --sub-lang en,de,es \        # Languages of the subtitles to download (optional) separated by commas, use --list- subs for available language tags
  --convert-subs srt \         # Convert the subtitles to other format (currently supported: srt|ass|vtt|lrc)
  -o "~/%(uploader)s/%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s" \  # OUTPUT TEMPLATE
  --skip-download \            # Do not download the video
  --ignore-errors vidURLorID   # Continue on download errors, for example to skip unavailable videos in a playlist

In one line and simplified:

yt-dlp --write-auto-sub --write-sub --sub-lang en --convert-subs srt --skip-download vidURLorID

If the conversion didn't work, convert it with FFmpeg:

ffmpeg -i myTitle.en.vtt output.srt

To convert from srt to txt:

sed -r -e 's/^\xef\xbb\xbf//' -e 's/\r//' -e 's/^[0-9]*$//' -e '/^[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}$/d' -e 's/^\s*$//' -e '/^$/d;s/<[^>]*>//g' output.srt | uniq > output.txt

Whisper (OpenAI)

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

You can get some tools based on Whisper from this awesome list.

Live Captions

Live Captions is an application that provides live captioning for the Linux desktop.

Only the English language is supported currently. Other languages may produce gibberish or a bad phonetic translation.

On Flathub.

Kdenlive

Kdenlive have an Automatic Subtitling/Speech to text feature (optionally using Whisper).

score 4 · Answer 2 · edited Nov 02 '20 at 17:14

UPDATE:

autosub is no longer mantained. Another fork with GUI called pyTranscriber can be used.

You can use this Command-line utility

Autosub is a utility for automatic speech recognition and subtitle generation. It takes a video or an audio file as input, performs voice activity detection to find speech regions, makes parallel requests to Google Web Speech API to generate transcriptions for those regions, (optionally) translates them to a different language, and finally saves the resulting subtitles to disk.

https://github.com/agermanidis/autosub/

Python3 users, do this:

pip install git+https://github.com/BingLingGroup/autosub.git@alpha

Make sure you have ffmpeg installed.

score 3 · Answer 3 · answered Sep 24 '12 at 18:32

3

I personally like Gnome Subtitles it is available in the repositories.

sudo apt-get install gnome-subtitles

answered Sep 24 '12 at 18:32

Marlinc

801

score 3 · Answer 4 · edited Jun 12 '20 at 14:37

I used Aegisub on Windows some years ago, and was really happy with it. Apparently it is available for Linux. It is pretty self explaining.

Aegisub only creates the subtitles file, e.g an .srt file. To combine the video and the subtitle to create a hard-coded subtitle you still need to use a second program.
On Windows I used VirtualDub, but it is not available for Linux. You can use VLC to do this on Linux:

Create your subs in Aegisub, saving it as usual as a .ass file.

Use VLC to add that subtitle track to your video. Subtitle -> Add subtitle file...

Configure the subtitle display style and settings so they display to your liking. Tools -> Preferences -> Subtitles/OSD

You can now watch the video to make sure the subs are displaying as you intended. For example I can check certain subs that I've specified in Aegisub to be displayed at the top of the screen rather than the bottom.

The output will be identical to how it looks now, so make sure all is good.

Go to Media -> Convert/Save... (Ctrl + R).

Under File Selection, add your video file. Tick "Use a subtitle file" and browse to your .ass sub file.

Click the down arrow on the Convert/Save button and click Convert...(Alt + O).

Under Settings, ensure the Convert option is ticked. Tick the Display the output option. Subs aren't added for some reason unless you tick this.

Edit the profile so the video and audio settings are what you want. Under the subtitle tab, tick the Subtitles box, and use DVB subtitle codec. Make sure you tick 'Overlay subtitles on the video'. Press save.

Enter a destination folder and filename in the Destination box.

Press start.

Wait for it to be done, and that's it. The caveat with this method is that the encoding will happen in real-time with the video, so if you have a 2 hour video, it will take 2 hours. This is due to ticking the 'Display the output' box. But for some reason it only works when you tick this.

There are also other subtitle-editors.

Update:
I don't remember Aegisub having a functionality to automatically set beginning and end of a spoken sentence in the subtitles file. And I don't see a mention of such a function anywhere on the site. It is however with (key-combinations) pretty easy to set those times manually.

Is there even any program which has such a function (in any OS)?

score 3 · Accepted Answer · answered May 18 '11 at 16:56

I did not find a way to get the subtitle program to automatically add rudimentary subtitles, by analysing the voices in the video.

Therefore, the alternative that I use is

Upload the video to Youtube (for example, privately) and use the in-build facility to create automatically rudimentary subtitles.

Then,

Add the video to http://www.universalsubtitles.org/ and create manually the timeframes for each sentence, if the automated way in Youtube did not work, or sentences are mising.
Use GNOME Subtitles (found in the Software Center) in order to clean up the subtitles and fix any timings.

score 1 · Answer 6 · answered Mar 21 '21 at 12:21

1

Inside Kdenlive video editor, in the top bar > project > subtitles > "Speech recognition" . You must first download the language pack from https://alphacephei.com/vosk/models , in kdenlive go to Settings > configure kdenlive > "Speech To Text".

answered Mar 21 '21 at 12:21

MoonDragon

71

score 0 · Answer 7 · answered Mar 14 '24 at 18:14

Speech note is worth mentioning here as it is free and can transcribe a vast number of languages. It comes as a flatpack so it is trivial to install. At the moment, you would have to use a tool like mpeg to extract the audio and feed it, but you can be the first to contribute to the completely open source code in github.... ;)

TLDR; soon this will hopefully be an option, but we are not completely there yet.

Ciro Santilli OurBigBook.com · Answer 8 · 2024-11-03T09:19:04.520

OpenAI Whisper (fully offline and MIT licensed)

This software was previously mentioned at: https://askubuntu.com/a/1378514/52975 but I wanted to provide a minimal runnable example.

Tested on Ubuntu 24.04:

sudo apt install ffmpeg
pipx install openai-whisper==20231117

Sample usage with this video: https://commons.wikimedia.org/wiki/File:Goldstone_Apple_Valley_Radio_Telescope_(GAVRT)_Solar_Patrol_(SVS14530).webm

wget -O gavrt.webm https://upload.wikimedia.org/wikipedia/commons/4/45/Goldstone_Apple_Valley_Radio_Telescope_%28GAVRT%29_Solar_Patrol_%28SVS14530%29.webm--2024-08-09
time whisper gavrt.webm

The video is 1:16 long and features a lady speaking in perfect American English about a technical subject. There is light background music throughout.

The terminal now contains:

/home/ciro/.local/pipx/venvs/openai-whisper/lib/python3.12/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead                             
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")                                                                                                                           
Detecting language using up to the first 30 seconds. Use `--language` to specify the language                                                                                                 
Detected language: English
[... transcript ...]
real    0m32.569s

user    3m40.130s

sys     0m8.885s

and cwd now has among others, a file gavrt.srt:

1
00:00:00,000 --> 00:00:05,080
The Gavart Solar Patrol program is a heliophysics program aimed at citizen
2
00:00:05,080 --> 00:00:09,120
scientists and K through 12 students both locally, nationally and throughout the
3
00:00:09,120 --> 00:00:14,480
world. The goal of Gavart Solar Patrol is to monitor active regions on the Sun in
4
00:00:14,480 --> 00:00:18,120
order to understand how they're connected to explosive events that we
5
00:00:18,120 --> 00:00:23,100
categorize under space weather. Participants can remote in and actually
6
00:00:23,100 --> 00:00:26,480
control the telescope themselves. So a common observing mode with the Gavart
7
00:00:26,480 --> 00:00:30,400
Solar Patrol is that we'll have classrooms actually operate the
8
00:00:30,400 --> 00:00:34,760
telescope themselves, collect some data and then generate maps of what the Sun
9
00:00:34,760 --> 00:00:39,560
looks like at radio frequencies. They're gaining a really unique experience that I
10
00:00:39,560 --> 00:00:43,960
think is really special to the Gavart program and that's the ability to walk
11
00:00:43,960 --> 00:00:48,280
through the scientific process from the very beginning, from the steps of
12
00:00:48,280 --> 00:00:52,120
collecting the data themselves, all the way to reducing that data and
13
00:00:52,120 --> 00:00:56,880
interpreting scientific results from their studies. I get excited anytime I
14
00:00:56,880 --> 00:01:00,040
get to operate a radio telescope and so I really enjoy it when other people get
15
00:01:00,040 --> 00:01:04,800
to have that same opportunity and that same learning process.

Amazing! The transcription was perfect or almost perfect! And the installation/usage seamless!

Benchmarked on a .

vosk-transcriber

This is a convenient CLI for Vosk, tested on Ubuntu 24.04, you can install with the English model as:

pipx install vosk==0.3.45
mkdir -p ~/var/lib/vosk
cd ~/var/lib/vosk
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
unzip vosk-model-en-us-0.22.zip
cd -

and then use as:

time vosk-transcriber -m ~/var/lib/vosk/vosk-model-en-us-0.22 -i gavrt.webm -o gavrt.srt -t srt

it took:

real    0m26.538s                                                                              
user    0m22.677s                                                                              
sys     0m5.617s

and gavrt.srt contains:

1
00:00:00,690 --> 00:00:03,030
the gabbert solar patrol program is a
2
00:00:03,030 --> 00:00:05,760
helium physics program aimed at citizen scientists
3
00:00:05,760 --> 00:00:07,620
and keep you told students both locally
4
00:00:07,620 --> 00:00:10,260
nationally and throughout the world the goal
5
00:00:10,260 --> 00:00:12,960
of gabbert solar control is to monitor
6
00:00:12,990 --> 00:00:14,850
active regions on the sun in order
7
00:00:14,850 --> 00:00:17,610
to understand how they're connected to explosive
8
00:00:17,610 --> 00:00:20,160
events that we categorize under space weather
9
00:00:20,610 --> 00:00:23,400
participants can remote in and actually control
10
00:00:23,400 --> 00:00:25,680
the telescope themselves so a common observing
11
00:00:25,680 --> 00:00:27,870
mode with the gabbert solo patrol is
12
00:00:27,870 --> 00:00:30,420
that we'll have classrooms actually operate the
13
00:00:30,420 --> 00:00:33,330
telescope themselves collect some data and then
14
00:00:33,360 --> 00:00:35,010
generate maps of what the sun looks
15
00:00:35,010 --> 00:00:37,920
like at radio frequencies they're gaining a
16
00:00:37,920 --> 00:00:39,900
really unique experience that i think is
17
00:00:39,990 --> 00:00:40,320
is real
18
00:00:40,320 --> 00:00:42,390
early special to the gabbert program and
19
00:00:42,390 --> 00:00:44,370
that's the ability to walk through the
20
00:00:44,370 --> 00:00:47,430
scientific process from the very beginning from
21
00:00:47,430 --> 00:00:49,830
the steps of collecting the data themselves
22
00:00:50,160 --> 00:00:51,990
all the way to producing that data
23
00:00:52,080 --> 00:00:55,380
and interpreting scientific results from their studies
24
00:00:55,770 --> 00:00:57,120
i get excited anytime i get to
25
00:00:57,120 --> 00:00:58,890
operate a radio telescope and i really
26
00:00:58,890 --> 00:01:00,240
enjoy it when other people get to
27
00:01:00,240 --> 00:01:00,450
have
28
00:01:00,570 --> 00:01:03,060
seeing opportunity and that same learning process
29
00:01:04,080 --> 00:01:08,010
and

So it is clearly worse than Whisper.

Benchmarked on a Lenovo ThinkPad P14s amd laptop.

score 0 · Answer 9 · answered Sep 24 '12 at 18:48

0

Ok, found some tool which looks nice and similar to subtitle workshop - subtitle editor (apt-get install subtitleeditor).

Tried to compare it to Gnome Subtitles, subtitle editor looks more advance tool.

answered Sep 24 '12 at 18:48

idgar

3,010

score 0 · Answer 10 · edited Dec 01 '21 at 01:34

0

For KDE, a good subtitle editor is subtitlecomposer. Install it with the command:

sudo apt-get install subtitlecomposer

edited Dec 01 '21 at 01:34

karel

122,292
133
301
332

answered Sep 24 '12 at 19:17

Anwar

77,855

Automatically generate subtitles/close caption from a video using speech-to-text?

10 Answers10

YouTube

Whisper (OpenAI)

Live Captions

Kdenlive

Linked

Related