Video Summarization Using Subtitles: A Simplified Approach

Introduction

I’ve recently been working a lot with video data—in fact, I’d already shared some learnings on a few deep learning-based video summarization techniques, in a recent article published on Heartbeat.

While I searched a lot for research papers that dealt with video summarization, I came across a paper that dealt with creating these summaries in the simplest way possible—by using the subtitles of a given video. You can check out the paper here.

First, the scripts are extracted from the subtitle files of the videos. Then, the extracted scripts are partitioned into segments. Finally, the partitioned script
segments are converted into a TF-IDF vector-based representation, which
acts as the semantic index.

This process is depicted in the flow diagram below:

Diving Deeper

Video indexing is the process of providing watchers a way to access and navigate contents easily, similar to book indexing. And semantic video indexing is basically creating meaningful video summaries.

The methods for video summarization that we saw in my previous article are based on low-level visual or motion information, such as color or motion
activity. However, when humans deal with multimedia data, they prefer to
describe, query, and browse the content in terms of keywords rather than low-level features. Thus, how to extract information from digital multimedia is very important—although challenging—task.

The most popular method for extracting semantic information is to combine human annotation with machine learning. But such methods are semi-automatic and complex because the initial training set needs to be labeled by humans, and the learned classifiers may also need to be tuned for different videos.

Instead of discussing complex video processing algorithms in this article, we’ll look at a new approach to building the summaries for video content by analyzing the subtitle file.

Step 1: Conversion of .srt file to .txt file:

def srt_to_txt(srt_file):
    """ Extract text from subtitles file
    Args:
        srt_file(str): The name of the SRT FILE
    Returns:
        str: extracted text from subtitles file
    """
    text = ''
    for index, item in enumerate(srt_file):
        if item.text.startswith("["):
            continue
        text += "(%d) " % index
        text += item.text.replace("n", "").strip("...").replace(
                                     ".", "").replace("?", "").replace("!", "")
        text += ". "
    return text

The subtitle file is parsed into script elements, where each script element has the following three attributes: ‘Start Time’, ‘End Time’, and ‘Text’. We use the information in the script elements to partition them in the next step.

Step 2: Script Partition:

def find_summary_regions(srt_filename, duration=30, language="english"):
    """ Find important sections
    Args:
        srt_filename(str): Name of the SRT FILE
        duration(int): Time duration
        language(str): Language of subtitles (default to English)
    Returns:
        list: segment of subtitles as "summary"
    """
    srt_file = pysrt.open(srt_filename)

    enc = chardet.detect(open(srt_filename, "rb").read())['encoding']
    srt_file = pysrt.open(srt_filename, encoding=enc)

    # generate average subtitle duration
    subtitle_duration = time_regions(
        map(srt_segment_to_range, srt_file)) / len(srt_file)
    # compute number of sentences in the summary file
    n_sentences = duration / subtitle_duration
    summary = summarize(srt_file, n_sentences, language)
    total_time = time_regions(summary)
    too_short = total_time < duration
    if too_short:
        while total_time < duration:
            n_sentences += 1
            summary = summarize(srt_file, n_sentences, language)
            total_time = time_regions(summary)
    else:
        while total_time > duration:
            n_sentences -= 1
            summary = summarize(srt_file, n_sentences, language)
            total_time = time_regions(summary)
    return summary

In a video, when there’s a dialogue or a long narration that extends to several frames, the script element gap is very small. Concurrently, it’s evident that script elements that constitute an extended narration will also have a high “semantic correlation” among themselves. Hence, we can see that the script element gap is a useful parameter by which to group together semantically relevant script elements, thereby creating a partition of the scripts.

Step 3: Script Vector Representation

After partitioning the scripts into segments, we build an index for each script
segment. We adopt the term-frequency inverse document frequency (tfidf) vector space model, which is widely used for information retrieval, as the semantic index for the segments.

The first step involves the removal of stop words, e.g. “about”, “I” etc. The Porter stemming algorithm is then used to obtain the stem of each word—e.g., the stem for the word “families” is “family”. The stems are collected into a dictionary, which is then used to construct the script vector for each segment.

The Final Step: Creating the Video Summary

def get_summary(filename="1.mp4", subtitles="1.srt"):
    """ Abstract function
    Args:
        filename(str): Name of the Video file (defaults to "1.mp4")
        subtitles(str): Name of the subtitle file (defaults to "1.srt")
    Returns:
        True
    """
    regions = find_summary_regions(subtitles, 60, "english")
    summary = create_summary(filename, regions)
    base, ext = os.path.splitext(filename)
    output = "{0}_1.mp4".format(base)
    summary.to_videofile(
                output,
                codec="libx264",
                temp_audiofile="temp.m4a", remove_temp=True, audio_codec="aac")
    return True

After the script extraction and script partition processes, the video summary is created, taking into account the most important keywords in the subtitle file.

Let’s Implement the Code

First, let’s create a virtual environment where all the dependencies will be installed, according to our requirements. Open the terminal and type in :

To activate the environment, type:

Here is the link to my GitHub repository where I have posted the code for creating a video summary :

You can clone the repository and unzip it in the folder where you created the virtual environment. I created my virtual environment in the documents folder and named my unzipped folder vidsummary1.

For the code to work, you’ll need to have the following packages installed:

If you don’t have these packages installed, then you can do so by running this command:

Usage

To generate a summary of a video file sample.mp4 with the subtitle file subtitle.srt:

To summarize a YouTube video from its URL:

If you want to retain the subtitles of the downloaded YouTube video, you can use this code snippet to do so:

Experimental Result

Our test data consists of a single episode from the popular TV sitcom Young Sheldon (season 3 episode 20). The following video is generated after performing summarization on it:

Conclusion

In this article, we covered the techniques and processes to create a video summary using a video’s subtitles:

Script extraction
Script partition
Script vector representation

We’ve also worked through a simple Python implementation on our local system, for a video with subtitles that we already have, and also by downloading a YouTube video through its URL.