Extract the content from a file between two match patterns (Extract only HTML from a file)

Question

I have file which contains different kind of text formats, my goal is to extract only HTML part and create a file with this HTML code. I think it is possible with grep or awk. My file contains also lines as this:

Sender name `<test@email.com>`

I wrote this script cat file1.html | grep -E "<[^>]*>". But the problem is that it outputs also the lines as Sender name, etc. I want to extract the content only after the <html> tag. So this is not useful for me:

References: <test@test.com>
From: test user <test@test.com>
Message-ID: <test@test.com>
In-Reply-To: <test@test.com>

pa4080 · Accepted Answer · 2021-01-22T11:07:32.747

We can achieve this goal by the tool sed - stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.

0. First let's create a simple file to test our commands:

$ printf '\nTop text\nSender <example@email.com>\n\n<html>\n\tThe inner text 1\n</html>\n\nMiddle text\n\n<HTML>\n\tThe inner text 2\n</HTML>\n\nBottom text\n' | tee example.file
Top text
Sender <example@email.com>
<html>
        The inner text 1
</html>
Middle text
<HTML>
        The inner text 2
</HTML>
Bottom text

1. We can crop everything between the tags <html> and </html>, including them, in this way:

$ sed -n -e '/<html>/,/<\/html>/p' example.file

<html>
        The inner text 1
</html>

The option -e script (--expression=script) adds a script to the commands to be executed. In this case the script that is added is '/<html>/,/<\/html>/p'. While we have only one script we can omit this option.
The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.
This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.
Finally by the two comma separated patterns - /<html>/,/<\/html>/ - we can specify a range. Please note we using \ to escape the special character / that plays role of delimiter here.

2. If we want to crop everything between the tags <html> and </html>, without printing them, we should add some additional commands:

$ sed -n '/<html>/,/<\/html>/{ /html>/d; p }' example.file
    The inner text 1

The curly braces, { and }, are used to group the commands.
The command d will delete each line that maces to the expression html>.

3. But, our example.file has also upper case <HTML> tags. So we should make the pattern match case insensitive. We could do that by adding the flag /I to the regular expressions:

$ sed -n '/<html>/I,/<\/html>/I{ /html>/Id; p }' example.file
    The inner text 1
    The inner text 2

The I modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.

4. If we want to remove all HTML tags between the <html> tags we could add an additional command, that will parse and 'delete' the strings, which begin with < and end with >:

sed -n '/<html>/I,/<\/html>/I{ /html>/Id; s/<[^>]*>//g; p }' example.file

The command s will substitute the strings that mach to the expression /<[^>]*>/ with an empty string // - s/<old>/<new>/.
The pattern flag g will apply the replacement to all matches to the regexp, not just the first.

Probably we would want to omit the delete command in this case:

sed -n '/<html>/I,/<\/html>/I{ s/<[^>]*>//g; p }' example.file

5. To make the changes in place of the file and create a backup copy we can use the option -i, or we can to create a new file based on the sed's output by redirecting > the output to the new file:

sed -n '/<html>/I,/<\/html>/I p' example.file -i.bak

sed -n '/<html>/I,/<\/html>/I p' example.file > new.file

References:

Extract the content from a file between two match patterns (Extract only HTML from a file)

1 Answers1

Linked