How can I extract and print specific parts of an HTML file?

Question

I have a document that looks like this:

<html>
<head>
<title>Hello</title>
</head>
<body>
This is a page
</body>
</html>

I need to list the content between <html> and </html> and then the entire file without the <....> and </....> parts. How do I do that?

score 2 · Answer 1 · edited Jun 12 '20 at 14:37

This answer is based on a literal reading of the question. Anyone who comes across this when searching for how to view an HTML file in a convenient, human-readable way in a terminal should instead see How can I preview HTML documents from the command line? That is not what the methods detailed in this answer do.

Sometimes a < or > character appears in an HTML file even when it is not meant to designate the beginning or end of a tag. If you have to deal with that sort of thing -- or, more generally, if you need your solution to be robust and work with arbitrary HTML documents -- then you should use a utility that actually parses the HTML.

However, if you're just generating output for your own convenience and will notice if something goes wrong (and nothing terrible would happen if you didn't notice), then you can do what you're asking with any of several text processing techniques.

The most common ways to process text using Unix system utilities treat input as a sequence of lines. Since line breaks don't have special significance in HTML, I have avoided this approach, and the methods given in this answer will work even for tags that are split across lines. However, I emphasize that these are still approximate solutions.

Extracting Text Between `<html>` Tags

This Python 3 one-liner (run it from your shell) prints all the text in index.html that appears after the first occurrence of <html> but before the first occurrence of </html>:

python3 -c 'import pathlib; s=pathlib.Path("index.html").read_text(); e="<html>"; print(s[s.find(e)+len(e):s.find("</html>")])'

If you like, you can ungolf and enhance that into a reusable script:

#!/usr/bin/env python3
from sys import argv
from pathlib import Path
start = '<html>'
end = '</html>'
for path in argv[1:]:
    text = Path(path).read_text()
    print(text[text.find(start) + len(start) : text.find(end)])

If you saved the script as print-inside-html, you'd mark it executable like this:

chmod +x print-inside-html

And you'd run it on index.html like this:

./print-inside-html index.html

You can run it on multiple files at once, if you like:

./print-inside-html index.html foo.html coolstuff/index.html

However, you may notice that if there is leading and trailing whitespace between the start and end tags, that gets printed. If you don't want this, then you can use the strip function to remove it. Here's a modified one-liner:

python3 -c 'import pathlib; s=pathlib.Path("index.html").read_text(); e="<html>"; print(s[s.find(e)+len(e):s.find("</html>")].strip())'

And, ungolfed:

#!/usr/bin/env python3
from sys import argv
from pathlib import Path
start = '<html>'
end = '</html>'
for path in argv[1:]:
    text = Path(path).read_text()
    print(text[text.find(start) + len(start) : text.find(end)].strip())

However, neither of the above ways accommodates case-variant tag names (e.g., HTML instead of html) or whitespace in tags after the name. This further-modified one-liner uses regular expressions to accommodate both:

python3 -c 'import re,pathlib; s=pathlib.Path("index.html").read_text(); print(s[re.search(r"(?i)<html\s*>",s).end():re.search(r"(?i)</html\s*>",s).start()].strip())'

Ungolfed:

#!/usr/bin/env python3
import re
from sys import argv
from pathlib import Path
start = re.compile(r'(?i)<html\s>')
end = re.compile(r'(?i)</html\s>')
for path in argv[1:]:
    text = Path(path).read_text()
    print(text[start.search(text).end() : end.search(text).start()].strip())

(?i) makes the regular expressions case-insensitive and \s* consumes any whitespace between the tag name and the closing >. See this guide and this question for information about the features used in that code.

Removing Text That Looks Like Tags

If you're willing to treat anything that starts with a < or </, followed by a non-whitespace character (that is also not /, <, or >), followed by any number of characters besides >, followed by >, as a tag, then this prints index.html with tags removed:

python3 -c 'import re,pathlib; print(re.sub(r"</?[^\s/<>][^>]*>", "", pathlib.Path("index.html").read_text()))'

This is not parsing the HTML code as such, and the actual rules for what constitutes a tag are more subtle. Obviously this will not work in any application that requires HTML always be parsed correctly. For example, do not use this in a web browser or code sanitizer! (Really, don't use it in any application program or general-purpose utility.)

That's a somewhat more manageable one-liner (than the ones above for extracting text between <html> and </html> tags). But in case you want it as a well-formatted script:

#!/usr/bin/env python3
import re
from sys import argv
from pathlib import Path
pattern = re.compile(r'</?[^\s/<>][^>]*>')
for path in argv[1:]:
    text = Path(path).read_text()
    print(pattern.sub('', text))

If you put that in a file called remove-tagish-stuff then these commands mark it executable and run it on one file, then on a couple more files at once:

chmod +x remove-tagish-stuff
./remove-tagish-stuff index.html
./remove-tagish-stuff foo.html bar/baz.html

This doesn't modify the files; like the other code above, it simply outputs their contents with some parts removed.

When you run this on most HTML, including the sample HTML shown in your question, you'll see many blank lines. You'll probably want this, since most documents would be pretty unreadable with everything crunched together. However, if you want to turn repeated blank lines into just one and remove whitespace at the very beginning and end, then you could use this instead:

python3 -c 'import re,pathlib; s=re.sub(r"</?[^\s/<>][^>]*>","",pathlib.Path("index.html").read_text()); print(re.sub("\n{3,}","\n\n",s).strip())'

And here's that one, ungolfed into a script where you pass filenames as command-line arguments (as with the previous scripts):

#!/usr/bin/env python3
import re
from sys import argv
from pathlib import Path
tag = re.compile(r'</?[^\s/<>][^>]*>')
excess = re.compile('\n{3,}')
for path in argv[1:]:
    text = Path(path).read_text()
    detagged = tag.sub('', text)
    print(excess.sub('\n\n', detagged).strip())

If you are going to use any of these, I recommend using the simplest ones that do what you what. By the same token, it's possible to further "improve" and complicate the code to cover more cases -- < and > occurring in tag attributes, for example -- but I've avoided that here. If you need to do anything like accurately parsing the structure of an arbitrary HTML document, then you should not use regular expressions.

Why I am I showing this at all, given that commands and scripts like those shown above should only ever be used in situations that aren't at all serious? It's for the same basic reason that I might try using grep to find a word in a folder of web pages. It's brittle and far from foolproof (grep -FR tallest . wouldn't match She's the tall<em>est</em>!), but it can sometimes be handy so long as one remembers it's limited.

How can I extract and print specific parts of an HTML file?

1 Answers1

Extracting Text Between <html> Tags

Removing Text That Looks Like Tags

Extracting Text Between `<html>` Tags