Help with sed script to remove Wikipedia citation numbers

Question

I am just beginning to learn sed and awk. I have to submit an homework assignment tomorrow, which is a copy-paste from Wikipedia. Just the opportunity to practice some sed scripting!

So I have the document in html format. Now I need to replace [<number>] with nothing. How would I do this?

This is what I tried, but I think it does not even match the pattern I want:

cat content.xml | sed 's/\[\d+\]/ /g' > content2.xml

As a next stage, I will be implementing the replacement of these patterns, which are hyperlinks, but even the above mentioned simple pattern is not being matched:

<a href="https://en.wikipedia.org/wiki/Immune_system">immune system</a>

and then remove the citations:

<a name="cite_ref-Gleeson2007_27-0"/><a href="https://en.wikipedia.org/wiki/Physical_exercise#cite_note-Gleeson2007-27">[27]</a>

score 1 · Accepted Answer · edited May 23 '17 at 12:39

You went the Wrong direction, you should learn XML/XSLT instead :) (XML Style Sheet). Either for use with ODT or XHTML. For ODT, a macro may be be better, but I don't know it.

Make a look on this accepted answer: RegEx match open tags except XHTML self-contained tags

The solution in this answer for How to replace all images in Libreoffice with their description should work for you too with little modification.

Help with sed script to remove Wikipedia citation numbers

1 Answers1