All tags in XML or HTML files
Hi there
A nifty way of finding which tags are being used in a file (specially useful for XML files, where the tags can be anything) is using "grep" to get the tags, sort them (with "sort", what else?) and removing duplicates with "uniq":
Code:
pabloa:~$ grep -ohe "<[^/][^> ]*[ |>]" *.xml |sort|uniq
<city>
<country>
<description
<language>
<metadata
<title>
<topic>
<value>
<?xml
<year>
It's so useful that I'm going to do an alias for it. Here we are using a few very nice features of the "grep" command:
- -o: output only the matching bit instead of the whole line
- -h: don't output the file name where the pattern was found
- -e: use a regular expression (it seems that this has to be the last flag of the three, otherwise it malfunctions)
The pattern used ("<[^/][^> ]*[ |>]") can be explained in words like this: "anything starting with a '<', followed by any character different than '/' (so we avoid closing tags), followed by anything which is not a space or a '>', up to (and including) a space or a '>'"
Improve at your leisure and enjoy at your pleasure!
Cheers.
P.
Re: All tags in XML or HTML files
Pabloa,
Very neat script indeed! Thank you for sharing! XML massaging is unavoidable this days when doing software localization in order to make the CAT Tools to properly digest the wide variety of structures presented by XML. I will try to share my scripts too.
Best wishes,
James