Simple regular expressions
Hi there
People for whom words are their daily bread and butter would find a close acquaintance with regular expressions most useful. (Wow, that was a literary sentence!)
It seems that "grep" has got its name from "global regular expression parser" or something like that. In general, what we pass as a first parameter to this command is a regular expression. All the examples in this thread are literal strings. The first slogan for today is:
- "literal strings are the simplest regular expression, and they match themselves".
That is, the word "line" will match an "l" followed immediately by an "i", etc.
Normal letters, digits and spaces behave very well. But there are a number of characters with special meaning. I'll introduce the two most used, and will leave the rest for other posts.
- The dot (full stop) character ".": matches any letter
- The asterisk, (star) character "*": matches cero or more occurrences of the regular expression immediately preceding it.
Some examples:
- The expression "l.ne" matches lane, lene, line, lone, lune, and also lbne, lcne, l4ne, etc.
- The expression "line*" matches lin, line, linee, lineee, and so on, cero or more letters "e" following "lin".
These are toy examples, of course. But there are a few useful things that can be done with these simple rules. For example grep -o '<.*>' index.html will display the html tags (as long as there is only one per line, more on this in future posts). ".*" can be read as "cero or more instances of any character".
Enjoy!
Cheers.
P.
Re: Simple regular expressions
Hello
Taking it one step at the time, when we want to group characters, we do it with square brackets. So, the notation [abc] means any of the letters "a", "b" or "c". Within the brackets there are two useful notations: a range of characters is written with the first and last character of the range separated by a dash ("-"), and also the "hat" character ("^") can be used to negate the content of the square brackets. Some examples:
- [0-9]: any digit
- [a-z]: any lowercase letter
- [A-Z]: any uppercase letter
- [a-zA-Z0-9]: any alphanumeric character
- [^0-9]: any character which is not a digit
- [^&]: any character that is not an ampersand
Now it's easier to understand a regular expression used in a previous post that finds html entities. These entities always start with an ampersand and end with a semicolon. So this regular expression finds all occurrences of them: '&[^;]*;'. In words, it looks for a single ampersand ("&") followed by any number of characters different to a semicolon, followed by a semicolon.
Cheers.
P.
Re: Simple regular expressions
it's great that regular expressions are the same in most the programming languages and search engines