Thursday | 21 NOV 2024
[ previous ]
[ next ]

He sed, she sed

Title:
Date: 2021-03-18
Tags:  

Here I'm going to take a stab at learning sed. The man page isn't super helpful so I'm going to throw things at the wall and see if anything sticks.

Sed works by running an expression against each line of a file.

The goal is to get the links from hacker news.

https://news.ycombinator.com/news

Express yourself

$ curl -s https://news.ycombinator.com/news | sed -e 's/class="storylink"/&/'

The -e flag means --expression and references 1 full sed expression. This means we can add more expressions to run against the line by chaining the e option.

The expression itself is a replacement operation, the statement is saying if the line contains class="storylink", replace it with &. & is a special character in sed to mean the match itself. If we leave this blank then it will replace the class text with nothing.

Quiet down

By default sed prints all the line it is processing, regardless of the matches or lack of matches. To get rid of the lines that we don't care about we can use -n which is the --quiet/--silent option and then use p in our regex to print out the lines we care about.

$ curl -s https://news.ycombinator.com/news | sed --quiet -e 's/class="storylink"/&/p'

Now we should get only the lines that match our requirement of containing class="storylink" and it will replace the matching text with itself, doing nothing ultimately.

You've captured my heart

curl --silent https://news.ycombinator.com/news | sed --quiet --regexp-extended -e 's/.*href="(.*)" class="storylink".*/\1/p'

The first thing to note is we want the quiet option in sed as we want to control what we want to print.

The next flag we set is the regexp extended flag. This lets us use brackets without having to escape them.

We have the expression flag to tell sed we want to run the following expression for each line.

Now our expression has become a bit more complex. The first part is that we don't want to just match on some text, we want to match the entire link containing the href and class attributes. This is why we do .* on either side of the text we are matching.

Then within the text we are matching we use the brackets to say what we want to capture. By doing a capture, we then have access to to the capture via 1. So we replace the entire matching line with the capture group. We finally use the p option to print the line.

The only issue with sed is that it doesn't have a non greedy option, so we can't stop at the first match. In this case if we remove the storylink class check and only do the href, we will get all sorts of strange data.

You've capture my hearts

curl --silent https://news.ycombinator.com/news | sed -rn -e 's/.*href="(.*)" class="storylink">(.*)<\/a><s.*/\2\\n\\1\\n/p'

Here we are using 2 capture groups to get the link and the title of the story and then formatting it through the replacement.