cheatsheets devlog projects

architecture-of-a-video-scraper.md

Architecture of a Video Scraper

2022-10-26 1

The problem:

My mom has various shows she likes to watch but she usually misses them if she works late or goes out. She can watch them online but they are ad filled and even when you pay for it, there are way too many ads that she has trouble actually clicking enough small x’s to actually watch her show. I’m pretty salty about that because I paid for a subscription and then in the middle of it they added ads automatically.

Anyway, I decided to scrape the videos and just set up a small web server on my home server. That way I could bookmark the page for her and all it will be is a list of links and videos she can watch. No extra bullshit.

The history:

I originally wrote the script as a bash script that ran on a cron. It was very simple as it would just take in a list of show names, go to various streaming sites and download the video if it was available. This worked but was quite finnicky. Making changes and fixing things was a pain in the ass but the initial prototype was quick and worked well enough.

Eventually I got tired of having to deal with video sites changing names, changing urls, changing hosts, and various other things and so I wanted to create a more robust solution. The ideal solution would be to be able to have a program go and check if there were new episodes and pull them over. I shouldn’t need to worry about which html tag the source is on, or what link it should go to.

I re-wrote the script in python and it was a bit better. More maintainable but it was still a mess of hardcoded links and show names. This script had the same problems as the bash script in that, if things went well then the script worked perfectly, however if things were screwed up on the streaming sites, then the script would fail to download the episode and I would get a call from my mom asking me to take a look!

The solution:

I now rewrote the script in javascript as a node script. This time I had the idea of splitting everything up into different things. When I write code, I like writing in one big file. It helps me keep things all in my head. This time however, I decided that I wanted to write small functions that do just one thing. The idea being that I would make it a bit like unix utilities.

This is also when I came up with the idea of using a plain js file as the configuration file. I would create a js file that contains an array of shows and urls to check and this will be used further down the line when downloading.

The urls are changing daily so I made the urls parameter a function that takes in date. This way I can execute the function to get the correct urls.

    { 
        title: "Show",
        name: (episodeDate) => `/home/media/tv/show/${episodeDate}.mp4`, 
        urls: (episodeDate) => [
            `https://www.example.com/series/show-${episodeDate}/`,
        ], 
        pageUrl: "https://example.com/show.html",
    },

This sets up the title, the name that I want to save the file with, the urls to check and a general page url that I can use further down the line to get more urls automatically. The pageUrl can probably replace the urls function but I like being able to manually add a url to the urls function and have it just work. I then delete the url as it was a hardcoded url to a specific file I knew existed.

Tip 1 Definitely use javascript as some sort of configuration program driver. I really like how easy it was to just have it do all sorts of things inside the configuration.

Now that I have a list of shows, I can loop through them and download them.

This is the main index file.

    for (let episode of episodes) {
        if (fs.existsSync(episode.name))  {
            console.log(`Episode exists - ${episode.title}`);
            continue;
        }

        let u1 = [];
        if (episode.pageUrl && episode.pageUrl !== "") {
            u1 = await getUrls(episode.pageUrl);
        }
        
        let urls = u1.concat(episode.urls());

        for (let url of urls) {
            console.log(`Downloading: ${episode.title}`);
            const result = download(episode, url, defaultDate);
            if (result.status === true) {
                break;
            }
        }
    }

This combines the urls that I have set and also parses out urls from the general page. This way I can get both urls I know about and urls that might be new. This then calls download which is a simple function that downloads a given url to a given name.

const axios = require("axios");
const cheerio = require("cheerio");
const { exec } = require("child_process");

const elementList = ["iframe", "video", "video source"];
const domains = ["dailymotion", "youtube"];
const downloadOptions = `--ignore-config -f "best[ext=mp4]/best"`;

async function download(episode, url, episodeDate) {
    try {
        const response = await axios.get(url);
        const $ = cheerio.load(response.data);

        const sources = $(elementList.join(",")).map((i, el) => $(el).attr("src")).toArray();
        let validSources = sources.filter(source => domains.some(domain => source.indexOf(domain) !== -1))

        for (let video of validSources) {
            const command = `/usr/local/bin/youtube-dl ${downloadOptions} -o "${episode.name(episodeDate)}" "${video}"`;
            try { 
                exec(command); 
                return { status: true };
            } catch (err) { console.log(`YoutubeDL threw an error for: ${episode.title} ${err}`) };
        }
    } catch (err) { };
}

module.exports = download;

This function takes in a url, gets the page, finds the iframes or video tags and then downloads the video using youtubedl. This works perfectly.

I also wrote a script that will use the configuration file to check how far along the downloads are going. All the downloading happens in parallel so its nice to be able to see quickly how things are progressing. By having everything be in a configuration file, I was able to write 2 utilities without having to duplicate things.

I’m pretty happy with the scripts now and I have to deal with much less problems now. I have this script on a cron that just runs every hour.

I also wrote a bash script that would find the latest episodes and create a little web page for my mom to watch the videos through. I should change that at some point but it works. This was a script that I wrote at the very beginning when I was using my bash script solution.

One goal for the future is to see the progress of the downloads remotely so I can see how things are moving when I’m not home. I’d also like a way to give my mom a button so she can trigger the cron manually instead of waiting an hour.

Overall pretty happy with how this project was split out and I really like the idea of using javascript as a configuration language. That could be fun though I haven’t had much reason to use it yet.