I've been curious about how smaller-scale sites manage to crawl and index content from the internet effectively. For instance, FilmRot lets you search video transcripts from YouTube incredibly quickly, and the creator mentions it costs only $600 a month to run. This seems surprisingly low for such a large operation. I wonder if they're using web scraping methods and maybe even headless browsers like Chrome to bypass YouTube's restrictions or stay under API limits. That must incur a significant cost in computing resources, plus there's storage for all the transcripts and the indexing required for fast searches.
Another example is services that alert users about specific keywords on Reddit; these would need to scan the entire platform. How do they manage to do this efficiently without massive hosting resources? I've received varying answers from GPT, so I'd love to hear some real experiences and references on this topic!
2 Answers
It's pretty fascinating how much can be done with text indexing and searching. The technology has come a long way, and many tools are available to optimize these processes. For instance, if you’re storing a GB of text in RAM, that can represent hundreds or even thousands of books when stored in raw text.
Regarding YouTube, you typically don't need a browser. You can interact with it by sending HTTP requests, just like a browser would, but without needing to load any images or videos. The science behind indexing and fast lookups is substantial. Many developers leverage libraries that handle text searches much more efficiently than most could on their own.
For sites that don't rely on JavaScript, the crawling process is straightforward. You simply follow links from the pages you visit and gather new links recursively.
For those that do use JS, there are stripped-down versions of headless Chromium that consume minimal resources while still enabling you to perform the necessary actions for web scraping. This balance helps smaller operations keep costs down while still effectively collecting data.
I get that full-text search tools like Elasticsearch are great for indexing text. However, when you consider that a single YouTube video request could be around 1MB, plus an additional 5KB for transcription, scraping a million videos could easily rack up to about 1TB of bandwidth.
With protections like Cloudflare possibly complicating HTTP requests, how are these sites managing? Do they scrape once and leave it, never updating?