Mapping the internet... again

By PurplePumkiin April 5, 2026 7 min read 136 views

I am a massive data hoarder. Looking through my NAS, I have years of the most random stuff. From virtual machines I made when I was 15 that serve literally no purpose, to random multi-hundred gigabyte water simulations I made in blender (they're still there). I collect, categorize, and sort stupid amounts of data. I don't know why I do it, but I always have. My main PC has 20TB of storage alone, plus an additional 32TB on my NAS, and another 12TB on the server that runs this very website. But there's always been one project I keep coming back to.

Indexing the internet. It sounds stupid, that data is already out there. You can just download it. Terabytes of scientific articles, stored and sorted by someone else. A map of the internet already exists, you can easily find them with just one google search. For instance, Ruslan Enikeev made a beautiful map, showing a geo-origin and relative popularity for thousands of websites. But here I am, making my own version. Why? Because I can. Map Showing Geo-Internet Hotspots (Martin Sanchez https://unsplash.com/photos/red-heart-shaped-illustration-on-black-surface--VSicyd4c4A) Now, in typical fashion, this is a personal project. I don't expect many to follow in my footsteps, I expect even fewer to even begin to understand why I'm doing this. My own boyfriend thinks I'm strange for this. But for those few that do understand, or are at least morbidly curious, all of my code is open-sourced and freely available on my github here. At the time of writing this article, it is basic and primative. It follows robots.txt and is decently configurable for speed, with much more coming soon. But first, let's go over the basics and how it works.

Let's start at square one, the design considerations. I want this project to be highly portable. Not only is this going to generate a stupid amount of data, it's also going to lag any machine it is on if you let it. The server that runs this website is also running this indexer. Because the server that runs this page uses hard-drives, we have to worry about IO pressure. Unfortunately, there is no way around this for me. However, if I get a new server with SSD's, this project will get migrated to that. Point number one is that database migrations suck. I don't want to be trapped into one vendor for a high-speed DB, I don't want to have an ORM to manage it. So, I used my one true love, SQLite. SQL is already a decently performant database. It can manage millions and millions of rows. And with the right indexes, reads are stupid fast. But we run into a big issue, how do you take multiple threads that all want to dump potentially hundreds of URLs every few seconds and avoid conflicts? Well, SQLite has a great feature, Write Ahead Logs (WAL). With this, each thread can dump data into the DB and keep a copy of each write request.

The second consideration is orchestration. Initially I was going to take the easy road, spawn a bunch of goroutines and just request each domain one after another. But that would get you banned. Most websites have a robots.txt. This is a file that directs indexers and web-scrapers, telling them what is allowed and not allowed. This file also sometimes stipulates how fast you can scrape. Snipet from wikipedia's robots file, stipulating crawl-delay

So in the case of wikipedia, they wish that you keep to a 2-5 second delay. They don't actually specify how fast you can go, they just ask for you to keep it reasonable, but still, all the same. I keep a max speed of 30 requests per minute, or a 2 second delay between requests. If I had the program just request the next link in the queue, making sure I don't hammer specific domains, or even to keep everything organized and targeted would be practically impossible. As such, I have an in memory worker scheduler. Each thread or worker is assigned an ID, and domains are assigned to workers using these IDs. Let us take two domains, example.com and pumkiin.tech. These two domains each have their own robots.txt requirements. When the script starts, it spawns the required number of workers as dictated by the .env or environment variables specified. These workers then hit a function that requests a domain. Now, to make sure that two different workers don't manage to simultaneously request the same domain, there is a memory mutex put in place. Only one worker can request a domain at a time. This memory layer also keeps track of who owns what domain, locking out the domain from other workers. In the case of the two domains mentioned earlier, the DB stores them in a domains table. The function is called, and it spits out the first domain that has url(s) in the urls table that need to be crawled. If there are more workers than domains, they just sleep for a few seconds, hoping that something else has been found. They will continue to sleep until work is available for them.

Now that workers have a domain they own, they pass in their worker ID and the domain into a new function. This function actually gets a url for them to scrape. When the domain is passed in, it checks if the worker owns the specified domain, if so, it fetches it a job to do. When there are no more urls to scrape, the worker requests a new unowned domain.

From there, it's just basic boiler plate. I used GO for this project. It's fast, and I'm trying to learn new things. If I wasn't trying to learn, I'd use the one true king of programming languages, PHP 8.4. But in general, it performs a GET request for the URL passed to it, takes the bytes of the body and using goquery turns it into a document. If you specify to save the data that is fetched, it saves it to disk, sorted by domain, with the file name being a hash of the contents. If you don't specify, it throws the data out. URLs found in the body are turned into valid urls for relative links, and then saved to the urls table. There is of course more to it than that, but it's just general http handling.

As always, this isn't the best code. But it works, and it's far faster than anything I've ever made before. I've made three other indexers in the past. My first one was made with bash using curl and its' recursive functionality. And the second one was made with python. Both were slow. Curl is a great tool, but when handling millions of URLs, it starts to crack. And the python indexer was unfortunately a disaster from the start. For some reason I decided to vibe code the whole thing, it barely worked to begin with and I wouldn't recommend that experience for anyone. I'll update this blog soon to show the data I'm collecting, but first I have to push an update to the indexer for more important features. Life is too short to not do dumb things, I plan on doing my fair share, so stay tuned and I'll come up with something else that is even dumber in the future.

Comments

Privacy Policy Update