Search IPFS Archives

What is IPFS, anyway? How should I think about it? I claim, in this essay, that it’s a mediocre cache for content that mostly just slows down content delivery. It does have a degree of censorship resistance, though. This may sound like a negative, even harsh evaluation. And, with the addition of things like filecoin, it could (does?) become potentially useful for general storage. Provided, that is, you are willing to pay for it.

The Achille’s heel for IPFS is the same as for the older-generation file-sharing tools (SleepyCat, Gnutella, Bittorrent). Content is not available unless there is at least one seeder. That’s one issue. Another is that content is effectively not available if you don’t have an effective way of searching for it.

This is perhaps painfully obvious, but I want to dwell in it, because its like the air we breathe: invisible if you don’t focus on it, vital when you’re short of breath.

Here’s the mission. Say I have a website. Perhaps it’s unreliable, due to frequent network outages. Perhaps it’s unreliable because I forget to pay my internet bill. Maybe because I was tragically killed in a car accident or something. Perhaps that website hosts obscure info that’s of no particular interest to anyone, except for occasional, fleeting ephemeral visits from curious passers-by. Perhaps some foreign country censors that website (despite the fact that it’s obscure and boring). Can I use IPFS to improve the reliability and availability of that website, and, if so, how?

Let’s say I’m talking about this website, linas.org. There are two steps: First, I have to seed IPFS with my content. Then I have to publish an IPNS record, so that it can be found. This second step is effectively the creation of a URL, in the original sense: a resource locator that provides a well-known name, so that, if you know the name, you can find the content associated with it. In my case, it would be something like /ipns/www.linas.org/ and so as long as you know that, the automated machinery will find the CID for it, and the CID can then be used to query the IPFS DHT to find a copy of the content associated with that CID.

There are three distinct issues with this naive scheme. First, you have to somehow know that the droid you are looking for has the name www.linas.org. The conventional way to find this out is to use a search engine (Google, DuckDuckGo) and ask it about stuff that is contained here. This is non-trivial in itself. But lets ignore that, for now, and get back to it later.

Now, if you find this essay on DuckDuckGo, and click on the link, the first thing that happens is that your browser sends the (ascii, utf8) string linas.org to DNS for IP address lookup. DNS is fast; it happens in an eyeblink. With the IP address in hand, a TCP/IP socket can be opened and an https request put on it. This is also incredibly fast: your internet provider, and the internet backbone have incredibly fast and efficient routers that can take TCP/IP packets, look at the headers, and forward them to appropriate destinations, on millisecond timescales. My webserver can package up the desired web page back into a collection of TCP/IP packets, in millilseconds, delivered to your browser, with renders the content. This all works so well because there has been three-plus decades of effort to tune it, and billions, maybe even a trillion dollars investment in the technology to make this go fast and smoothly.

So maybe the fundamental problem is that IPFS doesn’t have a few decades and a trillion dollars of investment behind it. First of all, IPNS is slow. It can take more than a minute; sometimes five, to convert an IPNS name to a CID, even if the target is on the machine right next to you. I don’t understand the technical reasons for this. Fetching the content is faster, but can still take seconds (even if its on a machine next to you). This has something to do with going out on the DHT, which requires querying multiple other machines in the internet. before the hash (the CID) can be localized to the machine in the room with you. These are both fairly significant flaws that prevent the instant and overwhelming adoption of IPFS as an alternative content delivery mechanism.

What does IPFS offer in exchange for this poor performance profile? Well, *if* the content is popular, say, music or movies, then there will be lots of cached copies, and content delivery is sped up by having different hosts deliver the assorted shards making up the whole. The original seeder will typically not participate in this process. But if the content is unpopular, then only the seeder can provide a copy, and if the seeder is offline, that’s it: the content is not available. IPFS does not function as an archive.

So what other archival problems are there? Well, for starters, if I publish /ipns/www.linas.org/ and someone else does too, and their CID differs from mine, then who wins? For DNS, we have domain name registrars and IANA to manage and control the coherency at the top levels, and we’ve got tech like TTL’s and expiry dates inside the DNS servers themselves to guard against stale or invalid lookups. This is, again, a multi-billion dollar business that greases these wheels. At time of writing, I am not aware of anything similar for IPNS.

Remarkably, namecoin does not solve this problem. It was originally presented as an alternative to the hassle of working with domain name registrars. It almost instantly devolved into yet another speculative crypto bubble. Fer cris sake, there weren’t even any instructions for how to hook up your newly acquired domain names into your DNS infrastructure. When I asked, I was told “its obvious, you’re stupid”. Well, I sysadmin several DNS servers. It’s not obvious, and I’m not stupid. I want to call this “capitalism” but it is baser than that: its idiotic behavior fueled by greed and driven by tech supplied by crypto bros. Sigh. I’m off-topic.

The naming issue is part of a more general archival issue of provenance, cataloging and metadata. Provenance: where did it come from? How do you know it’s authentic? The git version control system is an example of “blockchain done right”: when you’ve got something in git, you know that it’s correct, authentic, uncorrupted, unadulterated. (Unless someone fed you a bogus git tree, but there are informal ways of verifying the authenticity of the tree, as a whole.

There are no archival tools built on git. The Internet Archive Wayback Machine does not offer git snapshots of archived websites. Without git, website diffs are hard. Without git, there is no particular guarantee that wayback archives haven’t been corrupted or altered. Without git, there is no particular way for me to mirror the content on the wayback machine.

I mean, suppose I wanted to keep a spare copy of some certain archive. I have the disk space: if I think some particular website is particularly awesome, how can I replicate it? The internet archives don’t offer this. Neither does IPFS. There are some efforts: e.g. Wikipedia-on-IPFS, which allows me to publish (seed?) portions of Wikipedia. But I kind of want to do that for arbitrary websites. Why? Cause I’m a packrat, I guess.

There are some other archival-on-IPFS efforts. One is ReplayWeb.page which captures entire websites and packages them into blobs: WARC, WACZ, HAR format files. It provides considerable infrastructure for working with archives and sharing them, in general. It’s not really about IPFS, though. Issues like search and discovery are unresolved.

Another is GhostArchive. They’re a bit mysterious. Like, who are they? They don’t tell you were the archives are kept (they’re somehow IPFS-related??) and there’s no way for me to re-mirror their archives. They state that they are adding a MementoWeb API.

So, what’s the MementoWeb API? They’re URI’s that include time-stamps, that allow you to access earlier, time-stamped copies of URL’s. This assumes that there is an archive that holds these: the API is just a way of offering access to archived data.

Timestamps aren’t the only metadata, although they’re maybe the principle kind. To get at the rest, we need search…

I wonder if there’s a way to do interplanetary git. What would that be like? Who would use it? Who would want it?

Whatever, I’ve careened off-course. It’s like I’ve got some vague desires having to do with curating collections, sharing files, preserving data and pondering future technologies. This essay attempted to be a coherent walk through these, but degenerates into a web of confusion by the end. I’m gonna go out and ride my bike now. What are you going to do?

Search IPFS Archives

Comments

Leave a Reply Cancel reply