srfi-194, Zipf, git, blockchain, AtomSpace

So I recently implemented the Zipfian random distribution for the scheme request-for-implementation 194. The code is available on git.  Discussions turned to the merits of using git and github, and so I had the opportunity to opine about the meaning of life in a long email.  Without further ado, the meaning of life:

Both bitcoin and git showed up at the same time, and both were built on the same idea of append-only logs. git explicitly allowed multiple forks; bitcoin explicitly forbade them. Both made design mistakes, as they were early adopters. Git wasn’t sufficiently file-system-like-ish, (which is plan9’s strong point). A shame, since file-backup, restore, corruption-protection from viruses, accidental deletion, etc. are a “thing”. For example, keeping the unix  /etc in git is a life-saver for system admins.  No matter; seems that guix and nix might have a superior solution, anyway, from what I can tell.

Bitcoin made two mistakes: not being file-system-ish enough, and not defining a sufficiently generic compute platform (solved by ethereum. Basically, bitcoin knows how to add and subtract; that’s all that’s needed for accounting. Ethereum knows how to multiply and divide and branch and tail-recurse). The not-being-a-filesystem choice is forced by anti-forking properties of bitcoin/ethereum, since the blockchain grows rapidly and so all commits must be tiny. By contrast, git allows not only large commits, but any kind of code, c++, scheme, whatever. But never defines an execution context for that code. In git, it’s up to the user to download, compile, install. It’s not automated like in ethereum.

Git fails to provide a fully-automated network filesystem. You have to manually git-push, git-pull to get network updates. There’s no peer-discovery protocol (which is what started this email chain), and resolving file conflicts is problematic during `git merge`. Also, git fails to provide a directory search mechanism: if you don’t know where the content is, you can’t search for it (as also complained about in this email chain).  Github sort-of-ish solves this, but its proprietary. Compare this to IPFS, which is full-network-automated, and does provide content search. Unfortunately, IPFS doesn’t have the append-only log structure of git. It also uses DHT’s, which are problematic, as they completely fail to deal with locality. Enter scuttlebutt .. which provides a “true” decentralized append-only log. The scuttlebutt core is fully generic (which is why git-on-scuttlebutt is possible). However, 90% of scuttlebutt development focus is on social media, not filesystems. Think of it as a kind-of git-for-social-media posts. Or maybe a web-browser-displaying-git-content-in-a-pretty-way. (BTW, the scuttlebutt people are really nice people. You should go hang out with them.)

The lack of proper indexes in git is severe, as is the lack of content-based search. Once you get into search, you wander down the rabbit-hole of query languages, pattern matching and pattern-mining. So scheme (and most functional programming languages) have pattern-matchers. For example, case-lambda but also define-syntax and syntax-case, or racket’s racket/match … but in a certain sense, these are pathetically weak and limited in ability and performance. Compare to, for example SQL — it blows the doors off syntax-case in usability and power. Never mind pattern-mining. And then we have the issue of term-rewriting. So, for example, schemes’ hygienic macros do term re-writing, but they do it only once, when you start your scheme code up for the first time. There is no ability to perform runtime expansion of hygienic macros.  Macro languages are not run-time – again, compare/contrast to SQL select/update.

Well, but SQL is obviously deficient — its record-based, and if you compare it to syntax-case, define-syntax, you will note that trees aka s-expressions are what we really wanted. Pulling on that thread gives you graph query languages, e.g. GraphQL for javascript… which is nice cause json is kind-of-ish like typed s-expressions. Yes, scheme is explicitly untyped, but don’t knock types, they’re really nice. The racket people are onto something, something that ain’t javascript or CamL or haskell.

So, although graph query languages are vast improvements over plain-jane pattern matchers, one can do better still… which brings me to what I work on… the AtomSpace and Atomese (sorry, I hate camel-case, but that’s history now.) It’s a graph database — you can store arbitrary typed s-expressions. It’s a pattern matcher, but far more sophisticated than GraphQL. It’s a programming language, but is more like assembly or byte-code, or an intermediate-language: low-level, not for humans, but for other algos. It could’ve/should’ve been more javascript-like, but that’s a historical mistake. Maybe still fixable. It’s vaguely prolog-like, and so you could say minikanren-like, but it has a stronger runtime, and generalizes truth values beyond true/false, so e.g. for Bayesian probability or neural-nets. It’s .. well, it’s a graph database, it’s weakly distributed; there’s some ongoing work, there, but the as mentioned, DHT’s are terrible in ensuring data locality. So if I have to e.g. (case-lambda (foo)(..) (bar baz) (...)) I would rather that (foo) and (bar baz) be on the same network node, rather than opposite sides of the planet. But Kademlia puts them on opposite sides of the planet (because their hashes are completely different), and I haven’t been able to crack that nut.

What does this have to do with zipf? Well, all patterns have a grammar. This is Chomsky’s theorem from the 1960’s or something. If you know the grammar, you can parse text to see if it validly obeys the grammar. Alternately, you can generate syntactically-valid random text from the grammar.  But what if you don’t know the grammar? Well, that’s what machine-learning and deep-learning is supposed to be for, except that machine-learning could only learn very simple grammars (e.g. decision-tree-forests) until it hit the wall. So deep-learning overcame that wall, but it bypasses syntax by claiming its not important, or that it is a black box, or whatever the pundits claim these days.  So I’m trying to do pattern mining correctly.

But to do that, I need to validate that the syntax that is learned matches the syntax used to generate the sample corpus. So I have to generate a random syntax, generate a random text corpus from it, pattern match it, deduce the syntax, and check precision/recall accuracy. Most networks in nature (natural language, genomics, proteomics) seem to be Zipf-distributed. And so I need a Zipf random generator.  So here we are…

FWIW genomes seem to be zipfian with exponent of 1/2 .. I have been unable to find any explanation why it’s 1/2 and not 1. Its not just genomes, its also text.  (Although I was banned from Wikipedia for pointing that out. Hash-tag time: #defund-wikipedia-admins) Anyone have a clue, here? Anyone help me out? I mean, with the exponent=1/2 part?

Or help me out with Atomese, or with the syntax-learning project? Or anything at all? A lifeline? Donate some bitcoin? 1MjcVE8A4sKDqbbxSf8rej9uVPZfuJQHnz

Leave a Reply

Your email address will not be published. Required fields are marked *