A messy and incomplete list of open source (and some notable closed-source) Artificial General Intelligence projects, as well as lists of various components and tools that can be used within existing, or in new AGI projects. These components cover everything from NLP and language generation to data clustering and machine-learning algorithms, large data stores, knowledgebases, reasoning engines, program-learning systems, and the like.
A good overview is given by Pei Wang's Artificial General Intelligence : A Gentle Introduction. See also the Wikipedia article on Artificial Consiousness and Strong AI.
See also a large list of free/open-source "narrow AI" software, at the GNU/Linux AI & Alife HOWTO.
Suggested Education for Future AGI Researchers.
Novamente's general cognition/reasoning system. Includes NLP subsystem, reasoning, learning, 3d virtual avatar interfaces, robotics interfaces. Open-source, GPL license.
Nominally associated with Artificial General Intelligence Research Institute and SIAI.
NARS, the Non-Axiomatic Reasoning System, aims to explain a large variety of cognitive phenomena with a unified theory, and, in particular, reasoning, learning, and planning. Site holds a number of white-papers. Was inspiration for OpenCog. (OpenCog claims to overcome certain limitations in NARS) OpenNARS is Pei Wang's implementation. Released under GPLv2.
An intelligent agent, communicating by email. Built for the US Navy. Based on Baar's Global Workspace Theory. Answers only one question: "What do I do next?". See Tutorial
General framework for running cognitive experiments(?). Java source code available under unspecified license.
Aims to couple common-sense knowledge-base systems to natural langauge text processing. Open source project.
Seems primarily aimed at robots.
Cognitive architecture research platform, aimed at simulating and understanding human cognition.
Polyscheme is a cognitive framework intended to achieve human-level artificial intelligence and to explain the power of human intelligence. Variety of research papers published, no source code available.
Commercialized "Heierarchical Temporal Memory"
SNePS is a knowledge representation, reasoning, and acting (KRRA) system. See also the Wikipedia page See also a paper by Shapiro, part of the SNePS group.
Primarily an implementation of Markov Logic Networks (MLN). MLN are remarkable because they unify, in a single conceptual framework, both statistical and logical (reasoning, first-order-logic) approaches to AI. This seems to endow the theory with a partucularly strong set of powers, and in particular, the ability to learn, without supervision, some of the harder NLP tasks, such as dependency grammars, automatic thesaurus/ synonym-set learning, entity extraction, reasoning, textual entailment, etc.
Primarily an implementation of Markov Logic Networks, for Statistical Relational Learning, including dependency parsing, semantic role leabelling, etc. Perhaps more NLP focused than Alchemy.
YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.
FreeHAL is a ... ?? chatbot and stuff ... ?? TODO -- figure this one out. Hard to tell if this is "real" or a hack.
Nutcracker performs textual entailment using a first-order-logic (FOL) theorem prover, and an FOL model builder. Built on top of Boxer, which takes the output of a combinatory categorical grammar (CCG grammar) parser, and converts this first-order logic based on Hans Kamp's "Discourse Representation Theory".
Written in prolog. Non-free license, bars commercial use.
The MultiNet paradigm - Knowledge Representation with Multilayered Extended Semantic Networks by Hermann Helbig. Wires up NLP processing to hard-wired upper ontology, and adds reasoning. No source code available.
Developed by Vulcan Inc. in association with SRI International, Cyc Corp. and the UTexas/Austin CS/AI labs, aims to provide reasoning and question-answering over large data sets. All knowlege entry is done manually, by experts. Some research results are available publicly.
Developed by Hakia Labs, proprietary, commercial software for taking NLP input and generating ontological frames/expressions from it. See also ontologicalsemantics.com.
Powers Hakia search.
A giant list can be found at Peter Clark's Some Ongoing KBS/Ontology Projects and Groups. Problems with ontologies are reviewed in Ontology Development Pitfalls.
Big ones include
Common-sense knowledgebase. Large. GPL license. Users can edit data online, at http://torg.media.mit.edu:3000/
Collection of english-language sentences, rather than using a strict upper ontology. This is actually quite conventient, if you have a good NLP input system, as it helps avoid the strictures of pre-designed ontologies; and rather gets you to deal with the structure of your NLP-to-KR layter. From MIT. -- large -- 700K sentences
YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.
Semantic network.
See also: Wordnet::Similarity A perl module implementing various word similarity measures from Wordnet data. i.e. Thesaurus-like.
Licensing is unclear.
SUMO WP article. Includes an open source Sigma knowledge engineering environment, includes a theorem prover. Sigma uses KIF.
"The largest formal public ontology in existence", availble under GPL. (although OpenCyc is arguably bigger, and is free.) Has mappings to WordNet.
Large KB under artistic license. Source for engine not available. KB seems messy and capricious. The uppper ontology is not clear. See however, remarks above.
Common sense KB, available in CycL. GPL'ed
A knowledge representation system. Conceptual Graph Interchange Format is an ISO standard. See also "Common Logic Interchange Format (CLIF)", which is more lisp-like.
Seems well-engineered. Actual KB is slim. Source not available. Might be a dead project??
Provides a firm theoretical foundation for representing ontologies; no actual data. OWL version of GFO under a modified BSD license. Examples include the periodic table of elements, amino acids. See also WP article.
Below is a list of reasoning and/or inference engines only, without accompanying ontologies/datasets.
Implements a probabalistic analog of first-order logic. Ideal for uncertain inference. Beta available now. In the process of being ported to Opencog. First-order logic statements are expressed in terms of hypergraphs. The nodes and edges of the hypergraphs can hold various different "truth value" structures. A set of basic types define how truth values are to be combined, resulting in the primitives needed for uncertain reasoning. These are described in Ben Geortzel's book of the same name. A specific claim is that the rules are explictly founded on probablility theory.
Truth values are probability distributions, usually represented as compound objects, e.g. having not only a probability, but also having upper and lower on the uncertainty of the probability estimate.
Actual implementation works primarily by applying typed pattern matching to hypergraphs, to implement a backward-chainer. That is, PLN defines a typed firt-order logic; it does not (yet?) define a typed functional programming language (although it comes close to doing so). Inference control is through various aglorithms, including "economic attention allocation" and Hebbian activation nets.
GNU GPLv3 Affero license.
Similar to PLN in various ways, but uses a different set of formulas for inference. Truth values are represented with a pair of real numbers: strength and confidence.
Open source, written in Lisp.
An extension of Markov networks to first-order logic. Ungrounded first-order logic expressions are hooked togethr into a graph. Each expression may have a variety of different groundings. The "most likely grounding" is obtained by applying maximum entropy principles aka Boltzmann statistics computed from a partition function that describes the network. One important stumbling block is that computing the partition function can be intractable. Thus, sometimes a data representation is used such that certain probabilities are solvable in closed form, and the hard (combinatorial) problems are pushed off to clustering algorithms. (See e.g. Hoifung Poon).
MLN's stick to a very simple "truth value" -- a real number, ranging from 0.0 to 1.0 -- indicating the probability of an expression being true. Normally, no attempt is made to bound the uncertainty of this truth value, except possibly by analogy to physics (e.g. second-order derivatives expressing permeability, susceptibility, etc. or strong order when far from the Curie temperature, etc.) That is, maximum entropy principles are used to maximize the number of "true" formulas that fit the maximum amount of the (contraditory) input data. However, it is unclear how confident one should be of a given deduction
Several implementations, including "Alchemy" listed below.
Similar to MLN, but avoids making certain assumptions about Bayesian priors. Rarely applied to logic/reasoning directly. Uses a single real number to represent the probability.
Similar to MLN, but abandons maximum entropy for clustering/classification based on mutual information. That is, datasets are search emprically for small patterns that have a high value of mutual information. These are then clustered together as approrpriate, and then the search is repeated on patterns based on the clusters.
Prolog engine, open source. Supports tabling/memoing, well-founded negation. This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.
Prolog engine. For performance, adds "demand-driven indexing". This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.
Inference engine, bottom-up. Implements the datalog query system. Has "Magic Set" optimization. Implemented in Java. Immature? LGPL license.
PowerLoom uses a fully expressive, logic-based representation language (a variant of KIF). It uses a natural deduction inference engine that combines forward and backward chaining to derive what logically follows from the facts and rules asserted in the knowledge base. Has interfaces to common-lisp, C++ and Java. GPL license.
Among the first expert system/rule engines ever. Originally from NASA, now public domain. C language. Designed for embeding expert systems into devices, etc. See also Wikipedia page. Extensive number of features.
Inference engine, specifically tailored to work well with Python. Features:
Primarily an inference engine coupled to an ontology. GPL license.
Drools is a business rule management system (BRMS) and an enhanced Rules Engine implementation, ReteOO, based on Charles Forgy's Rete algorithm tailored for the Java language. Despite using RETE, this is possibly the slowest inference engines out there, as well as the least stable (per WWW Madrid 2009 Semantic Web OpenRuleBench results).
Function symbols. Meant for event processing, not data processing ...
Use Boolean SAT for traditional propositional logic solvers, use SMT for solvers that include arithmetic expressions.
Java, on sourceforge. Recommended for small-to-medium systems. A frame-slot type system.
Theorem prover.
Theorem prover. Usually used for formal verification. BSD license.
Theorem prover.
With integrated theorem prover. CMU Lisp. GPL license.
Fast! Small exectuables! ML/OCaml-like type system. Supports several programming styles: Functional programming (both lazy and eager evaluation) Imperative programming (including safety via theorem proving), Concurrent programming (multi-core GC) Weakness: very very new, current version is 0.1.6.
Purely functional programming, good concurrency support, good FFI, goood compiler. Lazy evaluation. Weakness: difficult for programmer to predict time/space performance.
Concurrent, functional, fault-tolerant programming. See also wikipedia article.
Fast! Unifies functional, imperative, and object-oriented programming styles. Provides a strong type and type-inference system derived from ML. Weakness: no multi-core/concurrent support. Type system can be subtle. Poor FFI, modules system. See also Wikipedia page.
Object-oriented, functional programming. Focus on scalability. Targets JVM. Good Java integration. Weakness: no tail recursion in JVM!! which means mutually recursive proceedures are icky/slow.
Modern Lisp dialect, targeted at JVM. Good Java integration. Weakness: no tail recursion in JVM!!
A wiki containing an extensive listing of software oand other things is at ACLWeb, and in particular, at the Tools and Software page. A small list is at the NLP Resources wiki page at agiri.org. A general overview of the state of the art is at AAAI Natural Language page.
A particularly important theory is Dick Hudson's Word Grammar.
Other NLP resources include:
See also http://www.singinst.org/research/researchareas
CRF++ is an implementation of Conditional Random Fields. Has pre-existing modules for text chunking, named entity recognition, information extraction. Open source, written in C++.
Includes a shallow parser, a sentence splitter, entity detection, sense annotation (using wordnet senses), etc. Strong Spanish/Latin language support.
OpenNLP is more of a directory of other NLP projects. Inlcludes some good maximum-entropy implementations.
Has a book, multiple articles. Integration into WordNet. Written in python. Not clear whether it has an actual parser. Seems to do some sort of entity extraction, esp. for biomedical terms.
The IMS Open Corpus Workbench (CWB) is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.
Java, GPL'ed. Big. Also in use for Dialogue processing and Natural Language Generation.
From Carnegie-Mellon. A parser for the English language, based on "link grammar", a novel theory of natural language syntax. Written in C, with a BSD license. English dictionary includes 90K words. Actively maintained.
Built on top of the Carnegie Mellon link parser. Extracts dependency relations from link data. Creates FrameNet-like semantic frames from the dependency graphs. Includes ability to handle multi-sentence corpus, entity detection, and perform anaphora (pronoun) resolution via Hobbs algo. Apache v2 license. Written in Java. Actively developed/maintained.
Now includes not one, but two! natural language generation facilities: NLGen/SegSim and NLGen2.
Rule-driven dependency parser. English, Spanish, Galician, French, and Portuguese. Parser-compiler in Ruby; parser is in Perl. GPL license.
Dependency parser, generating output similar to RelEx. Statistical parser. Trains on treebank data, has been applied to half-a-dozen different languages. Slow, RelEx+linkgrammar is 3x to 4x faster. Java, GPL v2 license.
Trainable, fast, accurate dependency parser. Has four different training methods. Uses a fast shift-reduce algorithm for single-pass parsing. Reads CoNLL. C++. unclear license?
Maltparser is a system for data-driven dependency parsing, which will learn a parsing model from treebank data, and can then be used to parse new data using the induced model. Java, BSD license. old URL.
Trainable, fast dependency parser. Uses minimum spanning tree methods. Reads CoNLL. Java, CPL license. download
Incremental Sigmoid Belief Network Dependency Parser. Trainable. GPL license.
Dependency output. Linguist-written rules. GPL license.
Idea from Luc Steels. There is a LISP implementation at http://www.emergent-languages.org/ A Java implementation at TexAI.
There is a large list of NL generators located at the ACLWeb Natural Language Generation Portal.
NER is commonly done in one of several ways:
A powerfull system for extracting entities and entity relations from free text. See the YAGO-NAGA listing above.
Java, GPL'ed. Big. GATE is supplied with an Information Extraction system called ANNIE, which seems to be focused on "entity extraction".
From the website: "Meta-optimizing semantic evolutionary search (MOSES) is a new approach to program evolution, based on representation-building and probabilistic modeling. MOSES has been successfully applied to solve hard problems in domains such as computational biology, sentiment evaluation, and agent control. Results tend to be more accurate, and require less objective function evaluations, in comparison to other program evolution systems. Best of all, the result of running MOSES is not a large nested structure or numerical vector, but a compact and comprehensible program written in a simple Lisp-like mini-language." For details, see Moshe Looks' PhD thesis.
Apache License.
Performs clustering using genetic programming techniques. (i.e. attempts to find small algorithmic expressions that will cluster the data). Omniclust is an n-ary agglomerative search algorithm. For details, see, Clustering gene expression data via mining ensembles of classification rules evolved using moses. Looks M, Goertzel B, de Souza Coelho L, Mudado M, Pennachin C. Genetic and Evolutionary Computation Conference. (GECCO 2007): 407-414. Java codebase.
Java. Has been used to build a POS tagger, end of sentence detector, tokenizer, name finder. LGPL/Apache license.
LGPL license
MIT license
Portable toolkit for building and manipulating hidden Markov models. C source code, non-free-license prohibits redistribution.
A particularly interesting subset concerns Compositional data, which is data located on a simplex and/or a projective space.
Caution: All of the systems listed below fail horribly when applied to real-world data sets of any reasonable size -- e.g. datasets with 100K entries. This is typically because they try to compute similarity measures between all 100K x 100K = 10 billion pairs of elements, which is intractable on contemporary single-CPU systems. You can win big by avoiding these systems, and exploiting any sort of pre-existing organization in your data set. Only after breaking your problem down to itty-bitty-sized chunks should you consider any of the below.
From thier website: "The VLFeat open source library implements popular computer vision algorithms including SIFT, MSER, k-means, hierarchical k-means, agglomerative information bottleneck, and quick shift. It is written in C for efficiency and compatibility, with interfaces in MATLAB for ease of use, and detailed documentation throughout. It supports Windows, Mac OS X, and Linux."
Appears to be aimed at image processing. GPL license.
Assumes data is located on a simplex, and uses that fact in it's algo's. Includes an algo for PCA analysis, another using a partition clustering algorithm, and an agglomerative hierarchical clustering using the Aitchison distance. Command-line interface. Written in C. (No library interfaces currently defined.) Focused on genetic/bio data. GPL license.
Mfuzz clustering. Aimed at genetic expression time-series data, claimed to be robust against noise. Uses R language. GPLv2 license.
R-based data mining. GPL.
Data mining, clustering. Java. GPL. From personal experience -- fails totally on any but the very smallest data sets. Dying/dead mailing list.
See also:
Fast, decision-tree-based implementation of k-nearest neighbor classification. Implements half-dozen algo's. GPL'ed. (Might not scale well for large problems?) Used in the MaltParser NLP parser, thus has been applied to NLP tasks.
Library that implements Support Vector Machine, which is one of many ways of doing a linear classifier.
Per website: "STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks."
Clustering, runs in memory, thus much faster than Hadoop. Scala interfaces.
Implementation of MapReduce ideas in C++.
Implementation of MapeReduce ideas in Java. Part of the Apache project.
Database for storing hypergraphs. Pretty Cool. Java based. Strange BSD-like license, but requires source code! Compatibility of license with GPL is unclear.
Shard overview describes an alternate to centralized, normalized datbases.