AGI -- Artificial General Intelligence

A messy and incomplete list of open source (and some notable closed-source) Artificial General Intelligence projects, as well as lists of various components and tools that can be used within existing, or in new AGI projects. These components cover everything from NLP and language generation to data clustering and machine-learning algorithms, large data stores, knowledgebases, reasoning engines, program-learning systems, and the like.

A good overview is given by Pei Wang's Artificial General Intelligence : A Gentle Introduction. See also the Wikipedia article on Artificial Consiousness and Strong AI.

See also a large list of free/open-source "narrow AI" software, at the GNU/Linux AI & Alife HOWTO.

Suggested Education for Future AGI Researchers.




AGI: The Whole Enchilada, with all the trimmings

A list of projects that attempt to put together the right ingredients to cook up AGI in full generality. This includes solving the problems of reasoning, learning, plannng, acting, and internally modeling the external world (including complex external objects, such as people and thier emotional state). These commonly include work on language comprehension and speech, and subsystems for sensory input and motor control. Some of these are open source, some are not. Some are academic projects, some are commercial attempts.
OpenCog

Novamente's general cognition/reasoning system. Includes NLP subsystem, reasoning, learning, 3d virtual avatar interfaces, robotics interfaces. Open-source, GPL license.

Nominally associated with Artificial General Intelligence Research Institute and SIAI.

Demo: AI Virtual Pet Answering Simple Questions

Pei Wang's NARS project

NARS, the Non-Axiomatic Reasoning System, aims to explain a large variety of cognitive phenomena with a unified theory, and, in particular, reasoning, learning, and planning. Site holds a number of white-papers. Was inspiration for OpenCog. (OpenCog claims to overcome certain limitations in NARS) OpenNARS is Pei Wang's implementation. Released under GPLv2.

Stan Franklin's LIDA

An intelligent agent, communicating by email. Built for the US Navy. Based on Baar's Global Workspace Theory. Answers only one question: "What do I do next?". See Tutorial

Ron Sun's CLARION

General framework for running cognitive experiments(?). Java source code available under unspecified license.

TexAI

Aims to couple common-sense knowledge-base systems to natural langauge text processing. Open source project.

John Weng's SAIL architecture

Seems primarily aimed at robots.

ACT-R

Cognitive architecture research platform, aimed at simulating and understanding human cognition.

Nick Cassimatis's PolyScheme

Polyscheme is a cognitive framework intended to achieve human-level artificial intelligence and to explain the power of human intelligence. Variety of research papers published, no source code available.

Jeff Hawkins Numenta

Commercialized "Heierarchical Temporal Memory"

SNePS

SNePS is a knowledge representation, reasoning, and acting (KRRA) system. See also the Wikipedia page See also a paper by Shapiro, part of the SNePS group.

General Intelligence Research Group
Yan King Yin (YKY)'s project, an attempt to build a semi-open, semi-closed source AGI project.



Semantic systems

Systems which handle the natural-language and semantic aspects of AGI, without striving for full-fledged "consiousness", 3D embodiment, natural language output (speech), planning of actions, activities, awareness and modelling of complex external systems (i.e. awareness of external human actors), self-awareness, etc. Note that some of the systems listed above also fail to handle e.g. embodiment, but none-the-less seem to have a more expansive vision and set of goals. By contrast, the systems below really try to stick to the "straight and narrow", without making overly broad claims.
Alchemy

Primarily an implementation of Markov Logic Networks (MLN). MLN are remarkable because they unify, in a single conceptual framework, both statistical and logical (reasoning, first-order-logic) approaches to AI. This seems to endow the theory with a partucularly strong set of powers, and in particular, the ability to learn, without supervision, some of the harder NLP tasks, such as dependency grammars, automatic thesaurus/ synonym-set learning, entity extraction, reasoning, textual entailment, etc.

the Beast

Primarily an implementation of Markov Logic Networks, for Statistical Relational Learning, including dependency parsing, semantic role leabelling, etc. Perhaps more NLP focused than Alchemy.

YAGO

YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.

FreeHAL

FreeHAL is a ... ?? chatbot and stuff ... ?? TODO -- figure this one out. Hard to tell if this is "real" or a hack.

Nutcracker and Boxer

Nutcracker performs textual entailment using a first-order-logic (FOL) theorem prover, and an FOL model builder. Built on top of Boxer, which takes the output of a combinatory categorical grammar (CCG grammar) parser, and converts this first-order logic based on Hans Kamp's "Discourse Representation Theory".

Written in prolog. Non-free license, bars commercial use.

MultiNet

The MultiNet paradigm - Knowledge Representation with Multilayered Extended Semantic Networks by Hermann Helbig. Wires up NLP processing to hard-wired upper ontology, and adds reasoning. No source code available.

Project Halo

Developed by Vulcan Inc. in association with SRI International, Cyc Corp. and the UTexas/Austin CS/AI labs, aims to provide reasoning and question-answering over large data sets. All knowlege entry is done manually, by experts. Some research results are available publicly.

OntoSem - Ontological Semantics

Developed by Hakia Labs, proprietary, commercial software for taking NLP input and generating ontological frames/expressions from it. See also ontologicalsemantics.com.

Powers Hakia search.




Ontologies, Knowledge Bases and Reasoning Engines

Some ontologies are understood as "stand-alone", while other ontologies are incomplete when considered without a reasoning engine. An example of the latter is the OpenCyc ontology, which, when examined as a dataset, superficially appears to be incomplete, messy and capricious. However, when coupled with it's reasoning system, it becomes complete. This is because many important "facts" are only one or two or three deductive steps away from the core dataset. Thus, the ontology provides an "armature", while the reasoning system provides the "clay" to fill in the gaps.

A giant list can be found at Peter Clark's Some Ongoing KBS/Ontology Projects and Groups. Problems with ontologies are reviewed in Ontology Development Pitfalls.

Big ones include

ConceptNet3

Common-sense knowledgebase. Large. GPL license. Users can edit data online, at http://torg.media.mit.edu:3000/

Open Mind Common Sense

Collection of english-language sentences, rather than using a strict upper ontology. This is actually quite conventient, if you have a good NLP input system, as it helps avoid the strictures of pre-designed ontologies; and rather gets you to deal with the structure of your NLP-to-KR layter. From MIT. -- large -- 700K sentences

YAGO

YAGO is a huge semantic knowledge base, consisting primarily of information about entities. Contains 2M entities, and 20M facts about them. The YAGO-NAGA project also includes SOFIE, a system for automatically extending an ontology via NLP and reasoning.

WordNet

Semantic network.

See also: Wordnet::Similarity A perl module implementing various word similarity measures from Wordnet data. i.e. Thesaurus-like.

Historical Thesaurus of English

Licensing is unclear.

SUMO - Suggested Upper Merged Ontology

SUMO WP article. Includes an open source Sigma knowledge engineering environment, includes a theorem prover. Sigma uses KIF.

"The largest formal public ontology in existence", availble under GPL. (although OpenCyc is arguably bigger, and is free.) Has mappings to WordNet.

OpenCyc

Large KB under artistic license. Source for engine not available. KB seems messy and capricious. The uppper ontology is not clear. See however, remarks above.

ThoughtTreasure

Common sense KB, available in CycL. GPL'ed

Conceptual Nets

A knowledge representation system. Conceptual Graph Interchange Format is an ISO standard. See also "Common Logic Interchange Format (CLIF)", which is more lisp-like.

Ontolingua

Seems well-engineered. Actual KB is slim. Source not available. Might be a dead project??

GFO - General Formal Ontology

Provides a firm theoretical foundation for representing ontologies; no actual data. OWL version of GFO under a modified BSD license. Examples include the periodic table of elements, amino acids. See also WP article.

DOLCE - Descriptive Ontology for Linguistic and Cognitive Engineering
SENSUS - An extended, re-organized version of WordNet. Does not appear to be publically available or maintained any more?
PSL - Process Specification Language
BFO - Basic Formal Ontology
SOAR expert system
DAML+OIL
Obsoleted by OWL
KIF - Knowledge Interchange Format
Obsoleted by SOU-KIF (used in SUMO)

Unstructured data

Datasets that do not rely on an ontology, or do so only weakly.
OpenCalais
Named Entity Recognition (NER). Commercial service, free for low volumes.
Triplestore
triplestore.com
DBpedia
Zemanta
Mizar2KIF
Mizar is a markup language for expressiong mathematical statements in a machine readable format. There are thousands of theorems written in Mizar. Unfortunately, Mizar is hard to comprehend, and the theorem prover is proprietary. The Mizar2KIF project aims to create a tool to export KIF from Mizar input.



Reasoning engines/Inference engines

There are two primary ways in which reasoning is being approached these days: through crisp logic (using boolean true/false truth values) and different approaches to fuzzy logic.

Below is a list of reasoning and/or inference engines only, without accompanying ontologies/datasets.

Probablistic Reasoning engines/Inference engines

There appear to be five primary approaches: PLN, NARS, MLN, CRF, HMI. One of the primary difficulties is probablistic reasoning is inference control: since nothing is ever strictly true or false, there is a huge combinatorial explosion during reasoning, and effective strategies must be found to control this. Another important problem is loss of precision: after a few inference steps, the uncertainaties can compound in such a way that all confidence in the resulting truth value is lost. Thus, inference control sometimes focuses on maximizing the certainty or confidence of a deduction, rather than maximizing it's truth.
Ben Geortzel's PLN Probabilistic Logic Network

Implements a probabalistic analog of first-order logic. Ideal for uncertain inference. Beta available now. In the process of being ported to Opencog. First-order logic statements are expressed in terms of hypergraphs. The nodes and edges of the hypergraphs can hold various different "truth value" structures. A set of basic types define how truth values are to be combined, resulting in the primitives needed for uncertain reasoning. These are described in Ben Geortzel's book of the same name. A specific claim is that the rules are explictly founded on probablility theory.

Truth values are probability distributions, usually represented as compound objects, e.g. having not only a probability, but also having upper and lower on the uncertainty of the probability estimate.

Actual implementation works primarily by applying typed pattern matching to hypergraphs, to implement a backward-chainer. That is, PLN defines a typed firt-order logic; it does not (yet?) define a typed functional programming language (although it comes close to doing so). Inference control is through various aglorithms, including "economic attention allocation" and Hebbian activation nets.

GNU GPLv3 Affero license.

Pei Wang's NARS Non-Axiomatic Reasoning System

Similar to PLN in various ways, but uses a different set of formulas for inference. Truth values are represented with a pair of real numbers: strength and confidence.

Open source, written in Lisp.

Pedro Domingos' MLN Markov Logic Networks

An extension of Markov networks to first-order logic. Ungrounded first-order logic expressions are hooked togethr into a graph. Each expression may have a variety of different groundings. The "most likely grounding" is obtained by applying maximum entropy principles aka Boltzmann statistics computed from a partition function that describes the network. One important stumbling block is that computing the partition function can be intractable. Thus, sometimes a data representation is used such that certain probabilities are solvable in closed form, and the hard (combinatorial) problems are pushed off to clustering algorithms. (See e.g. Hoifung Poon).

MLN's stick to a very simple "truth value" -- a real number, ranging from 0.0 to 1.0 -- indicating the probability of an expression being true. Normally, no attempt is made to bound the uncertainty of this truth value, except possibly by analogy to physics (e.g. second-order derivatives expressing permeability, susceptibility, etc. or strong order when far from the Curie temperature, etc.) That is, maximum entropy principles are used to maximize the number of "true" formulas that fit the maximum amount of the (contraditory) input data. However, it is unclear how confident one should be of a given deduction

Several implementations, including "Alchemy" listed below.

CRF Conditional Random Fields

Similar to MLN, but avoids making certain assumptions about Bayesian priors. Rarely applied to logic/reasoning directly. Uses a single real number to represent the probability.

HMI Hierarchical Mutual Information

Similar to MLN, but abandons maximum entropy for clustering/classification based on mutual information. That is, datasets are search emprically for small patterns that have a high value of mutual information. These are then clustered together as approrpriate, and then the search is repeated on patterns based on the clusters.

Inductive Logic Programming

Logic programming is the act of specifying programs as logical statements/assertions. Examples of logic programming languages include prolog, datalog. Inductive logic programming is the act of automatically learning new logic programming rules.

ILP

Crisp-logic reasoning engines/Inference engines

Reasoning engines that employ crisp logic -- i.e. boolean true/false truth values only. In general, crisp logic reasoning is a lot simpler than uncertain reasoning, since the combinatoric explosion is far far smaller, and loss of precision is not a concern.

XSB

Prolog engine, open source. Supports tabling/memoing, well-founded negation. This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.

Yap

Prolog engine. For performance, adds "demand-driven indexing". This is one of the fastest inference engines out there, per results of the Madrid 2009 Semantic Web OpenRuleBench results. Personally, I suspect that this is because of a strong grounding in inference and language design theory on the part of the developers.

IRIS

Inference engine, bottom-up. Implements the datalog query system. Has "Magic Set" optimization. Implemented in Java. Immature? LGPL license.

PowerLoom

PowerLoom uses a fully expressive, logic-based representation language (a variant of KIF). It uses a natural deduction inference engine that combines forward and backward chaining to derive what logically follows from the facts and rules asserted in the knowledge base. Has interfaces to common-lisp, C++ and Java. GPL license.

CLIPS - A Tool for Building Expert Systems

Among the first expert system/rule engines ever. Originally from NASA, now public domain. C language. Designed for embeding expert systems into devices, etc. See also Wikipedia page. Extensive number of features.

PyKE

Inference engine, specifically tailored to work well with Python. Features:

The Scone Knowledge-Base Project

Sigma Knowledge Engineering Environment

Primarily an inference engine coupled to an ontology. GPL license.

DROOLS

Drools is a business rule management system (BRMS) and an enhanced Rules Engine implementation, ReteOO, based on Charles Forgy's Rete algorithm tailored for the Java language. Despite using RETE, this is possibly the slowest inference engines out there, as well as the least stable (per WWW Madrid 2009 Semantic Web OpenRuleBench results).

Prova

Function symbols. Meant for event processing, not data processing ...

Boolean SAT, SMT Propositional logic solvers

Use Boolean SAT for traditional propositional logic solvers, use SMT for solvers that include arithmetic expressions.

Algernon - Rule-Based Programming

Java, on sourceforge. Recommended for small-to-medium systems. A frame-slot type system.

Theorem provers

Teorem provers are primarily meant for formal verification of systems, commenly of hardware designs, but also of mathematical statements, etc.
The E Equational Theorem Prover

Theorem prover.

HOL Higher Order Logic

Theorem prover. Usually used for formal verification. BSD license.

Prover9/Mace4
Prover9 is a theorem prover for first-order and equational logic. Mace4 searches for finite models and counterexamples.
SPASS Automated Theorem Prover for First-Order Logic

Theorem prover.

PVS Specification and Verification System

With integrated theorem prover. CMU Lisp. GPL license.




Expressivity

Its painfully clear that creating AI requires a subtle interplay between querying, pattern matching, and imperative, algorithmic processing. For example, rule engines require one to write rules, whose first part, the predicate, is meant to be a pattern-match against the output of other rules. More generally, a lot of the effort in the "semantic web" and sparql, etc. is about creating queries (such as in SQL) -- but the result of pattern matches return a large glob of data. Once one has this data, one then has to apply some algorithm to it. And then .. lather, rinse, repeat. One is thus faced with an infrastructure problem: what infrastructure is best for doing all of the above? One of the most interesting, new approaches to this is Barry Jay's "Pattern Caluclus" and the Bondi programming language, which promises to provide a foundation on which all of the above can be built, correctly, this time. Other "hot" programming languages that attempt to solve many of the irritating, horrid problems experienced in older, more popular langauges:
ATS

Fast! Small exectuables! ML/OCaml-like type system. Supports several programming styles: Functional programming (both lazy and eager evaluation) Imperative programming (including safety via theorem proving), Concurrent programming (multi-core GC) Weakness: very very new, current version is 0.1.6.

Haskell

Purely functional programming, good concurrency support, good FFI, goood compiler. Lazy evaluation. Weakness: difficult for programmer to predict time/space performance.

Erlang

Concurrent, functional, fault-tolerant programming. See also wikipedia article.

OCaml

Fast! Unifies functional, imperative, and object-oriented programming styles. Provides a strong type and type-inference system derived from ML. Weakness: no multi-core/concurrent support. Type system can be subtle. Poor FFI, modules system. See also Wikipedia page.

Scala

Object-oriented, functional programming. Focus on scalability. Targets JVM. Good Java integration. Weakness: no tail recursion in JVM!! which means mutually recursive proceedures are icky/slow.

Clojure

Modern Lisp dialect, targeted at JVM. Good Java integration. Weakness: no tail recursion in JVM!!




NLP - Natural Language Processing

A wiki containing an extensive listing of software oand other things is at ACLWeb, and in particular, at the Tools and Software page. A small list is at the NLP Resources wiki page at agiri.org. A general overview of the state of the art is at AAAI Natural Language page.

A particularly important theory is Dick Hudson's Word Grammar.

Other NLP resources include:

Morphology software
A wiki of morphology s/w
VerbOcean
A set of semantic-like verb frames.
FrameNet
A set of semantic-like frames. Free for personal use, but has commercial license.
WordNet
Dictionary of synonyms, antonyms, etc.

See also http://www.singinst.org/research/researchareas

General NLP Tool Sets

CRF++

CRF++ is an implementation of Conditional Random Fields. Has pre-existing modules for text chunking, named entity recognition, information extraction. Open source, written in C++.

Freeling

Includes a shallow parser, a sentence splitter, entity detection, sense annotation (using wordnet senses), etc. Strong Spanish/Latin language support.

OpenNLP

OpenNLP is more of a directory of other NLP projects. Inlcludes some good maximum-entropy implementations.

NLTK -- Natural Language Toolkit

Has a book, multiple articles. Integration into WordNet. Written in python. Not clear whether it has an actual parser. Seems to do some sort of entity extraction, esp. for biomedical terms.

CWB

The IMS Open Corpus Workbench (CWB) is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.

GATE - General Architecture for Text Engineering

Java, GPL'ed. Big. Also in use for Dialogue processing and Natural Language Generation.

Lingpipe

NLP Parsers

Another kind of useful linguistic resource is the NLP parser. Some free NLP parsers are:
Link Grammar Parser

From Carnegie-Mellon. A parser for the English language, based on "link grammar", a novel theory of natural language syntax. Written in C, with a BSD license. English dictionary includes 90K words. Actively maintained.

RelEx Dependency Grammar and Semantic Relationship Extractor

Built on top of the Carnegie Mellon link parser. Extracts dependency relations from link data. Creates FrameNet-like semantic frames from the dependency graphs. Includes ability to handle multi-sentence corpus, entity detection, and perform anaphora (pronoun) resolution via Hobbs algo. Apache v2 license. Written in Java. Actively developed/maintained.

Now includes not one, but two! natural language generation facilities: NLGen/SegSim and NLGen2.

DepPattern

Rule-driven dependency parser. English, Spanish, Galician, French, and Portuguese. Parser-compiler in Ruby; parser is in Perl. GPL license.

Stanford Parser

Dependency parser, generating output similar to RelEx. Statistical parser. Trains on treebank data, has been applied to half-a-dozen different languages. Slow, RelEx+linkgrammar is 3x to 4x faster. Java, GPL v2 license.

DeSR

Trainable, fast, accurate dependency parser. Has four different training methods. Uses a fast shift-reduce algorithm for single-pass parsing. Reads CoNLL. C++. unclear license?

Maltparser

Maltparser is a system for data-driven dependency parsing, which will learn a parsing model from treebank data, and can then be used to parse new data using the induced model. Java, BSD license. old URL.

MSTParser

Trainable, fast dependency parser. Uses minimum spanning tree methods. Reads CoNLL. Java, CPL license. download

ISBN Dependency Parser

Incremental Sigmoid Belief Network Dependency Parser. Trainable. GPL license.

Constraint Grammar

Dependency output. Linguist-written rules. GPL license.

Fluild Construction Grammars.

Idea from Luc Steels. There is a LISP implementation at http://www.emergent-languages.org/ A Java implementation at TexAI.

NLP text generators

Automated NL translation systems typically have generators; however, these are statistical, and cannot be directly controlled. Its nice to have a rule-based NL generator that can "learn" new forms of expression.

There is a large list of NL generators located at the ACLWeb Natural Language Generation Portal.

NLGen, SegSin, NLGen2
Text generation modules compatible with link-grammar/RelEx See the link-grammar/RelEx references for more details.
Penman sentence generation system
GATE
Described above, has a generator system.

Coreference Resolution

Includes the problem of Anaphora resolution. Best-known is the Hobbs algorithm for anaphora resolution. RelEx implements this algorithm.
BART
BART, short for "Beautiful Anaphora Resolution Toolkit", uses machine-learning and maximum-entropy statistical techniques to learn entities and identify them. Java, Apache license.

Word Sense Disambiguation

Word sense disambiguation attempts to determine which of multiple possible semantic senses are used in a sentence. A good set of references and code are on Rada Mihalcea senseval.org page. Code is under GPL license. See also:

Named Entity Recognition, Entity Extraction

Other NL tasks include named-entity recognition (NER) or entity extraction. Entity extraction refers to the recognition of names, dates, places in a body of text. Related is the recognition of technical terms.

NER is commonly done in one of several ways:

A gazeteer list or a shallow parser can be created in several ways: A large list of open-source tools are listed in the Wikipedia NER article.
SOFIE A Self-Organizing Framework for Information Extraction

A powerfull system for extracting entities and entity relations from free text. See the YAGO-NAGA listing above.

GATE - General Architecture for Text Engineering

Java, GPL'ed. Big. GATE is supplied with an Information Extraction system called ANNIE, which seems to be focused on "entity extraction".

CRF++
CRF++ is an open-source tool for conditional random fields. It does named-entity recognition among other things.
OpenCalais
Online, commericial service. Free for limited volumes.

Other

Other tools of interest.
SiteScraper
Scraping language content out of web forums.



Program learning

The idea behind "program learning" is to take some dataset, and to describe it in a more compact form as an algorithm. Conceptually, it requires deducing an algorithm, given a sample of the input and expected output. For example, program learning *might* be able to "compress" large (hidden) Markov models and/or Bayesian nets into smaller, faster, more manageable algorithms. Intuitively, this would seem to be a critical feature for AGI -- the ability to take some learned, ad-hoc data (Bayes nets, Markov chains) and convert them into small, effective proceedures. One of the most popular program learning algos is genetic programming.
MOSES Meta-Optimizing Semantic Evolutionary Search

From the website: "Meta-optimizing semantic evolutionary search (MOSES) is a new approach to program evolution, based on representation-building and probabilistic modeling. MOSES has been successfully applied to solve hard problems in domains such as computational biology, sentiment evaluation, and agent control. Results tend to be more accurate, and require less objective function evaluations, in comparison to other program evolution systems. Best of all, the result of running MOSES is not a large nested structure or numerical vector, but a compact and comprehensible program written in a simple Lisp-like mini-language." For details, see Moshe Looks' PhD thesis.

Apache License.

OpenBioMind

Performs clustering using genetic programming techniques. (i.e. attempts to find small algorithmic expressions that will cluster the data). Omniclust is an n-ary agglomerative search algorithm. For details, see, Clustering gene expression data via mining ensembles of classification rules evolved using moses. Looks M, Goertzel B, de Souza Coelho L, Mudado M, Pennachin C. Genetic and Evolutionary Computation Conference. (GECCO 2007): 407-414. Java codebase.




Machine Learning

Misc. machine learning. See also:
HBC: Hierarchical Bayes Compiler
HBC is a toolkit for implementing hierarchical Bayesian models. Model is described using a special markup language, and then code is generated: C, Java, matlab. (Tool itself is written in Haskell.)
Maxent

Java. Has been used to build a POS tagger, end of sentence detector, tokenizer, name finder. LGPL/Apache license.

Maximum Entropy Modeling Toolkit for Python and C++

LGPL license

Pebl Python Environment For Bayesian Learning

MIT license

HTK Hidden Markov Model Toolkit

Portable toolkit for building and manipulating hidden Markov models. C source code, non-free-license prohibits redistribution.




Data Clustering, simple Classifiers

Linear classifiers, data dimension reduction, data clustering, PCA principal component analysis, etc. An overview includes the The Impoverished Social Scientist's Guide to Free Statistical Software and Resources.

A particularly interesting subset concerns Compositional data, which is data located on a simplex and/or a projective space.

MCL Markov Clustering
MCL- "a clustering algorithm for graphs", appears to be an excellent clustering algorithm -- it does not require supervision (i.e. does not require the number of clusters to be specified a priori) and seems to be very scalable, of performance O(N k^2). The scalability is particularly important, in light of the note below. MCL is covered under the GPLv3. (I have no personal experience with this yet, but expect to "real soon now").

Caution: All of the systems listed below fail horribly when applied to real-world data sets of any reasonable size -- e.g. datasets with 100K entries. This is typically because they try to compute similarity measures between all 100K x 100K = 10 billion pairs of elements, which is intractable on contemporary single-CPU systems. You can win big by avoiding these systems, and exploiting any sort of pre-existing organization in your data set. Only after breaking your problem down to itty-bitty-sized chunks should you consider any of the below.

VLFeat

From thier website: "The VLFeat open source library implements popular computer vision algorithms including SIFT, MSER, k-means, hierarchical k-means, agglomerative information bottleneck, and quick shift. It is written in C for efficiency and compatibility, with interfaces in MATLAB for ease of use, and detailed documentation throughout. It supports Windows, Mac OS X, and Linux."

Appears to be aimed at image processing. GPL license.

SimCluster

Assumes data is located on a simplex, and uses that fact in it's algo's. Includes an algo for PCA analysis, another using a partition clustering algorithm, and an agglomerative hierarchical clustering using the Aitchison distance. Command-line interface. Written in C. (No library interfaces currently defined.) Focused on genetic/bio data. GPL license.

Mfuzz

Mfuzz clustering. Aimed at genetic expression time-series data, claimed to be robust against noise. Uses R language. GPLv2 license.

Rattle

R-based data mining. GPL.

Weka Machine Learning

Data mining, clustering. Java. GPL. From personal experience -- fails totally on any but the very smallest data sets. Dying/dead mailing list.

Learning classifiers

Classifiers that need a distinct "training" or "learning" phase before they can be used on general-purpose data. This includes SVM Support vector Machines, and classifier neural nets.

See also:

TiMBL

Fast, decision-tree-based implementation of k-nearest neighbor classification. Implements half-dozen algo's. GPL'ed. (Might not scale well for large problems?) Used in the MaltParser NLP parser, thus has been applied to NLP tasks.

libSVM

Library that implements Support Vector Machine, which is one of many ways of doing a linear classifier.




Databases, distributed processing

Datamining, statistical learning and common-sense knowledge-bases need infrastructure for storing all that data in a persitent, searchable, structured manner, ideally so that hundreds or thousands of clients can get at it. A popular paraddigm at this time is MapReduce, or "Distributed processing using key-value generaion and reducing primitives".
STXXL: Standard Template Library for Extra Large Data Sets

Per website: "STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks."

Spark

Clustering, runs in memory, thus much faster than Hadoop. Scala interfaces.

HyperTable

Implementation of MapReduce ideas in C++.

Hadoop

Implementation of MapeReduce ideas in Java. Part of the Apache project.

Hypergraph DB

Database for storing hypergraphs. Pretty Cool. Java based. Strange BSD-like license, but requires source code! Compatibility of license with GPL is unclear.

Shard databases

Shard overview describes an alternate to centralized, normalized datbases.




Algorithms

Judy trees
Judy arrays provide a very fast array/tree structure, primarily because it was designed to avoid cache misses. This is an important low-level technology. C library, LGPL license.



Narrow AI

Misc entries
scikit-learn
Easy-to-use and general-purpose machine learning in Python Scikit-learn integrates machine learning algorithms in the tightly-knit scientific Python world, building upon numpy, scipy, and matplotlib.
RapidMiner (YALE) Java data mining
OntoWiki and Powl
Semantic web development. Screenshots show business-type apps: addressbook, calander, etc. Powl seems to be a classes and GUI designer. GPL license
OWL API
Java interface for the W3C Web Ontology Language OWL. LGPL license.
Siafu: an Open Source Context Simulator
Simulate individual agents
Jamocha - one engine for all your rules.
Rule engine



Test Datasets

Datasets that can be used to evaluate narrow AI algorithms.
TechTC - Technion Repository of Text Categorization Datasets
Text categorization datasets. The overall idea is to read a block of text, and decide if it belongs to category A or to B. For algorithms such as SVM, it is common to simply do a word count, and classify based on that. Thus, the predigested datasets are of the form +1 or -1 (to indicate A or B), and a list of word counts (e.g. word 6578 occurs 2 times; word 6579 occurs 0 times ... etc.)



Embodiment, Avatars, Robotics

The Hanson Robotics heads are quite -- interesting.
Player/Stage/Gazebo
Robot control and sensor processing. GPL.
MOAST
The Mobility Open Architecture Simulation and Tools (MOAST) framework aids in the development of autonomous robots. It includes an architecture, control modules, interface specs, and data sets and is fully integrated with the USARSim simulation system.
OpenJaus
Robotics messaging. Military standard.
MicroPsi
Study of emotional agents. Simple virtual robotic agents that roam a 3D world and interact in various psycholgically motivated (needs & wants) kinds of ways. Humboldt University of Berlin. Java/Eclipse infrastructure.
AGIsim
GPL. AGISim is a framework for the creation of virtual worlds for artificial intelligence research, allowing AI and human controlled agents to interact in realtime within sensory-rich contexts. AGISim is built on the Crystal Space 3D game engine. Some parts of AGISim are closely related to OpenCog. Possibly/probably not maintained any more, I think the development team moved to OpenCog.



Chatbots

Most of these use very weak/narrow AI techniques. A large list can be found at Chatbots.org.
A.L.I.C.E.
Chatterbot, AIML. AIML is a stimulus-response system: a bunch of English sentence patterns are hard-coded, and a bunch of replies to these are hard-coded as well.

Big computers

National Science Foundation: Google+IBM: Cluster Exploratory -- grants for large cluster science.

Journals, Societies

Journal of Cognitive Science
Issues from 1980-2004 are online, free.
Symposium on Advances in Cognitive Architectures (2003)
Speakers, Abstracts, and Slides.
Cognitive Science Society
Promotes scientific interchange among the fields of Cognitive Science, Artificial Intelligence, Linguistics, Anthropology, Psychology, Neuroscience, Philosophy, and Education.
AGIRI
Artificial General Intelligence Research Institute. Publisher of the Journal of Artificial General Intelligence.
SIAI
The Singularity Institute for Artificial Intelligence. Runs the Singularity Summit, a seminar program aimed at explicating AGI concepts to business executives.
Lifeboat Foundation
Countering existential risks, including atomic war, meteors, bioterrorism, grey goo, and singularity/AGI issues.

Misc links

Beautiful Soup
Library for screen-scraping content from HTML pages. Python, Python Software License.
Scrapy
Application framework for writing spiders that screen-scrape content from HTML pages. Python, BSD License.
http://www.isi.edu/~hobbs/ see especially "magnum opus"
Semantics of Business Vocabulary and Business Rules
An attempt to mix natural lanaguage and first-order logic to describe business relationships.

This page is maintained by Linas Vepstas and was last updated in a substantial way in 2009.