What is the algorithmic probability or probability algorithm?
In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in inductive inference theory and analyses of algorithms. In his general theory of inductive inference, Solomonoff uses the prior obtained by this formula in Bayes' rule for prediction,
In the mathematical formalism used, the observations have the form of finite binary strings, and the universal prior is a probability distribution over the set of finite binary strings. The prior is universal in the Turing-computability sense, i.e. no string has zero probability. It is not computable, but it can be approximated.
Algorithmic probability deals with the following questions:
- Given a body of data about some phenomenon that we want to understand, how can we select the most probable hypothesis of how it was caused from among all possible hypotheses?
- How can we evaluate the different hypotheses?
- How can we predict future data and how can we measure the likelihood of that prediction being the right one?
What inspired Solomonoff’s algorithmic probability?
Four principal inspirations for Solomonoff's algorithmic probability were:
- Occam's razor,
- Epicurus' principle of multiple explanations,
- Modern computing theory
- Bayes’ rule for prediction.
What are the different types of algorithmic probability techniques?
1. An Algorithmic Probability Loss Function
The main task of a loss function is to measure the discrepancy between a value predicted by the model and the actual value as specified by the training data set. In most currently used machine learning paradigms this discrepancy is measured in terms of the differences between numerical values, and in case of cross-entropy loss, between predicted probabilities. Algorithmic information theory offers us another option for measuring this discrepancy–in terms of the algorithmic distance or information deficit between the predicted output of the model and the real value
2. Categorical Algorithmic Probability Classification
One of the main fields of application for automated learning is the classification of objects. These classification tasks are often divided into supervised and unsupervised problems.
3. Approximating the Algorithmic Similarity Function
While theoretically sound, the proposed algorithmic loss and classification cost functions rely on the uncomputable mathematical object K. However, recent research and the existence of increasing computing power have made available a number of techniques for the computable approximation of the non-conditional version of K.
What are some algorithmic probability algorithms?
1. Prediction, Probability, and Induction
A strong motivation for revising classical concepts of probability has come from the analysis of human problem-solving. When working on a difficult problem, a person is in a maze in which they have to make choices of possible courses of action. If the problem is a familiar one, the choices will be easy. If it is not familiar, there can be much uncertainty in each choice, but choices must somehow be made. One basis for choices might be the probability of each choice leading to a quick solution - this probability is based on experience in this problem and in problems like it
The usual method of calculating probability is by taking the ratio of the number of favorable choices to the total number of choices in the past.
There’s also induction. Actually, prediction is usually done by finding inductive models. These are deterministic or probabilistic rules for prediction. We are given a batch of data - typically a series of zeros and ones, and we are asked to predict any one of the data points as a function of the data points that precede it.
2. Compression and Algorithmic Probability
An important application of symbol prediction is text compression. If an induction algorithm assigns a probability S to a text, there is a coding method - Arithmetic Coding - that can re-create the entire text without error.
It should be noted that in general, it is impossible to find the truly best models with any certainty - there is an infinity of models to be tested and some take an unacceptably long time to evaluate. At any particular time in the search, we will know the best ones so far, but we can’t ever be sure that spending a little more time will not give much better models! While it is clear that we can always make approximations to algorithmic probability by using a limited number of models, we can never know how close these approximations are to the “True algorithmic probability”. algorithmic probability is indeed, formally incomputable.
The subjectivity of probability resides in a priori information - the information available to the statistician before he sees the data to be extrapolated. This is independent of what kind of statistical techniques we use.
In making predictions, there are several commonly used techniques for inserting a priori information. First, by restricting or expanding the set of induction models to be considered. This is certainly the commonest way. Second, by selecting prediction functions with adjustable parameters and assuming a density distribution over those parameters based on past experience with such parameters. Third, we note that much of the information in our sciences is expressed as definitions - additions to our language. algorithmic probability, or approximations of it, avails itself of this information by using these definitions to help assign code lengths, and hence a priori probabilities to models. Computer languages are usually used to describe models, and it is relatively easy to make arbitrary definitions part of the language.
5. Diversity and Understanding
Apart from the accuracy of probability estimate, algorithmic probability has AI another important value: Its multiplicity of models gives us many different ways to understand our data. A very conventional scientist understands his science using a single “current paradigm” - the way of understanding that is most in vogue at the present time. A more creative scientist understands his science in very many ways, and can more easily create new theories, new ways of understanding when the “current paradigm” no longer fits the current data
What are the applications of algorithmic probability?
Algorithmic probability has a number of important theoretical applications; some of them are listed below.
Solomonoff (1964) developed a quantitative formal theory of induction based on universal Turing machines (to compute, quantify and assign codes to all quantities of interest) and algorithmic complexity (to define what simplicity/complexity means). This is remarkable as it means that Solomonoff's inductive inference system will learn to correctly predict any computable sequence with only the absolute minimum amount of data. It would thus, in some sense, be the perfect universal prediction algorithm, if only it were computable.
AIXI and a Universal Definition of Intelligence
It can also be shown that M leads to excellent predictions and decisions in more general stochastic environments. Essentially AIXI is a generalisation of Solomonoff induction to the reinforcement learning setting, that is, where the agent's actions can influence the state of the environment. AIXI can be proven to be, in some sense, an optimal reinforcement learning agent embedded in an arbitrary unknown environment. Like Solomonoff induction, the AIXI agent is not computable and thus can only be approximated in practice.
Instead of defining an optimal general agent, it is possible to turn things around and define a universal performance measure for agents acting in computable environments, known as universal intelligence.
Expected Time/Space Complexity of Algorithms under the Universal Distribution
For every algorithm whatsoever, the m-average-case complexity is always of the same order of magnitude as the worst-case complexity for computational resources such as time and storage space. This result of Li and Vitanyi means that, if the inputs are distributed according to the universal distribution, rather than according to the uniform distribution as is customary, then the expected case is as bad as the worst case. For example, the sorting procedure Quicksort has worst-case running time n2, but expected running time nlogn on a list of n keys for permutations drawn at random from uniformly distributed permutations of n keys. But if the permutations are distributed according to the universal distribution, that is, according to Occam's dictum that the simple permutations have the highest probabilities, as is often the case in practice, then the expected running time rises to the worst-case of n2. Experiments appear to confirm this theoretical prediction.
PAC Learning Using the Universal Distribution in the Learning Phase
Using m as the sampling distribution in the learning phase in PAC-learning properly increases the learning power for the family of discrete concept classes under computable distributions, and similarly for PAC learning of continuous concepts over computable measures by using M . This model of Li and Vitanyi is akin to using a teacher in the learning phase who gives the simple examples first.
A formally related quantity is the probability that U halts when provided with fair coin flips on the input tape (i.e., that a random computer program will eventually halt). This halting probability Ω=∑xm(x) is also known as Chaitin's constant or "the number of wisdom", and has numerous remarkable mathematical properties. For example, it can be used to quantify Goedel's Incompleteness Theorem.