BAYESIAN INFERENCE AND MAXIMUM ENTROPY METHODS IN SCIENCE AND ENGINEERING: 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering
954(2007); http://dx.doi.org/10.1063/1.2821289View Description Hide Description
Measure is a valuation on elements of a lattice: it obeys the sum rule. Probability is a bi‐valuation on elements of a lattice within a context: it obeys the sum and product rules. Maximum entropy is the variational principle which assigns measures and probabilities. These results derive from simple and general symmetries. In inference, it is the propositions being investigated that form the appropriate lattice on which measures and probabilities are defined, and there is no rational alternative to the Bayesian approach of using standard probability calculus. This philosophy permeates science and beyond.
954(2007); http://dx.doi.org/10.1063/1.2821253View Description Hide Description
What is information? Is it physical? We argue that in a Bayesian theory the notion of information must be defined in terms of its effects on the beliefs of rational agents. Information is whatever constrains rational beliefs and therefore it is the force that induces us to change our minds. This problem of updating from a prior to a posterior probability distribution is tackled through an eliminative induction process that singles out the logarithmic relative entropy as the unique tool for inference. The resulting method of Maximum relative Entropy (ME), which is designed for updating from arbitrary priors given information in the form of arbitrary constraints, includes as special cases both MaxEnt (which allows arbitrary constraints) and Bayes' rule (which allows arbitrary priors). Thus, ME unifies the two themes of these workshops—the Maximum Entropy and the Bayesian methods—into a single general inference scheme that allows us to handle problems that lie beyond the reach of either of the two methods separately. I conclude with a couple of simple illustrative examples.
954(2007); http://dx.doi.org/10.1063/1.2821268View Description Hide Description
In this tutorial, I will discuss the concepts behind generalizing ordering to measuring and apply these ideas to the derivation of probability theory. The fundamental concept is that anything that can be ordered can be measured. Since we are in the business of making statements about the world around us, we focus on ordering logical statements according to implication. This results in a Boolean lattice, which is related to the fact that the corresponding logical operations form a Boolean algebra.
The concept of logical implication can be generalized to degrees of implication by generalizing the zeta function of the lattice. The rules of probability theory arise naturally as a set of constraint equations. Through this construction we are able to neatly connect the concepts of order, structure, algebra, and calculus. The meaning of probability is inherited from the meaning of the ordering relation, implication, rather than being imposed in an ad hoc manner at the start.
954(2007); http://dx.doi.org/10.1063/1.2821288View Description Hide Description
Probability calculus is understood, and uniquely defined as the only rational tool for consistent inference. Yet two problems remain. One is a matter of principle: how do we assign the prior distribution that expresses the question we wish to ask? The other is a matter of practice: how do we navigate the parameter space in order to compute the posterior inference? Probability distributions have a natural geometry, which can be used to help in both these. But, like any other professional tool, geometry should be used with intelligence and care.
954(2007); http://dx.doi.org/10.1063/1.2821300View Description Hide Description
All priors are not created equal. There are right and there are wrong priors. That is the main conclusion of this contribution. I use, a cooked‐up example designed to create drama, and a typical textbook example to show the pervasiveness of wrong priors in standard statistical practice.
954(2007); http://dx.doi.org/10.1063/1.2821301View Description Hide Description
This paper reviews ideas and results from unsupervised learning theory that have given the best explanation yet of how neural firing rates self‐organise to code natural images in area V1 of visual cortex. It then discusses the generalisation of these ideas to self‐organising spike‐coding networks. A mismatch between the resulting spike‐learning algorithm and the known physiological processes of synaptic plasticity is then used as a motivation to introduce the rather obvious idea that neurons are not sending their information to other neurons, but to synapses—more microscopic structures. This prompts a survey of other inter‐level communications in the brain and inside cells. It is proposed on the basis of this that information flows all the way up and down the reductionist hierarchy—an idea that transforms many of our ideas about machine learning and neuroscience. What it transforms them into is not yet clear, but the remainder of the paper discusses this.
954(2007); http://dx.doi.org/10.1063/1.2821302View Description Hide Description
We use the method of Maximum (relative) Entropy to process information in the form of observed data and moment constraints. The generic “canonical” form of the posterior distribution for the problem of simultaneous updating with data and moments is obtained. We discuss the general problem of non‐commuting constraints, when they should be processed sequentially and when simultaneously. As an illustration, the multinomial example of die tosses is solved in detail for two superficially similar but actually very different problems.
954(2007); http://dx.doi.org/10.1063/1.2821303View Description Hide Description
In a widely‐cited paper, Glymour (Theory and Evidence, Princeton, N. J.: Princeton University Press, 1980, pp. 63–93) claims to show that Bayesians cannot learn from old data. His argument contains an elementary error. I explain exactly where Glymour went wrong, and how the problem should be handled correctly. When the problem is fixed, it is seen that Bayesians, just like logicians, can indeed learn from old data.
954(2007); http://dx.doi.org/10.1063/1.2821304View Description Hide Description
It has recently been shown that the marginalization paradox (MP) can be resolved by interpreting improper inferences as probability limits. The key to the resolution is that probability limits need not satisfy the formal Bayes' law, which is used in the MP to deduce an inconsistency In this paper, I explore the differences between probability limits and the more familiar pointwise limits, which do imply the formal Bayes' law, and show how these differences underlie some key differences in the interpretation of the MP.
954(2007); http://dx.doi.org/10.1063/1.2821250View Description Hide Description
How to assign numerical values for probabilities that do not seem artificial or arbitrary is a central question in Bayesian statistics. The case of assigning a probability to the truth of a proposition or event for which there is no evidence other than that the event is contingent, is contrasted with the assignment of probability in the case where there is definte evidence that the event can happen in a finite set of ways. The truth of a proposition of this kind is frequently assigned a probability via arguments of ignorance, symmetry, randomness, the Principle of Indiffernce, the Principal Principal, non‐informativeness, or by other methods. These concepts are all shown to be flawed or to be misleading. The statistical syllogism introduced by Williams in 1947 is shown to fix the problems that the other arguments have. An example in the context of model selection is given.
954(2007); http://dx.doi.org/10.1063/1.2821251View Description Hide Description
Stephen Wolfram popularized elementary one‐dimensional cellular automata in his book, A New Kind of Science. Among many remarkable things, he proved that one of these cellular automata was a Universal Turing Machine. Such cellular automata can be interpreted in a different way by viewing them within the context of the formal manipulation rules from probability theory. Bayes's Theorem is the most famous of such formal rules.
As a prelude, we recapitulate Jaynes's presentation of how probability theory generalizes classical logic using modus ponens as the canonical example. We emphasize the important conceptual standing of Boolean Algebra for the formal rules of probability manipulation and give an alternative demonstration augmenting and complementing Jaynes's derivation. We show the complementary roles played in arguments of this kind by Bayes's Theorem and joint probability tables.
A good explanation for all of this is afforded by the expansion of any particular logic function via the disjunctive normal form (DNF). The DNF expansion is a useful heuristic emphasized in this exposition because such expansions point out where relevant 0s should be placed in the joint probability tables for logic functions involving any number of variables.
It then becomes a straightforward exercise to rely on Boolean Algebra, Bayes's Theorem, and joint probability tables in extrapolating to Wolfram's cellular automata. Cellular automata are seen as purely deductive systems, just like classical logic, which probability theory is then able to generalize. Thus, any uncertainties which we might like to introduce into the discussion about cellular automata are handled with ease via the familiar inferential path. Most importantly, the difficult problem of predicting what cellular automata will do in the far future is treated like any inferential prediction problem.
954(2007); http://dx.doi.org/10.1063/1.2821252View Description Hide Description
To Jaynes, in his original paper , maxent is ‘a method of reasoning which ensures that no unconscious arbitrary assumptions have been introduced’, while fifty years later, the MAXENT conference home page suggests that the method ‘is not yet fully available to the statistics community at large’. In fact, it is possible to see that generalized maxent problems, often in disguise, do play a significant role in machine learning and statistics. Deviations from the classic form of the problem are typically used to incorporate some form of prior knowledge. Sometimes that knowledge would be difficult or impossible to represent with only linear constraints or an initial guess for the density.
To clarify these connections, a good place to start is the classic maxent problem. This can then be generalized until the problem encompasses a large class of problems studied by the machine learning community. Relaxed constraints, generalizations of Shannon‐Boltzmann‐Gibbs (SBG) entropy and a few tools from convex analysis make the task relatively straightforward. In the examples discussed, the original maxent problem remains embedded as a special case. Providing a trail back to the original maxent problem will highlight the potential for cross‐fertilization between the two fields.
954(2007); http://dx.doi.org/10.1063/1.2821254View Description Hide Description
The Hammersley‐Clifford (H‐C) theorem relates the factorization properties of a probability distribution to the clique structure of an undirected graph. If a density factorizes according to the clique structure of an undirected graph, the theorem guarantees that the distribution satisfies the Markov property and vice versa. We show how to generalize the H‐C theorem to different notions of decomposability and the corresponding generalized‐Markov property. Finally we discuss how our technique might be used to arrive at other generalizations of the H‐C theorem, inducing a graph semantics adapted to the modeling problem.
954(2007); http://dx.doi.org/10.1063/1.2821255View Description Hide Description
The combinatorial basis of entropy, given by Boltzmann, can be written ln W, where H is the dimensionless entropy, N is the number of entities and W is number of ways in which a given realization of a system can occur (its statistical weight). This can be broadened to give generalized combinatorial (or probabilistic) definitions of entropy and cross‐entropy: and where P is the probability of a given realization, φ is a convenient transformation function, κ is a scaling parameter and C an arbitrary constant. If W or P satisfy the multinomial weight or distribution, then using and H and D asymptotically converge to the Shannon and Kullback‐Leibler functions. In general, however, W or P need not be multinomial, nor may they approach an asymptotic limit. In such cases, the entropy or cross‐entropy function can be defined so that its extremization (“MaxEnt” or “MinXEnt”), subject to the constraints, gives the “most probable” (“MaxProb”) realization of the system. This gives a probabilistic basis for MaxEnt and MinXEnt, independent of any information‐theoretic justification.
This work examines the origins of the governing distribution P. These include: (a) frequentist‐like models; (b) symmetry models; (c) prior MinXEnt models; (d) Kapur‐Kesavan inverse models; and (e) game theoretic models. The combinatorial definition and MaxProb are consistent with these different approaches, and the notion of probabilistic inference, yet offer greater utility than traditional MaxEnt/MinXEnt based on the Shannon and Kullback‐Leibler functions.
954(2007); http://dx.doi.org/10.1063/1.2821257View Description Hide Description
The fundamentals of the Maximum Entropy principle as a rule for assigning and updating probabilities are revisited. The Shannon‐Jaynes relative entropy is vindicated as the optimal criterion for use with an updating rule. A constructive rule is justified which assigns the probabilities least sensitive to coarse‐graining. The implications of these developments for interpreting physics laws as rules of inference upon incomplete information are briefly discussed.
954(2007); http://dx.doi.org/10.1063/1.2821258View Description Hide Description
In this paper, we explore the possibility that the concept of information may enable a derivation of the quantum formalism from a set of physically comprehensible postulates. Taking the probabilistic nature of measurements as a given, we introduce the concept of information via a novel invariance principle, the Principle of Information Gain. Using this principle, we then show that it is possible to deduce the abstract quantum formalism for finite‐dimensional quantum systems from a set of postulates, of which one is a novel physical assumption, and the remainder are based on experimental facts characteristic of quantum phenomena or are drawn from classical physics. The concept of information plays a key role in the derivation, and gives rise to some of the central structural features of the quantum formalism.
954(2007); http://dx.doi.org/10.1063/1.2821259View Description Hide Description
Newtonian dynamics is derived from prior information codified into an appropriate statistical model. The basic assumption is that there is an irreducible uncertainty in the location of particles so that the state of a particle is defined by a probability distribution. The corresponding configuration space is a statistical manifold the geometry of which is defined by the information metric. The trajectory follows from a principle of inference, the method of Maximum Entropy No additional “physical” postulates such as an equation of motion, or an action principle, nor the concepts of momentum and of phase space, not even the notion of time, need to be postulated. The resulting entropic dynamics reproduces the Newtonian dynamics of any number of particles interacting among themselves and with external fields. Both the mass of the particles and their interactions are explained as a consequence of the underlying statistical manifold.
954(2007); http://dx.doi.org/10.1063/1.2821260View Description Hide Description
A novel information‐geometric approach to chaotic dynamics on curved statistical manifolds based on Entropic Dynamics (ED) is suggested. Furthermore, an information‐geometric analogue of the Zurek‐Paz quantum chaos criterion is proposed. It is shown that the hyperbolicity of a non‐maximally symmetric ‐dimensional statistical manifold underlying an ED Gaussian model describing an arbitrary system of non‐interacting degrees of freedom leads to linear information‐geometric entropy growth and to exponential divergence of the Jacobi vector field intensity, quantum and classical features of chaos respectively
954(2007); http://dx.doi.org/10.1063/1.2821261View Description Hide Description
Intended as an introduction of the author's research questions, this paper is a further exploration of “probability as a physical motive”, an attempt to entertain an alternative to causal, deterministic explanation in science. According to this approach, explanation need not be an account of what forces dynamics; explanation may be found in the correlations of dynamics to possibilities.
Uniform distribution of mass (near‐zero Weyl tensor of space‐time curvature) has been suggested by Penrose as that initial condition which accounts for the second law of thermodynamics, as the physical expression of the “MaxEnt” principle. A distribution of mass with respect to gravity is taken as a certain space‐time topography, and inquiry is made into how there might be more ways for space‐time topography to be irregular than for it to be flat. The attempt to understand the counter‐intuitive circumstance of uniform distribution representing dis‐equilibrium, in the case of gravity, leads to discussion of the Machian question of how a configuration may affect, or even effect, the very space in which it is supposed to reside. This leads to speculation on the idea that even state space might depend on state.