Preface 序

The following material is addressed to readers who are already familiar with applied mathematics, at the advanced undergraduate level or preferably higher, and with some field, such as physics, chemistry, biology, geology, medicine, economics, sociology, engineering, operations research, etc., where inference is needed. 1 A previous acquaintance with probability and statistics is not necessary; indeed, a certain amount of innocence in this area may be desirable, because there will be less to unlearn.

本书是面向熟悉应用数学的读者,包括高年级本科生或更高水平,需要进行推理的学科领域,诸如物理,化学,生物,地理,医药,经济,社会,工程,运筹等等(注1).不需要预先了解统计和概率的相关知识,实际上基本不了解反而更好,这样就不需要"忘记"一些东西了.

We are concerned with probability theory and all of its conventional mathematics, but now viewed in a wider context than that of the standard textbooks. Every chapter after the first has ‘new’ (i.e. not previously published) results that we think will be found interesting and useful. Many of our applications lie outside the scope of conventional probability theory as currently taught. But we think that the results will speak for themselves, and that something like the theory expounded here will become the conventional probability theory of the future.

本书关注概率理论以及相关的传统数学,但比教科书更宽广的背景来看待.除第一章外,每一章都会提到一些我们觉得有趣亦或有用的新(以前没有发表过的)结果.许多实际应用已经超出了作为教程的传统概率论的范围.我们认为这些结果是不言自明的,就像本书中扩展的理论一样,成为未来的传统概率论.

1 By ‘inference’ we mean simply: deductive reasoning whenever enough information is at hand to permit it; inductive or plausible reasoning when – as is almost invariably the case in real problems – the necessary information is not available. But if a problem can be solved by deductive reasoning, probability theory is not needed for it; thus our topic is the optimal processing of incomplete information.

注1: 这里的"推理"简单的说是指:当有足够的信息时,则应用演绎推理,反之,在没有足够信息(现实问题常常遇到的情况)时,使用归纳或合情推理.但是,对于问题能够用演绎推理来解决的情况,概率论就没什么用武之地了了.也就是说,我们研究的主题是在信息不完全的情况下的最优处理.

History 历史

The present form of this work is the result of an evolutionary growth over many years. My interest in probability theory was stimulated first by reading the work of Harold Jeffreys (1939) and realizing that his viewpoint makes all the problems of theoretical physics appear in a very different light. But then, in quick succession, discovery of the work of R. T. Cox (1946), Shannon (1948) and Pólya (1954) opened up new worlds of thought, whose exploration has occupied my mind for some 40 years. In this much larger and permanent world of rational thinking in general, the current problems of theoretical physics appeared as only details of temporary interest.

这本书现在的样子是多年依赖不断演变的结果.我对概率论的兴趣是被Harold Jeffreys (1939)的作品所激发的,并认识到他的观点使得所有物理理论里的问题以一种不同的视角展现出来.在他之后,R. T. Cox (1946), Shannon (1948) 和 Pólya (1954)的著作,为我打开了一个全新的思路,并在我的脑海里占据了差不多40年.在这个更广阔且永恒的理性思维的世界中,这些物理理论细节只是引起了我短期的兴趣.

The actual writing started as notes for a series of lectures given at Stanford University in 1956, expounding the then new and exciting work of George Pólya on ‘Mathematics and Plausible Reasoning’. He dissected our intuitive ‘common sense’ into a set of elementary qualitative desiderata and showed that mathematicians had been using them all along to guide the early stages of discovery, which necessarily precede the finding of a rigorous proof. The results were much like those of James Bernoulli’s Art of Conjecture (1713), developed analytically by Laplace in the late 18th century; but Pólya thought the resemblance to be only qualitative.

本书的写作是从1956年在斯坦福大学的几场讲座开始的,详细阐述了George Pólya新颖且令人兴奋的著作<数学与合情推理>.他将我们的直觉"常识"剖析为一组基础的定性的必要准则,并提到数学家们早已以此来指导数学新发现的早期工作,即在进一步寻求严格证明之前.这个结果和James Bernoulli的<推测的艺术>(1713年)非常相似,被Laplace在18世纪进一步分析,但是Pólya认为两者的相似性仅仅是在定性这个方面.

However, Pólya demonstrated this qualitative agreement in such complete, exhaustive detail as to suggest that there must be more to it. Fortunately, the consistency theorems of R. T. Cox were enough to clinch matters; when one added Pólya’s qualitative conditions to them the result was a proof that, if degrees of plausibility are represented by real numbers, then there is a uniquely determined set of quantitative rules for conducting inference. That is, any other rules whose results conflict with them will necessarily violate an elementary – and nearly inescapable – desideratum of rationality or consistency.

然而,Pólya展示了这种定性的部分的所有完整完备的细节,暗示了事情不应仅仅到此为止.幸运的是,R.T.Cox的一致性定理足以解决问题,当其和Pólya的定性条件合在一起的结论证明了,如果合情程度可以用一个实数来表示时,则存在唯一确定的一组量化规则来指导推理.就是说,任何与此冲突的规则集合将导致对合理性或一致性的破坏,而这种合理性或一致性往往是基础的和必不可少的.

But the final result was just the standard rules of probability theory, given already by Daniel Bernoulli and Laplace; so why all the fuss? The important new feature was that these rules were now seen as uniquely valid principles of logic in general, making no reference to ‘chance’ or ‘random variables’; so their range of application is vastly greater than had been supposed in the conventional probability theory that was developed in the early 20th century. As a result, the imaginary distinction between ‘probability theory’ and ‘statistical inference’ disappears, and the field achieves not only logical unity and simplicity, but far greater technical power and flexibility in applications.

但上面得到的最终结果就是概率论的一般规则,已被Daniel Bernoulli和Laplace给出.那为什么还这么大惊小怪呢?要重的新特点是,这些规则现在被视作广义逻辑的唯一合法的原则,并无需提及'几率'或'随机变量'的概念.所以其应用范围比20世纪早期发展出的传统概率论理论要宽广得多.结果就是概率理论和统计推断之间臆想出来的差别消失不见,在此领域中留下的不仅仅是逻辑上的统一和简化,还有应用中的更强更灵活的技术能力.

In the writer’s lectures, the emphasis was therefore on the quantitative formulation of Pólya’s viewpoint, so it could be used for general problems of scientific inference, almost all of which arise out of incomplete information rather than ‘randomness’. Some personal reminiscences about George Pólya and this start of the work are in Chapter 5.

在作者的讲演中,强调的是Pólya观点的定量公式,并应用于一般的科学问题推理,这些问题几乎都是源于信息不完备而不是源自随机性.关于George Pólya的个人回忆录和这些工作的开端将在第5章中提及.

Once the development of applications started, the work of Harold Jeffreys, who had seen so much of it intuitively and seemed to anticipate every problem I would encounter, became again the central focus of attention. My debt to him is only partially indicated by the dedication of this book to his memory. Further comments about his work and its influence on mine are scattered about in several chapters.

随着应用的不断开展,Harold Jeffreys的著作成为我关注的重心.Harold依靠直觉洞察到很多问题,且预料到了几乎我所有可能遇到的问题.我欠他的也只能以这本书来部分来作为他的记忆的弥补.对他著作的更多阐述以及其对我的影响将散布在本书的许多章节中.

In the years 1957–1970 the lectures were repeated, with steadily increasing content, at many other universities and research laboratories. 2 In this growth it became clear gradually that the outstanding difficulties of conventional ‘statistical inference’ are easily understood and overcome. But the rules which now took their place were quite subtle conceptually, and it required some deep thinking to see how to apply them correctly. Past difficulties, which had led to rejection of Laplace’s work, were seen finally as only misapplications, arising usually from failure to define the problem unambiguously or to appreciate the cogency of seemingly trivial side information, and easy to correct once this is recognized. The various relations between our ‘extended logic’ approach and the usual ‘random variable’ one appear in almost every chapter, in many different forms.

在1957-1970年间,在多所高校和研究所(注2)中反复多次做了这个演讲,内容也随之稳定增加.随之愈来愈明显的是,传统的统计推断中的突出难题变得易于理解和克服.取而代之的是,规则在概念上变得很微妙,需要深入的思考才能正确的应用这些规则.过去曾导致Laplace的工作不被接受的难点,现在看来不过是被错误的使用了,一旦认识到错误的原因就很容易将其修正过来.造成错误应用的原因,可能来自于没能无歧义的定义问题,也可能来自于过分关注重那些琐碎的辅助信息.在我们的"扩展的逻辑"方法和常用的"随机变量"之间的各种关联,会出现在几乎所有章节中,以不同的形式.

Eventually, the material grew to far more than could be presented in a short series of lectures, and the work evolved out of the pedagogical phase; with the clearing up of old difficulties accomplished, we found ourselves in possession of a powerful tool for dealing with new problems. Since about 1970 the accretion has continued at the same pace, but fed instead by the research activity of the writer and his colleagues. We hope that the final result has retained enough of its hybrid origins to be usable either as a textbook or as a reference work; indeed, several generations of students have carried away earlier versions of our notes, and in turn taught it to their students.

最终,内容累积到一个简短的系列演讲可以呈现的程度了,然后慢慢的超出了适宜教学的程度.伴随着旧问题的不断解决,我们发现自己已经掌握了一个处理新问题的强有力的工具.从1970年以来,内容稳步增长,内容主要来自作者及其同事的科研工作.我们希望最终内容能保留其所有的原始来源,既可作为教科书也可作为参考文献.事实上,好几代的学生已经将早期的版本带走并交给他们的学生了.

In view of the above, we repeat the sentence that Charles Darwin wrote in the Introduction to his Origin of Species: ‘I hope that I may be excused for entering on these personal details, as I give them to show that I have not been hasty in coming to a decision.’ But it might be thought that work done 30 years ago would be obsolete today. Fortunately, the work of Jeffreys, Pólya and Cox was of a fundamental, timeless character whose truth does not change and whose importance grows with time. Their perception about the nature of inference, which was merely curious 30 years ago, is very important in a half-dozen different areas of science today; and it will be crucially important in all areas 100 years hence.

回顾以上部分,我想重复一下Charles Darwin在<物种起源>的前言中说的一句话:我希望大家能谅解我加入这些个人细节,因为我喜欢让你们知道我不是草率的得出这些结论的.但可能有人会觉得这些30年前的工作今天看起来有些过时了.庆幸的是,Jeffreys, Pólya还有Cox的工作是基础的,永恒不变的,并随着时间推移而愈显其重要性.他们对推理本质的认知,在30年前也许是好奇的产物,在今天则是数个不同科学领域的重要部分,而且将成为今后的100年中所有领域的要重部分.

2 Some of the material in the early chapters was issued in 1958 by the Socony-Mobil Oil Company as Number 4 in their series ‘Colloquium Lectures in Pure and Applied Science’.

注2:前几章的一些材料,在1958年通过Socony-Mobil石油公司,作为其一系列的'在理论和应用科学讨论会'上的4号作品发表.

Foundations

From many years of experience with its applications in hundreds of real problems, our views on the foundations of probability theory have evolved into something quite complex, which cannot be described in any such simplistic terms as ‘pro-this’ or ‘anti-that’. For example, our system of probability could hardly be more different from that of Kolmogorov, in style, philosophy, and purpose. What we consider to be fully half of probability theory as it is needed in current applications – the principles for assigning probabilities by logical analysis of incomplete information – is not present at all in the Kolmogorov system.

Yet, when all is said and done, we find ourselves, to our own surprise, in agreement with Kolmogorov and in disagreement with his critics, on nearly all technical issues. As noted in Appendix A, each of his axioms turns out to be, for all practical purposes, derivable from the Pólya–Cox desiderata of rationality and consistency. In short, we regard our system of probability as not contradicting Kolmogorov’s; but rather seeking a deeper logical foundation that permits its extension in the directions that are needed for modern applications. In this endeavor, many problems have been solved, and those still unsolved appear where we should naturally expect them: in breaking into new ground.

As another example, it appears at first glance to everyone that we are in very close agreement with the de Finetti system of probability. Indeed, the writer believed this for some time. Yet when all is said and done we find, to our own surprise, that little more than a loose philosophical agreement remains; on many technical issues we disagree strongly with de Finetti. It appears to us that his way of treating infinite sets has opened up a Pandora’s box of useless and unnecessary paradoxes; nonconglomerability and finite additivity are examples discussed in Chapter 15.

Infinite-set paradoxing has become a morbid infection that is today spreading in a way that threatens the very life of probability theory, and it requires immediate surgical removal. In our system, after this surgery, such paradoxes are avoided automatically; they cannot arise from correct application of our basic rules, because those rules admit only finite sets and infinite sets that arise as well-defined and well-behaved limits of finite sets. The paradoxing was caused by (1) jumping directly into an infinite set without specifying any limiting process to define its properties; and then (2) asking questions whose answers depend on how the limit was approached.

For example, the question: ‘What is the probability that an integer is even?’ can have any answer we please in (0, 1), depending on what limiting process is used to define the ‘set of all integers’ (just as a conditionally convergent series can be made to converge to any number we please, depending on the order in which we arrange the terms).

In our view, an infinite set cannot be said to possess any ‘existence’ and mathematical properties at all – at least, in probability theory – until we have specified the limiting process that is to generate it from a finite set. In other words, we sail under the banner of Gauss, Kronecker, and Poincaré rather than Cantor, Hilbert, and Bourbaki. We hope that readers who are shocked by this will study the indictment of Bourbakism by the mathematician Morris Kline (1980), and then bear with us long enough to see the advantages of our approach. Examples appear in almost every chapter.

Comparisons

For many years, there has been controversy over ‘frequentist’ versus ‘Bayesian’ methods of inference, in which the writer has been an outspoken partisan on the Bayesian side. The record of this up to 1981 is given in an earlier book (Jaynes, 1983). In these old works there was a strong tendency, on both sides, to argue on the level of philosophy or ideology. We can now hold ourselves somewhat aloof from this, because, thanks to recent work, there is no longer any need to appeal to such arguments. We are now in possession of proven theorems and masses of worked-out numerical examples. As a result, the superiority of Bayesian methods is now a thoroughly demonstrated fact in a hundred different areas. One can argue with a philosophy; it is not so easy to argue with a computer printout, which says to us: ‘Independently of all your philosophy, here are the facts of actual performance.’ We point this out in some detail whenever there is a substantial difference in the final results. Thus we continue to argue vigorously for the Bayesian methods; but we ask the reader to note that our arguments now proceed by citing facts rather than proclaiming a philosophical or ideological position.

However, neither the Bayesian nor the frequentist approach is universally applicable, so in the present, more general, work we take a broader view of things. Our theme is simply: probability theory as extended logic. The ‘new’ perception amounts to the recognition that the mathematical rules of probability theory are not merely rules for calculating frequencies of ‘random variables’; they are also the unique consistent rules for conducting inference (i.e. plausible reasoning) of any kind, and we shall apply them in full generality to that end.

It is true that all ‘Bayesian’ calculations are included automatically as particular cases of our rules; but so are all ‘frequentist’ calculations. Nevertheless, our basic rules are broader than either of these, and in many applications our calculations do not fit into either category.

To explain the situation as we see it presently: The traditional ‘frequentist’ methods which use only sampling distributions are usable and useful in many particularly simple, idealized problems; however, they represent the most proscribed special cases of probability theory, because they presuppose conditions (independent repetitions of a ‘random experiment’ but no relevant prior information) that are hardly ever met in real problems. This approach is quite inadequate for the current needs of science.

In addition, frequentist methods provide no technical means to eliminate nuisance parameters or to take prior information into account, no way even to use all the information in the data when sufficient or ancillary statistics do not exist. Lacking the necessary theoretical principles, they force one to ‘choose a statistic’ from intuition rather than from probability theory, and then to invent ad hoc devices (such as unbiased estimators, confidence intervals, tail-area significance tests) not contained in the rules of probability theory. Each of these is usable within the small domain for which it was invented but, as Cox’s theorems guarantee, such arbitrary devices always generate inconsistencies or absurd results when applied to extreme cases; we shall see dozens of examples.

All of these defects are corrected by use of Bayesian methods, which are adequate for what we might call ‘well-developed’ problems of inference. As Harold Jeffreys demonstrated, they have a superb analytical apparatus, able to deal effortlessly with the technical problems on which frequentist methods fail. They determine the optimal estimators and algorithms automatically, while taking into account prior information and making proper allowance for nuisance parameters, and, being exact, they do not break down – but continue to yield reasonable results – in extreme cases. Therefore they enable us to solve problems of far greater complexity than can be discussed at all in frequentist terms. One of our main purposes is to show how all this capability was contained already in the simple product and sum rules of probability theory interpreted as extended logic, with no need for – indeed, no room for – any ad hoc devices.

Before Bayesian methods can be used, a problem must be developed beyond the ‘exploratory phase’ to the point where it has enough structure to determine all the needed apparatus (a model, sample space, hypothesis space, prior probabilities, sampling distribution). Almost all scientific problems pass through an initial exploratory phase in which we have need for inference, but the frequentist assumptions are invalid and the Bayesian apparatus is not yet available. Indeed, some of them never evolve out of the exploratory phase. Problems at this level call for more primitive means of assigning probabilities directly out of our incomplete information.

For this purpose, the Principle of maximum entropy has at present the clearest theoretical justification and is the most highly developed computationally, with an analytical apparatus as powerful and versatile as the Bayesian one. To apply it we must define a sample space, but do not need any model or sampling distribution. In effect, entropy maximization creates a model for us out of our data, which proves to be optimal by so many different criteria 3 that it is hard to imagine circumstances where one would not want to use it in a problem where we have a sample space but no model. Bayesian and maximum entropy methods differ in another respect. Both procedures yield the optimal inferences from the information that went into them, but we may choose a model for Bayesian analysis; this amounts to expressing some prior knowledge – or some working hypothesis – about the phenomenon being observed. Usually, such hypotheses extend beyond what is directly observable in the data, and in that sense we might say that Bayesian methods are – or at least may be – speculative. If the extra hypotheses are true, then we expect that the Bayesian results will improve on maximum entropy; if they are false, the Bayesian inferences will likely be worse.

On the other hand, maximum entropy is a nonspeculative procedure, in the sense that it invokes no hypotheses beyond the sample space and the evidence that is in the available data. Thus it predicts only observable facts (functions of future or past observations) rather than values of parameters which may exist only in our imagination. It is just for that reason that maximum entropy is the appropriate (safest) tool when we have very little knowledge beyond the raw data; it protects us against drawing conclusions not warranted by the data. But when the information is extremely vague, it may be difficult to define any appropriate sample space, and one may wonder whether still more primitive principles than maximum entropy can be found. There is room for much new creative thought here.

For the present, there are many important and highly nontrivial applications where Maximum Entropy is the only tool we need. Part 2 of this work considers them in de- tail; usually, they require more technical knowledge of the subject-matter area than do the more general applications studied in Part 1. All of presently known statistical mechanics, for example, is included in this, as are the highly successful Maximum Entropy spectrum analysis and image reconstruction algorithms in current use. However, we think that in the future the latter two applications will evolve into the Bayesian phase, as we become more aware of the appropriate models and hypothesis spaces which enable us to incorporate more prior information.

We are conscious of having so many theoretical points to explain that we fail to present as many practical worked-out numerical examples as we should. Fortunately, three recent books largely make up this deficiency, and should be considered as adjuncts to the present work: Bayesian Spectrum Analysis and Parameter Estimation (Bretthorst, 1988), Maximum Entropy in Action (Buck and Macaulay, 1991), and Data Analysis – A Bayesian Tutorial (Sivia, 1996), are written from a viewpoint essentially identical to ours and present a wealth of real problems carried through to numerical solutions. Of course, these works do not contain nearly as much theoretical explanation as does the present one. Also, the Proceedings volumes of the various annual MAXENT workshops since 1981 consider a great variety of useful applications.

These concern efficient information handling; for example, (1) the model created is the simplest one that captures all the information in the constraints (Chapter 11); (2) it is the unique model for which the constraints would have been sufficient statistics (Chapter 8); (3) if viewed as constructing a sampling distribution for subsequent Bayesian inference from new data D, the only property of the measurement errors in D that are used in that subsequent inference are the ones about which that sampling distribution contained some definite prior information (Chapter 7). Thus the formalism automatically takes into account all the information we have, but avoids assuming information that we do not have. This contrasts sharply with orthodox methods, where one does not think in terms of information at all, and in general violates both of these desiderata.

Mental activity

As one would expect already from Pólya’s examples, probability theory as extended logic reproduces many aspects of human mental activity, sometimes in surprising and even disturbing detail. In Chapter 5 we find our equations exhibiting the phenomenon of a person who tells the truth and is not believed, even though the disbelievers are reasoning consis- tently. The theory explains why and under what circumstances this will happen.

The equations also reproduce a more complicated phenomenon, divergence of opinions. One might expect that open discussion of public issues would tend to bring about a general consensus. On the contrary, we observe repeatedly that when some controversial issue has been discussed vigorously for a few years, society becomes polarized into two opposite extreme camps; it is almost impossible to find anyone who retains a moderate view. Prob- ability theory as logic shows how two persons, given the same information, may have their opinions driven in opposite directions by it, and what must be done to avoid this.

In such respects, it is clear that probability theory is telling us something about the way our own minds operate when we form intuitive judgments, of which we may not have been consciously aware. Some may feel uncomfortable at these revelations; others may see in them useful tools for psychological, sociological, or legal research.

What is ‘safe’? 什么是"安全"?

We are not concerned here only with abstract issues of mathematics and logic. One of the main practical messages of this work is the great effect of prior information on the conclusions that one should draw from a given data set. Currently, much discussed issues, such as environmental hazards or the toxicity of a food additive, cannot be judged rationally if one looks only at the current data and ignores the prior information that scientists have about the phenomenon. This can lead one to overestimate or underestimate the danger.

在这里我们关心的不仅仅是抽象的数学和逻辑问题.本书有助于实际应用的是,先验信息显著影响了从给定数据集能得出什么样的结论.当今讨论的一些问题,比如环境污染,食品添加剂的毒性,如果只看数据而忽略科学家给出的相关先验信息,是难以得到合理结论的.只看数据将导致高估或低估危害程度.

A common error, when judging the effects of radioactivity or the toxicity of some substance, is to assume a linear response model without threshold (i.e. without a dose rate below which there is no ill effect). Presumably there is no threshold effect for cumulative poisons like heavy metal ions (mercury, lead), which are eliminated only very slowly, if at all. But for virtually every organic substance (such as saccharin or cyclamates), the existence of a finite metabolic rate means that there must exist a finite threshold dose rate, below which the substance is decomposed, eliminated, or chemically altered so rapidly that it causes no ill effects. If this were not true, the human race could never have survived to the present time, in view of all the things we have been eating.

在判断某种物质的放射性或毒性时,一个常见的错误是预先假设了一个无门限值(低于指定值即为无毒性)的线性反应模型.例如对有毒性的重金属(水银,铅),如果视毒性累积是无门限的,且会被人体缓慢的代谢掉(如果能代谢的话).但几乎所有的有机物质(如糖精或甜蜜素),由于代谢率的原因,总是存在一个量值,在低于这个值的时候,该物质会被分解,排除,或快速的被化学降解导致不能产生毒性.如果不这样的话,在我们吃掉如此多样的食品时,人类就不能存活至今.

Indeed, every mouthful of food you and I have ever taken contained many billions of kinds of complex molecules whose structure and physiological effects have never been determined – and many millions of which would be toxic or fatal in large doses. We cannot doubt that we are daily ingesting thousands of substances that are far more dangerous than saccharin – but in amounts that are safe, because they are far below the various thresholds of toxicity. At present, there are hardly any substances, except some common drugs, for which we actually know the threshold.

实际上,你我每吃下的一口食物中,包含了数以亿计的各种复杂分子,其物理结构和作用从未曾被确定,而且其中百万计的物质可能在大剂量时是有毒的.我们无法怀疑每天吃进肚子里的上千中物质是远比糖精更大的风险,但从摄入量上看反而是安全的,因为其数量远低于其毒性的临界值.当今除了很少的一部分普通药品外,我们都不知道其临界值.

Therefore, the goal of inference in this field should be to estimate not only the slope of the response curve, but, far more importantly, to decide whether there is evidence for a threshold; and, if there is, to estimate its magnitude (the ‘maximum safe dose’). For example, to tell us that a sugar substitute can produce a barely detectable incidence of cancer in doses 1000 times greater than would ever be encountered in practice, is hardly an argument against using the substitute; indeed, the fact that it is necessary to go to kilodoses in order to detect any ill effects at all, is rather conclusive evidence, not of the danger, but of the safety, of a tested substance. A similar overdose of sugar would be far more dangerous, leading not to barely detectable harmful effects, but to sure, immediate death by diabetic coma; yet nobody has proposed to ban the use of sugar in food.

所以,在这些领域推导结论时,不应仅仅是估算反应曲线的斜率,更重要的是证明是否存在临界值,如果存在的话,如何估算临界值(最大的安全剂量).例如,已知一种甜性物质在超出常用剂量1000倍食用会导致癌症时,讨论其是否可以食用的问题并无多大意义.如果千倍的常用剂量才能检测到致病效果,与其说这证明了它是有害的,不如说证明了它是安全的.摄入如此多的糖就不是危不危险的问题了,这肯定会导致糖尿病患者直接死亡,但没人会要求禁止在食物中使用这类物质.

Kilodose effects are irrelevant because we do not take kilodoses; in the case of a sugar substitute the important question is: What are the threshold doses for toxicity of a sugar substitute and for sugar, compared with the normal doses? If that of a sugar substitute is higher, then the rational conclusion would be that the substitute is actually safer than sugar, as a food ingredient. To analyze one’s data in terms of a model which does not allow even the possibility of a threshold effect is to prejudge the issue in a way that can lead to false conclusions, however good the data. If we hope to detect any phenomenon, we must use a model that at least allows the possibility that it may exist.

我们不会摄入千倍剂量,所以研究千倍剂量下的效果无关紧要.在糖的替代品的例子中,重要的是参照正常摄入剂量的条件下,毒性的临界剂量是多少?如果临界值很大,理性的结论会是该替代品的食用安全性高于普通糖.在分析一个数据集的时候,采用一个完全不考虑临界值是否存在的模型,相当于用特定方法(预判不存在临界值的可能性)并推导出虚假结论,虽然数据集是没问题的.如果我们希望能检测出某种现象,我们至少要使用一种允许该现象存在的模型才可以.

We emphasize this in the Preface because false conclusions of just this kind are now not only causing major economic waste, but also creating unnecessary dangers to public health and safety. Society has only finite resources to deal with such problems, so any effort expended on imaginary dangers means that real dangers are going unattended. Even worse, the error is incorrectible by the currently most used data analysis procedures; a false premise built into a model which is never questioned cannot be removed by any amount of new data. Use of models which correctly represent the prior information that scientists have about the mechanism at work can prevent such folly in the future.

在序言中强调这一点,是因为这类虚假结论不仅仅会导致大量的经济浪费,还会对公众安全和健康造成不必要的危害.我们只有有限的社会资源来应对这些问题,所有任何浪费在想象带来的危害的努力意味着对真正危害的视而不见.更糟的是,这个分析过程,建立在虚假的前提之上的不被质疑的模型,即使你增加更多数据也不能察觉到这种错误.只有使用包含了科学家门提供的先验信息的模型,才能防止我们以后继续干这种蠢事.

Such considerations are not the only reasons why prior information is essential in inference; the progress of science itself is at stake. To see this, note a corollary to the preceding paragraph: that new data that we insist on analyzing in terms of old ideas (that is, old models which are not questioned) cannot lead us out of the old ideas. However many data we record and analyze, we may just keep repeating the same old errors, missing the same crucially important things that the experiment was competent to find. That is what ignoring prior information can do to us; no amount of analyzing coin tossing data by a stochastic model could have led us to the discovery of Newtonian mechanics, which alone determines those data.

之所以说这些,不仅仅是强调先验信息在推断中的重要性,它也对科学发展影响深远.注意上面一段得到的结论:坚持用老方法(不质疑模型本身是否正确)去分析新数据并不能让我们跳出旧思想.无论我们记录分析多少数据,我们仍然是在重复着同样的错误:错过那些我们努力寻找的最关键重要的东西.这就是忽略先验信息的方法所能提供的,无论使用随机模型投掷多少次硬币都不能让我们发现牛顿的机械定律,而正是后者决定了我们观察到的数据.

Old data, when seen in the light of new ideas, can give us an entirely new insight into a phenomenon; we have an impressive recent example of this in the Bayesian spectrum analysis of nuclear magnetic resonance data, which enables us to make accurate quantitative determinations of phenomena which were not accessible to observation at all with the previously used data analysis by Fourier transforms. When a data set is mutilated (or, to use the common euphemism, ‘filtered’) by processing according to false assumptions, important information in it may be destroyed irreversibly. As some have recognized, this is happening constantly from orthodox methods of detrending or seasonal adjustment in econometrics. However, old data sets, if preserved unmutilated by old assumptions, may have a new lease on life when our prior information advances.

当我们从新的思想来看旧数据时,我们会得到耳目一新的洞察:我有一个令人印象深刻的最近的例子,在核磁共振数据的贝叶斯谱分析中,我们得到了对一些现象的精确度量,而之前用傅立叶变换的方法是完全无法得到的.当数据通过错误的假设被肢解处理(或者委婉的说,过滤)后,包含的重要信息可能被不可逆的破坏了.正如一些人所认识到的,这经常发生在正统的计量经济学的去趋势和季节调整方法中.然而,如果旧假设没有过滤掉旧数据中,随着先验信息的发现会出现新的生机.

Style of presentation 表达风格

In Part 1, expounding principles and elementary applications, most chapters start with several pages of verbal discussion of the nature of the problem. Here we try to explain the constructive ways of looking at it, and the logical pitfalls responsible for past errors. Only then do we turn to the mathematics, solving a few of the problems of the genre to the point where the reader may carry it on by straightforward mathematical generalization. In Part 2, expounding more advanced applications, we can concentrate from the start on the mathematics.

在本书第一部分,论述原理和基础应用,大部分章节从讨论问题的本质开始.我们会试图以构造性的方式来看待问题,以及导致过去错误的逻辑陷阱.然后转向数学,研究一些读者可以直接用普通数学解决的特定类型的问题.在第二部分,将论述那些直接从数学开始的高级应用.

The writer has learned from much experience that this primary emphasis on the logic of the problem, rather than the mathematics, is necessary in the early stages. For modern students, the mathematics is the easy part; once a problem has been reduced to a definite mathematical exercise, most students can solve it effortlessly and extend it endlessly, without further help from any book or teacher. It is in the conceptual matters (how to make the initial connection between the real-world problem and the abstract mathematics) that they are perplexed and unsure how to proceed.

作者从自身的经验中得知,在早期阶段强调问题的逻辑性比强调数学更重要.对于现在的学生来说,数学是相对简单的部分,一旦问题被归结为明确的数学练习,大部分学生能够轻而易举的解决并无限的将问题拓展下去,而且不需要老师或教科书的帮助.而问题中与概念相关的部分(如何正确的将现实问题和抽象数学关联起来),才会让学生感到迷惑,导致无法确定该如何处理.

Recent history demonstrates that anyone foolhardy enough to describe his own work as ‘rigorous’ is headed for a fall. Therefore, we shall claim only that we do not knowingly give erroneous arguments. We are conscious also of writing for a large and varied audience, for most of whom clarity of meaning is more important than ‘rigor’ in the narrow mathematical sense.

近代历史展现了,当一个足够鲁莽的人说自己的工作是'严格'的,则他将一路向下.因此,我们会声明我们并不是故意给出错误的争辩.我们有心写给大多数的各种读者,对于他们而言,意义清晰明确要比'严谨'的狭隘的数学更加重要.

There are two more, even stronger, reasons for placing our primary emphasis on logic and clarity. Firstly, no argument is stronger than the premises that go into it, and, as Harold Jeffreys noted, those who lay the greatest stress on mathematical rigor are just the ones who, lacking a sure sense of the real world, tie their arguments to unrealistic premises and thus destroy their relevance. Jeffreys likened this to trying to strengthen a building by anchoring steel beams into plaster. An argument which makes it clear intuitively why a result is correct is actually more trustworthy, and more likely of a permanent place in science, than is one that makes a great overt show of mathematical rigor unaccompanied by understanding.

把强调逻辑和清晰放在第一位还有两个重要原因.首先,没有任何争辩比争辩的前提更重要.正如Harold Jeffreys所指出,那些把重心放在强调数学的严格性上的人,恰恰是那些缺乏对真实世界的确实把握的人,他们把争辩维系在不真实的前提之上并因此失去了它们的关联.Jeffreys将此比作,通过将钢筋固定在塑料上来加固建筑.一个能清楚的说明一个结果是正确的论点,比一个更强调数学的严格性但却难以理解的论点更加可信,而且更可能在科学中占据一个长久的位置.

Secondly, we have to recognize that there are no really trustworthy standards of rigor in a mathematics that has embraced the theory of infinite sets. Morris Kline (1980, p. 351) came close to the Jeffreys simile: ‘Should one design a bridge using theory involving infinite sets or the axiom of choice? Might not the bridge collapse?’ The only real rigor we have today is in the operations of elementary arithmetic on finite sets of finite integers, and our own bridge will be safest from collapse if we keep this in mind.

其次,我们必须认识到,一个拥抱了无穷集合理论的数学中,尚未存在一个让人信服的严格标准.Morris Kline(1980, 351页)更接近Jeffreys的比喻:可以用无穷集合和选择公理来设计一座大桥吗?确定大桥绝不会坍塌吗?我们今天唯一拥有的严格性,仅在对有限整数集的初等算数运算之中,唯有如此我们的大桥才是最安全的.

Of course, it is essential that we follow this ‘finite sets’ policy whenever it matters for our results; but we do not propose to become fanatical about it. In particular, the arts of computation and approximation are on a different level than that of basic principle; and so once a result is derived from strict application of the rules, we allow ourselves to use any convenient analytical methods for evaluation or approximation (such as replacing a sum by an integral) without feeling obliged to show how to generate an uncountable set as the limit of a finite one.

当然，重要的是我们遵循这个“有限集合”的策略，只要结果受此影响之时; 但我们并不打算狂热坚持必须如此。特别，计算和近似的艺术与基本原理是两个不同的层次，所以一旦得到严格的规则应用的结果，我们就允许我们自己使用任何方便的分析手段来求值或近似值（比如求和换成积分），而没有必要生成一个不可数集然后求极限来处理一个有限集合。

We impose on ourselves a far stricter adherence to the mathematical rules of probability theory than was ever exhibited in the ‘orthodox’ statistical literature, in which authors repeatedly invoke the aforementioned intuitive ad hoc devices to do, arbitrarily and imperfectly, what the rules of probability theory would have done for them uniquely and optimally. It is just this strict adherence that enables us to avoid the artificial paradoxes and contradictions of orthodox statistics, as described in Chapters 15 and 17.

“正统的”统计学文献里常常使用前面提到的基于直觉的特殊手法，而且是随意和不完美的使用，但其实使用概率论规则就足以得到更优更独特的结果，所以在本书中我们将严格的遵守概率论的数学规则。正是这种严格的遵守才使我们避免了正统统计的人为矛盾和矛盾，正如第15章和第17章所描述的那样。

Equally important, this policy often simplifies the computations in two ways: (i) the problem of determining the sampling distribution of a ‘statistic’ is eliminated, and the evidence of the data is displayed fully in the likelihood function, which can be written down immediately; and (ii) one can eliminate nuisance parameters at the beginning of a calculation, thus reducing the dimensionality of a search algorithm. If there are several parameters in a problem, this can mean orders of magnitude reduction in computation over what would be needed with a least squares or maximum likelihood algorithm. The Bayesian computer programs of Bretthorst (1988) demonstrate these advantages impressively, leading in some cases to major improvements in the ability to extract information from data, over previously used methods. But this has barely scratched the surface of what can be done with sophisticated Bayesian models. We expect a great proliferation of this field in the near future.

同样重要的是，这个策略通常以两种方式简化计算：（i）不需要确定“统计量”的抽样分布的问题了，数据的特征完全在似然函数中被表达，并可以立即写出这个该函数; （ii）可以在计算开始时消除多余参数，从而降低搜索算法的维数。如果在一个问题中有多个参数，那么对于最小二乘法或最大似然算法而言计算量会有数量级的减小。 Bretthorst（1988）的贝叶斯计算机程序显示出了这些优点，在某些情况下导致从数据中提取信息的能力比以前使用的方法有了很大的提高。但是类似的改进尚未涉及到复杂的贝叶斯模型，我们预计在不久的将来这个领域将会大大增加。

A scientist who has learned how to use probability theory directly as extended logic has a great advantage in power and versatility over one who has learned only a collection of unrelated ad hoc devices. As the complexity of our problems increases, so does this relative advantage. Therefore we think that, in the future, workers in all the quantitative sciences will be obliged, as a matter of practical necessity, to use probability theory in the manner expounded here. This trend is already well under way in several fields, ranging from econometrics to astronomy to magnetic resonance spectroscopy; but, to make progress in a new area, it is necessary to develop a healthy disrespect for tradition and authority, which have retarded progress throughout the 20th century.

一个学会了如何直接使用概率论作为扩展逻辑的科学家，与只学习了一些不相关的特殊处理方法的人相比，解决问题的能力和多样性方面有很大的优势。随着我们问题的复杂性增加，这个相对优势也会增加。因此，我们认为，将来所有定量科学的工作者都必须按照实际需要以这里所阐述的方式使用概率论。从计量经济学到天文学到磁共振波谱学等领域，这一趋势已经在进行中，但要在新的领域取得进展，就必须对传统和权威有所取舍，这在整个20世纪一直是滞后的。

Finally, some readers should be warned not to look for hidden subtleties of meaning which are not present. We shall, of course, explain and use all the standard technical jargon of probability and statistics – because that is our topic. But, although our concern with the nature of logical inference leads us to discuss many of the same issues, our language differs greatly from the stilted jargon of logicians and philosophers. There are no linguistic tricks, and there is no ‘meta-language’ gobbledygook; only plain English. We think that this will convey our message clearly enough to anyone who seriously wants to understand it. In any event, we feel sure that no further clarity would be achieved by taking the first few steps down that infinite regress that starts with: ‘What do you mean by “exists”?’

最后，应该警告一些读者不要去寻找那些不存在的隐含的意义细微之处。当然，我们将解释和使用概率和统计的所有标准技术术语 - 因为那是我们的话题。但是，尽管我们对逻辑推理本质的关注导致我们讨论了许多相同的问题，但是我们的语言与逻辑学家和哲学家的那些拙劣的术语差别很大。没有语言技巧，也没有“元语言”的噱头; 只有简单的英语。我们认为这将把我们的信息清楚地传达给任何想要了解它的人。在任何情况下，我们都确信，从’“存在”是什么概念‘开始，无限的深入解析下去，并不会让问题变得更明确。

Acknowledgments

In addition to the inspiration received from the writings of Jeffreys, Cox, Pólya, and Shannon, I have profited by interaction with some 300 former students, who have diligently caught my errors and forced me to think more carefully about many issues. Also, over the years, my thinking has been influenced by discussions with many colleagues; to list a few (in the reverse alphabetical order preferred by some): Arnold Zellner, Eugene Wigner, George Uhlenbeck, John Tukey, William Sudderth, Stephen Stigler, Ray Smith, John Skilling, Jimmie Savage, Carlos Rodriguez, Lincoln Moses, Elliott Montroll, Paul Meier, Dennis Lindley, David Lane, Mark Kac, Harold Jeffreys, Bruce Hill, Mike Hardy, Stephen Gull, Tom Grandy, Jack Good, Seymour Geisser, Anthony Garrett, Fritz Fröhner, Willy Feller, Anthony Edwards, Morrie de Groot, Phil Dawid, Jerome Cornfield, John Parker Burg, David Blackwell, and George Barnard. While I have not agreed with all of the great variety of things they told me, it has all been taken into account in one way or another in the following pages. Even when we ended in disagreement on some issue, I believe that our frank private discussions have enabled me to avoid misrepresenting their positions, while clarifying my own thinking; I thank them for their patience.

E.T. Jaynes
July, 1996

序