Programming Language Syntax

Unlike natural languages such as English or Chinese,computer languages must be precise. Both their form (syntax) and meaning (semantics) must be specified without ambiguity, so that both programmers and computers can tell what a program is supposed to do. To provide the needed degree of precision, language designers and implementors use formal syntactic and semantic notation. To facilitate the discussion of language features in later chapters, we will cover this notation first: syntax in the current chapter and semantics in Chapter 4.

与英语或中文等自然语言不同,计算机语言必须要精确。它们的形式(语法)和含义(语义)必须要明确定义,以便程序员和计算机都能理解程序的预期行为。为了提供所需的精确度,语言设计者和实现者使用正式的语法和语义符号。为了便于后续章节中讨论语言特性,我们将首先介绍这些符号:本章介绍语法,第四章介绍语义。

Specifying Syntax: Regular Expressions and Context-Free Grammars

编程语言的语法之所以使用BNF(巴科斯-诺尔范式)而不是正则表达式,主要是因为编程语言的语法复杂性通常超出了正则表达式的表达能力。以下是几个关键点来解释这一选择:

  1. 表达能力:

    • 正则表达式:它们是用来描述正规语言的,只能表达有限的语法模式,通常用于简单的字符串匹配。正则表达式不能很好地处理递归模式,这是编程语言常见的结构,如嵌套的括号、if-else结构和函数调用等。
    • BNF范式:它是一种用于表示上下文无关文法(CFG)的符号,能够描述更复杂的语言结构,包括递归,这使得BNF非常适合表示编程语言的语法。
  2. 递归结构:

    • 编程语言中经常会有递归结构,比如在语言中可以有无限嵌套的函数调用或者循环结构。BNF范式自然而然地支持递归定义,而正则表达式则不支持。
  3. 上下文无关文法:

    • BNF是上下文无关文法的一种表示方法,它能够描述大多数编程语言的语法。上下文无关文法比正则表达式的表达能力更强,因为它允许使用非终结符和规则来定义语言的语法结构。
  4. 解析算法:

    • BNF范式定义的语法可以直接被转换成解析器,如LL或LR解析器,这些解析器可以构建出源代码的抽象语法树(AST)。正则表达式不适合这种用途,因为它们缺乏构建复杂结构的能力。
  5. 可读性和维护性:

    • BNF范式提供了一种清晰的方式来描述语法规则,这让语言设计者更容易定义、阅读和维护语言的语法。相比之下,复杂的正则表达式可以难以理解和维护。

总之,正则表达式在处理简单的文本模式时非常有用,但对于定义编程语言这样结构复杂、层次多、需要递归定义的语法来说,BNF范式提供了更强大、更灵活的语法描述能力。

Scanning

By grouping input characters into tokens, the scanner dramatically reduces the number of individual items that must be inspected by the more computationally intensive parser. In addition, the scanner typically removes comments (so the parser doesn’t have to worry about them appearing throughout the context-free grammar); saves the text of “interesting” tokens like identifiers, strings, and numeric literals; and tags tokens with line and column numbers, to make it easier to generate high-quality error messages in subsequent phases.

通过将输入字符分组成标记,扫描器大大减少了需要由计算密集型的解析器检查的单个项的数量。此外,扫描器通常会移除注释(这样解析器就不必担心它们出现在上下文无关语法中);保存“有趣”的标记的文本,比如标识符、字符串和数字字面量;并给标记添加行号和列号,以便在后续阶段生成高质量的错误消息变得更加容易。

Generating a Finite Automaton

While a finite automaton can in principle be written by hand, it is more common to build one automatically from a set of regular expressions, using a scanner generator tool.

有限自动机原则上可以手动编写,但更常见的做法是利用扫描器生成工具从一组正则表达式自动构建。

As it turns out, however, there is no obvious one-step algorithm to convert a set of regular expressions into an equivalent deterministic finite automaton (DFA). The typical scanner generator implements the conversion as a series of three separate steps.

然而,事实证明,并没有明显的一步算法可以将一组正则表达式转换为等价的确定有限自动机(DFA)。典型的扫描器生成器将转换实现为三个单独步骤的系列。

  • The first step converts the regular expressions into a nondeterministic finite automaton (NFA).
  • the second step of a scanner generatortranslates the NFA into an equivalentDFA
  • The third step is a space optimization that generates a final DFA with the minimum possible number of states.

Scanner Code

We can implement a scanner that explicitly captures the “circles-and-arrows” structure of a DFA in either of two main ways. One embeds the automaton in the control flow of the program using gotos or nested case (switch) statements; the other, described in the following subsection, uses a table and a driver.

我们可以通过两种主要方式之一,明确地捕获DFA的“圆圈和箭头”结构来实现扫描器。一种方式是利用goto或嵌套的case(switch)语句将自动机嵌入程序的控制流中;另一种方式在下一小节中描述,它使用表和驱动程序。

Parsing

The parser is the heart of a typical compiler. It calls the scanner to obtain the tokens of the input program, assembles the tokens together into a syntax tree, and passes the tree (perhaps one subroutine at a time) to the later phases of the compiler, which perform semantic analysis and code generation and improvement. In effect, the parser is “in charge” of the entire compilation process; this style of compilation is sometimes referred to as syntax-directed translation.

解析器是典型编译器的核心。它调用扫描器以获取输入程序的标记,将这些标记组装成语法树,然后将语法树(可能一次一个子程序)传递给编译器的后续阶段,这些阶段执行语义分析、代码生成和优化。实际上,解析器“主导”整个编译过程;这种编译风格有时被称为语法定向翻译。

解析算法:

  • LL(自顶向下)
  • LR(自底向上)
  • 递归下降
  • 表驱动自顶向下解析

Syntax Errors

In general, the term syntax error recovery is applied to any technique that allows the compiler, in the face of a syntax error, to continue looking for other errors later in the program. High-quality syntax error recovery is essential in any production-quality compiler. The better the recovery technique, the more likely the compiler will be to recognize additional errors (especially nearby errors) correctly, and the less likely it will be to become confused and announce spurious cascading errors later in the program.

一般来说,“语法错误恢复”这个术语适用于任何允许编译器在面对语法错误时继续寻找程序中其他错误的技术。在任何生产级编译器中,高质量的语法错误恢复都是至关重要的。恢复技术越好,编译器识别额外错误(尤其是附近的错误)的准确性就越高,它混淆并在程序后期宣布虚假的级联错误的可能性就越小。

Theoretical Foundations

Our understanding of the relative roles and computational power of scanners, parsers, regular expressions, and context-free grammars is based on the formalisms of automata theory. In automata theory, a formal language is a set of strings of symbols drawn from a finite alphabet. A formal language can be specified either by a set of rules (such as regular expressions or a context-free grammar) that generates the language, or by a formal machine that accepts (recognizes) the language. A formal machine takes strings of symbols as input and outputs either “yes” or “no.” A machine is said to accept a language if it says “yes” to all and only those strings that are in the language. Alternatively, a language can be defined as the set of strings for which a particular machine says “yes.”

我们对扫描仪、解析器、正则表达式和上下文无关文法的相对作用和计算能力的理解是基于自动机理论的形式主义。在自动机理论中,形式语言是从有限字母表中提取的一组符号串。形式语言可以通过一组规则(如正则表达式或上下文无关文法)来指定,这些规则生成语言,也可以通过一个接受(识别)语言的形式机来指定。形式机将符号串作为输入,并输出“是”或“否”。如果机器对于所有属于语言的字符串都回答“是”,则称该机器接受该语言。或者,语言可以被定义为一组字符串,对于这些字符串,特定的机器回答“是”。

Summary and Concluding Remarks

  • 正则表达式
  • 上下文无关文法
  • 实际编译器中的扫描和解析算法
  • 语法错误恢复(painc)
  • 自顶向下解析
  • 自底向上解析

An Overview of Compilation

Compilers are among the most well-studied classes of computer programs. We will consider them repeatedly throughout the rest of the book, and in Chapters 2, 4, 14, and 16 in particular. The remainder of this section provides an introductory overview.

编译器是最受人研究的计算机程序类别之一。我们将在本书的其余部分多次考虑它们,特别是在第2、4、14和16章。本节的其余部分提供了一个简要概述。

In a typical compiler, compilation proceeds through a series of well-defined phases, shown in Figure 1.3. Each phase discovers information of use to later phases, or transforms the program into a form that is more useful to the subsequent phase.

在典型的编译器中,编译过程经过一系列明确定义的阶段,如图1.3所示。每个阶段都会发现对后续阶段有用的信息,或将程序转换为对后续阶段更有用的形式。

The first few phases (up through semantic analysis) serve to figure out the meaning of the source program. They are sometimes called the front end of the compiler. The last few phases serve to construct an equivalent target program. They are sometimes called the back end of the compiler. Many compiler phases can be created automatically from a formal description of the source and/or target languages.

前几个阶段(直到语义分析)用于弄清源程序的含义。它们有时被称为编译器的前端。最后几个阶段用于构建一个等价的目标程序。它们有时被称为编译器的后端。许多编译器阶段可以根据源语言和/或目标语言的形式化描述自动生成。

One will sometimes hear compilation described as a series of passes. A pass is a phase or set of phases that is serialized with respect to the rest of compilation: it does not start until previous phases have completed, and it finishes before any subsequent phases start. If desired, a pass may be written as a separate program, reading its input from a file and writing its output to a file. Compilers are commonly divided into passes so that the front end may be shared by compilers for more than one machine (target language), and so that the back end may be shared by compilers for more than one source language. In some implementations the front end and the back end may be separated by a“middle end”that is responsible for language- and machine-independent code improvement. Prior to the dramatic increases in memory sizes of the mid to late 1980s, compilers were also sometimes divided into passes to minimize memory usage: as each pass completed, the next could reuse its code space.

有时会听到将编译描述为一系列传递。传递是一个相对于编译的其余部分进行串行化的阶段或一组阶段:它在前一阶段完成之前不会开始,并且在任何后续阶段开始之前就会结束。如果需要,可以将传递编写为一个独立的程序,从文件读取输入并将输出写入文件。编译器通常被划分为传递,以便前端可以被多台机器(目标语言)的编译器共享,后端可以被多种源语言的编译器共享。在一些实现中,前端和后端可能被一个负责语言和机器无关代码改进的“中端”分开。在20世纪80年代中期到晚期内存大小大幅增加之前,有时也将编译器划分为传递以最小化内存使用:当每个传递完成时,下一个传递可以重用其代码空间。

Lexical and Syntax Analysis

Scanning is also known as lexical analysis. The principal purpose of the scanner is to simplify the task of the parser, by reducing the size of the input (there are many more characters than tokens) and by removing extraneous characters like white space. The scanner also typically removes comments and tags tokens with line and column numbers, to make it easier to generate good diagnostics in later phases. One could design a parser to take characters instead of tokens as inputdispensing with the scanner—but the result would be awkward and slow.

扫描也被称为词法分析。扫描器的主要目的是简化解析器的任务,通过减少输入的大小(字符比标记要多得多),并删除多余的字符,如空白。扫描器通常还会移除注释,并为标记添加行号和列号,以便在后续阶段更容易生成良好的诊断信息。理论上可以设计一个解析器,接受字符而不是标记作为输入,从而省去扫描器,但结果会很笨拙且效率低下。

Parsing organizes tokens into a parse tree that represents higher-level constructs (statements, expressions, subroutines, and so on) in terms of their constituents. Each construct is a node in the tree; its constituents are its children. The root ofthe tree is simply “program”; the leaves, from left to right, are the tokens received from the scanner. Taken as a whole, the tree shows how the tokens fit together to make a valid program. The structure relies on a set of potentially recursive rules known as a context-free grammar. Each rule has an arrow sign (−→) with the construct name on the left and a possible expansion on the right. In C, for example, a while loop consists of the keyword while followed by a parenthesized Boolean expression and a statement:

解析将标记组织成一个解析树,该树以其组成部分表示高级构造(语句、表达式、子程序等)。每个构造都是树中的一个节点;它的组成部分是它的子节点。树的根节点简单地是“程序”;从左到右的叶子节点是来自扫描器的标记。整体上看,树显示了标记如何组合在一起形成一个有效的程序。该结构依赖于一组潜在递归的规则,称为无上下文语法。每个规则都有一个箭头符号(→),左边是构造名称,右边是可能的扩展。例如,在C语言中,while循环由关键字while后跟括号中的布尔表达式和一个语句组成:

1
iteration-statement −→ while ( expression ) statement

The statement, in turn, is often a list enclosed in braces:

1
2
statement −→ compound-statement
compound-statement −→ { block-item-list opt }

where

1
block-item-list opt −→ block-item-list

or

1
block-item-list opt −→ ϵ

and

1
2
3
4
block-item-list −→ block-item
block-item-list −→ block-item-list block-item
block-item −→ declaration
block-item −→ statement

Here ϵ represents the empty string; it indicates that block-item-list opt can simply be deleted. Many more grammar rules are needed, of course, to explain the full structure of a program.

这里的 ϵ 代表空字符串;它表示 block-item-list opt 可以被简单地删除。当然,需要更多的语法规则来解释程序的完整结构。

A context-free grammar is said to define the syntax of the language; parsing is therefore known as syntax analysis. There are many possible grammars for C (an infinite number, in fact); the fragment shown above is taken from the sample grammar contained in the official language definition [Int99]. A full parse tree for our GCD program (based on a full grammar not shown here) appears in Figure 1.4. While the size of the tree may seem daunting, its details aren’t particularly important at this point in the text. What is important is that (1) each individual branching point represents the application of a single grammar rule, and (2) the resulting complexity is more a reflection of the grammar than it is of the input program. Much of it stems from (a) the use of such artificial “constructs” as block item-list and block item-list opt to generate lists of arbitrary length, and (b) the use of the equally artificial assignment-expression, additive expression, multiplicative-expression, and so on, to capture precedence and associativity in arithmetic expressions. We shall see in the following subsection that much of this complexity can be discarded once parsing is complete.

一个无上下文语法被认为定义了语言的语法;因此,解析也被称为语法分析。对于C语言有许多可能的语法(实际上是无限多个);上面显示的片段取自官方语言定义中包含的样例语法[Int99]。基于未在此处显示的完整语法的我们的GCD程序的完整解析树显示在图1.4中。虽然树的大小可能看起来令人生畏,但在本文的这一阶段,其细节并不特别重要。重要的是:(1)每个单独的分支点代表一个单一的语法规则的应用,以及(2)产生的复杂性更多地反映了语法而不是输入程序本身。其中许多复杂性源自于(a)使用诸如 block item-list 和 block item-list opt 之类的人为“构造”来生成任意长度的列表,以及(b)使用同样人为的 assignment-expression、additive-expression、multiplicative-expression 等来捕获算术表达式中的优先级和结合性。在接下来的小节中,我们将看到一旦解析完成,许多这些复杂性都可以被丢弃。

In the process of scanning and parsing, the compiler checks to see that all of the program’s tokens are well formed, and that the sequence of tokens conforms to the syntax defined by the context-free grammar. Any malformed tokens (e.g., 123abc or $@foo in C) should cause the scanner to produce an error message. Any syntactically invalid token sequence (e.g., A = X Y Z in C) should lead to an error message from the parser.

在扫描和解析的过程中,编译器检查程序的所有标记是否格式良好,并且标记序列是否符合上下文无关语法定义的语法。任何格式不良的标记(例如,在C语言中的123abc或$@foo)都应该导致扫描器产生错误消息。任何语法上无效的标记序列(例如,在C语言中的A = X Y Z)应该导致解析器产生错误消息。

Semantic Analysis and Intermediate Code Generation

Semantic analysis is the discovery of meaning in a program. The semantic analysis phase of compilation recognizes when multiple occurrences of the same identifier are meant to refer to the same program entity,and ensures that the uses are consistent. In most languages the semantic analyzer tracks the types of both identifiers and expressions, both to verify consistent usage and to guide the generation of code in later phases.

语义分析是对程序中含义的发现。编译的语义分析阶段识别了相同标识符的多次出现意味着引用同一程序实体,并确保这些用法是一致的。在大多数语言中,语义分析器跟踪标识符和表达式的类型,既用于验证一致的使用,也用于指导后续阶段代码的生成。

To assist in its work,the semantic analyzer typically builds and maintains a symbol table data structure that maps each identifier to the information known about it. Among other things, this information includes the identifier’s type, internal structure (if any), and scope (the portion of the program in which it is valid).

为了辅助其工作,语义分析器通常构建并维护一个符号表数据结构,将每个标识符映射到已知的信息。这些信息包括标识符的类型、内部结构(如果有的话)以及作用域(其有效的程序部分)。

Using the symbol table, the semantic analyzer enforces a large variety of rules that are not captured by the hierarchical structure of the context-free grammar and the parse tree. In C, for example, it checks to make sure that

利用符号表,语义分析器强制执行许多上下文无关语法和解析树所不能捕获的规则。例如,在C语言中,它会检查确保

  • Every identifier is declared before it is used.
  • No identifier is used in an inappropriate context (calling an integer as a subroutine, adding a string to an integer, referencing a field of the wrong type of struct, etc.).
  • Subroutine calls provide the correct number and types of arguments.
  • Labels on the arms of a switch statement are distinct constants.
  • Any function with a non-void return type returns a value explicitly.

In many compilers, the work of the semantic analyzer takes the form of semantic action routines,invoked by the parser when it realizes that it has reached a particular point within a grammar rule.

在许多编译器中,语义分析器的工作以语义动作例程的形式进行,当解析器意识到已经到达语法规则中的特定点时,会调用这些例程。

Of course, not all semantic rules can be checked at compile time. Those that can are referred to as the static semantics of the language. Those that must be checked at run time are referred to as the dynamic semantics of the language. C has very little in the way of dynamic checks (its designers opted for performance over safety). Examples of rules that other languages enforce at run time include the following.

当然,并非所有的语义规则都能在编译时进行检查。那些可以在编译时检查的规则被称为语言的静态语义。那些必须在运行时检查的规则被称为语言的动态语义。C语言在动态检查方面几乎没有什么(其设计者选择了性能而不是安全性)。其他语言在运行时执行的规则的例子包括以下内容。

  • Variables are never used in an expression unless they have been given a value.
  • Pointers are never dereferenced unless they refer to a valid object.
  • Array subscript expressions lie within the bounds of the array.
  • Arithmetic operations do not overflow.

When it cannot enforce rules statically, a compiler will often produce code to perform appropriate checks at run time, aborting the program or generating an exception if one of the checks then fails. (Exceptions will be discussed in Section 8.5.) Some rules, unfortunately, may be unacceptably expensive or impossible to enforce, and the language implementation may simply fail to check them. In Ada, a program that breaks such a rule is said to be erroneous; in C its behavior is said to be undefined.

当编译器无法静态地强制执行规则时,通常会生成代码,在运行时执行适当的检查,如果其中一个检查失败,则中止程序或生成异常(异常将在第8.5节中讨论)。遗憾的是,一些规则可能成本过高或无法实施,语言实现可能会简单地不对其进行检查。在Ada中,违反此类规则的程序被称为错误的;在C中,其行为被称为未定义。

A parse tree is sometimes known as a concrete syntax tree, because it demonstrates, completely and concretely, how a particular sequence of tokens can be derived under the rules of the context-free grammar. Once we know that a token sequence is valid, however, much of the information in the parse tree is irrelevant to further phases of compilation. In the process of checking static semantic rules, the semantic analyzer typically transforms the parse tree into an abstract syntax tree (otherwise known as an AST,or simply a syntax tree) by removing most of the “artificial” nodes in the tree’s interior. The semantic analyzer also annotates the remaining nodes with useful information,such as pointers from identifiers to their symbol table entries. The annotations attached to a particular node are known as its attributes. A syntax tree for our GCD program is shown in Figure 1.5.

解析树有时被称为具体语法树,因为它完全而具体地展示了在上下文无关文法规则下特定标记序列的推导过程。然而,一旦我们知道一个标记序列是有效的,解析树中的大部分信息对于编译的后续阶段来说就变得无关紧要了。在检查静态语义规则的过程中,语义分析器通常会通过移除树中大部分“人为”节点,将解析树转换成抽象语法树(又称AST,或者简称语法树)。语义分析器还会为剩余的节点添加有用的信息,比如从标识符到它们的符号表条目的指针。附加到特定节点的注释被称为它的属性。我们的最大公约数程序的语法树如图1.5所示。

In many compilers,the annotated syntax tree constitutes the intermediate form that is passed from the front end to the back end. In other compilers, semantic analysis ends with a traversal of the tree that generates some other intermediate form. One common such form consists of a control flow graph whose nodes resemble fragments of assembly language for a simple idealized machine. We will consider this option further in Chapter 14, where a control flow graph for our GCD program appears in Figure 14.3. In a suite of related compilers, the front ends for several languages and the back ends for several machines would share a common intermediate form.

在许多编译器中,带注释的语法树构成了从前端传递到后端的中间形式。在其他编译器中,语义分析以遍历生成某些其他中间形式的树而结束。一种常见的这种形式包括一个控制流图,其节点类似于简化理想机器的汇编语言片段。我们将在第14章进一步探讨这个选项,在那里我们的GCD程序的控制流图出现在图14.3中。在一套相关的编译器中,多种语言的前端和多种机器的后端将共享一个通用的中间形式。

Target Code Generation

The code generation phase of a compiler translates the intermediate form into the target language. Given the information contained in the syntax tree, generating correct code is usually not a difficult task (generating good code is harder, as we shall see in Section 1.6.4). To generate assembly or machine language, the code generator traverses the symbol table to assign locations to variables, and then traverses the intermediate representation of the program, generating loads and stores for variable references,interspersed with appropriate arithmetic operations, tests, and branches. Naive code for our GCD example appears in Figure 1.6, in x86 assembly language. It was generated automatically by a simple pedagogical compiler.

编译器的代码生成阶段将中间形式转换为目标语言。根据语法树中包含的信息,生成正确的代码通常并不是一项困难的任务(生成优秀的代码更难,正如我们将在1.6.4节中看到的那样)。为了生成汇编或机器语言,代码生成器遍历符号表以为变量分配位置,然后遍历程序的中间表示,生成变量引用的加载和存储,并穿插适当的算术操作、测试和跳转。我们的GCD示例的朴素代码出现在图1.6中,使用x86汇编语言编写。这是由一个简单的教学编译器自动生成的。

The assembly language mnemonics may appear a bit cryptic,but the comments on each line (not generated by the compiler!) should make the correspondence between Figures 1.5 and 1.6 generally apparent. A few hints: esp, ebp, eax, ebx, and edi are registers (special storage locations, limited in number, that can be accessed very quickly). -8(%ebp) refers to the memory location 8 bytes before the location whose address is in register ebp; in this program, ebp serves as a base from which we can find variables i and j. The argument to a subroutine call instruction is passed by pushing it onto a stack, for which esp is the top-of-stack pointer. The return value comes back in register eax. Arithmetic operations overwrite their second argument with the result of the operation.

这些汇编语言助记符可能看起来有点晦涩,但每行的注释(并非由编译器生成!)应该能够使图1.5和图1.6之间的对应关系基本上显而易见。一些提示:esp、ebp、eax、ebx和edi是寄存器(特殊的存储位置,数量有限,可以被非常快速地访问)。-8(%ebp)指的是在寄存器ebp中存储地址的位置的前8个字节的内存位置;在这个程序中,ebp用作基址,我们可以从中找到变量i和j。对于子程序调用指令,参数是通过将其推送到堆栈中传递的,esp是堆栈顶部指针。返回值存放在寄存器eax中。算术操作会用操作的结果覆盖其第二个参数。

Often a code generator will save the symbol table for later use by a symbolic debugger, by including it in a nonexecutable part of the target code.

通常,代码生成器会将符号表保存下来,以便将来被符号调试器使用,方法是将其包含在目标代码的不可执行部分中。

Code Improvement

Code improvement is often referred to as optimization, though it seldom makes anything optimal in any absolute sense. It is an optional phase of compilation whose goal is to transform a program into a new version that computes the same result more efficiently—more quickly or using less memory, or both.

代码改进通常被称为优化,尽管在任何绝对意义上它很少使任何东西都变得最佳。它是编译的一个可选阶段,其目标是将程序转换为一个计算相同结果的更高效版本——更快或者使用更少的内存,或者两者兼而有之。

Some improvements are machine independent. These can be performed as transformations onthe intermediate form. Other improvements require an understanding of the target machine (or of whatever will execute the program in the target language). These must be performed as transformations on the target program. Thus code improvement often appears as two additional phases of compilation, one immediately after semantic analysis and intermediate code generation, the other immediately after target code generation.

有些改进是与机器无关的。这些可以作为对中间形式的转换来执行。其他改进需要对目标机器(或者在目标语言中执行程序的任何内容)的理解。这些必须作为对目标程序的转换来执行。因此,代码改进通常出现为编译的两个额外阶段,一个紧跟语义分析和中间代码生成之后,另一个紧跟目标代码生成之后。

Applying a good code improver to the code in Figure 1.6 produces the code shown in Example 1.2 (page 5). Comparing the two programs, we can see that the improved version is quite a lot shorter. Conspicuously absent are most of the loads and stores. The machine-independent code improver is able to verify that i and j can be kept in registers throughout the execution of the main loop. (This would not have been the case if,for example,the loop contained a call to a subroutine that might reuse those registers, or that might try to modify i or j.) The machine specific code improver is then able to assign i and j to actual registers of the target machine. For modern microprocessor architectures, particularly those with so-called superscalar implementations (ones in which separate functional units can execute instructions simultaneously), compilers can usually generate better code than can human assembly language programmers.

将图1.6中的代码应用于良好的代码改进器会产生示例1.2(第5页)中所示的代码。比较这两个程序,我们可以看到改进后的版本要短得多。明显缺失的是大部分的加载和存储操作。机器无关的代码改进器能够验证在主循环的执行过程中,i和j可以一直保持在寄存器中。(如果,例如,循环中包含对子程序的调用,该子程序可能会重用这些寄存器,或者可能试图修改i或j,那么情况就不同了。)然后,特定于机器的代码改进器能够将i和j分配给目标机器的实际寄存器。对于现代微处理器架构,特别是那些具有所谓的超标量实现(可以同时执行指令的独立功能单元),编译器通常能够生成比人类汇编语言程序员更好的代码。

Summary and Concluding Remarks

In this chapter we introduced the study of programming language design and implementation. We considered why there are so many languages, what makes them successful or unsuccessful, how they may be categorized for study, and what benefits the reader is likely to gain from that study. We noted that language design and language implementation are intimately related to one another. Obviously an implementation must conform to the rules of the language. At the same time, a language designer must consider how easy or difficult it will be to implement various features, and what sort of performance is likely to result for programs that use those features.

在本章中,我们介绍了编程语言设计和实现的研究。我们考虑了为什么会有这么多种语言,是什么让它们成功或失败,它们可以如何分类进行研究,以及读者可能从这些研究中获得什么好处。我们指出语言设计和语言实现之间密切相关。显然,实现必须符合语言的规则。同时,语言设计者必须考虑实现各种特性的难易程度,以及使用这些特性的程序可能会产生怎样的性能。

Language implementations are commonly differentiated into those based on interpretation and those based on compilation. We noted, however, that the difference between these approaches is fuzzy,and that most implementations include a bit of each. As a general rule, we say that a language is compiled if execution is preceded by a translation step that (1) fully analyzes both the structure (syntax) and meaning (semantics) of the program, and (2) produces an equivalent program in a significantly different form. The bulk of the implementation material in this book pertains to compilation.

编程语言的实现通常分为基于解释和基于编译的两种类型。然而,我们指出这两种方法之间的区别是模糊的,并且大多数实现都包含了两者的一些特点。一般而言,我们说一种语言是编译型的,如果在执行之前需要经过一个翻译步骤,这个步骤会(1)完全分析程序的结构(语法)和含义(语义),以及(2)生成一个在形式上明显不同的等效程序。本书中大部分的实现内容涉及编译。

Compilers are generally structured as a series of phases. The first few phases scanning, parsing, and semantic analysis—serve to analyze the source program. Collectively these phases are known as the compiler’s front end. The final few phases—intermediate code generation, code improvement, and target code generation—are known as the back end. They serve to build a target program preferably a fast one—whose semantics match those of the source.

编译器通常被构建为一系列阶段。最初的几个阶段——扫描、解析和语义分析——用于分析源程序。这些阶段统称为编译器的前端。最后的几个阶段——中间代码生成、代码优化和目标代码生成——被称为后端。它们用于构建目标程序——最好是一个语义与源程序相匹配且快速的程序。

Chapters 3,6,7,8,and 9 form the core of the rest of this book. They cover fundamental issues of language design, both from the point of view of the programmer and from the point of view of the language implementor. To support the discussion of implementations, Chapters 2 and 4 describe compiler front ends in more detail than has been possible in this introduction. Chapter 5 provides an overview of assembly-level architecture. Chapters 14 through 16 discuss compiler back ends, including assemblers and linkers, run-time systems, and code improvement techniques. Additional language paradigms are covered in Chapters 10 through 13. Appendix A lists the principal programming languages mentioned in the text, together with a genealogical chart and bibliographic references. Appendix B contains a list of “Design & Implementation” sidebars; Appendix C contains a list of numbered examples.

第3、6、7、8和9章构成了本书的核心部分。它们涵盖了语言设计的基本问题,既从程序员的角度,也从语言实现者的角度进行了讨论。为了支持对实现的讨论,第2章和第4章比本介绍部分更详细地描述了编译器的前端。第5章概述了汇编级架构。第14至16章讨论了编译器的后端,包括汇编器和链接器、运行时系统以及代码优化技术。第10至13章涵盖了额外的语言范式。附录A列出了文本中提到的主要编程语言,以及它们的谱系图和参考文献。附录B包含了“设计与实现”方块的列表;附录C包含了编号示例的列表。

The Art of Language Design

Today there are thousands of high-level programming languages, and new ones continue to emerge. Human beings use assembly language only for specialpurpose applications. In a typical undergraduate class, it is not uncommon to find users of scores of different languages. Why are there so many? There are several possible answers:

今天有成千上万种高级编程语言,而且新的语言不断涌现。人类只在特定应用中使用汇编语言。在典型的本科课程中,发现使用数十种不同语言的情况并不罕见。为什么会有这么多种语言?这有几个可能的答案:

Evolution.

Computer science is a young discipline; we’re constantly finding better ways to do things. The late 1960s and early 1970s saw a revolution in “structured programming,” in which the goto-based control flow of languages like Fortran,Cobol,and Basic3 gave way to while loops, case (switch) statements, and similar higher level constructs. In the late 1980s the nested block structure of languages like Algol,Pascal,and Ada began to give way to the object-oriented structure of Smalltalk, C++, Eiffel, and the like.

计算机科学是一个年轻的学科;我们不断发现更好的做事方式。20世纪60年代末和70年代初,出现了“结构化编程”的革命,其中像Fortran、Cobol和Basic3这样基于goto的控制流让位给了while循环、case(switch)语句和类似的更高级构造。到了20世纪80年代末,像Algol、Pascal和Ada这样的语言的嵌套块结构开始让位于Smalltalk、C++、Eiffel等面向对象的结构。

Special Purposes.

Many languages were designed for a specific problem domain. The various Lisp dialects are good for manipulating symbolic data and complex data structures. Icon and Awk are good for manipulating character strings. C is good for low-level systems programming. Prolog is good for reasoning about logical relationships among data. Each of these languages can be used successfully for a wider range of tasks, but the emphasis is clearly on the specialty.

许多编程语言是为特定的问题领域而设计的。各种Lisp方言适用于操纵符号数据和复杂数据结构。Icon和Awk适用于操纵字符串。C适用于低级系统编程。Prolog适用于推理数据之间的逻辑关系。这些语言都可以成功地用于更广泛的任务,但重点显然在于专业领域。

Personal Preference.

Different people like different things. Much of the parochialism of programming is simply a matter of taste. Some people love the terseness of C; some hate it. Some people find it natural to think recursively; others prefer iteration. Some people like to work with pointers; others prefer the implicit dereferencing of Lisp, Clu, Java, and ML. The strength and variety of personal preference make it unlikely that anyone will ever develop a universally acceptable programming language.

不同的人喜欢不同的东西。编程中的很多偏见只是品味上的问题。有些人喜欢C的简洁性;有些人则讨厌。有些人觉得递归思考很自然;而其他人则更喜欢迭代。有些人喜欢使用指针;其他人则更喜欢Lisp、Clu、Java和ML中隐式解引用的方式。个人偏好的强大和多样性使得几乎不可能有人能够开发出一个被普遍接受的编程语言。

Of course, some languages are more successful than others. Of the many that have been designed, only a few dozen are widely used. What makes a language successful? Again there are several answers:

当然,一些语言比其他语言更成功。在众多被设计出来的语言中,只有几十种被广泛使用。是什么让一种语言成功?这个问题有几个答案:

Expressive Power.

One commonly hears arguments that one language is more “powerful” than another, though in a formal mathematical sense they are all Turing complete—each can be used, if awkwardly, to implement arbitrary algorithms. Still, language features clearly have a huge impact on the programmer’s ability to write clear, concise, and maintainable code, especially for very large systems. There is no comparison, for example, between early versions of Basic on the one hand, and Common Lisp or Ada on the other. The factors that contribute to expressive power—abstraction facilities in particular—are a major focus of this book.

人们经常听到有关一种语言比另一种语言更“强大”的论点,尽管从正式的数学意义上讲,它们都是图灵完备的——每种语言都可以被使用(虽然有时会很笨拙)来实现任意算法。然而,语言特性显然对程序员编写清晰、简洁和易维护的代码能力产生巨大影响,特别是对于非常庞大的系统。例如,早期的Basic与Common Lisp或Ada之间是无法比较的。构成表达能力的因素——尤其是抽象设施——是本书的主要关注点。

Ease of Use for the Novice.

While it is easy to pick on Basic, one cannot deny its success. Part of that success is due to its very low “learning curve.” Logo is popular among elementary-level educators for a similar reason: even a 5-year-old can learn it. Pascal was taught for many years in introductory programming language courses because, at least in comparison to other “serious” languages, it is compact and easy to learn. In recent years Java has come to play a similar role. Though substantially more complex than Pascal, it is much simpler than, say, C++.

虽然批评Basic很容易,但人们无法否认它的成功。它的成功部分归功于其非常低的“学习曲线”。类似的原因,Logo在小学教育者中很受欢迎:甚至5岁的孩子都可以学会。Pascal在很多年里被用来教授入门级编程语言课程,因为至少与其他“严肃”的语言相比,它更为简洁且易学。近年来,Java也开始扮演类似的角色。虽然比Pascal复杂得多,但它比如C++等语言简单得多。

Ease of Implementation.

In addition to its low learning curve, Basic is successful because it could be implemented easily on tiny machines, with limited resources. Forth has a small but dedicated following for similar reasons. Arguably the single most important factor in the success of Pascal was that its designer, Niklaus Wirth, developed a simple, portable implementation of the language, and shipped it free to universities all over the world (see Example 1.15).4 The Java designers took similar steps to make their language available for free to almost anyone who wants it.

除了其低学习曲线之外,Basic之所以成功,还因为它可以轻松在资源有限的小型设备上实现。Forth也因类似的原因拥有一小部分忠实的追随者。可以说,Pascal成功的最重要因素之一是其设计者尼古拉斯·沃斯(Niklaus Wirth)开发了一种简单、可移植的语言实现,并免费提供给全球各地的大学(见示例1.15)。Java的设计者们也采取了类似的措施,让他们的语言几乎可以免费提供给任何想要使用它的人。

Standardization.

Almost every widely used language has an official international standard or (in the case of several scripting languages) a single canonical implementation; and in the latter case the canonical implementation is almost invariably written in a language that has a standard. Standardization—of both the language and a broad set of libraries—is the only truly effective way to ensure the portability of code across platforms. The relatively impoverished standard for Pascal, which is missing several features considered essential by many programmers (separate compilation, strings, static initialization, random-access I/O), is at least partially responsible for the language’s drop from favor in the 1980s. Many of these features were implemented in different ways by different vendors.

几乎每种广泛使用的编程语言都有一个官方的国际标准,或者(在一些脚本语言的情况下)有一个单一的规范实现;而在后一种情况下,规范实现几乎总是用一种有标准的语言编写。标准化——无论是语言本身还是一套广泛的库——是确保代码在各种平台上可移植的唯一真正有效的方法。Pascal的标准相对较为简陋,缺少许多被许多程序员认为是必不可少的功能(如独立编译、字符串、静态初始化、随机访问I/O),这在一定程度上导致了该语言在20世纪80年代失宠。许多这些功能是由不同的供应商以不同的方式实现的。

Open Source.

Most programming languages today have at least one open-source compiler or interpreter, but some languages—C in particular—are much more closely associated than others with freely distributed, peer-reviewed, community-supported computing. C was originally developed in the early 1970s by Dennis Ritchie and Ken Thompson at Bell Labs,5 in conjunction with the design of the original Unix operating system. Over the years Unix evolved into the world’s most portable operating system—the OS of choice for academic computer science—and C was closely associated with it. With the standardization of C, the language has become available on an enormous variety of additional platforms. Linux, the leading open-source operating system, is written in C. As of October 2008, C and its descendants account for 66% of the projects hosted at the sourceforge.net repository.

如今,大多数编程语言至少都有一个开源编译器或解释器,但有些语言——尤其是C语言——与自由分发、同行评审、社区支持的计算密切相关。C语言最初是在20世纪70年代初由贝尔实验室的丹尼斯·里奇(Dennis Ritchie)和肯·汤普逊(Ken Thompson)开发的,同时也是与最初的Unix操作系统的设计相关联的。多年来,Unix发展成为世界上最具可移植性的操作系统——成为学术计算机科学的首选操作系统,并与C语言密切相关。随着C语言的标准化,该语言已经可以在大量额外的平台上使用。领先的开源操作系统Linux就是用C语言编写的。截至2008年10月,在sourceforge.net代码库托管的项目中,C语言及其衍生版本占了66%。

Excellent Compilers.

Fortran owes much of its success to extremely good compilers. In part this is a matter of historical accident. Fortran has been around longer than anything else, and companies have invested huge amounts of time and money in making compilers that generate very fast code. It is also a matter of language design, however: Fortran dialects prior to Fortran 90 lack recursion and pointers, features that greatly complicate the task of generating fast code (at least for programs that can be written in a reasonable fashion without them!). In a similar vein, some languages (e.g., Common Lisp) are successful in part because they have compilers and supporting tools that do an unusually good job of helping the programmer manage very large projects.

Fortran的成功很大程度上归功于非常优秀的编译器。部分原因是历史的偶然性。Fortran的历史比其他任何语言都要悠久,许多公司投入了大量时间和资金来开发能够生成非常高效代码的编译器。然而,这也与语言设计有关:Fortran 90之前的方言缺乏递归和指针,这些特性会极大地复杂化生成高效代码的任务(至少对于可以在不使用它们的情况下以合理方式编写的程序来说是如此!)。类似地,一些语言(例如Common Lisp)之所以成功,部分原因在于它们具有编译器和支持工具,这些工具在帮助程序员管理非常大型项目方面做得异常出色。

Economics, Patronage, and Inertia.

Finally, there are factors other than technical merit that greatly influence success. The backing of a powerful sponsor is one. PL/I, at least to first approximation, owes its life to IBM. Cobol and, more recently, Ada owe their life to the U.S. Department of Defense: Ada contains a wealth of excellent features and ideas, but the sheer complexity of implementation would likely have killed it if not for the DoD backing. Similarly, C#, despite its technical merits, would probably not have received the attention it has without the backing of Microsoft. At the other end of the life cycle, some languages remain widely used long after “better” alternatives are available because of a huge base of installed software and programmer expertise, which would cost too much to replace.

最后,除了技术优势之外,还有其他因素极大地影响着成功。强大赞助商的支持是其中之一。例如,PL/I至少在最初阶段,归功于IBM。Cobol和最近的Ada都归功于美国国防部:Ada包含了大量出色的特性和想法,但如果没有美国国防部的支持,其实现的复杂性很可能会使其难以生存。同样,尽管C#在技术上有其优点,但如果没有微软的支持,它可能不会得到如今的关注。另一方面,一些语言在更好的替代方案出现后仍然被广泛使用,这是因为已安装的软件和程序员的专业知识庞大,替换它们的成本太高。

Clearly no single factor determines whether a language is “good.” As we study programming languages, we shall need to consider issues from several points of view. In particular, we shall need to consider the viewpoints of both the programmer and the language implementor. Sometimes these points of view will be in harmony, as in the desire for execution speed. Often, however, there will be conflicts and tradeoffs, as the conceptual appeal of a feature is balanced against the cost of its implementation. The tradeoff becomes particularly thorny when the implementation imposes costs not only on programs that use the feature, but also on programs that do not.

显然,没有单一因素能决定一种语言是否“优秀”。在研究编程语言时,我们需要从多个角度考虑问题。特别是,我们需要考虑程序员和语言实现者的观点。有时这些观点会保持一致,比如对执行速度的渴望。然而,通常会存在冲突和权衡,比如某项特性的概念吸引力与其实现成本之间的平衡。当实现不仅给使用该特性的程序带来成本,而且也给不使用该特性的程序带来成本时,这种权衡变得特别棘手。

In the early days of computing the implementor’s viewpoint was predominant. Programming languages evolved as a means of telling a computer what to do. For programmers, however, a language is more aptly defined as a means of expressing algorithms. Just as natural languages constrain exposition and discourse, so programming languages constrain what can and cannot easily be expressed, and have both profound and subtle influence over what the programmer can think. Donald Knuth has suggested that programming be regarded as the art of telling another human being what one wants the computer to do [Knu84].6 This definition perhaps strikes the best sort of compromise. It acknowledges that both conceptual clarity and implementation efficiency are fundamental concerns. This book attempts to capture this spirit of compromise,by simultaneously considering the conceptual and implementation aspects of each of the topics it covers.

在计算机的早期发展阶段,实现者的观点占据主导地位。编程语言演变为告诉计算机该做什么的手段。然而,对于程序员来说,语言更恰当地被定义为表达算法的手段。就像自然语言限制了表述和交流一样,编程语言也限制了什么可以和不能轻松地表达,并对程序员的思维产生深远而微妙的影响。Donald Knuth曾建议将编程视为告诉另一个人想让计算机做什么的艺术。这个定义或许达到了最好的折衷。它承认了概念上的清晰和实现效率都是基本关注点。本书试图捕捉这种折衷的精神,同时考虑所涵盖主题的概念和实现方面。

0%