NCTU CIS: Compilers (Fall 2008)

¡@

Course Information

Instructor

Wuu Yang, <wuuyang@cis.nctu.edu.tw>, EC332B, 03-5131247, 03-5712121 ext. 56614.

Teaching Assistants

Phone: 56649.

Lectures

Wednesdays 6:30-9:30.

Office Hours

4EF, EC332B.

Course URL

http://www.cis.nctu.edu.tw/~wuuyang/Lecture/lecture.compiler.html

Course Discussion Board

http://plaslab24.cis.nctu.edu.tw/phpBB2/viewforum.php?f=3

Prerequisites

programming in a conventional language (such as C, C++, or Pascal) and data structure

Course Description

This course is an introduction to development tools for embedded systems.  Our focus will be on the compiler construction, from the most basic to some introduction to optimization, object code format, and some topics related to embedded systems.

In the introduction to the design and implementation of compilers for an imperative language, we will study various components in a compiler, including scanners, parsers, symbol tables, semantic analyzers, and code generators, and the overall organization of a compiler. In the scanners, we will study regular expressions and the transformation of a regular expression into a deterministic finite state machine. In the parsers, we will study LL and LR parsing methods and the construction of various parsing tables. There are many tools that can generate scanners and parsers from formal specifications. We will also study these tools and use them to generate the scanner and parser for our project.

We will study and implement the code generation components of a compiler. The symbol tables are a data structure shared by most components in the compiler. We will study how to process variables, types, expressions, records, arrays, pointers, and simple control structures. Procedures and packages will also be studied. Finally, we will study techniques for code generation and local optimization. If time permits, we will study global optimization techniques in compilers.

This course will include an integrated programming project. We will give you a detailed specification of a simplified language that is very similar to Ada. It will be our target language. We will also supply the documents of a scanner generator and of a parser generator. A target language and its simulator will also be provided. You will implement a scanner, a parser, and other components used to generate code for a hypothetical machine. To help you make progress, we will set three check points during the semester and a final demonstration at the end of the semester. You will need to finish a scanner, a parser, and a primitive code generator at the checkpoints, respectively. At the end of the semester, you will need to show (off) a complete compiler that generate code for the hypothetical machine. We will provide a simulator for the machine to run your object code. In the project, you will need to implement a minimum set of features, which will be announced later in the class. If you implement extra features, you will earn extra points.

Textbook

C.N. Fischer and R.L. LeBlanc, Jr., Crafting a Compiler with C, Benjamin/Cummings, 1991. (table of contents)

(reference only) A.V. Aho, R. Sethi, and J.D. Ullman, Compilers, Principles, Techniques, and Tools, Addison-Wesley, 1985. (table of contents)

(reference only) Andrew W. Appel, Modern Compiler Implementation In Java, Cambridge Univ. Press, 1997.

(reference only) K. Dowd, High-performance Computing (chapters 2 and 4), O'Relly, 1992 ???. RISC vs CISC, what an optimizing compiler does

(reference only) David H. D. Warren, Logic Programming and Compiler Writing, Software Practice and Experience, Vol. 10, pp. 97-125, 1980.

(reference only) J. R. Levine, Linkers and Loaders, Morgan Kaufmann Publishers, San Francisco, 2000.

¡@


Grading

There will be two exams (25% each) and a course project (50%).

Semester Project

The compiler design course is closely associated with a project. Three checkpoints are set up to help you make progress. At the final demonstration, you are required to show off a complete compiler. We will provide you with a grading sheet that details the grading policy. We urge to you to make continual progress in your project since this is one of the best ways to learn the course thoroughly.

Class material is available on www. See below. There is a README file explaining how to use the class material.

First Checkpoint: The scanner. Due date: October 9, 2008, Thursday, beginning of class (4th week)

At the first checkpoint, you will need to finish the scanner, using the scangen tool. You need to demonstrate your scanner with all the reserved words and symbols plus 10 correct names and comments. In addition, you need to use 10 incorrect symbols to demonstrate that your scanner can handle the incorrect names properly (e.g., issue some error messages). The output of your scanner should list each token in the input, together with the line number and the position of the first character of the token. In case the token is an identifier or a number, the string representing the token must also be printed. You do NOT need to print a copy of the output nor the program in order to save some paper. We will use on-line demonstration in the first checkpoint.

If the input is "354+126*(920-27/190)", the output from the scanner should be the sequence of tokens: "34", "+", "126", "*", "(", "920", "-", "27", "/", "190" and ")".

There are several test cases with which you may try your scanner. These are the files .../tests/stests[1-7].Adacs and .../scanner.test.adacs. You may also implement a pragma (pragma ListOn and pragma ListOff) to turn on or off the listing of the source program.

You may use scangen to implement the scanner. There is a file for token definitions. If you choose to use lex or flex, you will have to modify the token definitions. A flex manual is also included in this web page.

The following issues should be addressed:

  1. How does the scanner handle the string 10..20 ?
  2. How does the scanner handle the string "abc""def" ?
  3. How does the scanner handle run-away strings (or cross-line strings) and run-away comments? Note that in Ada/CS, there is no run-away comments since every comment is terminated by end-of-line.
  4. How does the scanner handle the string 10.3E+5, 10.3E+BC, 10.3E+2.34, and 10.3EBC ? An acceptable solution to 10.3EBC could be two tokens: a real number 10.3 and an identifier EBC. But you may make your own decision.
  5. How does the scanner handle a very long identifier? How long can an identifier be? What if the maximum is exceeded?
  6. How does the scanner handle a very long string constant? How long can a string be? What if the maximum is exceeded?
  7. How does the scanner handle reserved words? How do you use a table for this purpose?
  8. How does the scanner handle illegal characters, such as !, [ and ], which do not appear in Ada/CS?
  9. How does the scanner handle the string '''' ?
  10. How does the scanner handle the ' sign as in a character constant 'a' and in an attribute such as d'First?
  11. How does the scanner handle end-of-file?
  12. How does the scanner recover from errors? What characters are deleted when errors occur?
  13. Can the scanner handle nested comments? This is not related to Ada/CS due the format of a comment. But in general, nested comments could be an issue in a scanner.
Second Checkpoint: The parser. Due date:  November 6, 2008, Thursday, beginning of class (8th week)

NOTE: It seems inhumane to ask the students to build parse trees for such a large language as Ada/CS. Therefore, no parse trees need be built in this checkpoint.

At the second checkpoint, you will need to finish the parser. You need to demonstrate your scanner with the test programs we will provide. As in the first checkpoint, you do NOT need to print a copy of the output nor the program. We will use on-line demonstration.

You may use llgen to implement the parser. There is a file for syntax definitions. If you choose to use yacc or bison, you will have to modify the syntax definitions. A bison manual is also included in this web page.

Note in the input syntax specification (i.e. the lladacs.bnf file), every terminal specification is followed by two numbers, such as

* 2 4

Here * is a token and 2 and 4 are used for error correction. The explanation is in the fmq manual (i.e. the fmq.doc.Z file). Read it if you are interested. Otherwise, simply ignore the numbers.

Note that the tables for the scanner and the parser are generated by independent tools (scangen and llgen, respectively). The numbers for the tokens must be consistent. In the scanner, token numbers are the major numbers in the adacs.scan file. In the parser, token numbers are assigned by llgen based on the order tokens appear in the *terminals section in the lladacs.bnf file. You have to make sure that tokens are represented by the same numbers in the tables generated by the two tools.

For this checkpoint, your parser should answer accept for correct test file and reject for incorrect test files. For incorrect input, the positions of the errors should be indicated. The parser should try to parse to the end of the input and catch as many errors as possible.

Ignore this paragraph The product of the parser is the parse tree of the input program. The parse tree is constructed with pointers and dynamically allocated nodes. The parse tree should be dumped to the output file in an understandable format. Finally, you need to add a compiler directive "pragma BuildParseTree", which instructs the compiler to build the parse tree, and another directive "pragma DumpParseTree", which instructs the compiler to dump the parse tree built so far to a designated output file. In the absence of the pragmas, the compiler should not build or dump the parse tree.

You need to address the following issues:

  1. Does the parser issue good error messages, such as line numbers and character positions of errors and explanations of errors?
  2. How does the parser recover from errors?
  3. Are there any optimizations to the parse table, size reduction or speed improvement?
Third Checkpoint: The symbol table. Due date: December 4, 2008, Thursday, beginning of class (12th week)

At the third checkpoint, you will need to finish the symbol table and related routines. You need to show that your symbol table can correctly handle block entry and exit, procedure names, and parameters. As usual, do not print a copy of the program or the output. On-line demonstration is enough.

You will need about 5 days to implement a primitive working symbol table.

You may use C++ STL 'map' .  The class is essentially a red-black tree.

You need to address the following issues:

  1. Explain how the symbol table is organized, a hashed table, binary search list, or an unordered list, etc. Do you use one big table or one small table per scope?
  2. What does the compiler do when it encounters an undeclared variable, an undeclared type name, an undeclared function or procedure, an undeclared enumeration literal, an undeclared constant, etc.? The compiler should try to avoid spurious error messages.
  3. Do you use action-controlled semantic stack or parser-controlled semantic stack?
  4. You need to add action routines to print a message whenever a new scope is created or closed.
  5. You need to add action routines to print a message whenever a global simple variable (that is, variables of type integer or real) is declared.
  6. You need to add action routines to print a message whenever a local simple variable (that is, variables of type integer or real) is declared in a function.
  7. You need to add action routines to print a message whenever a loop index variable is implicitly declared.
  8.  Do you check duplicate declaration of the same name in the same scope?
  9. Do you check type compatibility?
  10. Do you use overload resolution during code generation?
  11. Do you build type descriptors for records and arrays?
Final Checkpoint: A complete compiler. Due date: January 7, 2008, Thursday, beginning of class (17th week)

At the last checkpoint, you will have the chance to show off your complete compiler. We will schedule an interview with each students and test the features that you have implemented. You need to bring your grade sheet to the interview. Please do not print your program or output. We will give you a grade for the project during the interview. In the project, you will need to implement a minimum set of features, which will be announced later in the class. If you implement extra features, you will earn extra points.

Here is a list of semantic checks:

  1. All identifiers are declared.
  2. No identifiers are multiply-defined.
  3. Types in expressions are correct with respect to the type rules.
  4. Imported names are handled properly. Name conflicts are resolved correctly.
  5. Inheritance relation among classes are properly formed in OO languages.
  6. Methods in a class are not multiply defined.
  7. All declared methods are implemented.
  8. Abstract methods are not implemented in abstract classes.
  9. Public/protected/private rules are observed.

The target machine may be the pseudo machine defined in the appendix of the textbook. There is a simulator called testit in the distribution directory. You may also generate code for the Java virtual machine, a manual of which is included in the next section, and use any web browser to run it.


The grading standard for the compiler project:

See the project grading sheet on the web.

Class material


Local On-Line Information

Information on the Net


Wuu Yang, <wuuyang@cis.nctu.edu.tw>, EC332B, 03-5712121 ext. 56614.