EDA045F Program Analysis: Homework 2

Please try to track the time that you need for the homework assignment and report it, so that we can calibrate the effort for future classes.

Dataflow Analysis 2

This is a group project and intended to be worked on in a team of two. If you have no partner, you can also submit alone.

Background

In this excerise, you will again work with the Soot, though this time we will take advantage of Soot's built-in analysis frameworks.

Resources

Soot 3.1.0: [jar]
Soot 3.1.0 packaged documentation: [jar]
JUnit 4: [jar] (used by some of the benchmarks)
Stub program 2: [java] (v2: alters the cg pack settings)
Recommended starting point for implementing your own analysis.

Test suite:

Binary	Source	Description	Dependencies
javacc-5.0.jar	javacc-5.0-src.tar.gz	The JavaCC compiler generator (small)	junit4
bcel-6.2.jar	bcel-6.2-src.tar.gz	The BCEL Bytecode Engineering library
xalan-2.7.2.jar	xalan-2.7.2-src.tar.gz	The Xalan XSLT processor	bcel, junit4

Use these files to test your analysis. Note that there is no `ground truth' available; you will have to check the source code to convince yourself that your analysis produces sensible output.

List of main entry points:
- javacc: jjdoc:jjtree:javacc:org.javacc.jjtree.Main:org.javacc.utils.JavaFileGenerator:org.javacc.jjdoc.JJDocMain:org.javacc.parser.Main
- bcel: org.apache.bcel.util.Class2HTML:org.apache.bcel.util.JavaWrapper:org.apache.bcel.util.BCELifier:org.apache.bcel.verifier.TransitiveHull:org.apache.bcel.verifier.Verifier:org.apache.bcel.verifier.NativeVerifier:org.apache.bcel.verifier.GraphicalVerifier:org.apache.bcel.verifier.VerifyDialog:org.apache.bcel.verifier.exc.AssertionViolatedException
- xalan: java_cup.Main:org.apache.xml.dtm.ref.IncrementalSAXSource_Xerces:org.apache.xml.dtm.ref.DTMSafeStringPool:org.apache.xml.dtm.ref.DTMStringPool:org.apache.xml.serializer.Version:org.apache.xml.resolver.apps.resolver:org.apache.xml.resolver.apps.xparse:org.apache.xml.resolver.apps.xread:org.apache.xml.resolver.Version:org.apache.xml.resolver.tests.BasicResolverTests:org.apache.xmlcommons.Version:org.apache.xerces.impl.xpath.regex.REUtil:org.apache.xerces.impl.xpath.XPath:org.apache.xerces.impl.Version:org.apache.xerces.impl.Constants:org.apache.xalan.xslt.EnvironmentCheck:org.apache.xalan.xslt.Process:org.apache.xalan.xsltc.util.JavaCupRedirect:org.apache.xalan.xsltc.cmdline.Compile:org.apache.xalan.xsltc.cmdline.Transform:org.apache.xalan.xsltc.ProcessorVersion:org.apache.xalan.lib.sql.ObjectArray:org.apache.xalan.processor.XSLProcessorVersion:org.apache.xalan.Version
Soot documentation: [HTML] (local copy)
Soot survivor's guide: [pdf] (local copy, this is for an older version of Soot but still provides a helpful overview. Sections 1–4 are recommended reading for this exercise.)
Fundamental Soot objects: [HTML]
Eric Bodden on implementing intraprocedural analyses in Soot: [HTML]
IFDS example program: [HTML] (local)
Java language specification: [HTML] [PDF]

Using Soot

In this exercise, you will use Soot as a library, rather than merely extending its frontend (as in the first exercise). That means that you need to programmatically enable / disable options; the stub source code points you to all the relevant options.

Debugging Soot

Below are some hacks for making it easier to debug Soot code that I have found to be useful. Feel free to suggest your own, and I will add them:

If you test with a program that starts with 10000 empty lines (many editors have a meta-command for repeating a command multiple times, such as C-u 1 0 0 0 0 in EMACS), you can use the test stmt.getJavaSourceStartLineNumber() > 10000 to tell from any given Stmt whether it is part of your test program. This makes it easy to detect library code.
There is a Soot plugin for older versions of Eclipse, though it does not appear to be maintained.

Submission

Deadline: 18th of October (end-of-day)
Submission mechanism: Moodle

1. Finding Inefficient Code (2.5 points)

The Java Collections framework defines lists, sets, and maps, and a general supertype Collection for sets and lists. This allows programmers to use lists in code like the following:

    LinkedList<Integer> l = new LinkedList<>();
    ...
    f(l);

    void f(Collection<Integer> set) {
        for (int i ... ) {
            if (set.contains(i)) {
                ...
            }
        }
    }

Such code would usually be much more efficient if we were using a set type instead of LinkedList. In this exercise, you will write an analysis to help programmers find such `inappropriate' uses of lists.

Note that Java internally distinguishes between multiple kinds of method invocations that are all reflected in Jimple. Some of them (invokespecial and invokestatic) tell us precisely what code to call, while others (invokedynamic and invokevirtual) only give us a supertype of the callee and the name of the invoked method. Resolving these calls requires the use of call graphs.

Type	Description	Example
`SpecialInvokeExpr`	Used for calling constructors, private methods, and `super` methods (inherited constructors and methods even if they are being overridden).	`new T();`
`StaticInvokeExpr`	Used for calling `static` methods (i.e., methods without a `this`):	`System.exit(0);`
`VirtualInvokeExpr`	Used for calling a `regular' public or protected method when the class or subclass type is known	`HashSet<Integer> hs = ...; hs.size();`
`InterfaceInvokeExpr`	Used for calling an interface method when the Java compiler only knows the interface type.	`Set<Integer> s = ...; s.size();`
`DynamicInvokeExpr`	Mainly intended for dynamically typed languages that run on the JVM, this instruction determines its invocation target by calling a helper method. Oracle's Java 8 compiler (though not JastAdd) also uses the instruction for constructing Lambda abstractions.	`Function f = i -> i - 1;`

The types InterfaceInvokeExpr, SpecialInvokeExpr, VirtualInvokeExpr have the common supertype InstanceInvokeExpr, which may be convenient for this task.

Task:
1. Write a program analysis that finds:
  - Instances of java.util.LinkedList or java.util.ArrayList that flow to
  - calls of contains() or remove()
  - using interprocedural analysis.
2. Your own main() method must allow for two parameters: (1) the class path, and (2) the set of main entry points.
3. Your solution should print the file names and line numbers to stdout (e.g., using System.out.println()) in the following format:
  IFDSLIST <methodname> <linenr-or-nothing>
  where the methodname is returned by Body.getMethod().getSignature() and the line number is returend by Unit.getJavaSourceStartLineNumber() (e.g., DEP <MyClass: int[] mymethod(int)> -1). You may omit the line number if it is not available, or otherwise print -1).
4. Test your program with the supplied benchmarks to ensure that it can analyse them.
5. Describe in a paragraph your main design choices. Note that both May and Must analyses are acceptable solutions, as long as you explain your reasoning here.
Note that contains() etc. may be invoked both via invokevirtual invokeinterface.
You can use IFDSTabulationProblem or IDETabulationProblem as basis for your implementation.
I recommend using the Stub2 program linked in the resources above as a starting point.
Submit your source code, plus any tests that you added. You do not have to submit anything that you used from the list of resources above.

2. Worst Case Execution Time (2.5 points)

Worst-case execution time analysis (WCET) is an important analysis in real-time and control systems engineering. This analysis determines the maximum amount of time that the execution of a certain subroutine will take, considering all possible control paths. (In practice, this also has stringent requirements on the hardware and the operating system).

Consider the following language:

    id  : identifiers
    nat : natural numbers

    val ::= nat | id | "true" | "false" ;

    expr ::= <value> + <value>
           | <value> - <value>
           | <value> == <value>
           | <value> < <value>
           | <value>
	   ;

    stmt ::= id "=" <expr> ";"
           | "skip" ";"
           | "twice" <stmt> ";"
           | "{"  <stmt>* "}"
           | "if" <expr> "then" <stmt> "else" <stmt>
           | "while" <expr> "do" <stmt> "done"
           | "return" <expr> ";"

Assume that the semantics of all expressions are analogous to Java, with support for boolean and integer types, and the skip instruction acting as a `no-op'. The semantics of twice s; are to execute the statement s twice in a row.

There are several approaches for computing worst-case execution time, but they all hinge on knowing how often each individual statement may be executed.

Task: Design a dataflow analysis suitable for detecting the maximum number of times that each individual statement (stmt) in a sequence of statements is executed.
1. Describe a suitable intermediate language based on control-flow graphs. Specifically, describe how you represent the stmt productions.
2. Design an analysis that can determine how often each statement is executed, ignoring while loops. Your analysis must be fully precise.
3. Extend your analysis to support while loops, and to be precise about the number of executions in the presence of `for-loop'-like loops:
  - x = 0; while x < 10 do ... ; x = x + 1; ... done
  (where x is an arbitrary variable that is not updated anywhere else in the program.) The analysis may be conservative otherwise. You do not have to be precise in nested loops.
4. Describe your lattice(s) and your merge operation(s).
5. Describe the transfer functions for your nodes. If you choose a path-sensitive analysis (cf. the slides to lecture 4, where we skimmed this briefly at the end), your transfer functions can transfer different information over true and false branches of conditionals.
6. Describe the height of your lattice.
7. Describe the properties of your analysis (forward? backward? path-sensitive? flow-sensitive?)
8. Describe the properties of your framework (monotonic? distributive?)
9. Describe how your analysis handles nested loops. What happens if the loops use the same loop variable?
10. Give one program that executes in a finite number of steps that your analysis cannot analyse precisely and explain the limitation (unless your previous point already covers this question).
Since the source language has no call operation, your analysis can be wholly intra-procedural.
Hints:
- The twice construct is best represented as a control-flow construct.
- You will have to assign properties not just to variables, but also to statements.
- Your variables may need more than one property assigned to them.
Alternatively, you can implement an equivalent analysis in Soot, using BranchedFlowAnalysis. Note that this will likely be more challenging.
Write down your solution as a text or pdf file and submit answers to each of the points listed above.