EDA045F Program Analysis: Homework 3

Please try to track the time that you need for the homework assignment and report it, so that we can calibrate the effort for future classes.

Doop, Datalog, and Points-To Analysis

This is a group project and intended to be worked on in a team of two. If you have no partner, you can also submit alone.

Background

In this excerise, you will work with the Soufflé language, which is a dialect of Datalog. You can refer to the slides of lecture 6 for more information about Datalog. In addition, you will use the Doop framework for pointer anlaysis.

Resources

Soufflé: [source]
javacc fact database: [archive] A database of facts for JavaCC (note: about 1 GB when decompressed!)
small fact database [v2]: [archive] A database of facts for a smaller program (note: about 850 MB when decompressed!)
small program [v2]: [archive] Source code for the small program
Mini-Doop analysis in Datalog: [Soufflé-style Datalog]
Doop tutorial: [pdf] (optional, this is for a slighly older version of Doop)
Doop git repository: [bitbucket]
Soufflé major mode for syntax highlighting in EMACS: [EMACS LISP file] (optional)

Building and Using Soufflé with our dataset

You can build Soufflé by running:

    ./configure && make -j4

(where the parameter to `-j' is intended to speed up compiling Soufflé itself through parallelisation; you can tune it for your system. This does not affect the execution speed for Datalog code.)

You can now run Soufflé as ./src/souffle, which provides suitable help. Alternatively, you can install Soufflé into some directory dir by running

    ./configure --prefix=dir && make -j4 && make install

If you have extracted our dataset into the directory javaccfacts, you can now run a Soufflé datalog program p.dl on our dataset via

    souffle -F javaccfacts p.dl

By default, Soufflé reads from and writes to the current working directory, so you can omit the -F javaccfacts parameters if you are calling it from within the facts directory.

You can use the -j T parameter to run Soufflé in multi-threaded mode with T threads.

Building and Using the Doop Fact Generator

To build the Doop Fact Generator, git clone https://bitbucket.org/yanniss/doop and then run ./doop from within the directory you just created. Doop will automatically recompile itself (and download dependencies via the gradle build tool) as needed.

While the Doop main program can automatically invoke Soufflé for analysis, we will only be using it to generate facts and then run Soufflé `by hand' (and only in the last task; facts for all other tasks can be downloaded above).

For reference, I used the following invocation to obtain the fact files that you found in javaccfacts.tar.gz using the following invocation:

  ./doop -Xstop-at-facts output/javaccfacts -l '/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar 2.exercise/junit4.jar' -i 2.exercise/javacc-5.0.jar --main 'jjdoc jjtree javacc org.javacc.jjtree.Main org.javacc.utils.JavaFileGenerator org.javacc.jjdoc.JJDocMain org.javacc.parser.Main'

Submission

Deadline: 14th of November (end-of-day)
Submission mechanism: Moodle

1. The Type Hierarchy (1 point)

In this exercise, we examine the type hierarchy of javacc. The data has already been extracted by the Doop framework, so you only have to download and extract the javaccfacts.tar.gz archive and prepare Soufflé in order to get started. Recall that Soufflé has a notion of input relations that load data from an external file. The following setup allows you to load the type relations extracted from Java, assuming that you have previously declared DirectSuperclass and DirectSuperInterface as binary relations:

.input DirectSuperclass(IO="file", filename="DirectSuperclass.facts", delimiter="\t")
.input DirectSuperinterface(IO="file", filename="DirectSuperinterface.facts", delimiter="\t")

These direct supertype relationships describe each class' superclass, and each class' declared superinterfaces.

Task:
1. Write a Datalog program that collects the names of all types and writes them to Types.facts.
2. Write a Datalog program that computes all subtype/supertype relationships and writes them to AllSupertypes.facts. Entries should have the form subtype supertype. Recall that each type is its own subtype, and that the supertype relationship is transitive-- if A is subtype of B and B subtype of C, then A is also subtype of C.
3. Write a Datalog program that computes the number of subtypes for each type, excluding itself. The program should write the output to SubtypeCount.facts, as a binary relation with the right entry being the count.

2. Callgraph analysis (1 point)

We continue our exploration of Soufflé, but this time we use the small datasets. We perform a points-to analysis that uses a version of the Doop points-to framework. For this exercise, we build on the definitions from minidoop.dl.

We will be utilising Doop's relations for our analysis. Note that Doop does not currently support invokedynamic, which in turn is not needed for representing our data set. Furthermore, its representation merges invokevirtual and invokeinterface, meaning that our program analysis only needs to consider three different forms of invocations:

Type	Description	Example
`SpecialMethodInvocation`	Used for calling constructors, private methods, and `super` methods (inherited constructors and methods even if they are being overridden).	`new T();`
`StaticMethodInvocation`	Used for calling `static` methods (i.e., methods without a `this`):	`System.exit(0);`
`VirtualMethodInvocation`	Used for calling a `regular' public or protected method when the class or subclass type is known, or for an interface method	`HashSet<Integer> hs = ...; hs.size();`

Task:
1. Run Doop on the small dataset and note the following performance numbers:
  - Threads: Number of threads that you configured Soufflé to use (keep this constant for all exercises!)
  - Runtime: Number of seconds that Soufflé needed to complete all processing. If you use the time command on Unix to measure this, use that command's real output (`wallclock time').
  - Callgraph size: Number of entries (lines) in the generated callgraph file (CallGraphEdge.csv by default). On Unixish systems, you can use wc -l to compute this number.
  - Pointsto:Number of entries (lines) in the generated `points-to' file for parameters and local variables (VarPointsTo.csv).
2. Copy minidoop.dl into minidoop-cha.dl. Modify minidoop-cha.dl as follows:
  - Remove the three rules at the end of the file that compute the CallGraph relation.
  - Implement your own CallGraph rules that implemnet Class Hierarchy Analysis (you may want to look at the original rules for inspiration):
    - Whenever a potential call is non-virtual, assume that the call is taken (part of the call graph).
    - For any virtual call, assume that the MethodInvocation may invoke any method of matching name and signature from a suitable subtype. The basic.SupertypeOf relation can help you check for this property.
    - Note that you must handle all three types of calls: Special, Static, and Virtual.
    - Also note: InvokeSpecial behaves `a bit' like a virtual method call in that it has a receiver object (`?base', in Doop terminology), but it does not use dynamic dispatch, implying that there is always exactly one Method that is invoked.
  - You can also have a look at the bonus slides for callgraph construction.
  - Run your analysis and note down the same four performance numbers as above (warning: this may take a while).
3. Copy minidoop-cha.dl into minidoop-rta.dl. Modify minidoop-rta.dl as follows:
  - Alter your CallGraph rules to implement Rapid Type Analysis:
    - The rules for handling calls are as for CHA, except as follows:
    - You only analyse reachable code. The definition of the Reachable predicate already automatically exploits the call graph to determine which methods are reachable.
    - When analysing virtual calls: Instead of analysing all possible subtypes, you only analyse subtypes that have been instantiated at least once in the program (it doesn't matter if the instantiation is in an unrelated part of the code).
  - Run your analysis and note down the same four performance numbers as above. A good way to test your results is to check the call graph output and compare it against CHA; the call graph outputs are quite human-readable, as they use Soot-derived strings.
  - If you want to test your code against your own custom code, you can do Task 4 first.
4. Make sure to submit your measurement results (and the Datalog files!)

3. Object-Sensitive Points-to analysis (2 points)

We again use the small dataset.

Object-sensitive points-to analysis is a form of points-to analysis that distinguishes all derived information (call graph, reachability, ...) by the object that was the most recent receiver of a method call. For example, consider the following code:

    Class A {
        Object x;
        Object y;

        int f(Object xarg) {
            A aux = new A();
            aux.g(xarg);
            y = aux.y; // COPY
        }
        int g(Object yarg) {
            x = yarg;
        }

        static void entrypoint() {
            A a1 = new A();
            A a2 = new A();
            A a1alias = a1;
            a1.f("foo");
            a2.f("bar");
            a1alias.f();
        }
    }

Here, a1.f() and a2.f() would be analysed in different contexts, but a1.f() and a1alias.f() would be analysed in the same context. Since the calls to g() all go to the object allocated into variable aux, any contextual knowledge will get lost (assuming that heap objects are identified by allocation site, as in minidoop.dl). Thus, the analysis would determine (precisely) that a1.x can be "foo", a2.x can be "bar", and aux.y can be "foo" or "bar". This in turn means that the analysis would be imprecise in the line marked // COPY and conclude that both a1.y and a2.y could be "foo" or "bar".

Task:
1. Copy minidoop.dl into minidoop-os1.dl. Modify minidoop-os1.dl as follows:
  - Alter all relevant rules and definitions to make the analysis object-sensitive.
  - You will not need to modify anything above line 404 in the file.
  - Since the main entry points are not called from an object, you need to use a `fake' context to process these initial calls. You can choose any context that doesn't naturally occur in a program as a call site, such as the string "."
  - Some useful questions to aks yourself:
    - How does a relation change when we parameterise it by a context?
    - What is the type of our contexts?
    - When does the context change?
2. Ensure that you generate output that is analogous to the output of minidoop.dl, i.e., your output relations must not contain the context objects (it's fine to change the name of the output files, as long as you describe your changes). Feel free to generate helper relations and rules as needed.
3. Measure and report on the same four performance numbers as in Task 3.
4. To ascertain the quality of your output, compare it against the output of the original minidoop.dl (focussing on the application code, for which you have source code).
5. Write short (1-2 paragraph) description explaining your changes to the code.

4. Doop fact extraction (1 point)

Pick one additional Java program with suitable main entry points; you can refer to the examples from the previous exercise, use a small program of your own, or choose an existing executable Java program. Use Doop to obtain the facts for this program, and repeat the previous task.

Task: Repeat your measurements from tasks 2 and 3 (you may skip Call Hierarchy Analysis), and report your result for the new dataset. Include either the program in question or provide a reference. If you hit Moodle upload limits, you can e-mail the program to me separately from your homework submission.