Analyzer

The TJSA tool is an implementation of the string type system for Java programs to ensure adherence to a given security guideline.

We apply the theoretic concepts of the FJEUS core language, and extend the standard Java language with

  • string classes refined with monoid elements, used to describe the values and kinds of string objects
  • output effects, used to describe the string output of a program

These extensions do not form a new language; instead, they are encoded as standard Java 6 annotations. These annotations are defined by the string analyzer framework. They can be used by the Java programmer to refine a String class with monoid elements, or to annotate a method with its output effect (see below), in order to support the string analysis. Since the analyzed programs are valid Java programs, they can be compiled by an ordinary Java compiler.

The TJSA tool performs two steps:

  1. For unannotated string classes and method signatures, it infers the annotations, if possible.
  2. It type checks the program (including the inferred annotations) according to the rules of our string type system, which is a superset of the standard Java class type system.

If a program type checks, this means (by correctness) that the program follows the security policy. If the program does not type check, this means either that the program violates the security policy, or that the program is secure, but the type system is too weak to prove it (since the type system is not complete).

The policy is expressed in form of a monoid whose elements are equivalence classes of strings. The concrete monoid from section 2.1 of the paper and its concatenation operation are currently hard-coded as a module into the TJSA tool, i.e. the analyzer. The following descriptions are with respect to the formalizations of the use cases; please refer to the paper for a more thorough explanation.

 

Refined string types

Refined string types can be declared using the @ST annotation: For example,

@ST("Input,Lit") String name;

declares a variable name of class String that is further refined with the set of monoid elements {[Input],[Lit]}. In other words, name can either contain untrusted (user-generated) strings, or trusted string literals. Generally, the @ST annotation takes a string containing a comma-separated list of monoid element names, or "*" to denote "any string".

@ST annotations can be used wherever something of class "String" is declared, like local variable declarations, field declarations, or formal method arguments. For example, the secure string API provides a method annotated as follows:

@ST("Input") String getUserInput();

which means that getUserInput returns strings that are refined with the singleton set {[Input]}. In the following, we write such a refined type as String{[Input]}.

The following statement type-checks:

name = getUserInput();

because String{[Input]} is a subtype of String{[Input],[Lit]}. (In the type system, String<U> is a subtype of String<V> whenever U is a subset of V, as described in the paper.)

The type system assigns the type String{[Lit]} to most string literals, with two exceptions as follows:

  • "<script>" --> type String{[Script]}
  • "</script>" --> type String{[/Script]}
  • "any other literal" --> type String{[Lit]}

Therefore, the following statement also type-checks:

name = hasUserInput ? getUserInput() : "no name given";

because the conditional expression is either of type String{[Input]} or of type String{[Lit]}, which means it can be typed with String{[Input],[Lit]}, which is exactly the type of name. The following code, however, does not type-check:

@ST("C1") String sanitizedString = name;

When strings are concatenated using the "+" operator, the analyzer uses the concat function of the monoid to compute the refined string type of the entire expression. Therefore, the following assignment is ok:

@ST("Input") String prefixedName = "Name: " + getUserInput();

because {[Lit]} + {[Input]} = {[Input]}.

 

Type inference

If you leave away the @ST annotation, then the annotation is inferred automatically. For example,

String name;
name = hasUserInput ? getUserInput() : "no name given";

Given the above type of getUserInput, the following would be inferred:

@ST("Input,Lit") String name;
name = hasUserInput ? getUserInput() : "no name given";

In general, we infer the smallest string annotation such that the program is typable. More precisely, the inference algorithm creates type variables ?1, ?2, ?3 etc for each unannotated string class that occurs in the code, and derives constraints for these variables, such as "'Input' and 'Lit' must be in ?1". It then solves the constraints by calculating the smallest solution, if it exists.
 
Since the automatically numbered type variables that occur in the TJSA tool output may be hard to relate to the string types in the code, the programmer may find it more convenient in the testing stage to assign names to the type variables as follows:
 
@ST("?type-of-name") String name;
 
Note: The type inference only works where the type can actually be inferred, i.e. for
  • local string variables
  • string fields that are declared in a class
  • string types in a method signature when the method body is given 

In all other cases (most notably, string types occurring in an interface definition or in the Java class library), "no annotation" is equivalent to @ST("*"), which is equivalent to "any string". For example, the secure string API has a sanitization method that takes any string, and returns P1-strings:

@ST("C1") String escapeToHTML(String stringToSanitize);

Because the API is given as an interface and escapeToHTML does not have an implementation there, stringToSanitize gets the type String<*>.

This interpretation also allows the secure usage of string manipulation methods of the standard Java class library. For example, the substring method of the standard String class has the signature:

String substring(...);

This is treated by the analysis as

@ST("*") String substring(...);

Since all returned string values from a class library method are conservatively treated as "any string", the class library cannot be used to "bypass" the string types, e.g. by using substring to extract a malicious string out of a sanitized string. At the same time, since all formal string parameters of library methods are treated as "any string", one can always pass refined strings to methods of the Java class library, because String<*> is always a supertype of String<U> for all U.

 

Output effects

In the secure string API, there is a special method called "output" that has an output effect, namely the type of the string that is passed as an argument. For example, with the variable "name" from above, the statement

output(name);

has the effect {[Input],[Lit]}. Effects of subsequent output statements are concatenated, thus

output("<b>Your name is: ");
output(escapeToHTML(name));
output("</b>");

also has the overall output effect {[C1]}, because the first output has effect {[Lit]}, the second output has effect {C1} (name is C1-sanitized), the third output again has effect {[Lit]}, and the monoid defines the concatenation {[Lit]} + {[C1]} + {[Lit]} = {[C1]}.

Using the @SE annotation at method declarations, one can declare an upper bound of the output effect that a method has. For example, by writing

@SE("Lit,C1,Script,/Script") void main() { ... }

the programmer can enforce that the method main has output effect [Lit] or [C1] or [Script] or [/Script], but no other effect. 

The @SE annotation can be used directly in front of a method declaration. The analyzer checks whether the possible effects induced by the method body are a subset of the sets given by @SE.

In the particular example above, we have actually annotated the main method with the monoid elements that are considered "allowed" in the paper, because they represent the output traces that lead the finite state machine to an accepting state.

 

Effect inference

Similarly to refined string types, if you leave away the @SE annotation at a method, the TJSA tool tries to infer it. For example, if you write

void main() {

String name;
name = hasUserInput ? getUserInput() : "no name given";
output("<b>Your name is: ");
output(escapeToHTML(name));
output("</b>");

 }

then TJSA automatically infers @SE("C1") for the effect of main (and, as described above, @ST("Input,Lit") for the type of name). As with type inference, effect variables are automatically numbered ?1, ?2, ?3, etc, which can be prevented by using named effect variables like @SE("?effect-of-main").

Note: The effect inference only works for methods that have an actual implementation. If you leave away the @SE annotation at methods that do not have a body, this is equivalent to @SE("Lit"), which means the output effect of the method is equivalent to the output of a literal, or to an empty string, or to no output. Again, this is to make it compatible with the Java class library, which we assume does not contain any calls to our special method "output".

 

Secure string API and built-in types

The "secure string API", i.e. the annotations, the functionality to read user input, sanitize strings, and output strings, is not built into the analyzer, but specified as an ordinary Java interface. For the TJSA tool, the API forms part of the program to be analyzed. The API currently looks like this:

@interface ST { String value(); }
@interface SE { String value(); }

interface ExternalMethods {
    @ST("Input") String getUserInput();
    @ST("C1") String escapeToHTML(String str);
    @ST("C2") String escapeToJS(String str);

    void output(String str);
}

Note: The annotation definitions are ignored by the analyzer, because it "knows" them anyway. They are however needed to make the annotated program a valid Java program. The type of "output" is built into the analyzer; see below.

The following types and properties are hard-wired into the analyzer:

  • The type of the "+" operator (string concatenation) is hard-wired as String<U> + String<V> : String<UV>.
  • Any method with the following signature:
      void output(String s);
    gets the type
      @SE(U) output(@ST(U) String s);
    A long-term goal is to enable this polymorphic type in the secure string API, or at least to provide a way to specify which method is the "output" method.
  • The annotation names "@ST" and "@SE".
  • The entire security policy, i.e. the monoid elements and the concatentation of monoid elements. The long-term goal is to pass this information in a security policy file to the TJSA tool.
  • The types of string literals. This shall also be defined in the security policy file.

 

Context-sensitive analysis

Consider the following program (we use named type variables for illustration):

@ST("?ret") String identity(@ST("?param") String s) {
   return s;
}
...

@ST("Input") String input = getUserInput();
@ST("C1") String sanitized = escapeToHTML(input);
...
@ST("?x") String x = identity(input);
@ST("?y") String y = identity(sanitized);
...

Since 'identity' is first called with a "Input" and then with a "C1" string, the best string type that can be inferred for "?param" is "Input,C1", hence the best type for "?ret" is also "Input,C1", which is then the solution for both "?x" and "?y".

Using context-sensitive analysis, we can choose a call context that allows us to distinguish both calls to 'identity', and analyse the method separately for each call. The method then gets two different types:

@ST("Input") String identity(@ST("Input") String s);
@ST("C1") String identity(@ST("C1") String s);

This allows the analyzer to assign "Input" to "?x" and "C1" to "?y", i.e. to improve the precision of the analysis.

The implementation is parametric in the kind of context abstraction. The tool provides a number of specific context implementations:

  • 10-CFA: the context is the call stack, or more precisely, a stack of pairs, each consisting of the name of the called method and the line numbers of the invocation point.
  • Context-insensitive: one and the same context is used for all method calls.

The abstraction can be selected with a command-line parameter.

 

 

Details of the analysis phases

The TJSA tool performs the following analysis phases. It outputs the result of the last phase, unless the verbosity switch is used, in which case it outputs the result of all phases.

  • First, the input file(s) are parsed simultaneously, resulting in one large abstract syntax tree.
  • Then, class and field information is collected in a symbol table, together with their declared type. For unannotated strings, type variables are created and placed in the symbol table.
  • Then, each method body is type checked once according to the rules of the string type system. The AST is directly annotated with type information. Whenever an unannotated local variable is encountered, a type variable is created and placed in the typed AST. Whenever side conditions of the typing rules cannot immediately be checked because they involve type variables, a respective high-level constraint is generated and associated with the current method. Constraints may look like this:
    - {Input} is a subtype of ?1
    - {Lit} concatenated with {C1} concatenated with {Lit} is a subtype of ?2
    - the type ?1 joined with the type ?2 is a subtype of ?3
  • Also, whenever a method is called, method call information is collected, therefore creating an abstract call graph.
  • Using the abstract call graph, the program is executed abstractly, where the execution is steered by the contexts. Whenever a method is "executed" in this way, a new method signature indexed with the current context is generated. The high-level constraints of the method are parametrized with the current context, resulting in low-level constraints. (In a context-insensitive setting, high-level constraints are identical to low-level constraints, and indexed method signatures are identical to the declared signatures in the program.)
  • The low-level constraints are translated into Datalog rules, which in turn are externally solved using Succinct Solver.
  • Just to be sure that the solution is right, the analyzer then checks each method again: for each of its context-indexed signatures, it verifies the typed AST, filling in the type variable solutions instantiated with the current context.