Program Analysis

Motivation

  • Fundamental knowledge in language, compiler and runtime implementation will help make better engineering decisions, everywhere.
  • Compiler optimization trains in bit-fiddling as well as algorithmic thinking

Approaches to Program Analysis

  • Automated analysis of program behaviour, to find errors, optimize performance, find security vulnerabilities.
  • Program Analysis cannot ensure termination, soundness and completeness
  • Heartbleed vulnerability.
  • Four main apporaches of PA:
    • Data Flow Analysis.
    • Constraint Based Analysis.
    • Abstract Interpretation.
    • Type and Effect Systems.

Introduction

  • Where is program analysis used

    • Compilers
    • Bug finding tools
    • Performance profilers
    • Code completion
    • Automated testing
    • Code Summarization/Documentation
  • Static Analysis

    • Lint, Coverity - Suspicious error patterns
    • Ms SLAM - check API usage rules
    • FB Infer - Memory leak detection
    • ESC/Java - Verifying invariants
  • Dynamic Analysis

    • Purify - Array bound checking
    • Valgrind - Memory leak detection
    • Eraser - Datarace detection
    • Daikon - Finding likely invariants
  • Static Analysis:

    • Try to discover information about a program without running it.
    • Cost is proportional to program's size.
    • Is less effective as it may report spurious errors, false positives
  • Dynamic Analysis:

    • Runs the program and collect information about events that took place at runtime.
    • Cost is proportional to program's run time.
    • Its unsound as it only examines a finite number of runs in terms effectiveness therefore it may miss some errors, false negatives
  • Dynamic Analysis include

    • Profiling - execute and log events that happened at runtime. gprof
    • Test generation - produce tests that cover most of the program code. klee
    • Emulation - execute program in a virtual machine that takes care of collecting and analyzing data. valgrind
    • Instrumentation - augment the program with a meta-program that monitors its behviour. addressSanitizer.
  • Static Analyses

    • Dataflow analyses
      • propagate information based on the dependencies between program elements, which are given by the syntax of the program.
    • Constraint-based analyses
      • we derive constraints from the program.Relations between these constraints are not determined explicitly by the program syntax
    • Type Analyses
      • propagate information as type annotations, allowing us to prove properties about the program such as progress and preservation.
  • Parts of a Static Analysis

    • Program representation
      • CFG, AST or bytecode
    • Abstract domain
      • how to approximate program values
    • Transfer functions
      • assignments, conditionals, merge points
    • Fixed-point computation algorithm
      • invoke transfer functions
      • iterative algorithms
  • Soundness and Completeness in programs

    • false positives and false negative
  • Precision and Recall

    • true positives/ rejected and true positives/bad
  • Undecidability of program properties

    • program analysis cannot ensure termination + soundness + completeness.
    • termination determines whether a good program analysis can be designed.
  • Operational Semantics

    • what the program does, execution on an idealized machine.
  • Denotational Semantics

    • What the program means
    • Mathematical object in some suitable semantic domain, function from states to states
  • Axiomatic Semantics

    • What properties the program does.
    • Collection of logical properties afforded by a program, pre and post conditions.
  • Program Invariants

    • Loop Invarinats
      • a property of a loop that holds before and after each iteration
      • captures the essence of the loop's correctness and by extension of algorithms that employ loops.
    • Class invariants
      • a property that holds for all objects of a class
      • established upon the construction of the object and constantly maintained between calls to public methods

Software specifications

  • Specifications must be explicit.

    • Testing checks whether program implementation agrees with program spec.
    • Testing is a form of consistency checking between implementation and specification.
  • Specs evolve over time.

  • Classification of Testing Approaches

    • Manual vs Automated
      • find bugs quickly
      • no need to write tests
      • no need to maintain tests when s/w changes vs
      • Efficient test suite
      • Potential better coverage
    • Black-box vs White-box
      • don't have to modify code
      • does not need to analyze or study code
      • code in any format vs
      • efficient test suite
      • potential better coverage.
  • Classification of specifications

    • Safety properties

      • program will never reach a bad state
      • forms:
        • Assertions,
        • Types
          • adding the specification to the implementation allows computer to check the imlementation against the specification.
          • *Checker framework
        • Type-state properties
          • refine types with finite state information.
          • enables to specify which operations are valid in each state and how operations affect the state.
          • also called temporal safety properties.
          • i.e acquire locks, file read after open, root privileges.
    • Liveness properties

      • program will eventually reach a good state
      • program termination, starvation freedom
    • Pre and Post conditions

      • A pre-condition is a predicate
        • assumed to hold before a function executes.
      • A post-condition is a predicate
        • expected to hold after a function executes whenever the precondition also holds.
      • Most useful if they are executable
      • Need not be precise.
  • Houdini Algorithm

    • An annotation assistant for the static modular verifier.
    • Gnenerate many candidate invarinats and employ ESC/Java to verify or refute each
    • Different techniques for guessing invariants
      • Static analysis - mined from source code based on heuristics
      • Dynamic analysis(Daikon) - facts observed while running the program.
    • Workflow
      • Source -> Invariant Guesser -> Annotated source -> Invariant checker -> Invariant warnings -> Annotation Remover -> (Annotation Refutation Loop)
    • Pros
      • infers both loop invarinats and method contracts
      • infers the strongest invariant in the candidate set
      • easy to implement
    • Cons
      • only infers conjuctive invariants.
      • getting a high-quality invarinat rewuires aggressively many invariants which makes refutation process more expensive
      • no guarantee inferred invariants useful for verifying desired property.
  • How good is your Test Suite

    • how do we know that our test suite is good??

      • too few tests: may miss bugs
      • too many tests: costly to run, bloat and redundancy, harder to maintain
    • Approaches

      • Code coverage metrics

        • metric to quantify the extent to which a program's code is tested by a given test suite.
        • given as percentage of some aspect of the program executed in the tests.
        • 100% coverage rare in practice
          • safety-critical code
        • Types
          • Function coverage
          • Statement coverage
          • Branch coverage
          • Line coverage
          • Condition coverage
          • Basic block coverage
          • Path coverage
      • Mutation Analysis

        • founded on competent programmer assumption, program is close to correct to begin with
        • Key idea
          • Test variations(mutants) of the program
        • If test suite is good, should report faied tests in the mutants.
        • Find set of test cases to distinguish original program from its mutants.
        • What if a mutant result is the same as the original??

Data Flow Analysis

  • Propagate analysis iformation along the edges of a control flow graph.

  • Control Flow graph(CFG)

    • directed graph that summarises flow of control in all possible runs of the program.
    • Node represents statements in program.
  • Abstract vs concrete states

  • Abstract Domain

    • ordered in a lattice.
  • Goal

    • Compute analysis state at each program point
  • For each statement, define how it affects the analysis state.

  • Available expression analysis

    • for each program point, compute which expressions must have already been computed and not later modified.
      • useful to avoid re-computing an expression
      • used as part of compiler optimizations
  • Transfer Functions

    • How a given statement affects the analysis state
    • analysis state = available expressions
  • Two functions

    • gen: availabe expressions generated by a statement
      • gen: Stmt -> P(Expr)
      • a statement generates an available expression e if;
        • it evaluated e
        • it does not later write any variable used in e.
      • otherwise function returns empty set.
    • kill: availabel expressions killed by a statement.
      • a statemtn kills an availabel expression e if;
        • it modifies any of the variable used in e
      • otherwise, returns an empty set.
  • Always start with a control flow graph of a program beofre dataflow analysis.

  • Propagating Available Expressions

    • Forward Analysis
      • propagate available expressions in the direction of control flow
      • for each statement s, outgoing available expressions are: incoming avail. exprs. minus kill(s) plus gen(s)
      • when control flow splits, propagate available expressions both ways
      • when control flows merge, intersect the incoming available expressions.
  • Data Flow equations

    • AE(entry)(s), AE(exit)(s)

Define a Data Flow Analysis

  • Defined by 6 properties
    • Domain
      • analysis associated some information with every program input
        • infomration means elements of a set
      • Domain of analysis: all possible elements the set may have.
        • for all availabe expressions analysis: domain is set of non-trivial expressions
    • Direction
      • analysis propages information along the control flow graph
        • forward analysis: normal flow of control
        • backward analysis: invert all edges(reasons about executions in reverse)
    • Transfer Function
      • defines how a statement affects the propagated information.
      • dfexit(s) = some finctions of dfentry(s()
    • Meet Operator
      • what if two statements s1,s2 flow to a statement s?
        • forward analysis: execution branches merge
        • backward analysis: branching point
      • meet operator defined how to combine the incoming information
        • union
        • intersection
    • Boundary Condition
      • what information to start with at the first CFG node?
        • forward analysis: first node is the entry node.
        • backward analysis: first node is exit node
      • common choices
        • empty set
        • empty domain
    • Initial values
      • what is the information to start with at intermediate nodes?
      • common choices
        • empty set
        • empty domain

Reaching Definitions Analysis

  • For each program point compute which assignments may have been made and may not have been overwritten.
  • Define the 6 properties of this analysis
    • domain - definitions in the code
    • direction - forward
    • meet operator - union
    • transfer function; gen(s), kill(s)
    • boundary condition - entry node starts with all varibles undefined
  • always start with CFG.

Very Busy expression analysis

  • For each program point, find expressions that must be very busy, on all future paths, expressions will be used before any of the variables in it are redefined.
  • useful for program optimizations, hoisting
  • hoisting an expression, pre-compute it. before entering a block for later use.
  • define all 6 properties of the analysis
    • domain: all non-trivial expressions in code
    • direction: backward
    • meet: intersection
    • transfer function: backwards analysis - returns expressions that are very busy expressions at entry of statement.
    • boundary conditions: final node starts with no very busy expressions
    • initial values: no very busy expressions

Live Variable Analysis

  • For each statement, find variables that are may be live at the exit from the statement
  • "live" - variable is used before being redefined.
  • useful for identifying dead code
    • bug detections: dead assignments are typically unexpected
    • optimizations: remove dead code.
  • define all 6 properties of the analysis
    • domain: all varibales occuring in the code
    • direction: backward
    • meet operator: union
    • transfer function: backward - returns set of variables that are live at entry of statement
    • boundary conditions: final node starts with no live variables.
    • initial values: all nodes have no libe variables

Solving Data flow equations

  • Transfer functions yield data flow equations for each statement.
  • Round-robin, iterative algorithm
  • Work List Algorithm
    • will it always terminate?
    • impose constraints to ensure terminations
      • domain of analysis
      • transfer function and meet operator are monotonic wrt to partial order.

Intra vs Inter-procedural

  • Intra
    • reason about a function in isolation
  • Inter-
    • reason about multiple functions
    • calls and returns
    • one cfg per fucntion
    • connect call sites to entry nodes of callee
    • conect exit code back to call site.
    • analysis considers only possible inter-proc flows
    • arguments passed into call
    • return values propagated to caller
    • local variables not propagated, instead continues with state just before call.

Sensitivities

  • What your program looksat during analysis
    • Flow-sensitive:
      • Takes into account the order of statements.
    • Path-sensitive:
      • Takes into account the predicates at conditional brancehs
    • Context-sensitive(intra-procedural):
      • Takes into account the specific call site that leads into another function.

Call Graph Analysis

  • Abstraction of all method calls in a program

    • Nodes: Methods
    • Edges: Calls
    • Flow-insensitive: No execution order
  • Here: Static call graph

    • abstract of all calls that may execute
    • gives an overestimation of calls
  • Prune graph

    • focus on feasible behavior
    • minimize
      • reachable methods
      • call edges
      • potentiall polymorphic call sites

Call Graph Algorithms

  • Class Hierarchy Analysis(CHA)

    • most simple analysis
    • for a polymorphic call site m() on declared type T:
      • call edge to T.m and any subclass of T that implements m.
    • pros
      • very simple
      • correct: contains edges for all calls that the program may execute
      • few requirements: needs only hierarchy, not other analysis information
    • cons
      • very imprecise: most edges will never be executed
  • Rapid Type Analysis

    • Like CHA but, take into account only those types that the program actually instantiates.
    • Pros
      • still pretty fast
      • correct
      • much more precise thatn CHA. prunes nodes ad edges
    • Cons
      • does not reason about assignments
  • Variable Type Analysis(VTA)

    • reason about assignments

    • infer what types the objects involved in a call may have

    • Prune calls that are infeasible based on the inferred types

    • Type propagation steps

      • Form initial conservative call graph
        • e.g CHA or RTA
      • Build type-propagation graph
      • Collapse strongly connected components
      • Propagates types in one iteration
    • How to represent fields in this analysis??

    • Pros

      • More precise than RTA
        • considers only those types that may actually reach the call site
      • Still relatively fast
    • Cons

      • Requires initial call graph
        • actually a refinement algorithm
      • Some imprecision remains
        • because of field-based analysis
  • Declared-Type Analysis(DTA)

    • small brother to VTA
    • also reasons about assignments and how they propagate types
    • Not per variable but per type.
    • steps like above
    • Pros
      • faster than VTA
      • more precise than RTA
    • Cons
      • less precise than VTA
  • Spark Framework

    • RTA, DTA and VTA: instances of one single unifying framework

    • Genera Recipe

      • First build a pointer-assignment graph(PAG)
      • Propagate information via graph
    • Combine call graph construction with points-to analysis

      • Reason about objects a variable may refer to
    • Pointer-Assignment Graph

      • Nodes
        • Allocation
          • one for each new a()
          • represents a set of objects created here
          • has an associated type
        • Variable
          • one for each local variable, parameter, static field and thrown exception
          • represents a memory location holding pointers to objects
          • mat be typed(depends on setting)
        • Field reference
          • one fore each p.f
          • represents a pointer dereference
          • has a varibale node as its base
          • also models contents of arrays
      • Edges
        • Allocation
          • represents allocation of an object assigned to a variable
          • p = new HashMap()
        • Assignment
          • represents assignment among variables and fields
          • also apply for those below (field store and field load)
        • Field store
        • Field load
    • Points-to sets

      • each variable, set of objects the variable may refer to
    • Subset-based analysis

      • allocation and assignment edges induce subset constraints
        • just because p = new 1; does not mean we dont have p = new 2;
        • flow-insensitive
    • Introduce helper node: concrete fields

    • represents all objects pointed to by field f of all objects created at allocation site

    • Iterative propagation algorithms

      • initialize according to allocation edges
      • repeat until no changes
        • propagate sets along assignment edges a-> b
        • for each load edge a.f -> b;
          • for each cpts(a) propagate pts(c.f) to pts(b)
        • for each store edge a -> b.f
          • for each cpts(b) propagate pts(a) to pts(c.f)
    • Spark framework supports many variants

      • just one allocation site per type
      • Fields simply represented by their signature
      • Equality instead of subsets for assignments
    • Pros

      • Generic algorithm where precision and efficiency can be tuned
      • Jointly computing call graph and points-to sets increases precision
    • Cons

      • Still flow-insensitive
      • Can be quite expensive to compute

Random Testing and Fuzzing

  • Manual Testing

    • important but limited by human time
  • Automated Testing

    • Test execution
      • Regularly execute regression test suite
    • Test creation
      • Automatic test generation
    • Automated testing of programs is hard.
      • pathways through a program increases exponentially with number of branch points.
      • probably impossible for entire systems.

Test creation

  • Blackbox

    • No analysis of program
  • Greybox

    • Lightweight analysis of program
    • e.g coverage achieved by inputs
  • Whitebox

    • More heavyweight analysis of program
    • e.g conditions that trigger specific paths
  • All of them use feedback from test executions

  • Test across single functions, class and its fucntions, library or entire tool.

Randoop

  • Based on Feedback-Directed Random Test Generation paper 2007

  • Idea:

    • Guide randomized creation of new test inputs by feedback about execution of previous inputs
      • avoid redundant inputs
      • avoid illegal inputs
    • Test input here means sequence of method calls.
  • Build test inputs incrementally.

    • new test inputs extend previous ones
  • As soon as test input is created, execute it.

  • Use execution results to guide generation

    • Away from redundant or illegal method sequences
    • Towards sequences tht create new object states
  • Randoop: Input/Output

    • implementation of feedback-directed random test generation

    • Input

      • classes under test
      • time limit
      • set of contracts
        • method contracts, e.g throws no exception
        • object invariants, e.g
    • Output

      • test cases with assertions
    • Algorithm spec

      • Initialize components
      • Pick a method
      • Check if there are arguments needed to call above method.
      • If not start a new sequence
      • Classify sequence
        • no contract violation
        • not redundant
      • Pick another method
      • Classify sequence
        • discard if one already exists
      • Pick another method
        • need sequence that constructs above method
        • look into available components and if there reuse sequence
      • Classify said method
        • add to set of components
      • Runs this until end of time period
  • Redundant Sequences

    • during generation, maintain a set of all objects created.
    • Sequence is redundant if all objects created during its execution are in the above set (using equals() to compare)
    • Could also use more sophisticated state equivalence methods
      • heap canonicalization(expensive)
  • Test Oracles

    • oracle for contract-violating test cases
      • assertTrue(u.equals(u))
    • oracle for normal-behavior test cases
      • assertEquals(2, 1.size())

Greybox Fuzzing

  • Guide input generation toward a goal

    • Guidance based on lightweight program analysis
  • Three main steps

    • Randomly generate inputs
    • Get feedback from test executions: What code is covered
    • Mutate inputs that have covered new code.
  • American Fuzzy Lop

    • Simple yet effective fuzzing tool
      • targets c/cpp programs
      • imputs are files read by the program
    • Widely used in industry
  • Algorithm

    • Queue of test inputs
    • Seed inputs
    • Choose next input
    • Mutate input - is interesting?? discard or enqueue
  • Basic block CFG - sequence of operations/statements that are always executed together, no branch in between.

  • Measuring coverage

    • Different coverage metrics
      • line/statement/branch/path coverage
    • Here: Branch coverage
      • branches between basic blocks
      • rationale: reaching a code location not enough to trigger a bug, but state also matters
    • Compromise between
      • Effort spent on measuring coverage.
      • Guidance it provides to the fuzzer.
    • Efficient implementations
      • Instrumentation added at branching points
        • cur_location
        • shared_mem[cur_location ^ prev_location]++
        • prev_location = cur_location >> 1;
    • Detecting New behaviors
      • inputs that trigger a new edge in the CFG considered as new behaviour
      • alternative: consider new paths
        • more expensive to track
        • path explosion
    • Edge Hit Couts
      • refinement if the previous definition of new behaviors
      • for each edge, count how often it is taken
        • approx counts based on buckets of increasing size.
    • Evolving the input queue
      • Maintain queue of inputs
        • initially see inouts provided by user
        • once used keep input if it cover new edges
        • add new inputs by mutating existing input
      • Queue sizes of 1k to 10k
    • Mutation Operators
      • Create new inputs from existing inputs
      • Random transformations of bytes in an existing input
        • Bit flips
        • Addition and substraction
        • Insertion
        • Splicing
    • More tricks for Fast Fuzzing
      • Time and memory limits
        • discard input when execution is too expensive
      • Pruning the queue
        • periodically select subset of inputs that still cover every edge seen so far
      • Prioritize how many mutants to generate from an input in the queue
        • focus on unusual paths or try to reach specific locations

Symbolic and Concolic Testing

  • White box testing technique

  • Reference papers

    • DART: directed automated random testing
    • KLEE: Unassisted and Automated Genereation of High Coverage Tests for Complex Systems, PROGRAMS
    • Automated Whitebox Fuzz Testing
  • Symbolic Execution

    • Reason about behaviour of program by executing it with symbolic values.
    • Originally proposed by Jmes King and Lori Clarke
    • Became practical due to advances in constraint solving(SMT solvers)
    • Execution tree
      • Binary Tree
      • Nodes: Conditional statements
      • Edges: Execution of sequence on non-conditional statements
      • Each path in the tree represents an equivalence class of inputs
    • Symbolic values and Symbolic state
      • unknown values, user inputs are kept symbolically
      • Symbolic state maps variables to symbolic values
    • Path conditions
      • quantifier-free formula over the symbolic inputs that encodes all branch decisions taken so far.
    • Satisfiability of Formulas
      • Determine whether a path is feasible: check if its path condition is satisfiable.
      • Done by powerful SMT/SAT solvers.
      • Z3, Yices, STP
      • For a satisfiable formula, solvers also provide a concrete solution.
    • Basic applications
      • Detect infeasible paths
      • Generate test inputs
      • Find bugs and vulnerabilities
    • Advanced Applications
      • Generate program invariants
      • Prove that two pieces of code are equivalent
      • Debugging
      • Automated program repair
    • Challenges of S.E
      • Loops and recursion:
        • Infinite execution tree
        • Dealing with large execution trees
          • Heuristically select which branch to explore next
            • Select at random
            • Select based on coverage.
            • Prioritize based on distance to interesting program locations
            • Interleaving symbolic execution with random testing.
      • Path explosion
        • Apply same techniques as used above for loops and recursion to deal with large execution trees.
      • Environment modelling:
        • Deal with native/system/library calls
        • Program behavious may depend on parts of system not analyzed by symbolic execution.
        • Solution implemented by KLEE.
          • If all arguments are concrete forward to OS.
          • Otherwise provide models that can handle symbolic files with the gola of exploring all possible legal interactions with the environment.
      • Solver limitations:
        • Complex path conditions
      • Heap modelling:
        • Data structures and pointers
  • Concolic Testing

    • mix concrete and symbolic execution = concolic
    • Perform concrete and symbolic execution side-by-side
    • Gather path constraints while program executes.
    • After one execution, negate one decision and re-execute with new input that triggers another path.
    • Path conditions determine the next execution path...negate last decision taken on a given path
    • Algorithm
      • repeat until all paths are covered
      • Execute program with concrete input i and collect symbolic constraints at branch points: C
      • Negate one constraint to force taking an alternative branch b'
      • Call constraint solver to find solution for C': New concrete input i'
      • Execute with i' to take branch b'
      • Check at runtime that b' is indeed taken otherwise divergent execution.
    • When symbolic reasoning is impossible or impractical, fall back to concrete values
      • Native/system/api
      • Ops not handles by the solver
    • Large scale application of Concolic testinf
      • SAGE: concolic trsting tool developed at Ms.
      • Test robustness against unexpected inputs read from files
        • audio files
        • office documents
      • Start with known input files and handles bytes read from files as symbolic input
      • Use concolic execution to compute variants of these files.
      • Bounimova et.al paper Ms

Information Flow Analysis

  • Reference papers
    • A Lattice Model of Secure Information Flow Denning, Comm ACM, 1976
    • Dytan: A Generic Dynamic Taint Analysis Framework, Clause et al, 2007
  • Secure the data manipulated by a computing system
  • Enforce a security policy
    • Confidentiality
      • secret data does not leak to non-secret places
    • Integrity
      • High-integrity data is not influenced by low-integrity data
  • Goal
    • Check whether information from one place propagates to another place
      • place here means code location or variable
    • Complements techniwues that impose limits on releasing information
      • Access control lists
      • Cryptography
  • How to analyse the flow of information
    • Assign to each value some meta information that tracks the secrecy of the value.
    • Propagate meta information on program operations.
  • Non-interference
    • Property that information flow analysis aims to ensure.
    • Confidential data does not interfere with public data.
      • Variantion of confidential input does not cause a variation of public output
      • Attacker cannot observe any difference between two executions that differe only in their confidential input.
  • Information Level Policy
    • Lattice of security labels
      • How to represent different levels of secrecy??
        • Set of security labels
        • Form a universally bounded lattice.
      • Universally bounded lattice
        • Tuple of 6 elements(S set of security classes, -> Partial order of security classes, Lower_bound Upper_bound Least upper_bound - union of two pieces of security classes Greates lower_bound - intersection of security classes)
      • Information flow policy
        • Policy specifies secrecy of values and which flows are allowed
          • Lattice of security classes
          • Sources of secret information
          • Untrusted sinks
        • No flow from source to sink.
      • Declassification
        • No flow from high to low is impractical.
        • Eg hash value comparison*
        • mechanism to remove or lower security class of a value.
    • Analyzing information flows
      • given an information flow policy, analysis checks for policy violations
        • detect vulnerable code, SQL injections.
        • detect malicious code, privacy violations
        • check if program behaves as expected, secret data never written to console
      • Explicit vs Implicit Flows
        • Explicit: caused by data flow dependence
        • Implicit: caused by control flow dependence.
      • Static and Dynamic Analysis
        • Static information flow analysis
          • Overapproximate all possible data and control flow dependencies
          • Result: whether information may flow from secret source to untrusted sink
        • Dynamic information flow analysis
          • Associate security labels(taint markings) with memory locations
          • Propagate labels at runtime.
          • Tainted sources and sinks
            • Sources
              • Variables
              • Return values of a particular function
              • Data from a particular I/O stream
            • Sinks
              • Variables
              • Parameters given to a particular function
              • Instructions of a particular type(jump instructions)
          • Taint propagation
            • Explicit flows
              • For every operations that produes a new value, propagate labels of inputs to labels of output.
            • Implicit flows
              • Maintain security stack S: Labels of all values that influence the current flow of control
              • When x influences a branch decision at location loc, push label(x) on S
              • When control flow reaches immediate post-dominator of loc, pop label(x) from S
              • When an operation is executed while S is non-empty, consider all labels on S as input to the operation.
            • Hidden Implicit Flows
              • Implicit flows may happen even though a branch is not executed.
              • Approach explained so far will miss such hidden flows.
              • Tor eveal hidden flows: for every conditional with branches b1 and b2,
                • conservatively overapproximate which values may be defined in b1.
                • Add spurious definitions into b2.
        • Dytan
          • dynamic information flow analysis for x86 binaries.
          • *paper reference above.
        • There exists channels that information flow analysis may have missed, i.e power consumption, timing

Program Slicing

  • Reference papers

    • Program slicing,Weiser 1984
    • Thin Slicing, Sridharan 2007
    • Dynamic program slicing, Agrawal and Horgan 1990
    • A survey of program slicing techniques, 1995
  • Extract an executable subset of a program that potentially affects the values at a particular program location

    • Slicing criterion = program location + variable
    • An observer focusing on the slicing criterion cannot distinguish a run of the program from a run of the slice.
  • Applications

    • Debugging: focus on parts of program relevant for a bug
    • Program understanding: Which statements influence this statement??
    • Change impact analysis: Which parts of a program are affected by a change? What should be retested?
    • Parallelization: Determine parts of a program that can be computed independently of each other.
  • Forward vs backward

    • statements that are influenced by the slicing criterion vs statements that influence the slicing criterion
  • Static vs dynamic

    • statically computing a minimum slice is undecidable
    • dynamically computed slice focuses on particular execution/input.
  • Static program slicing

    • Weiser, 1984
    • Graph reachability problem based on program dependence graph.
    • P.D.G
      • Directed graph representing the data and control dependences between statements.
        • Nodes
          • Statements
          • Predicate expressions
        • Edges
          • Data flow dependences: one edge for each definition-use pair
          • Control flow dependences.
      • Variable Definition and Use
        • A variable definition for a varibale v is a basic block that assigns to v; can be local, global variable, parameter or property.
        • A variable use for a variable v is a basic block that reads the value of v; conditions, computations, output
      • Definition-clear paths
        • A definition-clear path for a variable v is a path n1...nk in the CFG such that n1 is a varibale definition and nk is variable use.
        • No other n1 between is a variable definition of n.
      • Definition-use pair
        • A DU-pair for a variable v is a pair of nodes(d, u) such that there is a definition-clear path d,...u in the CFG.
      • Control Flow dependencies
        • Post-dominator
          • n2 strictly post-dominates node n1 if every path n1,...exit in the CFG contains n2.
        • Control dependence
          • n2 is control-dependent on n1 not equal n2 if
            • there exists a cf path P where n2 post-dominates any node in P
            • and n2 does not post-dominate n1.
        • Computing slices
          • all statements from which n is reachable.
  • Thin slicing

    • Static slices are often very large.
      • Worst case: Entire program
      • Too large for common debugging and program understanding tasks
    • Main reason: Aims at an executable program but not needed for many tasks
    • Idead: Heuristically focus on statemented needed for common debugging tasks
      • Thin slice
    • Let user expand the thin slice on demand
      • Thin slices include producer statements but exclude explainer statements
        • Why do heap accesses read/write the same object??
        • Why can this producer execute?
      • Most explainers are not useful for common tasks.
      • Expose explainers on demand via incremental expansion.
    • Definition
      • Statement directly uses a memory location if it used it for some computation other than pointer dereference.
      • Dependence graph G for thin slicing: Data dependences computed based on direct uses only.
      • Thin slice: statements reachable from the criterion.
  • Dynamic Slicing

    • Statements of an execution that must be executed to give a varibale a particular value.
      • For an execution, a particular input
      • SLice for one input may be different from slice for another inout
    • Useful in debugging: Get a reduced program that leads to the unexpected value.
    • Approaches
      • Simple Approach
        • given an execution history(sequence of pdg nodes that are executed), slice for statement n and variable v.
          • Keep PDG nodes only if there are in history
          • Use static slicing approach(graph reachability) on reduced PDG.
        • Limitations
          • Multiple occurences of a single statemtn are representated as a single PDG node
          • Difference occurences of a statemetn may have different dependences.
            • All occurences get conflated
          • Slices may be larger than necessary.
      • Dynamic dependence graph
        • Nodes: Occurences of nodes of static PDG
        • Edges: Dynamic data and control flow dependences
        • Slice for statement n and varibales V that are defined or used at n:
          • Compute nodes Sdyn that can reach ani of the nodes that represent occurences of n.
          • Slice = statements with at least one node in Sdyn
        • May yield a program that if executed with another input, does not give the same value for the slicing criterion that the original program
        • Instead: Focuses on isolating statements that affect a particular value, useful for debugging anf program understanding.
      • Check F Tip survey on slicing.

Path Profiling

  • Reference paper

    • Efficient path profiling, Ball and Larus, 1995
    • While program path Larus 1999
    • HOLMES: Effective statistical debugging via efficient path profiling, Chilimbi 2009.
  • Goal:

    • Count how often a path through a function is executed.
    • Applications
      • Profile-directed compiler optimizations
      • Performance tuning: Which paths are worth tuning??
      • Test coverage: Which paths are not yet tested??
      • Statisitical debugging: Paths correlated with failure are more likely to contain the bug.
      • Energy analysis: Warn developers about paths and statements associated with high power consumption.
    • Challenges
      • Runtime overhead: limit slowdown of program
      • Accuracy: Ideall, precise profiles(no heuristics, no approximations)
      • Infinitely many paths
    • Start with CFG
  • Edge Profiling

    • Instrument each branching point
    • Count how often each CFG edge is executed.
    • Estimate most frequent path: Always follow most frequent edge.
    • Simple, efficient but too local and hence fails to uniquely identify the most frequent path.
  • Ball-Larus Algorithm for DAGs

    • assign a number to each path
      • Goal: Sum along a path yields unique number for path.
      • How: Associate with each node a value: NumPaths(n) = number of paths from n to exit
      • Compute Numpaths
        • Visit nodes in reverse topological order
        • If n is leaf node: NumPaths(n) = 1 else NumPaths(n) = sum of NumPaths of destination of outgoing edges
      • Enough to achieve precise goal
    • compute path number by incrementing a counter at branching points.
      • Goal: Minimize additions along edges.
      • Spanning Tree
        • Given graph G
        • Spanning tree T, undirected subgraph of G that is a tree and that contains all nodes of G.
      • Chord edges: edges in G but not in T
      • Choose spanning tree with maximum edge cost
        • cost of individual edges is assumed to be known
      • Compute increments at the chords of the spanning tree, based on existing event counting algorithm.
      • Instrument subset of all edges.
      • Assumes to know/estimate how frequent edges are executed
    • Properties of path encoding
      • Precise: a single unique encoding for each path.
      • Minimal: Instruments subset of edges with minimal cost
    • Assumptions
      • Control flow graph is a directed acyclic graph.
      • n paths (numbered 0 to n-1)
      • Graph has unique entry and exit nodes
      • Artificial back edge from exit to entry.
    • Instrumentation
      • Basic Idea
        • Initialize sum at entry: r=0
        • Increment at edges: r+=
        • At exit, increment counter for path: count[r]++
      • Optimization
        • Initialize with incremented value, if first chord on path: r=..
        • Increment sum and counter for path, if last chord on path: count[r+..]++
    • Regenerating the Path
      • Knowing the sum r, how to determine the path?
        • Use edges values from step 1("non-minimal increments")
        • Start at entry with R = r
        • At branches, use edge with largest value v that is smaller than or equal to R and set R-=v
  • Generalization of Ball-Larus to Cyclic CFGs

    • For each backedge n -> m, add dummy edges.
    • Remove backedges and add DAG-based increments
    • In addition, add instrumentation to each backedge
    • Leads to four kinds of paths
      • From entry to exit
      • From entry to backedge
      • From end of backedge to beginning of possibly another backedge
      • From end of backedge to exit
    • Full path information can be constructed from these four kinds.

Concurrency

  • Reference papers

    • Eraser: A dynamic data race detector for multithreaded programs, Savage, 1997
    • Fully automatic and precise detection of thread safety violation, Pradel, 2012
    • Finding and reproducing heisenburgs in concurrent programs, Musuvathi, 2008
  • Concurrency needed to take advantage, full use of the hardware.

  • Many real-world problems are inherently concurrent.

    • Servers must handle multiple concurrent requests
    • Computations done on huge data often are embrassingly parallel
  • Concurrency styles

    • Message-passing
      • No shared memory
      • Actor concurrency model
    • Thread-based, shared memory
      • All threads access the same shared memory
      • Synchronize locks and latches
  • Sequential consistency

    • Programs execute under sequential consistency
    • Program order is preserved: Each thread's instructions execute in the specified order
    • Shared memory ehaves like a global array: Reads and writes are done immediately
    • We assume sequential consistency for the rest of the text.
    • Many real-world platforms have more complex semantics("memory models")
  • Data Races

    • Use locks that ensure accesses to shared memory do not interfere
  • Dynamic Data Race detections

    • Look for unprotected accesses to shared memory.
    • All accesses to a shared memory location v should happen while holding the same lock L, consistent locking discipline
    • Dynamic analysis that monitors all lock acquisitions, lock releases and accesses of shared memory locations.
    • Uses Lockset Algorithm
      • Let locksHeld(t) be the set of locks held by thread t.
      • For each shared memory location v, initialize C(v) to the set of all locks
      • On each access to v by thread t,
        • Set C(v):= C(v) intersection locksHeld(t)
        • If C(v) = empty issue a warning.
    • Simple lockset algorithm produce false positives
      • variables initialized without locks held
      • read-shared data read without locks held
      • read-write locking mechanisms(producer-consumer style)
    • Refinement of the lockset algorithm
      • Keep state of each shared memory location
      • Issue warnings only in the shared-modified state.
      • states: Virgin-> Exclusive -> Shred -> Shared-modified.
    • Assumes consistent locking discipline.
    • May miss data races because it does not consider all possible interleavings.
    • May report false positive
  • Thread Safety

    • Popular way to encapsulate the challenges of concurrent programming: Thread-safe classes.
    • Class ensures correct synchronization
    • Clients can use instances as if they were alone
    • Rest of program can treat implemeantation of thread-safe class as a blackbox.
    • Correctness of program relies on thread safety of specific classes.
    • What if the class is actually not thread-safe??
    • ConTeGe - Concurrent Test Generator
      • creates multi-threaded unit tests
      • detects thread safety violations by comparing concurrent behavior against linearizations.
      • Algorithm
        • Create prefix
          • Instantiate CUT
          • Call methods(randomly)
        • Create suffixes for prefix
          • Call methods on shared CUT instance
        • Prefix + two suffixes = test
    • Thread safety oracle
      • Focus on exceptions and deadlocks
      • Compare concurrent executions to linearization
      • Linearizations
        • Put all calls into one thread
        • Preserve order of calls within a thread.
      • Properties of the oracle
        • Sound but incomplete
          • All reported violations are real
          • Cannot guarantee thread safety
        • Independent of bug type
          • Data races
          • Atomicity violations
          • Deadlocks
  • Interleavings

    • Scheduling Non-determinism
      • A single program executed with a single input may have many different interleavings
      • Scheduler decided interleavings non-deterministically
      • Some interleavings may expose bugs, others execute correctly(Heisenbugs)
      • How to explore different interleavings?? How to detect buggy interleavings??
    • CHESS in a nutshell
      • A user mode scheduler that controls all scheduling non-determinism
      • Guarantees:
        • Every program run takes a new thread interleaving.
        • Can reproduce the interleaving for every run
      • Systematic but non-exhaustive exploration of the set of possible interleavings.
      • Tree of interleavings
        • Search space of possible interleavings: represent as a tree
        • Node = points of scheduling decision
        • Edge = decisions taken
        • Each path = one possible schedule.
      • State space explosion
        • Number of interleavings: O(n^n.k), exponential in both n and k
        • Exploring all interleavings does not scale to large programs
      • Preemption Bounding
        • Limit exploration to schedules with a small number c of preemptions
          • Preemption = context switches forced by the scheduler
        • Number of schedules: O(n2.k)^c.n!)
          • exponential in c and n, but not in k
        • Based on empirical observation: Most concurrency bugs can be triggered with few(<2) preemptions.
      • Implementaion via binary instrumentation
      • Other ways to control scheduling
        • Randomly delay concurrency-related operations
        • Heuristics, known bug patterns or programmer annotations
        • Active testing, find potential bugs and then bia scheduler towards confirming them.