mirror of
https://github.com/capstone-engine/llvm-capstone.git
synced 2025-02-17 08:21:13 +00:00
![Balazs Benics](/assets/img/avatar_default.png)
Recently we uncovered a serious bug in the `GenericTaintChecker`. It was already flawed before D116025, but that was the patch that turned this silent bug into a crash. It happens if the `GenericTaintChecker` has a rule for a function, which also has a definition. char *fgets(char *s, int n, FILE *fp) { nested_call(); // no parameters! return (char *)0; } // Within some function: fgets(..., tainted_fd); When the engine inlines the definition and finds a function call within that, the `PostCall` event for the call will get triggered sooner than the `PostCall` for the original function. This mismatch violates the assumption of the `GenericTaintChecker` which wants to propagate taint information from the `PreCall` event to the `PostCall` event, where it can actually bind taint to the return value **of the same call**. Let's get back to the example and go through step-by-step. The `GenericTaintChecker` will see the `PreCall<fgets(..., tainted_fd)>` event, so it would 'remember' that it needs to taint the return value and the buffer, from the `PostCall` handler, where it has access to the return value symbol. However, the engine will inline fgets and the `nested_call()` gets evaluated subsequently, which produces an unimportant `PreCall<nested_call()>`, then a `PostCall<nested_call()>` event, which is observed by the `GenericTaintChecker`, which will unconditionally mark tainted the 'remembered' arg indexes, trying to access a non-existing argument, resulting in a crash. If it doesn't crash, it will behave completely unintuitively, by marking completely unrelated memory regions tainted, which is even worse. The resulting assertion is something like this: Expr.h: const Expr *CallExpr::getArg(unsigned int) const: Assertion `Arg < getNumArgs() && "Arg access out of range!"' failed. The gist of the backtrace: CallExpr::getArg(unsigned int) const SimpleFunctionCall::getArgExpr(unsigned int) CallEvent::getArgSVal(unsigned int) const GenericTaintChecker::checkPostCall(const CallEvent &, CheckerContext&) const Prior to D116025, there was a check for the argument count before it applied taint, however, it still suffered from the same underlying issue/bug regarding propagation. This path does not intend to fix the bug, rather start a discussion on how to fix this. --- Let me elaborate on how I see this problem. This pre-call, post-call juggling is just a workaround. The engine should by itself propagate taint where necessary right where it invalidates regions. For the tracked values, which potentially escape, we need to erase the information we know about them; and this is exactly what is done by invalidation. However, in the case of taint, we basically want to approximate from the opposite side of the spectrum. We want to preserve taint in most cases, rather than cleansing them. Now, we basically sanitize all escaping tainted regions implicitly, since invalidation binds a fresh conjured symbol for the given region, and that has not been associated with taint. IMO this is a bad default behavior, we should be more aggressive about preserving taint if not further spreading taint to the reachable regions. We have a couple of options for dealing with it (let's call it //tainting policy//): 1) Taint only the parameters which were tainted prior to the call. 2) Taint the return value of the call, since it likely depends on the tainted input - if any arguments were tainted. 3) Taint all escaped regions - (maybe transitively using the cluster algorithm) - if any arguments were tainted. 4) Not taint anything - this is what we do right now :D The `ExprEngine` should not deal with taint on its own. It should be done by a checker, such as the `GenericTaintChecker`. However, the `Pre`-`PostCall` checker callbacks are not designed for this. `RegionChanges` would be a much better fit for modeling taint propagation. What we would need in the `RegionChanges` callback is the `State` prior invalidation, the `State` after the invalidation, and a `CheckerContext` in which the checker can create transitions, where it would place `NoteTags` for the modeled taint propagations and report errors if a taint sink rule gets violated. In this callback, we could query from the prior State, if the given value was tainted; then act and taint if necessary according to the checker's tainting policy. By using RegionChanges for this, we would 'fix' the mentioned propagation bug 'by-design'. Reviewed By: Szelethus Differential Revision: https://reviews.llvm.org/D118987
//===----------------------------------------------------------------------===// // Clang Static Analyzer //===----------------------------------------------------------------------===// = Library Structure = The analyzer library has two layers: a (low-level) static analysis engine (GRExprEngine.cpp and friends), and some static checkers (*Checker.cpp). The latter are built on top of the former via the Checker and CheckerVisitor interfaces (Checker.h and CheckerVisitor.h). The Checker interface is designed to be minimal and simple for checker writers, and attempts to isolate them from much of the gore of the internal analysis engine. = How It Works = The analyzer is inspired by several foundational research papers ([1], [2]). (FIXME: kremenek to add more links) In a nutshell, the analyzer is basically a source code simulator that traces out possible paths of execution. The state of the program (values of variables and expressions) is encapsulated by the state (ProgramState). A location in the program is called a program point (ProgramPoint), and the combination of state and program point is a node in an exploded graph (ExplodedGraph). The term "exploded" comes from exploding the control-flow edges in the control-flow graph (CFG). Conceptually the analyzer does a reachability analysis through the ExplodedGraph. We start at a root node, which has the entry program point and initial state, and then simulate transitions by analyzing individual expressions. The analysis of an expression can cause the state to change, resulting in a new node in the ExplodedGraph with an updated program point and an updated state. A bug is found by hitting a node that satisfies some "bug condition" (basically a violation of a checking invariant). The analyzer traces out multiple paths by reasoning about branches and then bifurcating the state: on the true branch the conditions of the branch are assumed to be true and on the false branch the conditions of the branch are assumed to be false. Such "assumptions" create constraints on the values of the program, and those constraints are recorded in the ProgramState object (and are manipulated by the ConstraintManager). If assuming the conditions of a branch would cause the constraints to be unsatisfiable, the branch is considered infeasible and that path is not taken. This is how we get path-sensitivity. We reduce exponential blow-up by caching nodes. If a new node with the same state and program point as an existing node would get generated, the path "caches out" and we simply reuse the existing node. Thus the ExplodedGraph is not a DAG; it can contain cycles as paths loop back onto each other and cache out. ProgramState and ExplodedNodes are basically immutable once created. Once one creates a ProgramState, you need to create a new one to get a new ProgramState. This immutability is key since the ExplodedGraph represents the behavior of the analyzed program from the entry point. To represent these efficiently, we use functional data structures (e.g., ImmutableMaps) which share data between instances. Finally, individual Checkers work by also manipulating the analysis state. The analyzer engine talks to them via a visitor interface. For example, the PreVisitCallExpr() method is called by GRExprEngine to tell the Checker that we are about to analyze a CallExpr, and the checker is asked to check for any preconditions that might not be satisfied. The checker can do nothing, or it can generate a new ProgramState and ExplodedNode which contains updated checker state. If it finds a bug, it can tell the BugReporter object about the bug, providing it an ExplodedNode which is the last node in the path that triggered the problem. = Notes about C++ = Since now constructors are seen before the variable that is constructed in the CFG, we create a temporary object as the destination region that is constructed into. See ExprEngine::VisitCXXConstructExpr(). In ExprEngine::processCallExit(), we always bind the object region to the evaluated CXXConstructExpr. Then in VisitDeclStmt(), we compute the corresponding lazy compound value if the variable is not a reference, and bind the variable region to the lazy compound value. If the variable is a reference, just use the object region as the initializer value. Before entering a C++ method (or ctor/dtor), the 'this' region is bound to the object region. In ctors, we synthesize 'this' region with CXXRecordDecl*, which means we do not use type qualifiers. In methods, we synthesize 'this' region with CXXMethodDecl*, which has getThisType() taking type qualifiers into account. It does not matter we use qualified 'this' region in one method and unqualified 'this' region in another method, because we only need to ensure the 'this' region is consistent when we synthesize it and create it directly from CXXThisExpr in a single method call. = Working on the Analyzer = If you are interested in bringing up support for C++ expressions, the best place to look is the visitation logic in GRExprEngine, which handles the simulation of individual expressions. There are plenty of examples there of how other expressions are handled. If you are interested in writing checkers, look at the Checker and CheckerVisitor interfaces (Checker.h and CheckerVisitor.h). Also look at the files named *Checker.cpp for examples on how you can implement these interfaces. = Debugging the Analyzer = There are some useful command-line options for debugging. For example: $ clang -cc1 -help | grep analyze -analyze-function <value> -analyzer-display-progress -analyzer-viz-egraph-graphviz ... The first allows you to specify only analyzing a specific function. The second prints to the console what function is being analyzed. The third generates a graphviz dot file of the ExplodedGraph. This is extremely useful when debugging the analyzer and viewing the simulation results. Of course, viewing the CFG (Control-Flow Graph) is also useful: $ clang -cc1 -help | grep cfg -cfg-add-implicit-dtors Add C++ implicit destructors to CFGs for all analyses -cfg-add-initializers Add C++ initializers to CFGs for all analyses -cfg-dump Display Control-Flow Graphs -cfg-view View Control-Flow Graphs using GraphViz -unoptimized-cfg Generate unoptimized CFGs for all analyses -cfg-dump dumps a textual representation of the CFG to the console, and -cfg-view creates a GraphViz representation. = References = [1] Precise interprocedural dataflow analysis via graph reachability, T Reps, S Horwitz, and M Sagiv, POPL '95, http://portal.acm.org/citation.cfm?id=199462 [2] A memory model for static analysis of C programs, Z Xu, T Kremenek, and J Zhang, http://lcs.ios.ac.cn/~xzx/memmodel.pdf