AQuery: Soft Type Checking

During the last couple of weeks, I’ve been working with Dr. Dennis Shasha on a rewrite of the current AQuery translator. The original system is written in C and was in need of a clean up. For the task, we decided to use Scala for a variety of reasons, including (but not limited to):

  • efficient pattern matching
  • java interoperability
  • expressive type system

In this blog post, I want to provide a discussion of the type checking portion of the translator.

AQuery’s main backend target is kdb+, an extremely efficient column-oriented database with an elegant functional programming language (q). The language is dynamically typed, meaning that types are resolved at runtime. With this comes the usual set of advantages (e.g. modularity, speed of development) and disadvantages (e.g. error detection requires running the program) as usual.

The AQuery translator provides a “soft” type checking layer that makes sure some egregious mistakes are ruled out. Note that in contrast to traditional soft type checking, we cannot actually remove any dynamic type checks, as we don’t have access to the back end execution. However, similarly to the traditional approach, we can rule out provably wrong cases.

For example, in the query

SELECT c1 * c2 FROM t WHERE avgs(c1)

we don’t need to know what type c1 is to be able to notify the user that there is an error here. We know avgs returns a vector of floats, while a WHERE clause requires boolean values to filter out undesired elements.

If we run this through AQuery with the -tc flag, indicating we want to type check, we get the following:

josecambronero aquery$ echo "SELECT c1 * c2 FROM t WHERE avgs(c1)" |
 java -jar target/scala-2.11/aquery-assembly-0.1.jar -tc
Type Error: expected boolean found numeric at [1,29]

In this post, I’d like to discuss how we went about implementing this feature, along with some of the advantages and disadvantages with our approach.

Type representation

We represent types using Scala’s case objects, rather than enumerations, as we don’t have any need to iterate over all types at a time (we’ll see in a second why we don’t care about this feature).

Location reporting

All nodes in our AST extend the scala.util.parsing.input.Positional trait, which allows us to provide source code line and column. Scala’s parser combinator library allows us to easily do this by wrapping appropriate parsers with a call to positioned and combining them up into the complete parser. I’ll elide discussion of this here, as that is a whole other topic.

Function type representation

Let’s focus on functions, as these provide the most interesting portion of our analysis.

We maintain function information using a simple summary for each function. We won’t go into details here, but for purposes of type checking, we mainly care about two members in our representation

  def f: String
  def signature: PartialFunction[Seq[TypeTag], TypeTag]

Signature represents a partial function from a sequence of type tags to a type tag. Partial functions, in contrast to total functions, are defined only over a portion of their domain. In this case, signature directly mimics the definition of a function signature: it returns a typetag (return type) for a sequence of arguments that have the appropriate type tag. For sequences of type tags that do not fit our definition, the application of signature returns None, while for those that do fit, it returns Some(return_type).

First of all some background definitions and some Scala syntax overview.

In our type checker, we say type x is consistent with a set of types Y, iff x is unknown, Y contains unknown, or Y contains x.

In Scala a partial function can be defined with a case expression (below using some rough EBNF)

{ (case <Pattern> <if-condition>? => expression)+ }

If an expression satisfies the pattern, and an optional if-guard, the RHS of =&amp;gt evaluates. Note that when we match on a pattern we can also extract values by associating variables with the subexpressions in the pattern. This proves to be very useful.

Let’s consider two specific examples: max and fill. We define their type signatures as

// we can use v @ to refer to the entire seq as v in rest of code
{ case v @ x :: xs if v.forall(_.consistent(numAndBool)) =>
     if (v.forall(_.consistent(bool))) TBoolean else TNumeric 
} 

{ case a :: v :: Nil if a.consistent(Set(v)) => v }

respectively.

For max, our signature succinctly encodes the following:

  • `max` can be called with 1 or more arguments `(x :: xs)`, which allows us to consider `max` as a variadic function, so `max(some_column)` is valid, and so is `max(1, 2, 3, 4, ….)`.
  • `max` can be called as long as the arguments are either numeric or boolean (allowing boolean arithmetic)
  • the return type of `max` is boolean if all arguments are boolean, otherwise it is numeric

For fill, our signature encodes:

  • `fill` requires two arguments
  • the type of the first argument must be consistent with the second argument
  • the return type corresponds to the type of the second argument

Why do we represent a function signature using a partial function from Seq[TypeTag] to TypeTag rather than just as a tuple of formal argument types and return type, or a Set of such tuples in cases of overloading?

Here are some advantages that we can see just from the two examples provided above:

  • overloading is simple: We can compose partial functions using `orElse`, allowing us to easily express overloading. We can also represent variadic functions easily, which would likely require special handling if we used something like a set of tuples.
  • variable binding from pattern matching: We can easily express polymorphic functions by associating variables in the pattern matching with specific types and using these in returns/checks. Note that we cannot bind the same variable to multiple locations, so we cannot really pretend these variable bindings are type variables (a la SML), but they’re close enough to yield simple definitions when combined with if guards.
  • pattern matching makes readable type definitions: As seen before, pattern matching, along with the ability to execute arbitrary Scala code in the resulting expression for each case, allow us to provide succinct definitions for our type restrictions.

Let’s walk through the example of typing a simple expression

max(c1 > 2)

There is some boiler plate code below, which parses the expression and type checks each subexpression with an environment only containing the builtin functions. The line of interest is the last one.

scala> import edu.nyu.aquery.parse.AqueryParser._
import edu.nyu.aquery.parse.AqueryParser._

scala> import edu.nyu.aquery.analysis.TypeChecker
import edu.nyu.aquery.analysis.TypeChecker

scala> val checker = TypeChecker(Nil)
checker: edu.nyu.aquery.analysis.TypeChecker = edu.nyu.aquery.analysis.TypeChecker@5b1357ec

scala> val e = parse(expr, "max(c1 > 2)").get
e: edu.nyu.aquery.ast.Expr = FunCall(MAX,List(BinExpr(Gt,Id(c1),IntLit(2))))

scala> val typed = e.allSubExpressions.map(e => e.dotify(1)._1 -> checker.checkExpr(e)._1)

scala> typed.foreach(println)
(c1,unknown)
(2,numeric)
(c1>2,boolean)
(MAX(c1>2),boolean)

Here we can see exactly what is going on. We don’t know the type of c1, but we do know that c1 &amp;lt 2 yields a boolean, and given the definition of max above, we know that max of boolean yields a boolean. More complicated examples follow the same reasoning, but the likelihood of ending with TUnknown is of course much larger.

But, not all is rosy. There are some disadvantages to using an “opaque” structure such as partial functions, rather than something that we can explicitly inspect, like tuples and sets of them. Note that to some extent these disadvantages are inherent in all “shallow” embedded interpreters, which one might argue our type checker aligns with. I believe the main disadvantage is:

  • vague error messages (we don’t know exactly why a match failed!)

Consider,

josecambronero aquery$ echo "SELECT c1 * c2 FROM t WHERE max("a", 1)=2" |
 java -jar target/scala-2.11/aquery-assembly-0.1.jar -tc
Call Error: bad call to MAX at [1,29]

AQuery knows that the call to max in the where clause is wrong, as the first argument is neither a boolean nor a numeric, but it cannot identify what exactly went wrong. This comes directly from not knowing why a partial function failed to match.

For example, the next example returns the same error message despite the different type of error. In this case, max is called with no arguments, which of course is not well-typed.

josecambronero aquery$ echo "SELECT c1 * c2 FROM t WHERE max()=2" |
 java -jar target/scala-2.11/aquery-assembly-0.1.jar -tc
Call Error: bad call to MAX at [1,29]

So in essence we have traded abstraction at the source code, which is incredibly helpful for our development purposes, for abstraction at reporting time, which yields ambiguous messages.

In the aggregate, I believe the advantages strongly outweigh the disadvantages, given that our location-based error message should provide enough of a hint for most cases to be fixed.

Note that the type checking in the translation needs to be actively turned on (using -tc). This allows users to run code regardless of the output of our type checking process. The analysis performed can certainly be refined further and I think it might be a good place for newcomers to contribute to.

Leave a comment