First year takeaways

I’ve pretty much neglected this blog since last summer, but with the current school year over (yay!) I thought it would be a good time to do some retrospective blogging.

I had the pleasure of collaborating with a various great people on research projects this semester.

Jack FeserMicah Smith and I collaborated (under supervision of Sam Madden) on a database for dynamic missing value imputation and wrote up a paper, which after a round of review-based revisions, turned out quite nicely I think (we still have to hear back on this latest round of reviews).

I also worked with  Phillip Stanley-Marbell and Martin Rinard (my advisor) on two interesting projects related to color perception. The first project constitutes the first large-scale color perception study across different countries, based on iOS data, while the second used the same dataset to perform customized image compression. We wrote up the second of these and submitted to a small conference, and are in the process of polishing off the first before submitting to a conference venue.

I think I managed to get a lot out of these three research experiences. Working with great people pushed me to try to improve my own skills and bring more to the table. I wanted to sum up my thoughts and personal lessons in a list of observations. As a disclaimer, I’m certainly not trying to give advice to anyone else on this. I have no real qualifications to do so, and a year of phd program is hardly a long time as far as experience goes, but I thought I’d like to keep this for posterity for myself (and maybe it will help some other first-year in the future). So some takeaways from all of this (in bullet form, and in no particular order):

  • Revisiting ideas is extremely worthwhile. Our first stab at the image compression project flat out failed. I had the wrong approach to it. The second time around, rather than stabbing in the dark, we brainstormed about the core issue with what I was doing and our goals, and tried to find a solution that bridged the two. The second approach that came out of this worked.
  • Many systems make tradeoff decisions for the user. Exposing these tradeoffs and letting the user choose for themselves feels like a cleaner approach. Indeed, these tradeoffs themselves can be the most interesting part of your system. Our database work showed that formulating these tradeoffs in a rigorous way can be a novel contribution itself.
  • Writing up the paper, as you work through the problem, can help  direct your efforts. It is basically a concrete way of tackling the question: “what do I need to show to convince a colleague of the novelty/effectiveness/impact of my idea/technique/system?”. This isn’t my own observation, but rather the product of great advice from Phillip and a Simon Peyton-Jones video he recommended. I highly recommend doing this and plan on doing this going forward.
  • Related to the point above: spending time writing up your problem statement before diving in. This helps formalize the problem, which in turn can expose good approaches to tackling it, or point you in the right direction. This is an approach Martin emphasized from day one, and I think it has a lot of real benefits.
  • Have two projects going simultaneously. When I had a single project, getting stuck or not feeling excited about the problem was a big drag on my productivity. When I started working on two projects at a time, I could just switch to the other problem whenever I got tired of one. This made work a lot more enjoyable and productive. Some people may be able to manage more than two projects, but for me three started to get too hectic.
  • In terms of school work/classes, I had to make a big effort to switch the way I approached classes. In undergrad, and even in my masters, courses were a big focus, so it always made sense to spend additional time polishing homeworks/projects or studying for a test. Although there were diminishing returns on that time, it felt worthwhile. This time around, I’ve tried to switch my approach. I put in enough time to make sure I understand the core idea or takeaway, but after that, a couple of points are not really worth it, and that time is better spent on research.
  • Volunteer to organize at least one event for your department or lab. Advisors seem understanding of the fact that you are getting used to new work, a new environment, and potentially, a new field or research area. This means research output is not necessarily the right metric for the first year. This also means that you can volunteer for events that you might not want to (or be able to) once you are steeped deep in your research work. I had the pleasure of collaborating with a bunch of great people this semester on organizational events. In doing so, I met other students (junior and senior), faculty members, and community members that I wouldn’t have otherwise had the chance to meet as easily. Candace Ross and I helped put together the first diversity panel at the EECS admitted students visit weekend, and along with Ivan Kuraj, I helped plan the 2017 programming languages offsite.
  • Recognizing that different people have different research styles and tastes is incredibly useful. For example, I’ve learned that, at least at this point in my grad career, I like starting out with a concrete problem  (e.g. missing values in a database are a pain to work with) and then letting the problem steer me to the appropriate solution. Other people may like the idea of developing a core technique, and then finding a problem that it solves. While the latter sounds cool to me, I have a hard time conceptualizing solutions without concrete and motivating examples.
  • I’m making a conscious effort not to stress or place emphasis on producing “worthwhile” results. By that I mean: results aren’t just the final product of your research. Getting the database setup for your project, or finally setting up your VM for experiments, can also be “results”. This may seem like I’m simply redefining the meaning of results to suit me and feel positive. But it’s actually amazing how much of a psychological impact this redefinition had on me. During my first couple of months, I struggled with the fact that I wasn’t “producing results”. Eventually Martin pointed out that all of these small things were results, and that productivity took many forms. This insight was momentous for changing my whole mindset for the Spring semester.
  • A related point: don’t think about research as producing results for your advisor, produce them for yourself. This isn’t to say that you should ignore specific requests from your advisor. Clearly, they are (likely/hopefully) providing you with funding and the opportunity to work with them, so you do have a responsibility to deliver on that. However, the results you produce, and the work you do, should be for yourself. There is a large opportunity cost in attending graduate school, and you should make sure you do not short-change yourself by pursuing topics that you are not interested in. Martin has made this point to our group repeatedly, and is one of the points I think I have appreciated the most. I realize this is an enormous privilege to have, and I am working to deliver on that opportunity.
  • A less positive point, but one that I think is very important: own your mistakes. I fudged up some data pre-processing for our database evaluation. We figured this out after Micah and Jack had spent a non-trivial amount of time dealing with frustrating queries that didn’t match up with expectation (based on what they thought the preprocessing had been). In the end, we realized there had been a miscommunication, along with poor documentation on my part. This was a good lesson and one that I’m happy worked out with no lasting negative impact on others or myself.
  • Related to the prior point: own your mistakes, but also move on. There is no point in dwelling on something messed up. You fix it, apologize, take note for the future, and move on. Anything else is just bad for productivity and (more importantly) well-being.
  • Asking colleagues for help when you’re stuck, and similarly providing help when you’re in a position do so, is one of the greatest benefits of being part of an academic community. I’ve been beyond pleased to see how open and helpful others are. I like to think I’ve been of help to others as well, but for the coming year I’m making a conscious effort to offer my help whenever possible.
  • A bit more mundane: I’ve started using a time management application (Asana) to write out my TODOs by day and group them by project. It may be overkill for my purposes, but I have squeezed out a lot of order from my days this way. In general, I’ve realized that having two major TODOs per day, along with smaller tasks, is a good way to split up my time. I make an effort to make at least one of those major TODOs be research-related. Major TODOs in my mind are things like a significant amount of coding, writing up a problem, or setting up an evaluation benchmark etc.

Those are the main points, off the top of my head. I’m looking forward to this summer. I’ll be staying at school and getting started on a research project I’m very excited about. Hopefully I’ll get a chance to  write up at least one blog post along the way.

Orderly: A Simple DSL (Code Generation/Execution)

In this last post in the series surrounding the upcoming kx talk we’ll discuss the step of code generation (in the case of the external DSL) and execution  (in the case of the embedded one). The prior two posts, dealing with syntax and semantic analysis, respectively, can be found here and here.

We’ll discuss the embedded DSL first, given the simplicity in this case. Orderly is a simple language. It simply allows us to express market orders and their insertion into an order book in a simple syntax, with various semantic checks that might be relevant for our business purposes. Given its simplicity, the interpreting function, which takes our DSL and actually provides it with meaning,  is equivalently simple. In this case, it is worth pointing out that the “interpreting” function need not be actually an interpreter. It can just as well generate code that is compiled down and then called. Just one thing worth keeping in mind.

So in our case, all we do with Orderly’s AST (represented as a q list) is extract the fields to create a table row representing the order and a table name into which to insert the order.

This can be succinctly expressed in q as:

toTable:{flip `side`volume`units`ticker`px`cond`client!(),/:x}
add:{[ast] (last ast) upsert toTable 7#(-1 _ ast),`self}
interpret:add

For clarity (and pedagogical reasons), we’re dogmatic about what we name functions here, so our “interpreting” function is quite literally called interpret.

We add another useful function to the interpret.q file, which although not used to perform any translation of the AST, is useful for use with the resulting Orderly tables.

satisfies:{[t;env] select from t where first each @[;env;0b] each cond }

It is meant to easily filter market orders that satisfy certain the requisite conditions. We wrap with some error-trapping logic for generality.

Here is an example execution

internal-execution

 

This brings us to some of the dos and donts for this part

  • DO: add additional functions that may be of interesting for the domain but don’t need to be represented syntactically
  • DONT: perform drastic AST transformations in this phase. Such changes should be pushed into prior stages.

Now, for the external implementation, we generate q code. The user is free to read/modify the code. We can then feed the code as a normal q file to the interpreter for execution. This is the approach we take in AQuery. As usual, there are a series of tradeoffs here, some of which we’ll encounter as we explore the implementation below.

The insertions create code fairly similar to what the internal implementation uses. The only meaningful difference is when the external implementation optimizes a series of insertions into a single bulk insertion. This results in inter-table writes potentially being moved around relative to the original file, but all intra-table writes take place in the same order as the original.

In generating code, we need some helper functions, mainly those related to the deferred semantic checks we discussed in the prior post. For readability, we choose to use functions rather than insert the code-inline. For maintainability, we have chosen to keep these functions in a resource file in our source code directory and we insert these as a form of “prelude” in each generated file.

Here is an example of two generated orderly files (written to stdout). The first doesn’t optimize (i.e. rewrite) to bulk inserts, the second does.

external-generation

external-generation-optimized

Finally, here is an example showing the deferred semantic checks in action

external-delayed-error
This brings us to DOS and DONTS for the external implementation:

  • DO: generate intelligible code. This can mean factoring out code that might otherwise be inlined (a smart option would be to make this a command line option for the translator), it may (will?) mean adding comments to the generated code.
  • DO: include any helper code as a prelude to the original generated code. This allows easily movable files, which for a small DSL, will be appreciated by your users
  • DO: write the code using an implementation language that lets you create clean and maintainable generator code, using constructs such as string interpolation to splice together  generated code

That’s it for this tutorial series. I’ll be discussing the code and slides (to come) at the Kx meetup coming up on September 22nd, 2016. Feel free to stop by, ask questions, chat or just say hi!

Orderly: A Simple DSL (Semantic Analysis)

In this second installment of the Orderly series (post 1 here), we’ll discuss semantic analysis. At this point we’ve already parsed our text and determined whether or not it satisfies our language’s grammar. If it satisfies it, we “accept” it, otherwise, we reject it (hopefully with some sort of error messages describing what issues arose).

In the semantic analysis stage, we will decide whether the program given makes “sense”. This may include tasks such as type analysis, which aims to protect users by making sure that all operations on different types of data are defined and appropriate for that data type. It can also include things like analysis passes that perform  transformations on the program representation. The line between this and later optimization stages is often blurred.

In our case, for Orderly, we’re not really performing type checking, since the only arbitrary code is found in the modifier clause (where we allow q code). However, we do perform a couple of checks such as making sure that the prices and volume of a market order is nonzero. We also check that the q code provided corresponds to a monadic (here meant as a single argument) function. Note that this is an arbitrary restriction that we came up with, but is simply meant to be illustrative. Finally, we will change modifiers that correspond to a date to become a lambda that check for that date. This is an example of “uniformization” in our AST and can lead to simpler code generation later on. There are cases where this is not ideal, but at least in my experience, it has allowed me to keep a small core of code generation functions and defer most constructs to some combination of them.

We’ll discuss a couple of interesting points for both the internal and external implementations.

For the internal implementation, we got a bit cheeky and added a simple assertion function to  the .q namespace, which allows us to use it in infix form in our main checking function.  However, we atone for this sin by dropping it from .q after defining our main check function.  This then means users can no longer use it as infix (so we’re safe), but the main check function has been defined, so our purpose for it has been completed.

Here is the check function code in the internal implementation (one might say this reads as a DSL itself, although here the only convenience is syntactic):

check:{
  getVolume[x] should be ({x > 0};{"Volume should be positive"});
  getPrice[x] should be ({x > 0};{"Price should be positive"});
  x:setModifier[x;] wrapDate getModifier x;
  getModifier[x] should be (isUnary;{"Expected unary function"});
  x
 }

Here are two examples of the new semantic analysis in the internal implementation. Note how in the first, the language complains as “sum” is not a monadic function as expected (well…it is, but we explicitly rule out built-ins such as this, since it’s unclear what their meaning would be in an Orderly modifier-clause):

internal-analysis-error

The second example shows how we transform a date-based modifier clause in Orderly to
a single-argument lambda, which checks that the date corresponds to the original date
provided.

internal-transform-date

For the external implementation, there are two interesting points to make. First, consider that at least one of our checks (whether a function is monadic) cannot (always) be performed statically. We provide a simple solution: insert the check into the  code generated (which is not unlike dynamic type checking). However, one important point here is that the runtime check should have at least some context of the source location of the error to return to the user.

We can see in the picture below, that we’ve added this runtime check in a Verbatim AST node (used in Orderly to capture q-code meant for direct insertion into the code generated).

external-embed-error-message

The second interesting point for the external implementation, is that we have global knowledge of all the insertions into the various order books that will take place in this Orderly  file. This allows us to perform a very easy but worthwhile rewrite: change a series of single market order insertions into a series of bulk order insertions. This has the effect of modifying the order books in larger chunks, which of course should have a positive effect on runtime as any memory allocation for the constantly changing size of the order book can  be reduced. Additionally, we allow the underlying language (q) to take advantage of vector operations for these insertions. Note the speedup in the trivial example below, where the first timed operation is a series of single insertions and the second is a bulk insertion.

single-vs-bulk-insert

So some general takeaways/advice for this stage are:

  • DO: provide useful error messages (this should probably just be a DO for all stages)
  • DO: check as much before runtime as possible
  • DO: take advantage of global knowledge to perform global transformations
  • DO: pick an implementation language (for external implementations) that allows nice rewrite notation (e.g. pattern matching in functional languages)
  • DONT: write non-composable rewrites. If a pattern doesn’t match, then applying it to an operand should be equivalent to identity. In Scala, we can define these rewrites as partial functions (defined only for certain patterns) and lift it to a total function that simply returns the original operand for otherwise undefined operands.
  • DONT: rely on later stages to perform transformations you can perform now and encode in the AST directly.

The code for both the internal and external implementations can be found here.

The next post will address code generation for our simple language.

Orderly: A Simple DSL (Syntax Phase)

As some of you may know, I’ve been given the chance to present some of my work with Dr. Shasha at the upcoming Kx user meetup (September 22nd, 2016). The talk is meant to address two main points:

  • AQuery as a language and its optimizer
  • Discuss implementing new languages targeting q/kdb+ for execution.

I’m scheduled to focus on (2), while Dr. Shasha will discuss (1). The idea for (2) will be to structure it as a small tutorial, where we present dos and donts. To do so, we’ll be discussing a simple DSL (domain specific language) that should be simple enough to handle in 20-30min, and yet provide enough discussion points.

Since I have to go ahead and prepare this material in advance, I figured it didn’t hurt to make some accompanying blog posts.

So in this blog post, we’ll start from the bottom: we’ll introduce our DSL, discuss various implementation approaches, and then provide the syntax analysis phase (i.e. parsing) for the two approaches.

Implementing DSLs

DSLs are commonly implemented in two main ways: as external languages or embedded within another general purpose programming language. Although the exact definitions of these approaches varies by author, the general idea is that the external implementation has a  standalone parser, analysis, and code generation/execution. Meanwhile, internal implementations (aka embedded) are developed within the context of an existing language. Embedded implementations can be roughly broken down into two categories (although we’ll see that these are not necessarily clear cut, nor should we always want them to be so): shallow and deep embeddings.

Shallow embeddings do not construct their own AST but  rather rely on the semantics of the host language to directly evaluate them (usually as  function calls). The line between a shallow embedding and a well crafted library API can be thin at times. A deep embedding, in contrast, constructs an AST, which it is free to analyze and transform, and then relies on an “interpreting” function to  actually provide semantics to it.

We’ll see that in our case, our “external” implementation borrows from the embedded domain, as does our internal implementation.

Our Case Study: Orderly

“Orderly” is a simple language meant to encode (fictitiously simple) market orders and insert them into a main order book.

Some examples of orderly code might be:

 sell 1000 shares of IBM at $50 on 05/16/2016 for BAML -> order_book
  // uses an embedded q fun
 buy $1e6 of AAPL at $100 
 if 
"{[env] 10 > first exec avg close_price from env where sector=Tech}"
 -> order_book

Here we can see that Orderly allows us to use q functions directly (with some restrictions that we’ll see later on). Already, we can see that the barrier between our language and the host language is thinner than might be usual, even for an external implementation.

The orderly grammar is quite simple. We reproduce it below in EBNF format, for reference:

 <program> ::= (<order> "->" <ident>) +

 <order> ::= <side> <vol> "of" <ident> "at" <num> <modifier> ("for" <ident>)?

 <side> ::= "buy" |"sell"

 <vol> ::= <num> "shares" |"$" <num>

 <mod> ::= "if" "<monadic-fun-in-q>" | "on" <date>

 

Implementations

For our external implementation, we use Scala and take advantage of the elegant parser combinator library to craft a simple parser, which provides decent error reporting.

In our internal implementation, we write a simple parser in q, with a naive tokenizer (which splits tokens on whitespaces), and build in some error reporting.

The main takeaways for the syntax analysis phase are:

  • Generate useful parser errors (with context/source location if possible). There is nothing more frustrating as an user to get an error and have no clue how to fix it.
  • Parsing code should be clear and extendable. This is key for DSLs that might evolve (and most will). In the external case, the parser combinator library fits all these points. In our internal implementation, our code is clear, but its extensibility is debatable. I have hopes that someone will contribute a nice parser combinator library for q in the future.

 

Error Reporting Examples

The external implementation and the internal implementation both try to provide useful error messages for parse errors.

external-parse-error

Searching in the q interpreter

(Disclaimer: I just realized this misses built-ins that are not defined in .q, e.g. max/min… I suspect this could be easily extend to catch those, but fyi!).

My memory is not particularly good. Unfortunately, this means I tend to forget what I’ve named variables/functions etc in programs. The same is true with APIs for libraries that I use rarely. With IDEs, this is not usually an issue, as the vast majority of them offer autocomplete/search functionality. For some languages, this functionality is also built-in to the REPL (when available).

For example, in Scala, you can hit tab to find possible autocompletions.


scala> val l = List(1,2,3)
l: List[Int] = List(1, 2, 3)

scala> l.f
filter   filterNot   find   flatMap   flatten   fold   foldLeft   foldRight   forall   foreach

The autocompletion here is quite nice, as it is refined based on the type information.

While this is great, I would even settle for something much simpler when writing q code.

Let’s consider the follow q code


a:100;
ab:([]c1:1 2 3)
.f.g:{x}
.f.gh:{x}
.f.a:100
.f.ab:{x}
.f.abc.b:100
f:{x}
fg:{x}
mab:1000

In q, a somewhat similar functionality is offered by the kdb-autocomplete package for Atom. However, this still has some limitations (if I’m wrong on any of these please feel free to point out, as I’m not super familiar with the package):

  • it only seems to offer autocomplete for names when using the qualified
    name (i.e. including at least part of the namespace), unless they are built-ins

autocomplete-built-in

  • it doesn’t provide type information for each of the possible autocompletions (although it does provide some nice comments for built-ins)

autocomplete-no-type-info.jpg

  • it is restricted to the Atom editor

The last restriction, particularly, makes it hard for me to use it, as I usually write most of my code in the REPL (as I play with it), and then copy it back to the editor once I’ve polished it down to a reasonable version. So the autocomplete/search is missing where I need it the most!

Given this, I put together a small prototype in q, which provides some search functionality. It takes the last term in the current input and searches for possible matches in the appropriate namespace. It then provides a table of possible matches with some type info (this info could be extended to be more helpful, ideally, we would try to follow the Scala example and refine searches based on type info, if possible… but this might be bit nastier here since the types are discovered at runtime…but anyways).

To “search”, all you need to do is preface your code with the keyword: search, as we’ve attached a simple function to .z.pi, which handles input in q.

So let’s give it a try for our h:{2 + .f.} definition from before

q)search 2 * .f.
n  | tag  t
---| --------
a  | vars -7
ab | funs 100
abc| vars 99
g  | funs 100
gh | funs 100
q)search 2 * a
n | tag  t
--| -------
a | vars -7
ab| ts   98
q)2 * a // this evals normally
200

There we go, so it provides all the options, marked with an appropriate tag. Note that nested namespaces (such as .f.abc) show up as dictionaries (as expected) in their type.

We can of course use this to search builtins, just by using the .q. namespace. Otherwise note that it searches the global namespace `.

q)search ma
n  | tag  t
---| -------
mab| vars -7
q)search .q.ma
n   | tag  t
----| --------
mavg| funs 100
maxs| funs 108

For those interested, the q code is available in the blog github repo.

Trampolines in q: Jumping around

A tail call is when a function calls another function as the last step right before returning. Tail call elimination refers to an optimization that avoids growing the stack by reusing the last stack frame for the next function call. Given that we have “taken everything we need from” a given call, we are free to overwrite its frame.

Now, tail call elimination is often implemented for tail recursive calls (which is just a case where the two functions referenced above are one and the same). However, this means that mutually recursive functions (for example if f calls g last and g calls f last, and f != g) can still overflow the stack.

This is elegantly addressed by this paper which discusses a general fix for this
and implements it in scala: Trampolines (and more generally the Free monad). The idea isn’t new, but it is very well explained in the paper.

I won’t discuss the latter, as the truth is I’m still learning about monads myself, but I will provide a simple implementation of a trampoline in q.

We can use the trampoline technique by creating two “datatypes”: more and done. In q, we can’t really create new datatypes, but we will settle for pairs of type-label and value.

More has a thunk (i.e. an unevaluated function that takes zero arguments), while done contains a value.

Here they are:

// data constructors
more:{[f;x] (`more; {[f;a;d] f[a]}[f;x;])}
done:(`done; )

Now, we need our main run function (i.e. the interpreter for our datatypes). This function is tail-recursive, so the q interpreter takes care of the optimization for us.

// trampoline
run:{{x[1][]}/[{not `done~x 0};x][1]}

All it does is repeatedly evaluate thunks until we hit a done.

Now, let’s experiment. Similarly to the paper above, we pick a trivial problem that clearly shows the issue.

Consider the pair of functions even/odd. Even calls odd and viceversa.

even:{$[0=x;1b;odd x-1]}
odd:{$[0=x;0b;even x-1]}

A simple call shows that we run out of stack:

// this runs out of stack
q)even 1e6
'stack
@
{$[0=x;0b;even x-1]}
997999f

Meanwhile, our evenT/oddT, which employ the trampoline, will not run out of stack

evenT:{$[0=x;done 1b;more[oddT;x-1]]}
oddT:{$[0=x;done 0b;more[evenT;x-1]]}
q)run evenT 1e6
1b

If we check what evenT 1e6 is, we can see that it is really just “the recipe” for the next
iteration, wrapped in our more constructor.

q)evenT 1e6
`more
{[f;a;d] f[a]}[{$[0=x;done 0b;more[evenT;x-1]]};999999f;]

The paper referenced before goes into more depth and creates a “trampoline monad”, which turns out to be a specific case of a more general tool (free monad). It provides an outline/hints for how a compiler might automatically rewrite tail calls using this tool to avoid stack overflows. All in all, pretty neat I think.

AQuery: Rewrites through Partial Functions

As mentioned before, one of the motivations behind rewriting the AQuery system was to provide a cleaner, more maintainable, and extensible system. In doing so, I drew inspiration from a variety of sources. One of the main sources was Spark SQL’s Catalyst optimizer, which elegantly implements AST rewrites using partial functions. The main idea of a partial function for rewrites is that the RHS of a case is evaluated when the case matches, and otherwise “nothing” happens. In the background, we lift the partial function to return an option, returning Some(rewrite) or None if no match, in which case we simply put back the original expression. I encourage you to take a look at the Spark SQL SIGMOD paper, which does a nice job at outlining their system.

In this quick blog post, I would like to show how you can do the same in AQuery to implement a simple rewrite: we would like to combine adjacent selections into a single selection. This might be the case if we wanted to generate q code for a selection, as the language already executes selections sequentially. Combining the selections in this case helps us reduce the size of the output code.

AQuery queries are a composition of various relational algebra operators (along with some extensions).

For example, we can create a query as follows:

import edu.nyu.aquery.ast._
import edu.nyu.aquery.parse.AqueryParser._

// projection
val p = Project(_:RelAlg, Nil)
// two sets of selections
val f1 = Filter(_:RelAlg, 
   parse(rep(predicate | expr), "c1 > 2 f(x) != 100 10 LIKE 100").get
)
val f2 = Filter(_:RelAlg, parse(rep(expr), "c0 * 2 > 2").get)
// a sort
val s = SortBy(_:RelAlg, List((Asc, "c1"), (Desc, "c2")))
// 2 more selections
val f3 = Filter(_:RelAlg, parse(rep(expr), "c0 c1 c2").get)
val f4 = Filter(_:RelAlg, parse(rep(expr), "h() = 100").get)
// source table
val t = Table("t1")

// compose our operators
val make: RelAlg => RelAlg = List(f1, s, f2, f3, f4, p).reduceLeft(_ andThen _)
// make a query by applying it to a source table
val q = make(t)

We can visualize our query by using the Dot object, and simply wrapping our query to make it a top-level language construct.

println(Dot.toGraph(Query(Nil, q) :: Nil))

digraph G {
0 [label="AQuery Entry"];
0 -> 1;
1 [label="full-query"];
1 -> 2;
2 [label="project "];
2 -> 3;
3 [label="where h()!=100"];
3 -> 4;
4 [label="where c0, c1, c2"];
4 -> 5;
5 [label="where c0*2>2"];
5 -> 6;
6 [label="sort: Asc c1, Desc c2"];
6 -> 7;
7 [label="where c1>2, f(x)!=100, LIKE(10,100)"];
7 -> 8;
8 [label="t1"];
}

Now, we can write our rewrite rule

val rewrite: PartialFunction[RelAlg, RelAlg] = {
  case Filter(Filter(src, fs1, _), fs2, _) => Filter(src, fs1 ++ fs2)
}

Our rewrite can be read off directly from the code: when there are two consecutive filters, combine them into a single one by concatenating their selection expressions.

We now apply the transformation and look at the DOT graph again

println(Dot.toGraph(Query(Nil, q.transform(rewrite)) :: Nil))

digraph G {
0 [label="AQuery Entry"];
0 -> 1;
1 [label="full-query"];
1 -> 2;
2 [label="project "];
2 -> 3;
3 [label="where c0*2>2, c0, c1, c2, h()!=100"];
3 -> 4;
4 [label="sort: Asc c1, Desc c2"];
4 -> 5;
5 [label="where c1>2, f(x)!=100, LIKE(10,100)"];
5 -> 6;
6 [label="t1"];
}

Note how the consecutive selections have now been merged.

As I begin working on the AQuery optimizer I think this approach will pay dividends in terms of simplicity. I’ll write up any particularly interesting cases as I encounter them.

AQuery: Soft Type Checking

During the last couple of weeks, I’ve been working with Dr. Dennis Shasha on a rewrite of the current AQuery translator. The original system is written in C and was in need of a clean up. For the task, we decided to use Scala for a variety of reasons, including (but not limited to):

  • efficient pattern matching
  • java interoperability
  • expressive type system

In this blog post, I want to provide a discussion of the type checking portion of the translator.

AQuery’s main backend target is kdb+, an extremely efficient column-oriented database with an elegant functional programming language (q). The language is dynamically typed, meaning that types are resolved at runtime. With this comes the usual set of advantages (e.g. modularity, speed of development) and disadvantages (e.g. error detection requires running the program) as usual.

The AQuery translator provides a “soft” type checking layer that makes sure some egregious mistakes are ruled out. Note that in contrast to traditional soft type checking, we cannot actually remove any dynamic type checks, as we don’t have access to the back end execution. However, similarly to the traditional approach, we can rule out provably wrong cases.

For example, in the query

SELECT c1 * c2 FROM t WHERE avgs(c1)

we don’t need to know what type c1 is to be able to notify the user that there is an error here. We know avgs returns a vector of floats, while a WHERE clause requires boolean values to filter out undesired elements.

If we run this through AQuery with the -tc flag, indicating we want to type check, we get the following:

josecambronero aquery$ echo "SELECT c1 * c2 FROM t WHERE avgs(c1)" |
 java -jar target/scala-2.11/aquery-assembly-0.1.jar -tc
Type Error: expected boolean found numeric at [1,29]

In this post, I’d like to discuss how we went about implementing this feature, along with some of the advantages and disadvantages with our approach.

Type representation

We represent types using Scala’s case objects, rather than enumerations, as we don’t have any need to iterate over all types at a time (we’ll see in a second why we don’t care about this feature).

Location reporting

All nodes in our AST extend the scala.util.parsing.input.Positional trait, which allows us to provide source code line and column. Scala’s parser combinator library allows us to easily do this by wrapping appropriate parsers with a call to positioned and combining them up into the complete parser. I’ll elide discussion of this here, as that is a whole other topic.

Function type representation

Let’s focus on functions, as these provide the most interesting portion of our analysis.

We maintain function information using a simple summary for each function. We won’t go into details here, but for purposes of type checking, we mainly care about two members in our representation

  def f: String
  def signature: PartialFunction[Seq[TypeTag], TypeTag]

Signature represents a partial function from a sequence of type tags to a type tag. Partial functions, in contrast to total functions, are defined only over a portion of their domain. In this case, signature directly mimics the definition of a function signature: it returns a typetag (return type) for a sequence of arguments that have the appropriate type tag. For sequences of type tags that do not fit our definition, the application of signature returns None, while for those that do fit, it returns Some(return_type).

First of all some background definitions and some Scala syntax overview.

In our type checker, we say type x is consistent with a set of types Y, iff x is unknown, Y contains unknown, or Y contains x.

In Scala a partial function can be defined with a case expression (below using some rough EBNF)

{ (case <Pattern> <if-condition>? => expression)+ }

If an expression satisfies the pattern, and an optional if-guard, the RHS of =&amp;gt evaluates. Note that when we match on a pattern we can also extract values by associating variables with the subexpressions in the pattern. This proves to be very useful.

Let’s consider two specific examples: max and fill. We define their type signatures as

// we can use v @ to refer to the entire seq as v in rest of code
{ case v @ x :: xs if v.forall(_.consistent(numAndBool)) =>
     if (v.forall(_.consistent(bool))) TBoolean else TNumeric 
} 

{ case a :: v :: Nil if a.consistent(Set(v)) => v }

respectively.

For max, our signature succinctly encodes the following:

  • `max` can be called with 1 or more arguments `(x :: xs)`, which allows us to consider `max` as a variadic function, so `max(some_column)` is valid, and so is `max(1, 2, 3, 4, ….)`.
  • `max` can be called as long as the arguments are either numeric or boolean (allowing boolean arithmetic)
  • the return type of `max` is boolean if all arguments are boolean, otherwise it is numeric

For fill, our signature encodes:

  • `fill` requires two arguments
  • the type of the first argument must be consistent with the second argument
  • the return type corresponds to the type of the second argument

Why do we represent a function signature using a partial function from Seq[TypeTag] to TypeTag rather than just as a tuple of formal argument types and return type, or a Set of such tuples in cases of overloading?

Here are some advantages that we can see just from the two examples provided above:

  • overloading is simple: We can compose partial functions using `orElse`, allowing us to easily express overloading. We can also represent variadic functions easily, which would likely require special handling if we used something like a set of tuples.
  • variable binding from pattern matching: We can easily express polymorphic functions by associating variables in the pattern matching with specific types and using these in returns/checks. Note that we cannot bind the same variable to multiple locations, so we cannot really pretend these variable bindings are type variables (a la SML), but they’re close enough to yield simple definitions when combined with if guards.
  • pattern matching makes readable type definitions: As seen before, pattern matching, along with the ability to execute arbitrary Scala code in the resulting expression for each case, allow us to provide succinct definitions for our type restrictions.

Let’s walk through the example of typing a simple expression

max(c1 > 2)

There is some boiler plate code below, which parses the expression and type checks each subexpression with an environment only containing the builtin functions. The line of interest is the last one.

scala> import edu.nyu.aquery.parse.AqueryParser._
import edu.nyu.aquery.parse.AqueryParser._

scala> import edu.nyu.aquery.analysis.TypeChecker
import edu.nyu.aquery.analysis.TypeChecker

scala> val checker = TypeChecker(Nil)
checker: edu.nyu.aquery.analysis.TypeChecker = edu.nyu.aquery.analysis.TypeChecker@5b1357ec

scala> val e = parse(expr, "max(c1 > 2)").get
e: edu.nyu.aquery.ast.Expr = FunCall(MAX,List(BinExpr(Gt,Id(c1),IntLit(2))))

scala> val typed = e.allSubExpressions.map(e => e.dotify(1)._1 -> checker.checkExpr(e)._1)

scala> typed.foreach(println)
(c1,unknown)
(2,numeric)
(c1>2,boolean)
(MAX(c1>2),boolean)

Here we can see exactly what is going on. We don’t know the type of c1, but we do know that c1 &amp;lt 2 yields a boolean, and given the definition of max above, we know that max of boolean yields a boolean. More complicated examples follow the same reasoning, but the likelihood of ending with TUnknown is of course much larger.

But, not all is rosy. There are some disadvantages to using an “opaque” structure such as partial functions, rather than something that we can explicitly inspect, like tuples and sets of them. Note that to some extent these disadvantages are inherent in all “shallow” embedded interpreters, which one might argue our type checker aligns with. I believe the main disadvantage is:

  • vague error messages (we don’t know exactly why a match failed!)

Consider,

josecambronero aquery$ echo "SELECT c1 * c2 FROM t WHERE max("a", 1)=2" |
 java -jar target/scala-2.11/aquery-assembly-0.1.jar -tc
Call Error: bad call to MAX at [1,29]

AQuery knows that the call to max in the where clause is wrong, as the first argument is neither a boolean nor a numeric, but it cannot identify what exactly went wrong. This comes directly from not knowing why a partial function failed to match.

For example, the next example returns the same error message despite the different type of error. In this case, max is called with no arguments, which of course is not well-typed.

josecambronero aquery$ echo "SELECT c1 * c2 FROM t WHERE max()=2" |
 java -jar target/scala-2.11/aquery-assembly-0.1.jar -tc
Call Error: bad call to MAX at [1,29]

So in essence we have traded abstraction at the source code, which is incredibly helpful for our development purposes, for abstraction at reporting time, which yields ambiguous messages.

In the aggregate, I believe the advantages strongly outweigh the disadvantages, given that our location-based error message should provide enough of a hint for most cases to be fixed.

Note that the type checking in the translation needs to be actively turned on (using -tc). This allows users to run code regardless of the output of our type checking process. The analysis performed can certainly be refined further and I think it might be a good place for newcomers to contribute to.

Telescope: Simulating lexical scope in q

(This post, and telescope, was inspired by a recent conversation with a friend).

One of the things that always bothered me with q, at least when starting out,
was the lack of lexically scoped local variables in functions. To see why this
might be annoying, let’s start from the ground up.

What is scope?
Scope can be roughly defined as the current name-value bindings visible at a given
point. It can be defined statically (aka lexical scoping), or dynamically.
To see the difference between the two, let’s consider two simple examples with
some pseudo-code.


val a = 100
function f() { a }
function g() { val a = 200; f() }

g()

What value does g() return? Under lexical scoping, when f was defined a was bound
to the inner-most binding available, which in this case was a = 100 in the line before
it. Under dynamic scoping, we care about the latest value of a when the function is called,
meaning that in this case the binding is a = 200, so f() returns a. Things works similarly
with nested scopes, with lexical scoping simply using the inner-most definition available.

In q, things get a bit hairier. For example, f would have access to the global a,
but if there were no such definition, then calling f would produce an error


q)\a
`symbol$()
q)f:{a}
q)f[]
{a}
'a

In general, a function cannot access any local variables in their enclosing scopes
(the top-level scope, and qualified namespaces being the exceptions). For example,
the following case can yield quite a few headaches for the beginner q programmer


q)a:100
q)f:{a:0;{a}[]}
q)f[]
100
q)a:200
q)f[]
200
q)delete a from ..
q)f[]
{a}
'a
q))

If a programmer wants the inner function in f to access a, then it has to explicitly
pass in the value as a parameter to the function. So we can rewrite f as


q)f:{a:10;{[a]a}[a]}
q)f[]
10
q)a:200
q)f[]
10

While simple, this can yield two issues:

1 – Less readable code (which means less maintainable in the long run)
2 – Running up against the limit of function parameters (though if this is the
case you should probably rewrite your code anyways)

I’ve written up a simple, proof of concept, package called “telescope” that effectively
performs the rewrite above for you. Telescope simulates lexical scoping by adding
formal parameters and partially
evaluating a function at definition time to guarantee static bindings of values.

The library takes a q function and returns the extended version. I also got
clever with the name of the function and called it .tele.scope 🙂

It is quite rought around the edges still, but it was enough for a blog post.

For example, it has some limitations

1 – The original function must be a valid q function, since it takes this as an argument. So
while the following is tempting, it will not work:

q)a:100
q){a:a * 2}
'a
q)

2 – There can be no duplicate function definitions inside a function that is to be extended.
So for example, {a:10; {a}[] + {a}[]} won’t work. This definition is completely reasonable and
the limitation is something I want to fix. However, for now, having unique function definitions
allows me to abuse q to find the variables defined in the immediate scope at the time of function
definition.

3 – The extension of a function involves replacing the string representation of a function
with a new function. This is done using q’s ssr. Unfortunately, this means that
the function matched on must exactly match the one in the code. So extra spaces can be an
issue some times. It is fairly easy to know if this is causing a problem though, as
the error should be akin to


q).tele.scope {a:100; { a}}
k){$[99h=@y;sublist[x;!y]!sublist[x;. y];~0h>@x;$[.Q.qp y;.Q.ind[y];y]i+!"j"$0|x[1]&(#y)-i:*x;abs[x]<#y;x#y;y]}
'type
#
0N
(,;;:a100;,`{a})

Ok…all caveats aside though, here are a couple of (in my opinion), nice examples:


// simple example, no global a
f:{a:100; g:{a * 2}; g[] + 4}
(.tele.scope f)[]~204

// bindings take place "statically"
q)a:1000
q)f:{a}
q)sf:.tele.scope f
q)a:2000
q)f[]
2000
q)sf[]
1000

// simple example, global a shadowed by local
a:1000
f[]~2004
(.tele.scope f)[]~204

// simple example, global and local a, lexically scoped (i.e. bound at def time)
f:{g:{a * 2}; a:100; h:{a}; (g[];h[])}
f[]~2000 1000
(.tele.scope f)[]~2000 100

using namespaces, evaluates in normal q, but not expected result
f:{.test.a:100;g:{.test.a};.test.a:200;h:{.test.a + 2};(g[];h[])}
f[]~200 202
(.tele.scope f)[]~100 202

// multiple levels of nesting
f:{a:100; {h:{a}; a:200; g:{{a+2}[]}; (h[];g[])}[]}
f[]~1000 1002
(.tele.scope f)[]~100 202

// works as expected with local variables of other types
f:{x+y}
{f:{x * y}; {f[x;y]}[100;200]}[]~300
.tele.scope[{f:{x * y}; {f[x;y]}[100;200]}][]~20000

And here is the code if you’re interested in taking a look

Putting together a simple multi-line input mode for q

As you may know (if you read symfun 🙂 ), I often program in a language call q.
It is a nice, very fast, terse, dynamically typed, interpreted language. One of the annoying things, however, is that the interpreter doesn’t allow multi-line input. However, it does let you set your own function to accept any user input… soooo, we can hack our own multi-line input mode 🙂

We can set up a very rudimentary (and indeed dangerous, since we don’t sanitize input and call eval parse on it) multi-line input mode. We set up a variable that accumulates input, until an empty line is entered. Once this line is entered, we parse and eval the input, and finally re-initialize the input accumulator. The display of results is a bit clunky right now, but that is easy to fix as well (but besides the point of this post.


.jc.init:{`.jc.cmd set ();}
.jc.eval:{if[not x~(); :@[eval parse@;ssr[x;"\n";" "];{1"'",x,"\n"; .jc.init[]}]]}
.jc.init[]
.z.pi:{if[x~(),"\n";0N!.jc.eval .jc.cmd;:.jc.init[];]; .jc.cmd,:x;}

And here it is in action

Screen Shot 2016-06-04 at 8.23.06 PM