First year takeaways

I’ve pretty much neglected this blog since last summer, but with the current school year over (yay!) I thought it would be a good time to do some retrospective blogging.

I had the pleasure of collaborating with a various great people on research projects this semester.

Jack FeserMicah Smith and I collaborated (under supervision of Sam Madden) on a database for dynamic missing value imputation and wrote up a paper, which after a round of review-based revisions, turned out quite nicely I think (we still have to hear back on this latest round of reviews).

I also worked with  Phillip Stanley-Marbell and Martin Rinard (my advisor) on two interesting projects related to color perception. The first project constitutes the first large-scale color perception study across different countries, based on iOS data, while the second used the same dataset to perform customized image compression. We wrote up the second of these and submitted to a small conference, and are in the process of polishing off the first before submitting to a conference venue.

I think I managed to get a lot out of these three research experiences. Working with great people pushed me to try to improve my own skills and bring more to the table. I wanted to sum up my thoughts and personal lessons in a list of observations. As a disclaimer, I’m certainly not trying to give advice to anyone else on this. I have no real qualifications to do so, and a year of phd program is hardly a long time as far as experience goes, but I thought I’d like to keep this for posterity for myself (and maybe it will help some other first-year in the future). So some takeaways from all of this (in bullet form, and in no particular order):

  • Revisiting ideas is extremely worthwhile. Our first stab at the image compression project flat out failed. I had the wrong approach to it. The second time around, rather than stabbing in the dark, we brainstormed about the core issue with what I was doing and our goals, and tried to find a solution that bridged the two. The second approach that came out of this worked.
  • Many systems make tradeoff decisions for the user. Exposing these tradeoffs and letting the user choose for themselves feels like a cleaner approach. Indeed, these tradeoffs themselves can be the most interesting part of your system. Our database work showed that formulating these tradeoffs in a rigorous way can be a novel contribution itself.
  • Writing up the paper, as you work through the problem, can help  direct your efforts. It is basically a concrete way of tackling the question: “what do I need to show to convince a colleague of the novelty/effectiveness/impact of my idea/technique/system?”. This isn’t my own observation, but rather the product of great advice from Phillip and a Simon Peyton-Jones video he recommended. I highly recommend doing this and plan on doing this going forward.
  • Related to the point above: spending time writing up your problem statement before diving in. This helps formalize the problem, which in turn can expose good approaches to tackling it, or point you in the right direction. This is an approach Martin emphasized from day one, and I think it has a lot of real benefits.
  • Have two projects going simultaneously. When I had a single project, getting stuck or not feeling excited about the problem was a big drag on my productivity. When I started working on two projects at a time, I could just switch to the other problem whenever I got tired of one. This made work a lot more enjoyable and productive. Some people may be able to manage more than two projects, but for me three started to get too hectic.
  • In terms of school work/classes, I had to make a big effort to switch the way I approached classes. In undergrad, and even in my masters, courses were a big focus, so it always made sense to spend additional time polishing homeworks/projects or studying for a test. Although there were diminishing returns on that time, it felt worthwhile. This time around, I’ve tried to switch my approach. I put in enough time to make sure I understand the core idea or takeaway, but after that, a couple of points are not really worth it, and that time is better spent on research.
  • Volunteer to organize at least one event for your department or lab. Advisors seem understanding of the fact that you are getting used to new work, a new environment, and potentially, a new field or research area. This means research output is not necessarily the right metric for the first year. This also means that you can volunteer for events that you might not want to (or be able to) once you are steeped deep in your research work. I had the pleasure of collaborating with a bunch of great people this semester on organizational events. In doing so, I met other students (junior and senior), faculty members, and community members that I wouldn’t have otherwise had the chance to meet as easily. Candace Ross and I helped put together the first diversity panel at the EECS admitted students visit weekend, and along with Ivan Kuraj, I helped plan the 2017 programming languages offsite.
  • Recognizing that different people have different research styles and tastes is incredibly useful. For example, I’ve learned that, at least at this point in my grad career, I like starting out with a concrete problem  (e.g. missing values in a database are a pain to work with) and then letting the problem steer me to the appropriate solution. Other people may like the idea of developing a core technique, and then finding a problem that it solves. While the latter sounds cool to me, I have a hard time conceptualizing solutions without concrete and motivating examples.
  • I’m making a conscious effort not to stress or place emphasis on producing “worthwhile” results. By that I mean: results aren’t just the final product of your research. Getting the database setup for your project, or finally setting up your VM for experiments, can also be “results”. This may seem like I’m simply redefining the meaning of results to suit me and feel positive. But it’s actually amazing how much of a psychological impact this redefinition had on me. During my first couple of months, I struggled with the fact that I wasn’t “producing results”. Eventually Martin pointed out that all of these small things were results, and that productivity took many forms. This insight was momentous for changing my whole mindset for the Spring semester.
  • A related point: don’t think about research as producing results for your advisor, produce them for yourself. This isn’t to say that you should ignore specific requests from your advisor. Clearly, they are (likely/hopefully) providing you with funding and the opportunity to work with them, so you do have a responsibility to deliver on that. However, the results you produce, and the work you do, should be for yourself. There is a large opportunity cost in attending graduate school, and you should make sure you do not short-change yourself by pursuing topics that you are not interested in. Martin has made this point to our group repeatedly, and is one of the points I think I have appreciated the most. I realize this is an enormous privilege to have, and I am working to deliver on that opportunity.
  • A less positive point, but one that I think is very important: own your mistakes. I fudged up some data pre-processing for our database evaluation. We figured this out after Micah and Jack had spent a non-trivial amount of time dealing with frustrating queries that didn’t match up with expectation (based on what they thought the preprocessing had been). In the end, we realized there had been a miscommunication, along with poor documentation on my part. This was a good lesson and one that I’m happy worked out with no lasting negative impact on others or myself.
  • Related to the prior point: own your mistakes, but also move on. There is no point in dwelling on something messed up. You fix it, apologize, take note for the future, and move on. Anything else is just bad for productivity and (more importantly) well-being.
  • Asking colleagues for help when you’re stuck, and similarly providing help when you’re in a position do so, is one of the greatest benefits of being part of an academic community. I’ve been beyond pleased to see how open and helpful others are. I like to think I’ve been of help to others as well, but for the coming year I’m making a conscious effort to offer my help whenever possible.
  • A bit more mundane: I’ve started using a time management application (Asana) to write out my TODOs by day and group them by project. It may be overkill for my purposes, but I have squeezed out a lot of order from my days this way. In general, I’ve realized that having two major TODOs per day, along with smaller tasks, is a good way to split up my time. I make an effort to make at least one of those major TODOs be research-related. Major TODOs in my mind are things like a significant amount of coding, writing up a problem, or setting up an evaluation benchmark etc.

Those are the main points, off the top of my head. I’m looking forward to this summer. I’ll be staying at school and getting started on a research project I’m very excited about. Hopefully I’ll get a chance to  write up at least one blog post along the way.

Orderly: A Simple DSL (Code Generation/Execution)

In this last post in the series surrounding the upcoming kx talk we’ll discuss the step of code generation (in the case of the external DSL) and execution  (in the case of the embedded one). The prior two posts, dealing with syntax and semantic analysis, respectively, can be found here and here.

We’ll discuss the embedded DSL first, given the simplicity in this case. Orderly is a simple language. It simply allows us to express market orders and their insertion into an order book in a simple syntax, with various semantic checks that might be relevant for our business purposes. Given its simplicity, the interpreting function, which takes our DSL and actually provides it with meaning,  is equivalently simple. In this case, it is worth pointing out that the “interpreting” function need not be actually an interpreter. It can just as well generate code that is compiled down and then called. Just one thing worth keeping in mind.

So in our case, all we do with Orderly’s AST (represented as a q list) is extract the fields to create a table row representing the order and a table name into which to insert the order.

This can be succinctly expressed in q as:

toTable:{flip `side`volume`units`ticker`px`cond`client!(),/:x}
add:{[ast] (last ast) upsert toTable 7#(-1 _ ast),`self}
interpret:add

For clarity (and pedagogical reasons), we’re dogmatic about what we name functions here, so our “interpreting” function is quite literally called interpret.

We add another useful function to the interpret.q file, which although not used to perform any translation of the AST, is useful for use with the resulting Orderly tables.

satisfies:{[t;env] select from t where first each @[;env;0b] each cond }

It is meant to easily filter market orders that satisfy certain the requisite conditions. We wrap with some error-trapping logic for generality.

Here is an example execution

internal-execution

 

This brings us to some of the dos and donts for this part

  • DO: add additional functions that may be of interesting for the domain but don’t need to be represented syntactically
  • DONT: perform drastic AST transformations in this phase. Such changes should be pushed into prior stages.

Now, for the external implementation, we generate q code. The user is free to read/modify the code. We can then feed the code as a normal q file to the interpreter for execution. This is the approach we take in AQuery. As usual, there are a series of tradeoffs here, some of which we’ll encounter as we explore the implementation below.

The insertions create code fairly similar to what the internal implementation uses. The only meaningful difference is when the external implementation optimizes a series of insertions into a single bulk insertion. This results in inter-table writes potentially being moved around relative to the original file, but all intra-table writes take place in the same order as the original.

In generating code, we need some helper functions, mainly those related to the deferred semantic checks we discussed in the prior post. For readability, we choose to use functions rather than insert the code-inline. For maintainability, we have chosen to keep these functions in a resource file in our source code directory and we insert these as a form of “prelude” in each generated file.

Here is an example of two generated orderly files (written to stdout). The first doesn’t optimize (i.e. rewrite) to bulk inserts, the second does.

external-generation

external-generation-optimized

Finally, here is an example showing the deferred semantic checks in action

external-delayed-error
This brings us to DOS and DONTS for the external implementation:

  • DO: generate intelligible code. This can mean factoring out code that might otherwise be inlined (a smart option would be to make this a command line option for the translator), it may (will?) mean adding comments to the generated code.
  • DO: include any helper code as a prelude to the original generated code. This allows easily movable files, which for a small DSL, will be appreciated by your users
  • DO: write the code using an implementation language that lets you create clean and maintainable generator code, using constructs such as string interpolation to splice together  generated code

That’s it for this tutorial series. I’ll be discussing the code and slides (to come) at the Kx meetup coming up on September 22nd, 2016. Feel free to stop by, ask questions, chat or just say hi!

Orderly: A Simple DSL (Semantic Analysis)

In this second installment of the Orderly series (post 1 here), we’ll discuss semantic analysis. At this point we’ve already parsed our text and determined whether or not it satisfies our language’s grammar. If it satisfies it, we “accept” it, otherwise, we reject it (hopefully with some sort of error messages describing what issues arose).

In the semantic analysis stage, we will decide whether the program given makes “sense”. This may include tasks such as type analysis, which aims to protect users by making sure that all operations on different types of data are defined and appropriate for that data type. It can also include things like analysis passes that perform  transformations on the program representation. The line between this and later optimization stages is often blurred.

In our case, for Orderly, we’re not really performing type checking, since the only arbitrary code is found in the modifier clause (where we allow q code). However, we do perform a couple of checks such as making sure that the prices and volume of a market order is nonzero. We also check that the q code provided corresponds to a monadic (here meant as a single argument) function. Note that this is an arbitrary restriction that we came up with, but is simply meant to be illustrative. Finally, we will change modifiers that correspond to a date to become a lambda that check for that date. This is an example of “uniformization” in our AST and can lead to simpler code generation later on. There are cases where this is not ideal, but at least in my experience, it has allowed me to keep a small core of code generation functions and defer most constructs to some combination of them.

We’ll discuss a couple of interesting points for both the internal and external implementations.

For the internal implementation, we got a bit cheeky and added a simple assertion function to  the .q namespace, which allows us to use it in infix form in our main checking function.  However, we atone for this sin by dropping it from .q after defining our main check function.  This then means users can no longer use it as infix (so we’re safe), but the main check function has been defined, so our purpose for it has been completed.

Here is the check function code in the internal implementation (one might say this reads as a DSL itself, although here the only convenience is syntactic):

check:{
  getVolume[x] should be ({x > 0};{"Volume should be positive"});
  getPrice[x] should be ({x > 0};{"Price should be positive"});
  x:setModifier[x;] wrapDate getModifier x;
  getModifier[x] should be (isUnary;{"Expected unary function"});
  x
 }

Here are two examples of the new semantic analysis in the internal implementation. Note how in the first, the language complains as “sum” is not a monadic function as expected (well…it is, but we explicitly rule out built-ins such as this, since it’s unclear what their meaning would be in an Orderly modifier-clause):

internal-analysis-error

The second example shows how we transform a date-based modifier clause in Orderly to
a single-argument lambda, which checks that the date corresponds to the original date
provided.

internal-transform-date

For the external implementation, there are two interesting points to make. First, consider that at least one of our checks (whether a function is monadic) cannot (always) be performed statically. We provide a simple solution: insert the check into the  code generated (which is not unlike dynamic type checking). However, one important point here is that the runtime check should have at least some context of the source location of the error to return to the user.

We can see in the picture below, that we’ve added this runtime check in a Verbatim AST node (used in Orderly to capture q-code meant for direct insertion into the code generated).

external-embed-error-message

The second interesting point for the external implementation, is that we have global knowledge of all the insertions into the various order books that will take place in this Orderly  file. This allows us to perform a very easy but worthwhile rewrite: change a series of single market order insertions into a series of bulk order insertions. This has the effect of modifying the order books in larger chunks, which of course should have a positive effect on runtime as any memory allocation for the constantly changing size of the order book can  be reduced. Additionally, we allow the underlying language (q) to take advantage of vector operations for these insertions. Note the speedup in the trivial example below, where the first timed operation is a series of single insertions and the second is a bulk insertion.

single-vs-bulk-insert

So some general takeaways/advice for this stage are:

  • DO: provide useful error messages (this should probably just be a DO for all stages)
  • DO: check as much before runtime as possible
  • DO: take advantage of global knowledge to perform global transformations
  • DO: pick an implementation language (for external implementations) that allows nice rewrite notation (e.g. pattern matching in functional languages)
  • DONT: write non-composable rewrites. If a pattern doesn’t match, then applying it to an operand should be equivalent to identity. In Scala, we can define these rewrites as partial functions (defined only for certain patterns) and lift it to a total function that simply returns the original operand for otherwise undefined operands.
  • DONT: rely on later stages to perform transformations you can perform now and encode in the AST directly.

The code for both the internal and external implementations can be found here.

The next post will address code generation for our simple language.

Orderly: A Simple DSL (Syntax Phase)

As some of you may know, I’ve been given the chance to present some of my work with Dr. Shasha at the upcoming Kx user meetup (September 22nd, 2016). The talk is meant to address two main points:

  • AQuery as a language and its optimizer
  • Discuss implementing new languages targeting q/kdb+ for execution.

I’m scheduled to focus on (2), while Dr. Shasha will discuss (1). The idea for (2) will be to structure it as a small tutorial, where we present dos and donts. To do so, we’ll be discussing a simple DSL (domain specific language) that should be simple enough to handle in 20-30min, and yet provide enough discussion points.

Since I have to go ahead and prepare this material in advance, I figured it didn’t hurt to make some accompanying blog posts.

So in this blog post, we’ll start from the bottom: we’ll introduce our DSL, discuss various implementation approaches, and then provide the syntax analysis phase (i.e. parsing) for the two approaches.

Implementing DSLs

DSLs are commonly implemented in two main ways: as external languages or embedded within another general purpose programming language. Although the exact definitions of these approaches varies by author, the general idea is that the external implementation has a  standalone parser, analysis, and code generation/execution. Meanwhile, internal implementations (aka embedded) are developed within the context of an existing language. Embedded implementations can be roughly broken down into two categories (although we’ll see that these are not necessarily clear cut, nor should we always want them to be so): shallow and deep embeddings.

Shallow embeddings do not construct their own AST but  rather rely on the semantics of the host language to directly evaluate them (usually as  function calls). The line between a shallow embedding and a well crafted library API can be thin at times. A deep embedding, in contrast, constructs an AST, which it is free to analyze and transform, and then relies on an “interpreting” function to  actually provide semantics to it.

We’ll see that in our case, our “external” implementation borrows from the embedded domain, as does our internal implementation.

Our Case Study: Orderly

“Orderly” is a simple language meant to encode (fictitiously simple) market orders and insert them into a main order book.

Some examples of orderly code might be:

 sell 1000 shares of IBM at $50 on 05/16/2016 for BAML -> order_book
  // uses an embedded q fun
 buy $1e6 of AAPL at $100 
 if 
"{[env] 10 > first exec avg close_price from env where sector=Tech}"
 -> order_book

Here we can see that Orderly allows us to use q functions directly (with some restrictions that we’ll see later on). Already, we can see that the barrier between our language and the host language is thinner than might be usual, even for an external implementation.

The orderly grammar is quite simple. We reproduce it below in EBNF format, for reference:

 <program> ::= (<order> "->" <ident>) +

 <order> ::= <side> <vol> "of" <ident> "at" <num> <modifier> ("for" <ident>)?

 <side> ::= "buy" |"sell"

 <vol> ::= <num> "shares" |"$" <num>

 <mod> ::= "if" "<monadic-fun-in-q>" | "on" <date>

 

Implementations

For our external implementation, we use Scala and take advantage of the elegant parser combinator library to craft a simple parser, which provides decent error reporting.

In our internal implementation, we write a simple parser in q, with a naive tokenizer (which splits tokens on whitespaces), and build in some error reporting.

The main takeaways for the syntax analysis phase are:

  • Generate useful parser errors (with context/source location if possible). There is nothing more frustrating as an user to get an error and have no clue how to fix it.
  • Parsing code should be clear and extendable. This is key for DSLs that might evolve (and most will). In the external case, the parser combinator library fits all these points. In our internal implementation, our code is clear, but its extensibility is debatable. I have hopes that someone will contribute a nice parser combinator library for q in the future.

 

Error Reporting Examples

The external implementation and the internal implementation both try to provide useful error messages for parse errors.

external-parse-error

Searching in the q interpreter

(Disclaimer: I just realized this misses built-ins that are not defined in .q, e.g. max/min… I suspect this could be easily extend to catch those, but fyi!).

My memory is not particularly good. Unfortunately, this means I tend to forget what I’ve named variables/functions etc in programs. The same is true with APIs for libraries that I use rarely. With IDEs, this is not usually an issue, as the vast majority of them offer autocomplete/search functionality. For some languages, this functionality is also built-in to the REPL (when available).

For example, in Scala, you can hit tab to find possible autocompletions.


scala> val l = List(1,2,3)
l: List[Int] = List(1, 2, 3)

scala> l.f
filter   filterNot   find   flatMap   flatten   fold   foldLeft   foldRight   forall   foreach

The autocompletion here is quite nice, as it is refined based on the type information.

While this is great, I would even settle for something much simpler when writing q code.

Let’s consider the follow q code


a:100;
ab:([]c1:1 2 3)
.f.g:{x}
.f.gh:{x}
.f.a:100
.f.ab:{x}
.f.abc.b:100
f:{x}
fg:{x}
mab:1000

In q, a somewhat similar functionality is offered by the kdb-autocomplete package for Atom. However, this still has some limitations (if I’m wrong on any of these please feel free to point out, as I’m not super familiar with the package):

  • it only seems to offer autocomplete for names when using the qualified
    name (i.e. including at least part of the namespace), unless they are built-ins

autocomplete-built-in

  • it doesn’t provide type information for each of the possible autocompletions (although it does provide some nice comments for built-ins)

autocomplete-no-type-info.jpg

  • it is restricted to the Atom editor

The last restriction, particularly, makes it hard for me to use it, as I usually write most of my code in the REPL (as I play with it), and then copy it back to the editor once I’ve polished it down to a reasonable version. So the autocomplete/search is missing where I need it the most!

Given this, I put together a small prototype in q, which provides some search functionality. It takes the last term in the current input and searches for possible matches in the appropriate namespace. It then provides a table of possible matches with some type info (this info could be extended to be more helpful, ideally, we would try to follow the Scala example and refine searches based on type info, if possible… but this might be bit nastier here since the types are discovered at runtime…but anyways).

To “search”, all you need to do is preface your code with the keyword: search, as we’ve attached a simple function to .z.pi, which handles input in q.

So let’s give it a try for our h:{2 + .f.} definition from before

q)search 2 * .f.
n  | tag  t
---| --------
a  | vars -7
ab | funs 100
abc| vars 99
g  | funs 100
gh | funs 100
q)search 2 * a
n | tag  t
--| -------
a | vars -7
ab| ts   98
q)2 * a // this evals normally
200

There we go, so it provides all the options, marked with an appropriate tag. Note that nested namespaces (such as .f.abc) show up as dictionaries (as expected) in their type.

We can of course use this to search builtins, just by using the .q. namespace. Otherwise note that it searches the global namespace `.

q)search ma
n  | tag  t
---| -------
mab| vars -7
q)search .q.ma
n   | tag  t
----| --------
mavg| funs 100
maxs| funs 108

For those interested, the q code is available in the blog github repo.

Trampolines in q: Jumping around

A tail call is when a function calls another function as the last step right before returning. Tail call elimination refers to an optimization that avoids growing the stack by reusing the last stack frame for the next function call. Given that we have “taken everything we need from” a given call, we are free to overwrite its frame.

Now, tail call elimination is often implemented for tail recursive calls (which is just a case where the two functions referenced above are one and the same). However, this means that mutually recursive functions (for example if f calls g last and g calls f last, and f != g) can still overflow the stack.

This is elegantly addressed by this paper which discusses a general fix for this
and implements it in scala: Trampolines (and more generally the Free monad). The idea isn’t new, but it is very well explained in the paper.

I won’t discuss the latter, as the truth is I’m still learning about monads myself, but I will provide a simple implementation of a trampoline in q.

We can use the trampoline technique by creating two “datatypes”: more and done. In q, we can’t really create new datatypes, but we will settle for pairs of type-label and value.

More has a thunk (i.e. an unevaluated function that takes zero arguments), while done contains a value.

Here they are:

// data constructors
more:{[f;x] (`more; {[f;a;d] f[a]}[f;x;])}
done:(`done; )

Now, we need our main run function (i.e. the interpreter for our datatypes). This function is tail-recursive, so the q interpreter takes care of the optimization for us.

// trampoline
run:{{x[1][]}/[{not `done~x 0};x][1]}

All it does is repeatedly evaluate thunks until we hit a done.

Now, let’s experiment. Similarly to the paper above, we pick a trivial problem that clearly shows the issue.

Consider the pair of functions even/odd. Even calls odd and viceversa.

even:{$[0=x;1b;odd x-1]}
odd:{$[0=x;0b;even x-1]}

A simple call shows that we run out of stack:

// this runs out of stack
q)even 1e6
'stack
@
{$[0=x;0b;even x-1]}
997999f

Meanwhile, our evenT/oddT, which employ the trampoline, will not run out of stack

evenT:{$[0=x;done 1b;more[oddT;x-1]]}
oddT:{$[0=x;done 0b;more[evenT;x-1]]}
q)run evenT 1e6
1b

If we check what evenT 1e6 is, we can see that it is really just “the recipe” for the next
iteration, wrapped in our more constructor.

q)evenT 1e6
`more
{[f;a;d] f[a]}[{$[0=x;done 0b;more[evenT;x-1]]};999999f;]

The paper referenced before goes into more depth and creates a “trampoline monad”, which turns out to be a specific case of a more general tool (free monad). It provides an outline/hints for how a compiler might automatically rewrite tail calls using this tool to avoid stack overflows. All in all, pretty neat I think.

AQuery: Rewrites through Partial Functions

As mentioned before, one of the motivations behind rewriting the AQuery system was to provide a cleaner, more maintainable, and extensible system. In doing so, I drew inspiration from a variety of sources. One of the main sources was Spark SQL’s Catalyst optimizer, which elegantly implements AST rewrites using partial functions. The main idea of a partial function for rewrites is that the RHS of a case is evaluated when the case matches, and otherwise “nothing” happens. In the background, we lift the partial function to return an option, returning Some(rewrite) or None if no match, in which case we simply put back the original expression. I encourage you to take a look at the Spark SQL SIGMOD paper, which does a nice job at outlining their system.

In this quick blog post, I would like to show how you can do the same in AQuery to implement a simple rewrite: we would like to combine adjacent selections into a single selection. This might be the case if we wanted to generate q code for a selection, as the language already executes selections sequentially. Combining the selections in this case helps us reduce the size of the output code.

AQuery queries are a composition of various relational algebra operators (along with some extensions).

For example, we can create a query as follows:

import edu.nyu.aquery.ast._
import edu.nyu.aquery.parse.AqueryParser._

// projection
val p = Project(_:RelAlg, Nil)
// two sets of selections
val f1 = Filter(_:RelAlg, 
   parse(rep(predicate | expr), "c1 > 2 f(x) != 100 10 LIKE 100").get
)
val f2 = Filter(_:RelAlg, parse(rep(expr), "c0 * 2 > 2").get)
// a sort
val s = SortBy(_:RelAlg, List((Asc, "c1"), (Desc, "c2")))
// 2 more selections
val f3 = Filter(_:RelAlg, parse(rep(expr), "c0 c1 c2").get)
val f4 = Filter(_:RelAlg, parse(rep(expr), "h() = 100").get)
// source table
val t = Table("t1")

// compose our operators
val make: RelAlg => RelAlg = List(f1, s, f2, f3, f4, p).reduceLeft(_ andThen _)
// make a query by applying it to a source table
val q = make(t)

We can visualize our query by using the Dot object, and simply wrapping our query to make it a top-level language construct.

println(Dot.toGraph(Query(Nil, q) :: Nil))

digraph G {
0 [label="AQuery Entry"];
0 -> 1;
1 [label="full-query"];
1 -> 2;
2 [label="project "];
2 -> 3;
3 [label="where h()!=100"];
3 -> 4;
4 [label="where c0, c1, c2"];
4 -> 5;
5 [label="where c0*2>2"];
5 -> 6;
6 [label="sort: Asc c1, Desc c2"];
6 -> 7;
7 [label="where c1>2, f(x)!=100, LIKE(10,100)"];
7 -> 8;
8 [label="t1"];
}

Now, we can write our rewrite rule

val rewrite: PartialFunction[RelAlg, RelAlg] = {
  case Filter(Filter(src, fs1, _), fs2, _) => Filter(src, fs1 ++ fs2)
}

Our rewrite can be read off directly from the code: when there are two consecutive filters, combine them into a single one by concatenating their selection expressions.

We now apply the transformation and look at the DOT graph again

println(Dot.toGraph(Query(Nil, q.transform(rewrite)) :: Nil))

digraph G {
0 [label="AQuery Entry"];
0 -> 1;
1 [label="full-query"];
1 -> 2;
2 [label="project "];
2 -> 3;
3 [label="where c0*2>2, c0, c1, c2, h()!=100"];
3 -> 4;
4 [label="sort: Asc c1, Desc c2"];
4 -> 5;
5 [label="where c1>2, f(x)!=100, LIKE(10,100)"];
5 -> 6;
6 [label="t1"];
}

Note how the consecutive selections have now been merged.

As I begin working on the AQuery optimizer I think this approach will pay dividends in terms of simplicity. I’ll write up any particularly interesting cases as I encounter them.