Hello there, my name is Jônatas Davi Paganini and this is my personal blog.
I'm developer advocate at Timescale and I also have a few open source projects on github.

Check my talks or connect with me via linkedin / twitter / github / instagram / facebook / strava / meetup.


Grepping SQL Code like a boss

Here’s the outline of the content of my talk in Poland I presented at the Lambda Days 2024.

My talk is a research which I call “grepping SQL code like a boss”. Introducing a set of tooling that can not only search but also refactor SQL code like a boss.

This is most of the talk outline for those folks that want to learn more about the topic.

It will be a mix of Ruby and SQL code examples. I will show you how to use the fast gem to build your own tools to search and refactor SQL code.

The fast gem is a tool that leverages ASTs to provide advanced search and refactoring capabilities. It already supports Ruby and recently I introduced support for SQL code.

My main intent is not “sell” my library but the idea behind it. You can easily build something similar in your language or even to your SQL dialect.

2 decades in the terminal

Yes, not only the presentation was made in the terminal, but I build my entire carreer in the terminal. IO is simple and easy. I can focus on the content and not the tools.

I’m already familiar with this toolset and I can build my own tools to help me and enhance my productivity.

Learn simple tools with single proposal

I love how unix tools are simple and have a single proposal. My favorite minimalist examples from unix like cat head wc sort uniq grep sed echo.

Markdown is a great example of simplification. Look how easy it to navigate into count the topics of talk in the terminal:

grep '^# ' talk-sql.md | wc -l # => 60

Now, going deep in regular expressions we can also count the examples of Ruby, SQL and so on…

grep '```ruby' talk-sql.md | wc -l # => 29
grep '```sql' talk-sql.md | wc -l # => 5
grep '```bash' talk-sql.md | wc -l # => 9

Now, let me share a past project that encouraged me to follow and continue investing on this idea.

TODO: kill PR review checklist

I had a very extensive experience onboarding developers. I onboarded 300+ backend developers in 5 years.

One of the most boring things on onboarding is align the quality expectations, and that’s why the review process involves CI with lots of linters plus a huge PR review checklist. You can expect things like:

  • Check the code style
  • Check the usage of the framework components
  • Check the broken links in the documentation
  • Check the performance
  • Check the security
  • Check the architecture
  • Check the design
  • Check the business logic
  • Check the requirements

So, I was eager to reduce the friction of introducing newcomers by helping them to have less things to get ready for the first Pull Request. The plan started automating item by item, so the newcomer does not have to face “Don’t use A insted of B” while they’re not even familiar with the codebase.

Introducing a set of custom linters related to the business domain was the way to properly guide developers to use the right framework components.

Automate the education

Pull Requests can have a insane bureaucracy during the review. From all CI to comments and reviews, more automation we add, less human interaction and no developers taking into personal consideration the feedback.

It also allows the developers to have more interesting conversations during the review instead of few drawn by the checklist and other bureaucracies.

Build smart stuff with simple tools

AST brings the most primitive stage of compilers into scene. Tagging word by word or token by token, it simply structures code into tree format.

You really don’t need advanced tools to build smart stuff. You can build most of it with simple tools.

I’m a Rubyist

Yes, I’m a Rubyist. I love Ruby. I love the community. I love the language. I started the fast gem to grep Ruby code like a boss. Recently, I extended it to SQL.

You can check my previous talk on youtube which I was grepping Ruby code like a boss. This was for Ruby Kaigi 2020 which I couldn’t come because of the lock down but I was able to record and participate remotely.

I also maintain the timescaledb ruby gem which wraps all TimescaleDB SQL features into simple macros: github.com/jonatas/timescaledb

I’m becoming a Postgrist

I’m a Rubyist, I’m not sure this word exists, but I’m coining the term Postgrist. I’m learning a lot about PostgreSQL and TimescaleDB.

I split my time between community management and developer advocacy.

Adventures with fast

Now, several years after fast was born, I’m extending it to SQL. I’m building a base prototype for SQL linters or refactoring tools.

  • https://github.com/jonatas/fast
  • https://rubygems.org/gems/ffast

    $ gem install ffast

You can just use exactly the same syntax for building node patterns both on sql and Ruby. The trees are different but follow the same structure, so it’s a matter of learning the new node types and you’re ready to go.

The Value of Trying

I really believe in the value of trying. I’m a researcher, trying new things. I’m always scratching my own itch.

Experimenting = growth

Growth only happens when you learn from your experiments.

I built a tool named fast to grep Ruby & SQL code like a boss.

  • My journey with “compilers”, ASTs, and code analysis
  • My journey with SQL
  • Advanced search and refactoring

The Power of Regex

Regular expressions are a powerful tool for searching and manipulating text. They are widely used in text editors, programming languages, and other tools.

"1 + 2".scan(/\d+/) # => ["1", "2"]

Regex operators

A few operators to remind how it works:

  • \d - digit
  • + - one or more
  • * - zero or more
  • ? - zero or one
  • | - or
  • () - capture group
  • [] - character class
  • {} - quantifier

Limitations of Regex

Regex can be difficult to use for complex searches and refactoring tasks.

If you need to build a “search and fix” scenario, example, if you find the target code, maybe it’s in the wrong context. So, several times, the Regular Expressions cannot be the silver bullet for targetting text.

Capturing and reusing previous information, ignoring other undesired scenarios. Nested contexts and scenarios plays a big role on refactoring, several times needing to capture groups of information.

Sometimes, regex is not enough to solve complex problems. If you’re trying to parse a programming language, you need to deal with nested structures.

Now, it’s time to start exploring on how to build a micro-engine for a derived Regular Expression language that can match elements directly from the AST.

The AST Advantage

Abstract Syntax Trees are made for programming language internals. It provides a more powerful and flexible way to search and manipulate code. In this example, I’m using the ffast gem which uses ruby-parser to parse the Ruby code.

Fast.ast("1").type # => :int
Fast.ast("1").children # => [1]

Every node in the AST has a type and children. The type is a symbol and the children.

The AST in Ruby

There’s an interesting way to represent the AST in Ruby. It’s called S-Expression AKA string expression. It’s a simple way to represent the AST in a human-readable format.

puts Fast.ast("1")

Output:

(int 1)

The S-Expression is the base inpiration for the node pattern we’re going to use to search and refactor the code.

Let’s explore more examples to get familiar:

AST of summing two numbers in Ruby

puts Fast.ast("1 + 2")

Outputs the String Expression aka sexp in Ruby:

(send
  (int 1) :+
  (int 2))

Search in the AST

The Abstract part of the Syntax Tree unifies the code syntax. It’s a common way to represent the code in a structured format.

Fast.ast("1 + 2.0").search("int")        # => [s(:int, 1)]
Fast.ast("1 + 2.0").search("(float _)")  # => [s(:float, 2.0)]

Similar to the sexp output, oyou can build nested search for children using () or combine with other operators.

Example of search combining or expressions with the {} operator.

Fast.ast("1 + 2.0").search("{int float}")  # => [s(:int, 1), s(:float, 2.0)]

Combining node patterns

Example matches integer or float, both should be positive values.

Fast.ast("1 + 2.0").search("({int float} .positive?)")
# => [s(:int, 1), s(:float, 2.0)]

Note that .positive? comes from the attempt to verify the method from the AST values.

Fast - Like regex but for the AST

So, the node pattern expressions can be composed with:

  • ‘exp’ which represents the full match with the content.
  • ‘(type *children)’
  • ‘_’ for anything not nil
  • ’{ this or that }’
  • ’[ this and that ]’
  • ‘_’ and ‘…’ for something or a node with children _ ‘$’ for captures _ ‘!’ to negate _ ‘#method’ to call a method with the node as param

Try fast .finders

SQL Support in ‘fast’

Fast now can also parse SQL from PostgreSQL. The AST conversion is very similar to what you have in the Ruby AST.

Fast.parse_sql('select 1')
s(:select_stmt,
  s(:target_list,
    s(:res_target,
      s(:val,
        s(:a_const,
          s(:ival,
            s(:ival, 1)))))))

And the SQL parser is just forwarding the pg_query AST to converge into a format that Fast is already familiar with.

In reality, the pg_query (PostgreSQL C bindings) do the heavy work and my library allows to reuse all these AST metadata with node patterns and refactoring methods.

Here’s the full implementation of the parsing:

# lib/fast/sql.rb:110
    def parse(statement, buffer_name: "(sql)")
      return [] if statement.nil?
      source_buffer = SQL::SourceBuffer.new(buffer_name, source: statement)
      tree = PgQuery.parse(statement).tree
      first, *, last = source_buffer.tokens
      stmts = tree.stmts.map do |stmt|
        from = stmt.stmt_location
        to = stmt.stmt_len.zero? ? last.end : from + stmt.stmt_len
        expression = Parser::Source::Range.new(source_buffer, from, to)
        source_map = Parser::Source::Map.new(expression)
        sql_tree_to_ast(clean_structure(stmt.stmt.to_h), source_buffer: source_buffer, source_map: source_map)
      end.flatten
      stmts.one? ? stmts.first : stmts
    end

The hard work I had was more on going in the most common queries and create a good clean_structure for it to allow us to have a very easy to comprehend AST to build simple node patterns.

Fastfile

The Fastfile can help to organize the dictionary of patterns in shortcuts.

I built it thinking about having one per project and may load something cross projects with my generic searches.

The previous Ruby snippet was brought to the example based on the following shortcut:

Fast.shortcut :sql_parser, "(def parse)", "lib/fast/sql.rb"

So, if I want to repeat such search, I just need to say fast .sql_parser and it will reuse the shortcut name to output the matching results.

Refactoring operations

The most complex part of the AST is rewriting it. Allowing you to refactor code from the AST just with a simple method call.

Here’s an example of a simple replace on a relname which is a node type.

Fast
  .parse_sql("SELECT * FROM customers")
  .replace("relname", "other_table")
  # => "SELECT * FROM other_table"

The replace method can also receive a block and build a much more complex scenario. The next example explores how to build a SQL formatter.

Format SQL

I wrote a full blog post on building a SQL formatter with fast and the final shorcut looks like this:

Fast.shortcut :format_sql do
  require 'fast/sql'
  file = ARGV.last
  method = File.exist?(file) ? :parse_sql_file : :parse_sql
  ast = Fast.public_send(method, file)
  ast = ast.first if ast.is_a? Array
  eligible_kw = [:RESERVED_KEYWORD]
  eligible_tokens = [:BY]

  output = Fast::SQL.replace('_', ast) do |root|
    sb = root.loc.expression.source_buffer
    sb.tokens.each do |token|
      if eligible_kw.include?(token.keyword_kind) || eligible_tokens.include?(token.token)
        range = Parser::Source::Range.new(sb, token.start, token.end)
        replace(range, range.source.upcase)
      end
    end
  end
  require 'fast/cli'
  puts Fast.highlight(output, sql: true)
end

Anonymize SQL

Another cool example on manipulating SQL via AST is anonymize it.

Here’s a simple anonymizer that transform table names and references.

Fast.shortcut :anonymize_sql do
  require 'fast/sql'
  file = ARGV.last
  method = File.exist?(file) ? :parse_sql_file : :parse_sql
  ast = Fast.public_send(method, file)
  memo = {}

  relnames = search("(relname $_)", ast).grep(String).uniq
  pattern = "{relname (sval {#{relnames.map(&:inspect).join(' ')}})}"
  puts "searching with #{pattern}"

  content = Fast::SQL.replace(pattern, ast) do |node|
    new_name = memo[node.source.tr(%|"'|, '')] ||= "x#{memo.size}"
    new_name = "'#{new_name}'" if node.type == :sval
    replace(node.loc.expression, new_name)
  end
  puts Fast.highlight(content, sql: true)
end

My latest experiments with SQL AST

Now, let’s talk about my latest exciting project that is about TimescaleDB and recognize framework standards and feature adoption.

Let’s dive into a simple example that allows me to explain my intention.

I’m going to track temperatures of different locations but my database will be massive, so I’m going to use the hypertables to automatically partition the table by month.

CREATE TABLE temperatures
( time TIMESTAMP NOT NULL,
  location varchar,
  value decimal);

-- automatic partitioning by month
SELECT create_hypertable('temperatures',
  by_range('time', INTERVAL '1 month'));

The create_hypertable function will allow you to insert on the temperatures but behind the scenes it’s splitting and saving the data into a monthly chunk.

Inserting data

The focus here is not on splitting the data either inserting, but I’m going to put some examples to make the full scenario understanding:

INSERT INTO temperatures ( location, time, value)
VALUES
( 'kitchen',   '2000-01-01 12:20:00', 22.2),
( 'kitchen',   '2000-01-01 12:32:00', 22.8),
( 'bedroom',   '2000-01-01 12:22:00', 22.8),
( 'bedroom',   '2000-01-01 14:33:00', 24.2);

Querying aggregated data

Here’s the tricky part of using time series data. Most of it will be called by statistics that aggregates the data in somehow. Let’s get the average hourly temperature:

SELECT time_bucket('1h', time) as time,
  location,
  avg(value) as value
  FROM temperatures
  GROUP BY 1,2;

The time_bucket is offered by the timescaledb extension and works like date_trunc but a bit more advanced and prepared for Timescale features.

Creating a materialized view

Now, you can imagine that putting the most frequent queries into materialized views is a good idea, so you can make the values persistent and avoid reprocessing.

CREATE MATERIALIZED VIEW  avg_temperature_by_hour
WITH (timescaledb.continuous) AS
SELECT time_bucket('1h', time) as time,
  location,
  avg(value) as value
  FROM temperatures
GROUP BY 1,2
WITH DATA;

My experiment - a small linter

Now let’s talk about the experiment, you can imagine a linter saying:

Hey buddy, I see you’re querying from the hypertable but you also have the materialized view. Let’s use the materialized view instead.

Learning the AST patterns

The first thing to build your pattern, is understand the AST. I build a minimalistic expression that contains the fragment I’m looking for:

puts Fast.parse_sql("SELECT time_bucket('1h', now())")

Outputs:

(select-stmt
  (target-list
    (res-target
      (val
        (func-call
          (funcname
            (string
              (sval "time_bucket")))
          (args
            (a-const
              (sval
                (sval "1h")))
            (func-call
              (funcname
                (string
                  (sval "now")))
              (funcformat :COERCE_EXPLICIT_CALL)))
          (funcformat :COERCE_EXPLICIT_CALL))))))

Now, I can also build a pattern for detecting select with time_bucket from hypertable.

Fast.shortcut :check_query do 
  pattern = <<~FAST
  (select_stmt
    (target_list #call_time_bucket)
    (from_clause #from_hypertable)
  FAST
  search_all pattern, # ...
end

Now, you can see I’m introducing #method_match to allow to build more readable patterns and delegate part of it to my favorite programming language. So, let’s specify the methods:

The #call_time_bucket pattern will just pick the first result of the expression given:

def call_time_bucket(node)
  node.first('(func_call (funcname (string (sval "time_bucket")')
end

And the from_hypertable pattern will collect and double check if the hypertables array contains the name.

def from_hypertable(node)
  if (relname = node.capture('(relname $_)')[0])
    hypertables.include?(relname)
  end
end
def hypertables
  @hypertables ||=
    Timescaledb.hypertables.map(&:hypertable_name)
end

Now, we can say, well, we have a query that matches. This query can be running standalone or be the implementation of a materialized view. So, next step is track what are the queries and what is the materialized views already available.

Track @query and @materialized

    # Previous pattern
    search_all pattern, ARGV.last, parallel: false, on_result: ->(file, results) do
      puts "#{file}: #{results.size}"
      results.each do |node|
        report node
        if node.parent.nil?
          @query[node.capture('(relname $_)')[0]] = node
        else
          root = node.ancestors.last
          # ... next slide

To map @materialized with the view name, we need to check if this is a CREATE MATERIALIZED VIEW using timescaledb.continuous feature.

case root.type
when :create_table_as_stmt
  view_name = <<~FAST
    (create_table_as_stmt
      (query ... )
      (into
        (rel
          (relname $_)
          (inh true)
          (relpersistence "p"))
        (options
          (def_elem
            (defnamespace "timescaledb")
            (defname "continuous")
        FAST
if (name=root.capture(view_name)[0])
  @materialized[name] = node
end

The cool part of AST, is that if we just run this shortcut over the previous SQL fragments, it will allow you to discover that:

@query["temperatures"] == @materialized["avg_temperature_by_hour"] # => true

So, the AST has the same representation and we can build a recursive algorithm to check what queries are elegible to be replaced by a materialized view.

# ... previous context
      @query.each do |table_name, query_table|
        @materialized.each do |view_name, query_materialized|
          if query_table == query_materialized
            puts "The table #{table_name} is also tracked as a continuous aggregate of #{view_name}", ""
            report(query_table.source)
            puts "Query the data from the materialized view to get pre-computed results", ""
            report("SELECT * FROM #{view_name}")
          end
        end

Demo

Running the demo with the previous sql fragments mapped in a demo.sql.

 fast .check_query demo.sql
The table temperatures is also tracked as a continuous aggregate of avg_temperature_by_hour

SELECT time_bucket('1h', time) as time,
  location,
  avg(value) as value
  FROM temperatures
  GROUP BY 1,2

Query the data from the materialized view to get pre-computed results

SELECT * FROM avg_temperature_by_hour

Yay! I got the final output as expected! I was able to create a linter that uses database metadata combined with AST metadata to intercept and teach a developer that needs a new direction.

I love building this type of primitive tools 🫶.

Playing with toys from compiler level is fun and very powerful!

The future

My plan for the future is try to help the query planner in the ORM level. Integrate with the timescaledb gem to replace queries before the query planner.

I can’t see myself able to go to such implementation in the core level of postgresql but I can build a small prototype that proves the concept.

Sources

Here are a few sources if you want to go deeper:

I build this article as a way to document my talk “Grepping SQL like a boss” at Lambda Days 2024 and you can find the gist to all talk files and sources to run the example here:

https://gist.github.com/jonatas/6ba56014185855b7f2efae74d6250016