A jq to bb Story

tl;dr Just because you can do it in Clojure doesn’t mean you should.

So I was looking for a regular expression in a bunch of files. You know, the standard use case for grep … or my currently preferred variant, ripgrep. Ripgrep is amazing software and I love it, but I couldn’t make it do quite what I wanted, which was to get a list of the unique matches. No file names, no line numbers, no context; just the matches.

For the sake of this blog post, I’ll say that instead of a sprawling tree of source files, I’ve got the Gettysburg address:

gettysburg

Four score and seven years ago our fathers brought forth on this continent, a
new nation, conceived in Liberty, and dedicated to the proposition that all men
are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field, as
a final resting place for those who here gave their lives that that nation
might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate—we can not consecrate—we can not
hallow—this ground. The brave men, living and dead, who struggled here, have
consecrated it, far above our poor power to add or detract. The world will
little note, nor long remember what we say here, but it can never forget what
they did here. It is for us the living, rather, to be dedicated here to the
unfinished work which they who fought here have thus far so nobly advanced. It
is rather for us to be here dedicated to the great task remaining before
us—that from these honored dead we take increased devotion to that cause for
which they gave the last full measure of devotion—that we here highly resolve
that these dead shall not have died in vain—that this nation, under God, shall
have a new birth of freedom—and that government of the people, by the people,
for the people, shall not perish from the earth.

Maybe I want to find all the words that end with ion. I can invoke ripgrep like this:

rg '\b\w+ion\b' gettysburg
2:new nation, conceived in Liberty, and dedicated to the proposition that all men
5:Now we are engaged in a great civil war, testing whether that nation, or any
6:nation so conceived and so dedicated, can long endure. We are met on a great
7:battle-field of that war. We have come to dedicate a portion of that field, as
8:a final resting place for those who here gave their lives that that nation
18:us—that from these honored dead we take increased devotion to that cause for
19:which they gave the last full measure of devotion—that we here highly resolve
20:that these dead shall not have died in vain—that this nation, under God, shall

Remember, I want just the matching words. If a line has multiple matches (as in the first line of the results above), I want to know about both of ’em.

I discovered that ripgrep has a ‑‑json option, which produces output that looks like this:

{"type":"begin","data":{"path":{"text":"gettysburg"}}}
{"type":"match","data":{"path":{"text":"gettysburg"},"lines":{"text":"new nation, conceived in Liberty, and dedicated to the proposition that all men\n"},"line_number":2,"absolute_offset":78,"submatches":[{"match":{"text":"nation"},"start":4,"end":10},{"match":{"text":"proposition"},"start":55,"end":66}]}}
… other results …
{"data":{"elapsed_total":{"human":"0.017184s","nanos":17184422,"secs":0},"stats":{"bytes_printed":2159,"bytes_searched":1469,"elapsed":{"human":"0.003108s","nanos":3107974,"secs":0},"matched_lines":8,"matches":9,"searches":1,"searches_with_match":1}},"type":"summary"}

The first line and last line are bits of metadata about the search, but the rest of it includes quite a bit of information about each match. Here’s a prettified version of one of those lines:

{
  "type": "match",
  "data": {
    "path": {
      "text": "gettysburg"
    },
    "lines": {
      "text": "new nation, conceived in Liberty, and dedicated to the proposition that all men\n"
    },
    "line_number": 2,
    "absolute_offset": 78,
    "submatches": [
      {
        "match": {
          "text": "nation"
        },
        "start": 4,
        "end": 10
      },
      {
        "match": {
          "text": "proposition"
        },
        "start": 55,
        "end": 66
      }
    ]
  }
}

That submatches array has exactly what I want: each element is an object whose match entry has a text entry, whose value is—well—the matching text.

Got JSON? Sounds like a great place to drop in some jq. If I pipe ripgrep’s output into jq, I can get pretty close to what I was looking for:

rg … | jq -r .data.submatches[]?.match.text
nation
proposition
nation
nation
portion
nation
devotion
devotion
nation

And now, if I tack on | sort | uniq, I get exactly what I was looking for, a list of the unique matches:1

devotion
nation
portion
proposition

Unix pipes in action. O joy, o rapture!

Now, this is a perfectly cromulent solution. jq is a perfect fit; that .data.submatches[]?.match.text elegantly expresses what I’m extracting from each line.

However, in more complex cases, jq can be a bit tricky to wrangle. I have spent an inordinate amount of time trawling its documentation looking for the right incantation. There’s nothing inherently wrong with that—powerful tools are worth learning—but this gave me a chance to try another powerful tool I’m trying to gain more familiarity with: Babashka, an interpreted Clojure dialect whose sweet spot is shell scripting.

Over the remainder of the post, I’ll build up an equivalent command for bb (what Babashka calls its executable). I’ll start with the simplest thing I can think of, which is simply dumping what I’m piping into bb:

rg … | bb '*input*'
clojure.lang.EdnReader$ReaderException: java.lang.RuntimeException: Invalid token: :

What happened? Babashka is a Clojure tool, so it was expecting that *input* (a dynamic variable bound to, well, the incoming data) would be a big blob of Clojure’s native EDN. We can change it to expect a sequence of plain strings with the ‑i flag:

rg … | bb -i '*input*'
("{\"type\":\"begin\",\"data\":{\"path\":{\"text\":\"gettysburg\"}}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"new nation, conceived in Liberty, and dedicated to the proposition that all men\\n\"},\"line_number\":2,\"absolute_offset\":78,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":4,\"end\":10},{\"match\":{\"text\":\"proposition\"},\"start\":55,\"end\":66}]}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"Now we are engaged in a great civil war, testing whether that nation, or any\\n\"},\"line_number\":5,\"absolute_offset\":178,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":62,\"end\":68}]}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"nation so conceived and so dedicated, can long endure. We are met on a great\\n\"},\"line_number\":6,\"absolute_offset\":255,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":0,\"end\":6}]}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"battle-field of that war. We have come to dedicate a portion of that field, as\\n\"},\"line_number\":7,\"absolute_offset\":332,\"submatches\":[{\"match\":{\"text\":\"portion\"},\"start\":53,\"end\":60}]}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"a final resting place for those who here gave their lives that that nation\\n\"},\"line_number\":8,\"absolute_offset\":411,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":68,\"end\":74}]}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"us—that from these honored dead we take increased devotion to that cause for\\n\"},\"line_number\":18,\"absolute_offset\":1100,\"submatches\":[{\"match\":{\"text\":\"devotion\"},\"start\":52,\"end\":60}]}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"which they gave the last full measure of devotion—that we here highly resolve\\n\"},\"line_number\":19,\"absolute_offset\":1179,\"submatches\":[{\"match\":{\"text\":\"devotion\"},\"start\":41,\"end\":49}]}}" "{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"that these dead shall not have died in vain—that this nation, under God, shall\\n\"},\"line_number\":20,\"absolute_offset\":1259,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":56,\"end\":62}]}}" "{\"type\":\"end\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"binary_offset\":null,\"stats\":{\"elapsed\":{\"secs\":0,\"nanos\":1197411,\"human\":\"0.001197s\"},\"searches\":1,\"searches_with_match\":1,\"bytes_searched\":1469,\"bytes_printed\":2159,\"matched_lines\":8,\"matches\":9}}}" "{\"data\":{\"elapsed_total\":{\"human\":\"0.003436s\",\"nanos\":3436109,\"secs\":0},\"stats\":{\"bytes_printed\":2159,\"bytes_searched\":1469,\"elapsed\":{\"human\":\"0.001197s\",\"nanos\":1197411,\"secs\":0},\"matched_lines\":8,\"matches\":9,\"searches\":1,\"searches_with_match\":1}},\"type\":\"summary\"}")

That’s better—at least I’m not getting an error—but still, it’s one big list with every string in it. It would be nice if I could process one line at a time, like I was doing with jq. I can do just that with the ‑‑stream option:

rg … | bb --stream -i '*input*'
"{\"type\":\"begin\",\"data\":{\"path\":{\"text\":\"gettysburg\"}}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"new nation, conceived in Liberty, and dedicated to the proposition that all men\\n\"},\"line_number\":2,\"absolute_offset\":78,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":4,\"end\":10},{\"match\":{\"text\":\"proposition\"},\"start\":55,\"end\":66}]}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"Now we are engaged in a great civil war, testing whether that nation, or any\\n\"},\"line_number\":5,\"absolute_offset\":178,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":62,\"end\":68}]}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"nation so conceived and so dedicated, can long endure. We are met on a great\\n\"},\"line_number\":6,\"absolute_offset\":255,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":0,\"end\":6}]}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"battle-field of that war. We have come to dedicate a portion of that field, as\\n\"},\"line_number\":7,\"absolute_offset\":332,\"submatches\":[{\"match\":{\"text\":\"portion\"},\"start\":53,\"end\":60}]}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"a final resting place for those who here gave their lives that that nation\\n\"},\"line_number\":8,\"absolute_offset\":411,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":68,\"end\":74}]}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"us—that from these honored dead we take increased devotion to that cause for\\n\"},\"line_number\":18,\"absolute_offset\":1100,\"submatches\":[{\"match\":{\"text\":\"devotion\"},\"start\":52,\"end\":60}]}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"which they gave the last full measure of devotion—that we here highly resolve\\n\"},\"line_number\":19,\"absolute_offset\":1179,\"submatches\":[{\"match\":{\"text\":\"devotion\"},\"start\":41,\"end\":49}]}}"
"{\"type\":\"match\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"lines\":{\"text\":\"that these dead shall not have died in vain—that this nation, under God, shall\\n\"},\"line_number\":20,\"absolute_offset\":1259,\"submatches\":[{\"match\":{\"text\":\"nation\"},\"start\":56,\"end\":62}]}}"
"{\"type\":\"end\",\"data\":{\"path\":{\"text\":\"gettysburg\"},\"binary_offset\":null,\"stats\":{\"elapsed\":{\"secs\":0,\"nanos\":1256068,\"human\":\"0.001256s\"},\"searches\":1,\"searches_with_match\":1,\"bytes_searched\":1469,\"bytes_printed\":2159,\"matched_lines\":8,\"matches\":9}}}"
"{\"data\":{\"elapsed_total\":{\"human\":\"0.003883s\",\"nanos\":3882750,\"secs\":0},\"stats\":{\"bytes_printed\":2159,\"bytes_searched\":1469,\"elapsed\":{\"human\":\"0.001256s\",\"nanos\":1256068,\"secs\":0},\"matched_lines\":8,\"matches\":9,\"searches\":1,\"searches_with_match\":1}},\"type\":\"summary\"}"

Okay, cool. Now I need to get these JSON strings into EDN so’s I can work with the data in Clojure-land. Babashka bundles in a bunch of libraries, including Cheshire for JSON support; it’s aliased as the json namespace. I’ll modify the command:

rg … | bb --stream -i '(-> *input* (json/parse-string true))'
{:type "begin", :data {:path {:text "gettysburg"}}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "new nation, conceived in Liberty, and dedicated to the proposition that all men\n"}, :line_number 2, :absolute_offset 78, :submatches [{:match {:text "nation"}, :start 4, :end 10} {:match {:text "proposition"}, :start 55, :end 66}]}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "Now we are engaged in a great civil war, testing whether that nation, or any\n"}, :line_number 5, :absolute_offset 178, :submatches [{:match {:text "nation"}, :start 62, :end 68}]}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "nation so conceived and so dedicated, can long endure. We are met on a great\n"}, :line_number 6, :absolute_offset 255, :submatches [{:match {:text "nation"}, :start 0, :end 6}]}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "battle-field of that war. We have come to dedicate a portion of that field, as\n"}, :line_number 7, :absolute_offset 332, :submatches [{:match {:text "portion"}, :start 53, :end 60}]}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "a final resting place for those who here gave their lives that that nation\n"}, :line_number 8, :absolute_offset 411, :submatches [{:match {:text "nation"}, :start 68, :end 74}]}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "us—that from these honored dead we take increased devotion to that cause for\n"}, :line_number 18, :absolute_offset 1100, :submatches [{:match {:text "devotion"}, :start 52, :end 60}]}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "which they gave the last full measure of devotion—that we here highly resolve\n"}, :line_number 19, :absolute_offset 1179, :submatches [{:match {:text "devotion"}, :start 41, :end 49}]}}
{:type "match", :data {:path {:text "gettysburg"}, :lines {:text "that these dead shall not have died in vain—that this nation, under God, shall\n"}, :line_number 20, :absolute_offset 1259, :submatches [{:match {:text "nation"}, :start 56, :end 62}]}}
{:type "end", :data {:path {:text "gettysburg"}, :binary_offset nil, :stats {:elapsed {:secs 0, :nanos 1393123, :human "0.001393s"}, :searches 1, :searches_with_match 1, :bytes_searched 1469, :bytes_printed 2159, :matched_lines 8, :matches 9}}}
{:data {:elapsed_total {:human "0.003764s", :nanos 3763784, :secs 0}, :stats {:bytes_printed 2159, :bytes_searched 1469, :elapsed {:human "0.001393s", :nanos 1393123, :secs 0}, :matched_lines 8, :matches 9, :searches 1, :searches_with_match 1}}, :type "summary"}

Now we’re talking! It’s Clojure’s maps and vectors … I know this. Getting down to the submatches should be easy:

rg … | bb --stream -i '(-> *input* (json/parse-string true)'\
(get-in [:data :submatches]))'
[{:match {:text "nation"}, :start 4, :end 10} {:match {:text "proposition"}, :start 55, :end 66}]
[{:match {:text "nation"}, :start 62, :end 68}]
[{:match {:text "nation"}, :start 0, :end 6}]
[{:match {:text "portion"}, :start 53, :end 60}]
[{:match {:text "nation"}, :start 68, :end 74}]
[{:match {:text "devotion"}, :start 52, :end 60}]
[{:match {:text "devotion"}, :start 41, :end 49}]
[{:match {:text "nation"}, :start 56, :end 62}]

All right, now things are gonna get a little funky with the threading macros to pull the match text out of the submatches:

rg … | bb --stream -i '(-> *input* (json/parse-string true)'\
'(get-in [:data :submatches])'\
'(->> (map #(get-in % [:match :text])))'\
')'
()
("nation" "proposition")
("nation")
("nation")
("portion")
("nation")
("devotion")
("devotion")
("nation")
()
()

Almost there! I’ve got the data, I just need it a bit cleaner: no empty lists, no parens, and no quotation marks. Once again Babashka offers a flag to adjust that. Unsurprisingly, it’s -o:

rg … | bb --stream -io '(-> *input* (json/parse-string true) (get-in [:data :submatches]) (->> (map #(get-in % [:match :text]))))'
nation
proposition
nation
nation
portion
nation
devotion
devotion
nation

At this point, I can just pipe it to | sort | uniq, and get the same results as the jq version. But in for a penny, in for a pound, I say! Why not pipe this back into bb and do it with Clojure?

rg … | bb … | bb -io '(-> *input* set sort)'

So when I ran this in the big ol’ source tree (instead of the Gettysburg Address), it took about 500ms with Babashka. jq took about 80ns. Babashka was an order of magnitude slower! I tweaked it thusly:2

rg … | \
bb --stream -iO '(-> *input* (json/parse-string true) (get-in [:data :submatches]))' | \
bb --stream -o '(get-in *input* [:match :text])' | \
bb -io '(-> *input* set sort)'

I got rid of the (->> map), moving the match-text-grabbin’ into a separate invocation of Babashka. After that, it was only around 20ns slower than the jq version (~100ms). I’m not sure what the reason for that is, but I’d guess that Babashka, as an interpreted version of Clojure, spends way more time grinding through loops then it takes Bash to shove data through a pipe.

Clojure is an expressive language; you get a lot of mileage out of a small amount of code. In this instance, though, not only was the raw code longer, it was nowhere near as readable as the jq equivalent. Specialty tools ftw!

So is jq better than Babashka? Maybe for the case of simple JSON filtering and extraction! But Babashka is a much more general tool. You can write scripts to do all sorts of things with it, and Michiel Borkent is expanding its capabilities all the time. As usual, it’s a matter of picking the right tool for the job. Babashka and jq are both excellent, and I look forward to using both again.

Questions? Comments? Contact me!

  1. Actually, I only need to sort ‘cause that’s how uniq works. C’est la vie.
  2. The observant reader may have noticed that the case of the o in the flag changed. (From ‑o to ‑O.) The input/output flags are a bit obtuse, but Mr. Borkent compiled a handy table summarizing their usage.

Tools Used

Babashka
0.88
jq
1.6
ripgrep
12.0.1