Faster selector execution by pre-parsing fields by jcoglan · Pull Request #5869 · apache/couchdb

jcoglan · 2026-01-29T15:44:49Z

Overview

While working on mango_selector:match() to implement detailed match failure information, I noticed that every time a non-operator field is evaluated, we call mango_doc:get_field(), which calls mango_util:parse_field() if its input is a binary. We can avoid re-parsing fields so often by changing norm_fields so that it normalizes to the parsed form [<<"a">>, <<"b">>] instead of <<"a.b">>.

The changes have made the benchmarks I added about 30-40% faster. (I wasn't sure where benchmarks like this should live, if anywhere... they're mostly here for me to check the effect and I'm happy to move or remove them before merging.)

In the first commit here I make this change and then change all the other logic that interacts with selectors to also use this form. This ended up having a bigger blast radius than I would have liked, so I undid most of it and changed the mango_selector functions and a couple others so they hide how normalized selector fields actually look from other callers. There's only a couple of places where these normalized selectors need to be turned back into valid JSON, and the mango_util:join_keys() function turns the array keys back into binaries as needed.

There are a couple of places in the logic for choosing indexes for queries where we now have to parse fields to match the form they appear in in selectors, but I figure doing this once per query during index selection is better than doing it once per selector-field evaluation during index updates or filtering.

I assume the latter commit is the state we'd actually want to keep so I'm happy to squash all the other edits away. I also need to write proper commits messages; I'll tidy the history up once I know what should be kept in the history.

Related Issues or Pull Requests

Checklist

Code is written and works correctly
Changes are covered by tests
Any new configurable parameters are documented in rel/overlay/etc/default.ini
Documentation changes were made in the src/docs folder
Documentation changes were backported (separated PR) to affected branches

janl

This looks good overall, thanks!

We do not want to add erlperf and argparse as dependencies just yet, could you revert those bits? You can keep the benchmark, just disable it for now and leave instructions for how to run it.

nickva

Wondering if this would affect cases when selectors are sent between cluster nodes during a release upgrade. One way to check would be to run a dev cluster node on a version without the change and others with the change and then make mango filtered changes feed and mango index/selector queries.

jcoglan · 2026-03-12T16:58:50Z

I have tried the things @nickva suggest via the following steps. I believe the results look fine but would appreciate others double checking them.

Check out the repo at this PR's branch
Check out two other copies of the repo at admin/main, I used commit c570deff373d7f9b1c14db34e981312b09ff71aa
On each one, apply the following patch:

diff --git a/rel/overlay/etc/vm.args b/rel/overlay/etc/vm.args
index b4167be7b..ca8f562dd 100644
--- a/rel/overlay/etc/vm.args
+++ b/rel/overlay/etc/vm.args
@@ -43,6 +43,8 @@
 
 # Which interfaces should the node listen on?
 -kernel inet_dist_use_interface {127,0,0,1}
+-kernel inet_dist_listen_min 9100
+-kernel inet_dist_listen_max 9200
 
 # Tell kernel and SASL not to log anything
 -kernel error_logger silent

On each copy, run: git clean -ffdx ; ./configure --dev --disable-spidermonkey --js-engine quickjs ; make ; ./dev/run -a a:a -n1 --erlang-cookie biscuit --node-number-seed $n with n from 1 to 3 for each copy
Run the following script to connect them into a cluster:

#!/bin/bash

source ~/.shell/couchdb

username='a'
password='a'
CDB_AUTH="$username:$password"

pwhash='"-pbkdf2:sha256-761a5edaa8635ae1fce1185d477cff666e137e5f2fe4d65037a8e08f380f67ab,b0e91baaa60de5da82be8bcd7acb4ab9,600000"'

uuid1='"019ce2e0305a7a9c9c8b4b50e992c097"'
uuid2='"019ce2e0305a7cd0861784fd28c84d67"'

for n in $(seq 1 3) ; do
  CDB_PORT="${n}5984"
  cdb '/_node/_local/_config/admins/a' -X PUT -d "$pwhash"
  cdb '/_node/_local/_config/chttpd/bind_address' -X PUT -d '"0.0.0.0"'
  cdb '/_node/_local/_config/couchdb/uuid' -X PUT -d "$uuid1"
  cdb '/_node/_local/_config/chttpd_auth/secret' -X PUT -d "$uuid2"
  cdb '/_node/_local/_config/cluster/q' -X PUT -d '"1"'
  cdb '/_node/_local/_config/cluster/n' -X PUT -d '"3"'
done

CDB_PORT=15984

ip='127.0.0.1'

cdb '/_cluster_setup' -X POST -d @- <<JSON
  {
    "action": "enable_cluster",
    "bind_address": "0.0.0.0",
    "username": "$username",
    "password": "$password",
    "node_count": "3"
  }
JSON

for n in $(seq 2 3) ; do
  cdb '/_cluster_setup' -X POST -d @- <<JSON
    {
      "action": "enable_cluster",
      "bind_address": "0.0.0.0",
      "username": "$username",
      "password": "$password",
      "port": $CDB_PORT,
      "node_count": "3",
      "remote_node": "$ip",
      "remote_current_user": "$username",
      "remote_current_password": "$password"
    }
JSON

  cdb '/_cluster_setup' -X POST -d @- <<JSON
    {
      "action": "add_node",
      "host": "$ip",
      "port": ${n}5984,
      "username": "$username",
      "password": "$password"
    }
JSON
done

cdb '/_cluster_setup' -X POST -d '{ "action": "finish_cluster" }'

cdb '/_membership' | jq

cdb is a helper from my shell environment which you can simulate using:

cdb () { curl -gsu "$CDB_AUTH" -H 'content-type: application/json' "http://127.0.0.1:${CDB_PORT}$1" "${@:2}" ; }

Create a DB with some docs in it:

CDB_PORT=15984
cdb '/mydb' -X DELETE
cdb '/mydb' -X PUT
for n in {1..100} ; do cdb "/mydb/doc-$n" -X PUT -d '{ "value": '$n' }' ; done

Try using _find and _explain against each server:

CDB_PORT=15984 # or 25984, 35984
cdb '/mydb/_find' -X POST -d '{ "selector": { "value": { "$gt": 3, "$lt": 7 } } }'
cdb '/mydb/_explain' -X POST -d '{ "selector": { "value": { "$gt": 3, "$lt": 7 } } }'

Try filtered changes:

CDB_PORT=15984 # or 25984, 35984
cdb '/mydb/_changes?filter=_selector' -X POST -d '{ "selector": { "value": { "$gt": 3, "$lt": 7 } } }'

nickva · 2026-03-12T22:17:23Z

That looks great, James. Thank you. Let me also give it a try locally with similar setup to double-check it. And then it should be good to go

nickva

Looking at mixed cluster:

(I used a bit of a simpler way to do it)

OLD (node1, node2, ...)

./dev/run --admin=adm:pass -n3 --degrade-cluster=1

NEW (node3)

./dev/run --admin=adm:pass -n1 --no-join --node-number-seed=3

With this script

#!/bin/sh

http delete $DB/db1
http put $DB/db1
http put $DB/db1/a v:='1' type=x
http put $DB/db1/b v:='2' type=x
http put $DB/db1/c v:='3' type=y
http put $DB/db1/d v:='4' type=y
http put $DB/db1/e v:='3' type=z
http put $DB/db1/f v:='2' type=x
http put $DB/db1/g v:='1' type=z

http put $DB/db1/_design/q language=query views:='{"v": {"map": {"type":"x", "v":{"$ge":2}, "fields": {"v":"asc"}}}}'

Then querying from node3 I see this error:

http post $DB3/db1/_find selector:='{"v":{"$gt":2}}' fields:='["v", "_id"]' execution_stats:='true'
HTTP/1.1 500 Internal Server Error

{
    "error": "unknown_error",
    "reason": "undef",
    "ref": 3538619864
}

Querying from DB1 and DB2 does work though:

http post $DB1/db1/_find selector:='{"v":{"$gt":2}}' fields:='["v", "_id"]' execution_stats:='true'
HTTP/1.1 200 OK

{
    "bookmark": "g1AAAAAyeJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozpoIkOGASEKEsAE9eDYE",
    "docs": [
        {
            "_id": "c",
            "v": 3
        },
        {
            "_id": "d",
            "v": 4
        },
        {
            "_id": "e",
            "v": 3
        }
    ],
    "execution_stats": {
        "execution_time_ms": 1.316,
        "results_returned": 3,
        "total_docs_examined": 8,
        "total_keys_examined": 8,
        "total_quorum_docs_examined": 0
    },
    "warning": "No matching index found, create an index to optimize query time."
}

The error in the log looks like:

[error] 2026-03-13T04:18:15.961351Z node3@127.0.0.1 <0.33714.0> 17120bdde3 req_err(3538619864) unknown_error : undef
    [<<"binary:join/2">>,<<"mango_idx_view:-field_ranges/1-lc$^0/1-0-/1 L327">>,<<"mango_cursor_special:create/4 L35">>,<<"mango_crud:find/5 L49">>,<<"mango_httpd:handle_find_req/2 L213">>
,<<"mango_httpd:handle_req/2 L34">>,<<"chttpd:handle_req_after_auth/2 L432">>,<<"chttpd:process_request/1 L410">>]
[notice] 2026-03-13T04:18:15.961542Z node3@127.0.0.1 <0.33714.0> 17120bdde3 127.0.0.1:35984 127.0.0.1 adm POST /db1/_find 500 ok 1
[error] 2026-03-13T04:18:20.567988Z node3@127.0.0.1 <0.33894.0> -------- Invalid Index Def [{<<"map">>,{[{<<"type">>,<<"x">>},{<<"v">>,{[{<<"$ge">>,2}]}},{<<"fields">>,{[{<<"v">>,<<"asc">>
}]}}]}}]. Error: error, Reason: {badmatch,undefined}

I didn't dive in yet to see why it's happening but figure we'd want to take a better look before merging

Currently, selectors like `{ "a.b": X }` or `{ "a": { "b": X } }` are normalized into the internal form: {[{<<"a.b">>, X}]} When the selector is evaluated, we call `mango_doc:get_field(Doc, <<"a.b">>)`, which then has to parse `<<"a.b">>` into the list `[<<"a">>, <<"b">>]` so it can follow the path to the field inside the doc. This path is re-parsed every time the document is evaluated. To save time, here we change the normalized form to: {[{[<<"a">>, <<"b">>], X}]} i.e. the field path is already parsed before the selector is evaluated. This has been shown to improve selector evaluation time by 30-40%. However, it does mean the normalized form is no longer a valid ejson structure as it contains a key that is not a binary. We therefore need to turn it back into `<<"a.b">>` in cases where the selector is serialized back to JSON, for example when it is returned as a part of `_explain` output. This is done using the `join_field` and `join_keys` functions in `mango_util`. We also need to update anywhere that hard-codes a particular document field so that the field is in the parsed array representation.

In the previous commit we changed lots of code to deal with normalized selector fields now being per-parsed lists, not binaries, i.e. instead of `<<"a.b">>` a field now looks like `[<<"a">>, <<"b">>]`. This commit winds most of that back and instead makes the functions in `mango_selector` hide the fact the field is now pre-parsed from callers. So for example, the code that deals with matching selectors to indexes can continue to pass the unparsed form into `mango_selector` functions, and `mango_selector` will take care of parsing/serializing inputs as necessary.

nickva

Look better, I'll still have to re-run my mixed cluster test a few times though. But for now, I wonder if we've switched to a less optimized join if we'd still see a speedup in indexing or filtering?

nickva · 2026-03-13T15:12:34Z

src/mango/src/mango_util.erl

+join_field(Field) ->
+    Field.
+
+% binary:join/2 is not available on all Erlang versions we support; it was


Since binary:join/2 is probably a faster implementation I wonder if we still get a noticeably indexing or filtering boost from this PR?

Try indexing a few million docs and see if notice a difference

nickva · 2026-03-13T15:15:54Z

src/mango/src/mango_util.erl

+% binary:join/2 is not available on all Erlang versions we support; it was
+% added in 28.0. For now, we use this function in its place, c.f.
+% https://www.erlang.org/doc/apps/stdlib/binary.html#join/2
+binary_join(Binaries, Separator) ->


Could add a guard that Binaries is a list and Separator is a binary. Sometimes those asserts could also help the compiler to optimize some snippets.

janl reviewed Feb 5, 2026

View reviewed changes

jcoglan marked this pull request as ready for review February 13, 2026 10:16

jcoglan force-pushed the faster-norm-fields branch from 6b1d6ef to 095aca7 Compare February 20, 2026 11:06

janl approved these changes Feb 20, 2026

View reviewed changes

jcoglan force-pushed the faster-norm-fields branch 4 times, most recently from 7ddba0f to d66b35c Compare February 23, 2026 14:30

nickva reviewed Feb 23, 2026

View reviewed changes

jcoglan force-pushed the faster-norm-fields branch 2 times, most recently from d66b35c to b03f59e Compare March 12, 2026 16:59

nickva requested changes Mar 13, 2026

View reviewed changes

jcoglan force-pushed the faster-norm-fields branch from b03f59e to 1f6cc39 Compare March 13, 2026 11:03

jcoglan added 2 commits March 13, 2026 11:55

jcoglan force-pushed the faster-norm-fields branch from 1f6cc39 to a91592f Compare March 13, 2026 11:55

nickva reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster selector execution by pre-parsing fields#5869

Faster selector execution by pre-parsing fields#5869
jcoglan wants to merge 2 commits intoapache:mainfrom
neighbourhoodie:faster-norm-fields

jcoglan commented Jan 29, 2026 •

edited

Loading

Uh oh!

janl left a comment

Uh oh!

nickva left a comment

Uh oh!

jcoglan commented Mar 12, 2026 •

edited

Loading

Uh oh!

nickva commented Mar 12, 2026

Uh oh!

nickva left a comment

Uh oh!

nickva left a comment

Uh oh!

nickva Mar 13, 2026

Uh oh!

nickva Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jcoglan commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Related Issues or Pull Requests

Checklist

Uh oh!

janl left a comment

Choose a reason for hiding this comment

Uh oh!

nickva left a comment

Choose a reason for hiding this comment

Uh oh!

jcoglan commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickva commented Mar 12, 2026

Uh oh!

nickva left a comment

Choose a reason for hiding this comment

Uh oh!

nickva left a comment

Choose a reason for hiding this comment

Uh oh!

nickva Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

nickva Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jcoglan commented Jan 29, 2026 •

edited

Loading

jcoglan commented Mar 12, 2026 •

edited

Loading