Skip to content

Faster selector execution by pre-parsing fields#5869

Open
jcoglan wants to merge 2 commits intoapache:mainfrom
neighbourhoodie:faster-norm-fields
Open

Faster selector execution by pre-parsing fields#5869
jcoglan wants to merge 2 commits intoapache:mainfrom
neighbourhoodie:faster-norm-fields

Conversation

@jcoglan
Copy link
Contributor

@jcoglan jcoglan commented Jan 29, 2026

Overview

While working on mango_selector:match() to implement detailed match failure information, I noticed that every time a non-operator field is evaluated, we call mango_doc:get_field(), which calls mango_util:parse_field() if its input is a binary. We can avoid re-parsing fields so often by changing norm_fields so that it normalizes to the parsed form [<<"a">>, <<"b">>] instead of <<"a.b">>.

The changes have made the benchmarks I added about 30-40% faster. (I wasn't sure where benchmarks like this should live, if anywhere... they're mostly here for me to check the effect and I'm happy to move or remove them before merging.)

In the first commit here I make this change and then change all the other logic that interacts with selectors to also use this form. This ended up having a bigger blast radius than I would have liked, so I undid most of it and changed the mango_selector functions and a couple others so they hide how normalized selector fields actually look from other callers. There's only a couple of places where these normalized selectors need to be turned back into valid JSON, and the mango_util:join_keys() function turns the array keys back into binaries as needed.

There are a couple of places in the logic for choosing indexes for queries where we now have to parse fields to match the form they appear in in selectors, but I figure doing this once per query during index selection is better than doing it once per selector-field evaluation during index updates or filtering.

I assume the latter commit is the state we'd actually want to keep so I'm happy to squash all the other edits away. I also need to write proper commits messages; I'll tidy the history up once I know what should be kept in the history.

Related Issues or Pull Requests

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

Copy link
Member

@janl janl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, thanks!

We do not want to add erlperf and argparse as dependencies just yet, could you revert those bits? You can keep the benchmark, just disable it for now and leave instructions for how to run it.

@jcoglan jcoglan marked this pull request as ready for review February 13, 2026 10:16
@jcoglan jcoglan force-pushed the faster-norm-fields branch 4 times, most recently from 7ddba0f to d66b35c Compare February 23, 2026 14:30
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this would affect cases when selectors are sent between cluster nodes during a release upgrade. One way to check would be to run a dev cluster node on a version without the change and others with the change and then make mango filtered changes feed and mango index/selector queries.

@jcoglan
Copy link
Contributor Author

jcoglan commented Mar 12, 2026

I have tried the things @nickva suggest via the following steps. I believe the results look fine but would appreciate others double checking them.

  1. Check out the repo at this PR's branch
  2. Check out two other copies of the repo at admin/main, I used commit c570deff373d7f9b1c14db34e981312b09ff71aa
  3. On each one, apply the following patch:
diff --git a/rel/overlay/etc/vm.args b/rel/overlay/etc/vm.args
index b4167be7b..ca8f562dd 100644
--- a/rel/overlay/etc/vm.args
+++ b/rel/overlay/etc/vm.args
@@ -43,6 +43,8 @@
 
 # Which interfaces should the node listen on?
 -kernel inet_dist_use_interface {127,0,0,1}
+-kernel inet_dist_listen_min 9100
+-kernel inet_dist_listen_max 9200
 
 # Tell kernel and SASL not to log anything
 -kernel error_logger silent
  1. On each copy, run: git clean -ffdx ; ./configure --dev --disable-spidermonkey --js-engine quickjs ; make ; ./dev/run -a a:a -n1 --erlang-cookie biscuit --node-number-seed $n with n from 1 to 3 for each copy
  2. Run the following script to connect them into a cluster:
#!/bin/bash

source ~/.shell/couchdb

username='a'
password='a'
CDB_AUTH="$username:$password"

pwhash='"-pbkdf2:sha256-761a5edaa8635ae1fce1185d477cff666e137e5f2fe4d65037a8e08f380f67ab,b0e91baaa60de5da82be8bcd7acb4ab9,600000"'

uuid1='"019ce2e0305a7a9c9c8b4b50e992c097"'
uuid2='"019ce2e0305a7cd0861784fd28c84d67"'

for n in $(seq 1 3) ; do
  CDB_PORT="${n}5984"
  cdb '/_node/_local/_config/admins/a' -X PUT -d "$pwhash"
  cdb '/_node/_local/_config/chttpd/bind_address' -X PUT -d '"0.0.0.0"'
  cdb '/_node/_local/_config/couchdb/uuid' -X PUT -d "$uuid1"
  cdb '/_node/_local/_config/chttpd_auth/secret' -X PUT -d "$uuid2"
  cdb '/_node/_local/_config/cluster/q' -X PUT -d '"1"'
  cdb '/_node/_local/_config/cluster/n' -X PUT -d '"3"'
done

CDB_PORT=15984

ip='127.0.0.1'

cdb '/_cluster_setup' -X POST -d @- <<JSON
  {
    "action": "enable_cluster",
    "bind_address": "0.0.0.0",
    "username": "$username",
    "password": "$password",
    "node_count": "3"
  }
JSON

for n in $(seq 2 3) ; do
  cdb '/_cluster_setup' -X POST -d @- <<JSON
    {
      "action": "enable_cluster",
      "bind_address": "0.0.0.0",
      "username": "$username",
      "password": "$password",
      "port": $CDB_PORT,
      "node_count": "3",
      "remote_node": "$ip",
      "remote_current_user": "$username",
      "remote_current_password": "$password"
    }
JSON

  cdb '/_cluster_setup' -X POST -d @- <<JSON
    {
      "action": "add_node",
      "host": "$ip",
      "port": ${n}5984,
      "username": "$username",
      "password": "$password"
    }
JSON
done

cdb '/_cluster_setup' -X POST -d '{ "action": "finish_cluster" }'

cdb '/_membership' | jq

cdb is a helper from my shell environment which you can simulate using:

cdb () { curl -gsu "$CDB_AUTH" -H 'content-type: application/json' "http://127.0.0.1:${CDB_PORT}$1" "${@:2}" ; }
  1. Create a DB with some docs in it:
CDB_PORT=15984
cdb '/mydb' -X DELETE
cdb '/mydb' -X PUT
for n in {1..100} ; do cdb "/mydb/doc-$n" -X PUT -d '{ "value": '$n' }' ; done
  1. Try using _find and _explain against each server:
CDB_PORT=15984 # or 25984, 35984
cdb '/mydb/_find' -X POST -d '{ "selector": { "value": { "$gt": 3, "$lt": 7 } } }'
cdb '/mydb/_explain' -X POST -d '{ "selector": { "value": { "$gt": 3, "$lt": 7 } } }'
  1. Try filtered changes:
CDB_PORT=15984 # or 25984, 35984
cdb '/mydb/_changes?filter=_selector' -X POST -d '{ "selector": { "value": { "$gt": 3, "$lt": 7 } } }'

@jcoglan jcoglan force-pushed the faster-norm-fields branch 2 times, most recently from d66b35c to b03f59e Compare March 12, 2026 16:59
@nickva
Copy link
Contributor

nickva commented Mar 12, 2026

That looks great, James. Thank you. Let me also give it a try locally with similar setup to double-check it. And then it should be good to go

Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at mixed cluster:

(I used a bit of a simpler way to do it)

OLD (node1, node2, ...)

./dev/run --admin=adm:pass -n3 --degrade-cluster=1

NEW (node3)

./dev/run --admin=adm:pass -n1 --no-join --node-number-seed=3

With this script

#!/bin/sh

http delete $DB/db1
http put $DB/db1
http put $DB/db1/a v:='1' type=x
http put $DB/db1/b v:='2' type=x
http put $DB/db1/c v:='3' type=y
http put $DB/db1/d v:='4' type=y
http put $DB/db1/e v:='3' type=z
http put $DB/db1/f v:='2' type=x
http put $DB/db1/g v:='1' type=z

http put $DB/db1/_design/q language=query views:='{"v": {"map": {"type":"x", "v":{"$ge":2}, "fields": {"v":"asc"}}}}'

Then querying from node3 I see this error:

http post $DB3/db1/_find selector:='{"v":{"$gt":2}}' fields:='["v", "_id"]' execution_stats:='true'
HTTP/1.1 500 Internal Server Error

{
    "error": "unknown_error",
    "reason": "undef",
    "ref": 3538619864
}

Querying from DB1 and DB2 does work though:

http post $DB1/db1/_find selector:='{"v":{"$gt":2}}' fields:='["v", "_id"]' execution_stats:='true'
HTTP/1.1 200 OK

{
    "bookmark": "g1AAAAAyeJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozpoIkOGASEKEsAE9eDYE",
    "docs": [
        {
            "_id": "c",
            "v": 3
        },
        {
            "_id": "d",
            "v": 4
        },
        {
            "_id": "e",
            "v": 3
        }
    ],
    "execution_stats": {
        "execution_time_ms": 1.316,
        "results_returned": 3,
        "total_docs_examined": 8,
        "total_keys_examined": 8,
        "total_quorum_docs_examined": 0
    },
    "warning": "No matching index found, create an index to optimize query time."
}

The error in the log looks like:

[error] 2026-03-13T04:18:15.961351Z node3@127.0.0.1 <0.33714.0> 17120bdde3 req_err(3538619864) unknown_error : undef
    [<<"binary:join/2">>,<<"mango_idx_view:-field_ranges/1-lc$^0/1-0-/1 L327">>,<<"mango_cursor_special:create/4 L35">>,<<"mango_crud:find/5 L49">>,<<"mango_httpd:handle_find_req/2 L213">>
,<<"mango_httpd:handle_req/2 L34">>,<<"chttpd:handle_req_after_auth/2 L432">>,<<"chttpd:process_request/1 L410">>]
[notice] 2026-03-13T04:18:15.961542Z node3@127.0.0.1 <0.33714.0> 17120bdde3 127.0.0.1:35984 127.0.0.1 adm POST /db1/_find 500 ok 1
[error] 2026-03-13T04:18:20.567988Z node3@127.0.0.1 <0.33894.0> -------- Invalid Index Def [{<<"map">>,{[{<<"type">>,<<"x">>},{<<"v">>,{[{<<"$ge">>,2}]}},{<<"fields">>,{[{<<"v">>,<<"asc">>
}]}}]}}]. Error: error, Reason: {badmatch,undefined}

I didn't dive in yet to see why it's happening but figure we'd want to take a better look before merging

@jcoglan jcoglan force-pushed the faster-norm-fields branch from b03f59e to 1f6cc39 Compare March 13, 2026 11:03
jcoglan added 2 commits March 13, 2026 11:55
Currently, selectors like `{ "a.b": X }` or `{ "a": { "b": X } }` are
normalized into the internal form:

    {[{<<"a.b">>, X}]}

When the selector is evaluated, we call `mango_doc:get_field(Doc,
<<"a.b">>)`, which then has to parse `<<"a.b">>` into the list
`[<<"a">>, <<"b">>]` so it can follow the path to the field inside the
doc. This path is re-parsed every time the document is evaluated.

To save time, here we change the normalized form to:

    {[{[<<"a">>, <<"b">>], X}]}

i.e. the field path is already parsed before the selector is evaluated.
This has been shown to improve selector evaluation time by 30-40%.

However, it does mean the normalized form is no longer a valid ejson
structure as it contains a key that is not a binary. We therefore need
to turn it back into `<<"a.b">>` in cases where the selector is
serialized back to JSON, for example when it is returned as a part of
`_explain` output. This is done using the `join_field` and `join_keys`
functions in `mango_util`.

We also need to update anywhere that hard-codes a particular document
field so that the field is in the parsed array representation.
In the previous commit we changed lots of code to deal with normalized
selector fields now being per-parsed lists, not binaries, i.e. instead
of `<<"a.b">>` a field now looks like `[<<"a">>, <<"b">>]`.

This commit winds most of that back and instead makes the functions in
`mango_selector` hide the fact the field is now pre-parsed from callers.
So for example, the code that deals with matching selectors to indexes
can continue to pass the unparsed form into `mango_selector` functions,
and `mango_selector` will take care of parsing/serializing inputs as
necessary.
@jcoglan jcoglan force-pushed the faster-norm-fields branch from 1f6cc39 to a91592f Compare March 13, 2026 11:55
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look better, I'll still have to re-run my mixed cluster test a few times though. But for now, I wonder if we've switched to a less optimized join if we'd still see a speedup in indexing or filtering?

join_field(Field) ->
Field.

% binary:join/2 is not available on all Erlang versions we support; it was
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since binary:join/2 is probably a faster implementation I wonder if we still get a noticeably indexing or filtering boost from this PR?

Try indexing a few million docs and see if notice a difference

% binary:join/2 is not available on all Erlang versions we support; it was
% added in 28.0. For now, we use this function in its place, c.f.
% https://www.erlang.org/doc/apps/stdlib/binary.html#join/2
binary_join(Binaries, Separator) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a guard that Binaries is a list and Separator is a binary. Sometimes those asserts could also help the compiler to optimize some snippets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants