Skip to content

bench: aggregations over decimals#6669

Merged
sploiselle merged 5 commits into
MaterializeInc:mainfrom
sploiselle:dec-agg-test
May 7, 2021
Merged

bench: aggregations over decimals#6669
sploiselle merged 5 commits into
MaterializeInc:mainfrom
sploiselle:dec-agg-test

Conversation

@sploiselle
Copy link
Copy Markdown
Contributor

@sploiselle sploiselle commented May 6, 2021

A very simple test of an aggregation over a single decimal column; this will help us understand how much the rust-dec-backed decimal implementation degrades performance and increases memory usage.

This supersedes #6658 and #6643

Sean Loiselle and others added 3 commits May 5, 2021 15:10
The Avro spec provides conflicting/unclear guidance on
canonicalizing schemas that contain decimal data (i.e. decimal
values require the presence of precision and scale fields to
express equality, but neither field is documented in the
canonicalization process).

To solve this problem for kgen, we are no longer canonicalizing
schemas.
This is a scaffold for Sean to create a new benchmark type for testing
the performance of aggregations.
Copy link
Copy Markdown
Contributor

@cirego cirego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

We need to figure out why the smoke test failed in CI.

@@ -0,0 +1 @@
data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's creating the data file and/or directory?

FROM KAFKA BROKER 'kafka:9092'
TOPIC 'aggregationtest'
FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://schema-registry:8081'
ENVELOPE UPSERT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider using INSERT instead of UPSERT, as the upsert operator is quite a bit more expensive. Using insert should make the performance delta between implementations more apparent.

@cirego
Copy link
Copy Markdown
Contributor

cirego commented May 6, 2021

It looks like this failed with the following error message:

CREATE SOURCE
psql:/data/aggregation_views.sql:21: ERROR:  column name "OuterRecord" is ambiguous
psql:/data/aggregation_views.sql:23: ERROR:  unknown catalog item 'aggregationtest'
$ docker wait 15c81efce39dd2dcf9575085d9e51cb288a90ebdba2f625bad69ffba79c26b47

@cirego
Copy link
Copy Markdown
Contributor

cirego commented May 6, 2021

I'm able to reproduce this locally:

[chris-work:~/materialize/materialize:(dec-agg-test[|])] docker logs aggregations_create-views_run_f4dc208bfa72 
CREATE SOURCE
psql:/data/aggregation_views.sql:21: ERROR:  column name "OuterRecord" is ambiguous
psql:/data/aggregation_views.sql:23: ERROR:  unknown catalog item 'aggregationtest'

@cirego
Copy link
Copy Markdown
Contributor

cirego commented May 6, 2021

I also see kgen exiting with the following errors:

Spawning 4 generator processes, writing 25000 messages each
thread 'main' panicked at 'max value of 1000000000 exceeds value expressable with precision 15', src/kafka-util/src/bin/kgen.rs:294:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'max value of 1000000000 exceeds value expressable with precision 15', src/kafka-util/src/bin/kgen.rs:294:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'max value of 1000000000 exceeds value expressable with precision 15', src/kafka-util/src/bin/kgen.rs:294:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'max value of 1000000000 exceeds value expressable with precision 15', src/kafka-util/src/bin/kgen.rs:294:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
0/4 processes finished: pid=7 returncode=101
1/4 processes finished: pid=8 returncode=101
2/4 processes finished: pid=9 returncode=101
3/4 processes finished: pid=10 returncode=101

@cirego
Copy link
Copy Markdown
Contributor

cirego commented May 6, 2021

I think the "OuterRecord" error may be a red-herring, when we really should have exited with error when kgen failed.

Edit: forget what I said about the red herring. I was contemplating that perhaps the schema didn't exist in the schema registry yet, but I can see that it was created:

[chris-work:~/materialize/materialize/test/bench/aggregations:(dec-agg-test[|])] curl -s http://localhost:8081/subjects/aggregationtest-value/versions/latest | jq .
{
  "subject": "aggregationtest-value",
  "version": 1,
  "id": 13,
  "schema": "{\"type\":\"record\",\"name\":\"aggregationtest\",\"namespace\":\"com.acme.avro\",\"fields\":[{\"name\":\"Key1Unused\",\"type\":\"long\"},{\"name\":\"OuterRecord\",\"type\":{\"type\":\"record\",\"name\":\"OuterRecord\",\"fields\":[{\"name\":\"DecValue\",\"type\":{\"type\":\"bytes\",\"logicalType\":\"decimal\",\"precision\":15,\"scale\":2}}]}}]}"
}

Once #6675 lands, we should pull those improvements into this benchmark too!

@sploiselle
Copy link
Copy Markdown
Contributor Author

@cirego tysm for checking into this; the root of the problem was just an inverted inequality that caused everything else to topped down around it. re: UPSERT vs. append-only envelope, I want the ability to control the total size of the view to examine memory consumption of the refactor. I guess it's possible to also throttle this with the number of messages we send, so idk––going to go with UPSERT because it also feels like a slightly more realistic.

@sploiselle sploiselle merged commit d1e9ec3 into MaterializeInc:main May 7, 2021
@sploiselle sploiselle deleted the dec-agg-test branch May 7, 2021 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants