Skip to content

avro+kafka: decimal data generation#6658

Closed
sploiselle wants to merge 2 commits into
MaterializeInc:mainfrom
sploiselle:dec-agg-bench-prep
Closed

avro+kafka: decimal data generation#6658
sploiselle wants to merge 2 commits into
MaterializeInc:mainfrom
sploiselle:dec-agg-bench-prep

Conversation

@sploiselle
Copy link
Copy Markdown
Contributor

@cirego helped create the scaffolding for benchmark for aggregations over decimal data (#6643), and in trying to bring it over the finish line ran into...

  • kgen lacking decimal data generation
  • An issue propagating decimal data declarations from an Avro schema into the schema registry, and ultimately into Materialize. I don't know if this implies that no one is sending us decimal data over Avro, or if I am really "holding it wrong."

The proposed commits resolve both of the points above.

Regarding the second point above re: decimals in Avro schemas, here's a snippet of the value schema I'm using, which is congruent with this example from our Avro tests:

{
  "name": "aggregationtest",
  "type": "record",
  "namespace": "com.acme.avro",
  "fields": [
    {
      "name": "Key1Unused",
      "type": "long"
    },
    {
      "name": "Key2Unused",
      "type": "long"
    },
    {
      "name": "OuterRecord",
      "type": {
        "name": "OuterRecord",
        "type": "record",
        "fields": [
          {
            "name": "Record1",
            "type": {
              "name": "Record1",
              "type": "record",
              "fields": [
                {
                  "name": "InnerRecord1",
                  "type": {
                    "name": "InnerRecord1",
                    "type": "record",
                    "fields": [
                      {
                        "name": "DecPoint",
                        "type": {
                          "type": "bytes",
                          "scale": 2,
                          "precision": 15,
                          "logicalType": "decimal"
                        }
                      }
                    ]
                  }
                },
...

This PR might be better suited to get folded into #6643, but wanted to get @umanwizard's eyes on the change before continuing to work off of these commits.

Previously, we elided fields necessary to declare decimal types in
Avro schemas during [canonical form parsing][pcf]. This PR simply
whitelists those fields so they can get propagated to Materialize.

[pcf]: https://github.com/MaterializeInc/materialize/blob/ddf26b13bf9185d7f9e1cb8dcb74deddc689b8bd/src/avro/src/schema.rs#L1978
@sploiselle sploiselle requested review from cirego and umanwizard May 5, 2021 16:07
Comment thread src/avro/src/schema.rs
"items" => 5,
"values" => 6,
"size" => 7,
// Supports decimals
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates https://avro.apache.org/docs/current/spec.html#Transforming+into+Parsing+Canonical+Form , which specifies that everything except name/type/fields/symbols/items/values/size should be stripped when putting a schema in canonical form.

Perhaps we just shouldn't be canonicalizing before putting things in schema registry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe. It seems like the canonicalization spec is incomplete because that same doc says:

For the purposes of schema resolution, two schemas that are decimal logical types match if their scales and precisions match.

The doc also implies that schema resolution is "valid" only for canonical schemas:

Parsing Canonical Form is a transformation of a writer's schema that let's us define what it means for two schemas to be "the same" for the purpose of reading data written against the schema.

So implicit in this is that scale and precision must be present, but they aren't explicitly part of canonicalization. I defer to your preference in handling this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, the doc is confusing. It seems to suggest that whether to even do resolution is determined by whether the canonical forms are equal, but resolution itself requires the non-canonicalized forms! I suspect this is a mistake (or at least that it could be clarified substantially) and I can write to the Avro mailing list about it.

I think the fix for now is not to change how we do canonicalization, but rather to just not canonicalize the schemas before we write them to CSR. I think we had to make the same fix in Materialize a while back to get decimals to work for a customer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., get rid of the canonical_form calls in kgen.rs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty for the pointer; removing canonical_form calls here fixed the issue.

Comment thread src/kafka-util/src/bin/kgen.rs Outdated
Comment on lines +291 to +292
assert!(10i64.pow(u32::try_from(precision).unwrap()) > max);
assert!(10i64.pow(u32::try_from(precision).unwrap()) > min.abs());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if 10^precision is larger than what can fit in an i64?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with checked_pow

@sploiselle sploiselle force-pushed the dec-agg-bench-prep branch from 0a91378 to 2bf1c78 Compare May 5, 2021 17:39
@sploiselle
Copy link
Copy Markdown
Contributor Author

Closing in favor of #6669

@sploiselle sploiselle closed this May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants