avro+kafka: decimal data generation#6658
Conversation
Previously, we elided fields necessary to declare decimal types in Avro schemas during [canonical form parsing][pcf]. This PR simply whitelists those fields so they can get propagated to Materialize. [pcf]: https://github.com/MaterializeInc/materialize/blob/ddf26b13bf9185d7f9e1cb8dcb74deddc689b8bd/src/avro/src/schema.rs#L1978
| "items" => 5, | ||
| "values" => 6, | ||
| "size" => 7, | ||
| // Supports decimals |
There was a problem hiding this comment.
This violates https://avro.apache.org/docs/current/spec.html#Transforming+into+Parsing+Canonical+Form , which specifies that everything except name/type/fields/symbols/items/values/size should be stripped when putting a schema in canonical form.
Perhaps we just shouldn't be canonicalizing before putting things in schema registry?
There was a problem hiding this comment.
Ah, maybe. It seems like the canonicalization spec is incomplete because that same doc says:
For the purposes of schema resolution, two schemas that are decimal logical types match if their scales and precisions match.
The doc also implies that schema resolution is "valid" only for canonical schemas:
Parsing Canonical Form is a transformation of a writer's schema that let's us define what it means for two schemas to be "the same" for the purpose of reading data written against the schema.
So implicit in this is that scale and precision must be present, but they aren't explicitly part of canonicalization. I defer to your preference in handling this.
There was a problem hiding this comment.
yep, the doc is confusing. It seems to suggest that whether to even do resolution is determined by whether the canonical forms are equal, but resolution itself requires the non-canonicalized forms! I suspect this is a mistake (or at least that it could be clarified substantially) and I can write to the Avro mailing list about it.
I think the fix for now is not to change how we do canonicalization, but rather to just not canonicalize the schemas before we write them to CSR. I think we had to make the same fix in Materialize a while back to get decimals to work for a customer.
There was a problem hiding this comment.
I.e., get rid of the canonical_form calls in kgen.rs
There was a problem hiding this comment.
ty for the pointer; removing canonical_form calls here fixed the issue.
| assert!(10i64.pow(u32::try_from(precision).unwrap()) > max); | ||
| assert!(10i64.pow(u32::try_from(precision).unwrap()) > min.abs()); |
There was a problem hiding this comment.
what happens if 10^precision is larger than what can fit in an i64?
There was a problem hiding this comment.
Fixed with checked_pow
0a91378 to
2bf1c78
Compare
|
Closing in favor of #6669 |
@cirego helped create the scaffolding for benchmark for aggregations over decimal data (#6643), and in trying to bring it over the finish line ran into...
kgenlacking decimal data generationThe proposed commits resolve both of the points above.
Regarding the second point above re: decimals in Avro schemas, here's a snippet of the value schema I'm using, which is congruent with this example from our Avro tests:
This PR might be better suited to get folded into #6643, but wanted to get @umanwizard's eyes on the change before continuing to work off of these commits.