-
Notifications
You must be signed in to change notification settings - Fork 41
apd: use gcassert to check that tmp big.Ints don't escape, add to CI #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
39a00dc to
2d8f51f
Compare
.github/workflows/go.yml
Outdated
| - name: 'GCAssert' | ||
| # Only run gcassert on latest version of Go. Inlining heuristics change | ||
| # from version to version. | ||
| if: ${{ matrix.arch == 'x64' && matrix.go == '1.17' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bleh, I feel like we won't remember to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change to >= for now? So at least it'll fail if the heuristics change and we get a version update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This was unsafe. As a result, tmp1 was escaping. The next commit will add static analysis using gcassert to catch this kind of bug.
2d8f51f to
cb81d05
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: 2 of 4 files reviewed, 1 unresolved discussion (waiting on @jordanlewis, @nvanbenschoten, and @yuzefovich)
cb81d05 to
645a85b
Compare
This commit adds `gcassert:noescape` directives to bigint.go to verify that the temporary big.Int values that are intended to be stack allocated do not escape to the heap. The commit then adds `gcassert` to CI.
645a85b to
c93f42c
Compare
|
TFTRs! |
This commit picks up the following changes to `cockroachdb/apd`: - cockroachdb/apd#103 - cockroachdb/apd#104 - cockroachdb/apd#107 - cockroachdb/apd#108 - cockroachdb/apd#109 - cockroachdb/apd#110 - cockroachdb/apd#111 Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets.
74590: colexec: integrate flat, compact decimal datums r=nvanbenschoten a=nvanbenschoten Replaces #74369 and #57593. This PR picks up the following changes to `cockroachdb/apd`: - cockroachdb/apd#103 - cockroachdb/apd#104 - cockroachdb/apd#107 - cockroachdb/apd#108 - cockroachdb/apd#109 - cockroachdb/apd#110 - cockroachdb/apd#111 Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets. ---- At a high-level, those changes implement the "compact memory representation" for Decimals described in cockroachdb/apd#102 (comment) and later implemented in cockroachdb/apd#103. Compared to the approach on master, the approach in cockroachdb/apd#103 is a) faster, b) avoids indirection + heap allocation, c) smaller. Compared to the alternate approach in cockroachdb/apd#102, the approach in cockroachdb/apd#103 is a) [faster for most operations](cockroachdb/apd#102 (comment)), b) more usable because values can be safely copied, c) half the memory size (32 bytes per `Decimal`, vs. 64). The memory representation of the Decimal struct in this approach looks like: ```go type Decimal struct { Form int8 Negative bool Exponent int32 Coeff BigInt { _inner *big.Int // nil when value fits in _inline _inline [2]uint } } // sizeof = 32 ``` With a two-word inline array, any value that would fit in a 128-bit integer (i.e. decimals with a scale-adjusted absolute value up to 2^128 - 1) fit in `_inline`. The indirection through `_inner` is only used for values larger than this. Before this change, the memory representation of the `Decimal` struct looked like: ```go type Decimal struct { Form int64 Negative bool Exponent int32 Coeff big.Int { neg bool abs []big.Word { data uintptr ---------------. len int64 v cap int64 [uint, uint, ...] // sizeof = variable, but around cap = 4, so 32 bytes } } } // sizeof = 48 flat bytes + variable-length heap allocated array ``` ---- ## Performance impact ### Speedup on TPC-DS dataset The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (`web_sales`, row count = 7,197,566): Queries ```sql # q1 select sum(ws_wholesale_cost + ws_ext_list_price) from web_sales; # q2 select sum(2 * ws_wholesale_cost + ws_ext_list_price) - max(4 * ws_ext_ship_cost), min(ws_net_profit) from web_sales; # q3 select max(ws_bill_customer_sk + ws_bill_cdemo_sk + ws_bill_hdemo_sk + ws_bill_addr_sk + ws_ship_customer_sk + ws_ship_cdemo_sk + ws_ship_hdemo_sk + ws_ship_addr_sk + ws_web_page_sk + ws_web_site_sk + ws_ship_mode_sk + ws_warehouse_sk + ws_promo_sk + ws_order_number + ws_quantity + ws_wholesale_cost + ws_list_price + ws_sales_price + ws_ext_discount_amt + ws_ext_sales_price + ws_ext_wholesale_cost + ws_ext_list_price + ws_ext_tax + ws_coupon_amt + ws_ext_ship_cost + ws_net_paid + ws_net_paid_inc_tax + ws_net_paid_inc_ship + ws_net_paid_inc_ship_tax + ws_net_profit) from web_sales; ``` Here's the difference in runtime of these three queries before and after this change on an `n2-standard-4` instance: ``` name old s/op new s/op delta TPC-DS/custom/q1 7.21 ± 3% 6.59 ± 0% -8.57% (p=0.000 n=10+10) TPC-DS/custom/q2 10.2 ± 0% 9.7 ± 3% -5.42% (p=0.000 n=10+10) TPC-DS/custom/q3 21.9 ± 1% 17.3 ± 0% -21.13% (p=0.000 n=10+10) ``` ### Heap allocation reduction in TPC-DS Part of the reason for this speedup was that it significantly reduces heap allocations because most decimal values are stored inline. We can see this in q3 from above. Before the change, a heap profile looks like: <img width="1751" alt="Screen Shot 2022-01-07 at 7 12 49 PM" src="https://user-images.githubusercontent.com/5438456/148625159-9ceb470a-0742-4f75-a533-530d9944143c.png"> After the change, a heap profile looks like: <img width="1749" alt="Screen Shot 2022-01-07 at 7 17 32 PM" src="https://user-images.githubusercontent.com/5438456/148625174-629f4b47-07cc-4ef6-8723-2e556f7fc00d.png"> _(the dominant source of heap allocations is now `coldata.(*Nulls).Or`. #74592 should help here)_ ### Heap allocation reduction in TPC-E On the read-only portion of the TPC-E (77% of the full workload, in terms of txn mix), this change has a significant impact on total heap allocations. Before the change, `math/big.nat.make` was responsible for **51.07%** of total heap allocations: <img width="1587" alt="Screen Shot 2021-12-31 at 8 01 00 PM" src="https://user-images.githubusercontent.com/5438456/147842722-965d649d-b29a-4f66-aa07-1b05e52e97af.png"> After the change, `math/big.nat.make` is responsible for only **1.1%** of total heap allocations: <img width="1580" alt="Screen Shot 2021-12-31 at 9 04 24 PM" src="https://user-images.githubusercontent.com/5438456/147842727-a881a5a3-d038-48bb-bd44-4ade665afe73.png"> That equates to roughly a **50%** reduction in heap allocations. ### Microbenchmarks ``` name old time/op new time/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 65.6µs ± 2% 42.5µs ± 0% -35.15% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 68.4µs ± 1% 48.4µs ± 1% -29.20% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 1.65ms ± 1% 1.20ms ± 1% -27.31% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 51.4ms ± 1% 38.3ms ± 1% -25.59% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 12.5µs ± 1% 9.4µs ± 2% -24.72% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 12.5µs ± 1% 9.6µs ± 2% -23.24% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 10.5µs ± 1% 8.0µs ± 1% -23.22% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 12.4µs ± 1% 9.6µs ± 1% -22.70% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 60.5µs ± 1% 47.1µs ± 2% -22.24% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 61.2µs ± 1% 47.7µs ± 1% -22.09% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 62.3µs ± 1% 48.7µs ± 2% -21.91% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 1.31ms ± 0% 1.03ms ± 1% -21.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 82.3µs ± 1% 64.9µs ± 1% -21.12% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 86.6µs ± 1% 68.5µs ± 1% -20.93% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 96.0µs ± 1% 77.1µs ± 1% -19.73% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 41.2ms ± 0% 33.1ms ± 0% -19.64% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 17.5µs ± 1% 14.3µs ± 2% -18.59% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 14.8µs ± 3% 12.1µs ± 3% -18.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 20.0µs ± 1% 16.4µs ± 1% -18.04% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 20.9µs ± 1% 17.2µs ± 3% -17.80% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 884µs ± 0% 731µs ± 0% -17.30% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 27.9ms ± 0% 23.1ms ± 0% -17.27% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 218µs ± 2% 181µs ± 2% -17.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 911µs ± 1% 755µs ± 1% -17.10% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 957µs ± 1% 798µs ± 0% -16.66% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.54ms ± 1% 1.29ms ± 1% -16.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 188µs ± 1% 157µs ± 2% -16.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 28.8ms ± 0% 24.1ms ± 0% -16.14% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 30.4ms ± 0% 25.7ms ± 1% -15.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 135ms ± 1% 114ms ± 1% -15.21% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.79ms ± 1% 1.52ms ± 1% -15.14% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 6.29ms ± 1% 5.50ms ± 1% -12.62% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 62.2ms ± 0% 54.7ms ± 0% -12.08% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 2.46ms ± 1% 2.17ms ± 1% -11.88% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 5.64ms ± 0% 4.98ms ± 0% -11.76% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 354ms ± 2% 318ms ± 1% -10.18% (p=0.000 n=10+8) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.8ms ± 1% 83.3ms ± 0% -9.25% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 396ms ± 1% 369ms ± 1% -6.83% (p=0.000 n=8+8) name old speed new speed delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 125MB/s ± 2% 193MB/s ± 0% +54.20% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 120MB/s ± 1% 169MB/s ± 1% +41.24% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 159MB/s ± 1% 219MB/s ± 1% +37.57% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 163MB/s ± 1% 219MB/s ± 1% +34.39% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 20.4MB/s ± 1% 27.2MB/s ± 2% +32.85% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 764kB/s ± 2% 997kB/s ± 1% +30.45% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 20.5MB/s ± 1% 26.8MB/s ± 2% +30.28% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 20.7MB/s ± 1% 26.8MB/s ± 1% +29.37% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 135MB/s ± 1% 174MB/s ± 2% +28.61% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 134MB/s ± 1% 172MB/s ± 1% +28.35% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 131MB/s ± 1% 168MB/s ± 2% +28.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 200MB/s ± 0% 255MB/s ± 1% +27.45% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 100MB/s ± 1% 126MB/s ± 1% +26.78% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 94.6MB/s ± 1% 119.6MB/s ± 1% +26.47% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 85.3MB/s ± 1% 106.3MB/s ± 1% +24.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 204MB/s ± 0% 254MB/s ± 0% +24.44% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 14.6MB/s ± 1% 18.0MB/s ± 2% +22.83% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 544kB/s ± 3% 664kB/s ± 2% +22.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 12.8MB/s ± 1% 15.6MB/s ± 1% +22.02% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 12.3MB/s ± 1% 14.9MB/s ± 3% +21.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 296MB/s ± 0% 358MB/s ± 0% +20.92% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 300MB/s ± 0% 363MB/s ± 0% +20.87% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 37.5MB/s ± 2% 45.4MB/s ± 2% +20.82% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 288MB/s ± 1% 347MB/s ± 1% +20.62% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 274MB/s ± 1% 329MB/s ± 0% +19.99% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 170MB/s ± 1% 204MB/s ± 1% +19.85% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 43.6MB/s ± 1% 52.1MB/s ± 2% +19.52% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 292MB/s ± 0% 348MB/s ± 0% +19.25% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 276MB/s ± 0% 326MB/s ± 1% +18.00% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.1MB/s ± 1% 73.3MB/s ± 1% +17.94% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 147MB/s ± 1% 173MB/s ± 1% +17.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 41.7MB/s ± 1% 47.7MB/s ± 1% +14.44% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 135MB/s ± 0% 153MB/s ± 0% +13.74% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 106MB/s ± 1% 121MB/s ± 1% +13.48% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 46.5MB/s ± 0% 52.7MB/s ± 0% +13.34% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 23.7MB/s ± 2% 26.3MB/s ± 2% +11.02% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.3MB/s ± 0% 100.7MB/s ± 0% +10.27% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 21.2MB/s ± 1% 22.7MB/s ± 1% +7.32% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 354kB ± 0% 239kB ± 0% -32.39% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 348kB ± 0% 239kB ± 0% -31.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 251kB ± 0% 177kB ± 0% -29.44% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 246kB ± 0% 177kB ± 0% -28.28% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 275kB ± 0% 198kB ± 0% -28.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 243kB ± 0% 177kB ± 0% -27.15% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.09% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 268kB ± 0% 198kB ± 0% -26.05% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 264kB ± 0% 198kB ± 0% -25.04% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 75.1kB ± 0% 56.9kB ± 0% -24.25% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 74.9kB ± 0% 56.9kB ± 0% -24.12% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 74.8kB ± 0% 56.9kB ± 0% -23.99% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 69.6kB ± 0% 53.1kB ± 0% -23.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 95.2kB ± 0% 75.9kB ± 0% -20.23% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 102kB ± 0% 82kB ± 0% -20.04% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 103kB ± 0% 83kB ± 0% -19.95% (p=0.000 n=7+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 100kB ± 0% 80kB ± 0% -19.90% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 1.14MB ± 0% 0.92MB ± 0% -18.80% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 271kB ± 0% 227kB ± 0% -16.16% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 1.10MB ± 0% 0.92MB ± 0% -15.92% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 280kB ± 1% 235kB ± 1% -15.91% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 1.09MB ± 1% 0.92MB ± 0% -15.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 291kB ± 0% 245kB ± 1% -15.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.11MB ± 0% 0.95MB ± 0% -15.14% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.22MB ± 0% 1.04MB ± 0% -14.77% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 1.65MB ± 0% 1.42MB ± 0% -13.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 593kB ± 0% 513kB ± 0% -13.36% (p=0.000 n=9+8) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 520kB ± 0% 454kB ± 0% -12.82% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 1.04MB ± 0% 0.92MB ± 0% -11.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 2.48MB ± 0% 2.25MB ± 0% -9.32% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 967kB ± 0% 881kB ± 0% -8.89% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 7.86MB ± 0% 7.36MB ± 0% -6.44% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 14.2MB ± 1% 13.4MB ± 1% -5.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 12.3MB ± 0% 11.7MB ± 0% -5.03% (p=0.001 n=7+7) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 27.2MB ± 1% 25.9MB ± 1% -4.84% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 465MB ± 0% 445MB ± 0% -4.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 403MB ± 0% 390MB ± 0% -3.44% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 1.07k ± 0% 0.05k ± 0% -95.70% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 702k ± 0% 32k ± 0% -95.46% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 489k ± 0% 28k ± 0% -94.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 4.40k ± 0% 0.30k ± 0% -93.15% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 1.11k ± 0% 0.09k ± 0% -92.02% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 561 ± 0% 46 ± 0% -91.80% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 3.45k ± 0% 0.30k ± 0% -91.28% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 1.19k ± 0% 0.15k ± 1% -87.31% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 4.87k ± 0% 0.70k ± 0% -85.69% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 32.2k ± 0% 6.3k ± 0% -80.40% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 1.45k ± 3% 0.29k ± 0% -79.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 1.39k ± 0% 0.30k ± 1% -78.64% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 26.2k ± 0% 6.8k ± 1% -73.95% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 6.64k ± 0% 1.95k ± 0% -70.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 3.44k ± 1% 1.12k ± 1% -67.48% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.4k ± 0% 20.4k ± 0% -67.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 2.95k ± 1% 1.05k ± 1% -64.52% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 10.8k ± 0% 4.5k ± 0% -58.21% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 628 ± 3% 294 ± 0% -53.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 36.1k ± 0% 20.2k ± 0% -44.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 81.7 ± 3% 46.0 ± 0% -43.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 14.4k ± 1% 8.2k ± 0% -42.97% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 79.0 ± 0% 46.0 ± 0% -41.77% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 13.7k ± 1% 8.2k ± 0% -40.05% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 191 ± 1% 120 ± 1% -37.52% (p=0.000 n=7+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 12.9k ± 2% 8.2k ± 0% -36.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 176 ± 2% 115 ± 1% -34.33% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 12.3k ± 0% 8.2k ± 0% -33.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 21.8k ± 0% 15.2k ± 0% -30.13% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 118 ± 0% 84 ± 0% -28.81% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 63.0 ± 0% 46.0 ± 0% -26.98% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 57.2 ±14% 46.0 ± 0% -19.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 9.69k ± 1% 8.23k ± 0% -15.07% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 340 ± 2% 294 ± 0% -13.43% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 82.0 ± 0% 79.0 ± 0% -3.66% (p=0.000 n=10+10) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
Updated planProjectionOperators and planSelectionOperators to include case for tree.NotExpr
Replaced RunTests with RunTestsWithoutAllNullsInjection
Fixes for build & lint issues
Removed unused field
Addressing PR review comments
backupccl: use SpanGroup.Sub instead of CoveringMerge
Release note: none.
rowenc: reduce reliance on the entire TableDescriptor
This change reduces uses of the entire TableDescriptor.
Release note: None
descpb: add and use NoColumnID constant
This change adds a `descpb.NoColumnID` constant so we don't have to
use `descpb.ColumnID(encoding.NoColumnID)` everywhere.
Release note: None
rowenc: remove DatumAlloc.scratch
This commit removes the `DatumAlloc.scratch` field which is always
nil.
Release note: None
sql: move `DatumAlloc` to sem/tree
This commit moves `DatumAlloc` from `sql/rowenc` to `sql/sem/tree`.
This is a fairly low-level construct and is not used exclusively for
row encoding/decoding.
Release note: None
rowenc: move low-level key encoding functions to subpackage
The `rowenc` package contains a hefty mix of functions operating at
different conceptual levels - some use higher level constructs like
table and index descriptors, others are lower level utilities. The
naming of these functions doesn't help, for example `EncodeTableKey`
sounds like a top-level function but is actually a low level utility
that appends a single value to a key
This commit moves the lower level utilities for encoding datum values
into keys to the `rowenc/keyside` package.
Release note: None
rowenc: minor cleanup of array decoding code
The code around array decoding was confusing; we now have a single
`decodeArray` variant which is the inverse of `encodeArray`.
Release note: None
rowenc: move low-level value encoding functions to subpackage
This commit moves the lower level utilities for encoding datum values
into values to the `rowenc/valueside` package.
Release note: None
valueside: introduce ColumnIDDelta type
When encoding multiple values, we encode the differences between the
ColumnIDs. This difference is passed to `valueside.Encode`, but it is
passed as a `descpb.ColumnID`. This commit introduces a new type to
make this distinction more evident.
Release note: None
valueside: add Decoder helper
We have a few copies of the same logic of decoding multiple SQL values
from a Value, in particular in range feed handling code.
This commit adds a helper type that centralizes this logic, reducing
duplication.
Note that the query executione engines retain their own specialized
implementations.
Release note: None
sql: release BatchFlowCoordinator objects to pool
This commit adds `BatchFlowCoordinator` objects to their `vectorizedFlowCreator`'s
`releasables` slice, which ensures that their `Release` method is called and that
they are properly recycled.
Before this change, heap profiles and instrumentation both indicated that these
objects were not being recycled as intended. For example, heap profiles looked like:
```
Type: alloc_objects
Time: Dec 30, 2021 at 2:10pm (EST)
Active filters:
focus=Pool\).Get
Showing nodes accounting for 92189, 0.57% of 16223048 total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
71505 79.69% | github.com/cockroachdb/cockroach/pkg/sql/colflow.NewBatchFlowCoordinator /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:218
10923 12.17% | github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.newLockTableGuardImpl /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table.go:452
2048 2.28% | github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree.(*Map).maybeInitialize /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/nstree/map.go:112
1170 1.30% | github.com/cockroachdb/cockroach/pkg/sql/colexec.newMaterializerInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:179
1030 1.15% | github.com/cockroachdb/pebble.(*Iterator).Clone /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:1228
585 0.65% | github.com/cockroachdb/cockroach/pkg/sql.MakeDistSQLReceiver /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:841
585 0.65% | github.com/cockroachdb/cockroach/pkg/sql/colfetcher.NewColBatchScan /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:223
585 0.65% | github.com/cockroachdb/cockroach/pkg/sql/span.MakeBuilder /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/sql/span/span_builder.go:61
585 0.65% | github.com/cockroachdb/cockroach/pkg/storage.mvccGet /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:859
585 0.65% | github.com/cockroachdb/cockroach/pkg/storage.mvccScanToBytes /Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2357
128 0.14% | github.com/cockroachdb/pebble.(*DB).newIterInternal /Users/nathan/Go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/db.go:765
0 0% 0% 89729 0.55% | sync.(*Pool).Get /Users/nathan/Go/go/src/sync/pool.go:148
```
Notice that `sync.(*Pool).Get` is responsible for *0.55%* of heap allocations and
**79.69%** of these are from `colflow.NewBatchFlowCoordinator`.
After this change, that source of allocations goes away and we see the following
impact on micro-benchmarks:
```
name old time/op new time/op delta
KV/Scan/SQL/rows=1-10 95.1µs ± 7% 95.9µs ± 5% ~ (p=0.579 n=10+10)
KV/Scan/SQL/rows=10-10 100µs ± 3% 103µs ±12% ~ (p=0.829 n=8+10)
name old alloc/op new alloc/op delta
KV/Scan/SQL/rows=10-10 21.7kB ± 0% 21.5kB ± 0% -0.76% (p=0.000 n=10+10)
KV/Scan/SQL/rows=1-10 20.1kB ± 0% 19.9kB ± 0% -0.70% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
KV/Scan/SQL/rows=1-10 245 ± 0% 244 ± 0% -0.41% (p=0.000 n=10+10)
KV/Scan/SQL/rows=10-10 280 ± 0% 279 ± 0% -0.36% (p=0.001 n=8+9)
```
sql/catalog: restore fast-path in FullIndexColumnIDs
This commit restores a [fast-path](https://github.com/cockroachdb/cockroach/commit/c9e116e586f24c5f3a831ac653f14fd03f588b93#diff-19625608f4a6e23e6fe0818f3a621e716615765cb338d18fe34b43f0a535f06dL140)
in `FullIndexColumnIDs` which was lost in c9e116e. The fast-path avoided
the allocation of a `ColumnID` slice and a `IndexDescriptor_Direction`
slice in `FullIndexColumnIDs` when given a unique index. In such cases,
these slices are already stored on the `IndexDescriptor`.
```
name old time/op new time/op delta
KV/Scan/SQL/rows=1-10 94.9µs ±10% 94.9µs ± 8% ~ (p=0.739 n=10+10)
name old alloc/op new alloc/op delta
KV/Scan/SQL/rows=1-10 20.1kB ± 0% 20.1kB ± 1% ~ (p=0.424 n=10+10)
name old allocs/op new allocs/op delta
KV/Scan/SQL/rows=1-10 245 ± 0% 241 ± 0% -1.63% (p=0.000 n=10+8)
```
kv: protect Replica's lastToReplica and lastFromReplica fields with raftMu
This commit moves the Replica's lastToReplica and lastFromReplica from
under the `Replica.mu` mutex to the `Replica.raftMu` mutex. These are
strictly Raft-specific pieces of state, so we don't need fine-grained
locking around them. As a reward, we don't need to grab the `Replica.mu`
exclusively (or at all) when setting the fields in
`Store.withReplicaForRequest`.
The locking in `setLastReplicaDescriptors` showed up in a mutex profile
under a write-heavy workload. It was responsible for **3.44%** of mutex
wait time. Grabbing the mutex was probably also slowing down request
processing, as the exclusive lock acquisition had to wait for read locks
to be dropped.
coldata: operate on Nulls value, not reference
This commit changes `col.Vec.SetNulls` to accept a `Nulls` struct by
value instead of by pointer. This lets us avoid a heap allocation on
each call to `Nulls.Or`.
Releaes note: None
roachtest: fix silly bug in tlp roachtest
I thought that we'd get some basic coverage of roachtests in unit tests
at low duration or something, but I guess not.
Release note: None
sql: support readonly default_with_oids var
This is just one less error when importing PGDUMP.
Release note (sql change): We now support `default_with_oids` which
only accepts being false.
importccl,jobs: truncate all job system errors
In #73303 we truncated the row data in this error message to prevent
problems with large row data preventing the job status from being
saved.
However, truncating the row means that our
"experimental_save_rejected" option does not work as expected. This
feature previously saved entire rows and now it would have truncated
rows in some case.
Now, we only truncate the error when producing the error
string. Further, we also add truncation to the job system to prevent
other parts of the system from saving too much data to the job system.
Release note: None
importccl: remove unused argument
Release note: None
sql/logictest: add diff support to expectation output
This adds support for generating a unified diff for expectation
mismatches. This can make it easier to spot the differences between
two results when the result contains many rows.
For example:
```
../../sql/logictest/testdata/logic_test/system_namespace:54: SELECT * FROM system.namespace
expected:
0 0 defaultdb 50
0 0 postgres 51
0 0 system 1
0 0 test 52
1 0 public 29
1 29 comments 24
1 29 database_role_settings 44
1 29 descriptor 3
1 29 descriptor_id_seq 7
1 29 eventlog 12
1 29 jobs 15
1 29 join_tokens 41
1 29 lease 11
1 29 locations 21
1 29 migrations 40
1 29 namespace 30
1 29 protected_ts_meta 31
1 29 protected_ts_records 32
1 29 rangelog 13
1 29 replication_constraint_stats 25
1 29 replication_critical_localities 2
1 29 replication_stats 27
1 29 reports_meta 28
1 29 role_members 23
1 29 role_options 33
1 29 scheduled_jobs 37
1 29 settings 6
1 29 sql_instances 46
1 29 sqlliveness 39
1 29 statement_bundle_chunks 34
1 29 statement_diagnostics 36
1 29 statement_diagnostics_requests 35
1 29 statement_statistics 42
1 29 table_statistics 20
1 29 transaction_statistics 43
1 29 ui 14
1 29 users 4
1 29 web_sessions 19
1 29 zones 5
50 0 public 29
51 0 public 29
52 0 public 29
but found (query options: "rowsort" -> ignore the following ordering of rows) :
0 0 defaultdb 50
0 0 postgres 51
0 0 system 1
0 0 test 52
1 0 public 29
1 29 comments 24
1 29 database_role_settings 44
1 29 descriptor 3
1 29 descriptor_id_seq 7
1 29 eventlog 12
1 29 jobs 15
1 29 join_tokens 41
1 29 lease 11
1 29 locations 21
1 29 migrations 40
1 29 namespace 30
1 29 protected_ts_meta 31
1 29 protected_ts_records 32
1 29 rangelog 13
1 29 replication_constraint_stats 25
1 29 replication_critical_localities 26
1 29 replication_stats 27
1 29 reports_meta 28
1 29 role_members 23
1 29 role_options 33
1 29 scheduled_jobs 37
1 29 settings 6
1 29 sql_instances 46
1 29 sqlliveness 39
1 29 statement_bundle_chunks 34
1 29 statement_diagnostics 36
1 29 statement_diagnostics_requests 35
1 29 statement_statistics 42
1 29 table_statistics 20
1 29 transaction_statistics 43
1 29 ui 14
1 29 users 4
1 29 web_sessions 19
1 29 zones 5
50 0 public 29
51 0 public 29
52 0 public 29
Diff:
--- Expected
+++ Actual
@@ -20,3 +20,3 @@
1 29 replication_constraint_stats 25
- 1 29 replication_critical_localities 2
+ 1 29 replication_critical_localities 26
1 29 replication_stats 27
logic.go:3340:
../../sql/logictest/testdata/logic_test/system_namespace:100: error while processing
```
The diff output is only added when the `-show-diff` flag is provided
because in some cases it can produce more noise than it is worth.
The diff library used here is the same one already used by the testify
libraries we depend on.
Release note: None
teamcity-trigger: give `kvserver` package higher stress timeout
Closes https://github.com/cockroachdb/cockroach/issues/69519.
Release note: None
backupccl: allow columns of type array in EXPORT PARQUET
Previously, EXPORT PARQUET only supported CRDB relations whose columns were
scalars of type int, string, float, or boolean. This change allows columns of
type array whose values can be int, string, float, boolean. Following the CRDB
ARRAY documentation, a value in an exported array can be NULL as well.
Informs: #67710
Release note (sql change): EXPORT PARQUET can export columns of type array
backupccl: remove old makeImportSpans
Release note: none.
backupccl: expand some comments on covering code
Release note: none.
security: make the bcrypt cost configurable
Release note (security update): For context, when configuring
passwords for SQL users, if the client presents the password in
cleartext via ALTER/CREATE USER/ROLE WITH PASSWORD, CockroachDB is
responsible for hashing this password before storing it.
By default, this hashing uses CockroachDB's bespoke `crdb-bcrypt`
algorithm, itself based off the standard Bcrypt algorithm.
The cost of this hashing function is now configurable via the new
cluster setting
`server.user_login.password_hashes.default_cost.crdb_bcrypt`.
Its default value is 10, which corresponds to an approximate
password check latency of 50-100ms on modern hardware.
This value should be increased over time to reflect improvements to
CPU performance: the latency should not become so small that it
becomes feasible to bruteforce passwords via repeated login attempts.
Future versions of CockroachDB will likely update the default accordingly.
security: make `server.user_login.min_password_length` visible in doc gen
This cluster setting was meant to be exported for visibility in
auto-generated docs (we've documented it before). This was an oversight.
Release note: None
authors: add Fenil Patel to authors
Release note: None
bench: force the index join in BenchmarkIndexJoin
Without using the index hint, we now choose to perform the full scan
over the primary index and put the filter on top of that rather than
performing a limited scan of the secondary followed by an index join.
I confirmed that this is the case on 21.1 and 21.2 binaries, so I'll
backport to the latter (in order to make the comparison between releases
sane).
Release note: None
dev: a few improvements to the script
* Add `set -e` so if the `dev` build fails, it doesn't try to run the
binary anyway.
* Add a way to force recompilation of `dev` (useful for developers).
Release note: None
changefeedccl: Shutdown tenant before main server.
Shutdown tenant server before stopping main test server.
This elliminates some of the error messages we see in the test
output when tenant attempts to connect to some ranges which are no
longer accessible.
Release Notes: None
importccl: fix import pgdump target column bug
Previously, if a COPY FROM statement had less columns than
the CREATE TABLE schema defined in the dump file,
we would get a nil pointer exception. This is because
we were not filling the non-targeted columns with a NULL datum.
This change fixes that and aligns behvaiour with how INSERT
handles non-targeted columns.
Release note (bug fix): IMPORT TABLE ... PGDUMP with a
COPY FROM statement in the dump file that has less target columns
than the CREATE TABLE schema definition would result in a
nil pointer exception.
roachtest: add new passing tests for pgjdbc nightly
Release note: None
go.mod: fix ordering
The first `require` is for direct dependencies. The second one for
indirect dependencies. There's no need for a third one.
Release note: None
Makefile: specify that code needs to be generated for `roach{test,prod}`
Release note: None
setting: clean up slot indexes
The internal setting code passes around slot indexes as bare integers,
and there are confusingly two variants: one is 1-indexed, another is
0-indexed.
This change makes all slot indexes 0-indexed and adds a `slotIdx`
type.
Note: the 1-indexed choice was perhaps useful at the time to help find
code paths where the slot index is not initialized, but now the slot
index is set along with other mandatory fields by `init()`.
Release note: None
setting: prevent use of SystemOnly settings from tenant code
This change implements the `system-only` semantics in the RFC
(#73349).
All SystemOnly setting values are now invisible from tenant code. If
tenant code attempts to get or set a SystemOnly setting by handle, it
results in panics in test builds; in non-test builds, these settings
always report the default value.
Release note (sql change): System-only cluster settings are no longer
visible for non-system tenants.
build: fix Pebble nightly benchmarks
The Pebble nightly benchmarks stopped running due to a build error
stemming from the removal of the generated code from the codebase.
Release note: None
scpb: move Node to screl, introduce scpb.TargetState
This refactoring commit moves scpb.Node to screl and groups the targets
into a new scpb.TargetState. Separating the quasi-constant target
elements and statuses from the elements' intermediate statuses
ultimately makes for cleaner code in the declarative schema changer.
Release note: None
scplan: make child packages internal, including scgraph and scgraphviz
This commit moves scgraph and scgraphviz under scplan, and makes all its
child packages internal.
Release note: None
<pkg>: <short description - lowercase, no final period>
authors: add gtr to authors
Release note: None
sql: format statements for the UI
Previously, statements sent to the UI were not formatted, making them difficult
to read. With this change, statements are now formatted using a builtin
function that prettifies statements (using existing pretty-printing logic).
Statements are formatted using pre-determined pretty-printing configuration
options mentioned in this issue: #71544.
Resolves: #71544
Release note (sql change): statements are now formatted prior to being send to
the UI, this is done using a new builtin function that formats statements.
ui: added formatting to statements on the details pages
Previously, statements displayed on the statement/transaction/index details
pages were not formatted. Formatting was added to allow for better readability
of statements on these detail pages. Requested statements are now formatted on
the server using the existing pretty-printing logic. Statements returned from
the statement handler are now formatted.
Formatting is done via a new builtin function 'prettify_statement', a wrapper
function over the existing pretty-printing logic.
Resolves: #71544
Release note (ui change): added formatting to statements on the statement,
transaction and index details pages.
roachpb: InternalServer API for tenant settings
This commit introduces the API that the tenants will use to obtain and
list for updates to tenant setting overrides. The API was designed to
allow for maximum flexibility on the server side so that it can be
implemented as a range feed without any extra state (if necessary).
Release note: None
schemachanger: fully qualify object names inside event log entries
Previously, the original SQL was displayed inside the declarative schema
changers event logs, which was both missing redactions and full name resolutions.
This was inadequate because the existing schema changer always fully resolved
missing names within its entries. To address this, this patch adds new metadata
that contains the fully resolved and redacted text.
Release note: None
schemachanger: full resolve names inside the AST for event logs and errors
Previously, the declarative schema changer left the AST
untouched when resolving names. This was inadequate because
both event log entry generation and some error messages generated
expect the fully resolved names inside the AST. To address this,
this patch adds support for copying and altering the AST including
support for adding annotations. Additionally, all resolved names
are now propagated back into the AST and extra validation is introduced
to make sure that no unresolved names are left.
Release note: None
schemachanger: add statement tag to event log entries
Previously, the tag field inside the event log entries
generated by the declarative schema changer were empty.
This was inadequate because the existing schema changer
always populated these fields. To address this, this
patch will now populate the statement tag field inside
the element metadata and into event log entries.
Release note: None
rfc: token-based authentication for SQL session revival
Release note: None
vendor: upgrade cockroachdb/apd to v3
This commit picks up the following changes to `cockroachdb/apd`:
- https://github.com/cockroachdb/apd/pull/103
- https://github.com/cockroachdb/apd/pull/104
- https://github.com/cockroachdb/apd/pull/107
- https://github.com/cockroachdb/apd/pull/108
- https://github.com/cockroachdb/apd/pull/109
- https://github.com/cockroachdb/apd/pull/110
- https://github.com/cockroachdb/apd/pull/111
Release note (performance improvement): The memory representation of
DECIMAL datums has been optimized to save space, avoid heap allocations,
and eliminate indirection. This increases the speed of DECIMAL arithmetic
and aggregation by up to 20% on large data sets.
sql/sem: remove EvalContext.getTmpDec
`apd.Decimal` can now be entirely stack allocated during arithmetic, so
there's no longer any need for this.
With https://github.com/cockroachdb/apd/pull/104, this does not introduce
any new heap allocations:
```
➜ (cd pkg/sql/sem/tree && goescape . | grep moved | wc -l)
328
```
scripts: add sgrep
@aliher1911 showed me this handy awk trick to filter goroutines [here].
I find myself reaching for this frequently, but it always takes me a
minute or two to get it right. This helper will make it a lot easier
and will perhaps enable many others to reap the benefits as well.
[here]: https://github.com/cockroachdb/cockroach/pull/66761#pullrequestreview-691529838
Release note: None
sqlproxyccl: better test output
This moves the logging output to files and enables inspection of the
authentication results in logs crdb-side.
Release note: None
storage/metamorphic: update clearRangeOp to not conflict with in-flight writes
This change updates clearRangeOp to only run on key spans
that do not have any transactional writes in-progress. This more
closely matches behaviour in KV and avoids cases where we'd trample on
intents and other in-flight keys. Also makes a similar adjustment
to mvccClearTimeRangeOp.
Fixes #72476. Speculatively. Reproduction was nearly impossible.
Release note: None.
Revert "roachpb: change `Lease.String()`/`SafeFormat()` to support refs."
This reverts commit 33fe9d15b2660cc40d255a3d448d88bb35c2c39d.
storage,roachpb: Add Key field to WriteTooOldError
Currently, WriteTooOldErrors that happen as part of a
ranged put operation (eg. MVCCClearTimeRange) can be very
opaque, as we don't know what key we tried to "write too old"
on. This change addresses that by adding a Key field to WriteTooOldError
to store one of the keys the error pertains to.
Informs #72476.
Release note: None.
dev: enumerate explicit list of all generated `.go` files when hoisting
We should be able to `find _bazel/bin/go_path -type f -name '*.go'` and
list the generated files that way, but bazelbuild/rules_go#3041 is in
the way and I can't figure out how to resolve it. Instead we have to do
things the hard way: run `bazel aquery` and parse out the paths to all
the generated files. In this way we avoid accidentally hoisting out
stale symlinks from a previous build. If bazelbuild/rules_go#3041 is
resolved upstream then this change can be reverted.
Release note: None
sql: public schema long running migration
Release note: None
sql: insert missing public schema namespace entry
When restoring a database, a namespace entry for the public
schema was not created.
Release note: None
authors: add msirek to authors
Release note: None
sql: Add hints to CREATE/ALTER table errors for MR
Add some hints when trying to create (or alter to) a multi-region table which
indicate that if the database is not multi-region enabled, it should be
converted to a multi-region enabled database via a "ALTER DATABASE ... SET
PRIMARY REGION <region>" statement.
Release note: None
catalog: add Index*Columns methods to table descriptor
These methods return the columns referenced in indexes as
[]catalog.Column slices.
Release note: None
catalog: add Index*ColumnDirections methods to table descriptor
This complements the previous commit which added Index*Column methods to
catalog.TableDescriptor.
Release note: None
catalog: remove FullIndexColumnIDs
Calls to this function can now be replaced with recently-added
IndexFullColumns and IndexFullColumnDirections method calls on the table
descriptor.
Release note: None
tabledesc: improve memory-efficiency of Index*Column* methods
This commit removes IndexCompositeColumns (which wasn't actually used
and is unlikely to ever be) and generates the indexColumnCache with less
memory allocations. The current scheme is causing
a statistically-significant performance regression.
Release note: None
sql: allow the 1PC optimization in some cases in the extended protocol
A previous commit (7e2cbf51869fc326974a5665db80da8b29422631) fixed our
pgwire implementation so that it does not auto-commit a statement
executed in the extended protocol until a Sync message is received. That
change also had the undesired effect of disabling the 1PC ("insert fast
path") optimization that is critical for write-heavy workloads.
With this current commit, the 1PC optimization is allowed again, as long
as the statement execution is immediately followed by a Sync message.
This still has the correct bugfix semantics, but allows the optimization
for the common case of how the extended protocol is used.
No release note since this only affects unreleased versions.
Release note: None
execgen: remove temporary decimals from the helper
This commit removes two temporary decimals from `execgen.OverloadHelper`
since after the upgrade of the decimal library it can use
stack-allocated temporary objects without them escaping to the heap.
Release note: None
colexec: batch allocate datums for projectTupleOp
This has a profound impact on the amount of garbage generated by the delivery
query in TPC-C.
```
name old time/op new time/op delta
TPCC/mix=delivery=1-16 38.0ms ± 2% 35.8ms ± 1% -5.76% (p=0.000 n=9+8)
name old alloc/op new alloc/op delta
TPCC/mix=delivery=1-16 8.17MB ± 1% 7.97MB ± 1% -2.36% (p=0.000 n=9+10)
name old allocs/op new allocs/op delta
TPCC/mix=delivery=1-16 80.2k ± 0% 20.3k ± 1% -74.65% (p=0.000 n=10+9)
```
Leveraged https://github.com/cockroachdb/cockroach/pull/74443 to find this.
Release note: None
kvserver/loqrecovery: check full key coverage in quorum recovery
Previously when doing unsafe replica recovery, if some ranges are
missing or represented by stale replicas that were split or merged,
recovery will change cluster to inconsistent state with gaps or
overlaps in keyspace.
This change adds checks for range completeness as well as adds a
preference for replicas with higher range applied index.
Release note: None
sql: fix enum hydration in distsql expression evaluation
Fixes: #74442
Previously in some circumstances we could fail to hydrate enum types
used in join predicate expressions and possibly other situations. Now
types used in ExprHelper are always hydrated during Init phase when a
distsql type resolver is being used. Also add a test case for the lookup
semi join repro case.
Release note (bug fix): Fix panic's possible in some distributed queries
using enum's in join predicates.
backupccl: add libgeos to BUILD.bazel
This is required when using randgen so that libgeos can be initialized
when a geometric type is generated.
Fixes #73895
Release note: None
sql: refactor pg_builtin to use actual grant options
Refactor builtins.priv -> privilege.Privilege.
Replace privilege.Kind with privilege.Privilege in functions that need
access to privilege.Privilege.GrantOption.
Release note: None
sql: refactor pg_builtin to use actual grant options
The builtins has_table_privilege, has_column_privilege,
has_any_column_privilege now use privileges.Priv.GrantOption instead
of privileges.Kind.GRANT.
Release note: None
dev: introduce common benchmarking flags
Specifically:
--bench-mem (to print out memory allocations);
--bench-time (to control how long each benchmark is run for);
--count (how many times to run it, also added to `dev test`);
-v and --show-logs (similar to `dev test`)
We also add supports for args-after-double-dash-are-for-bazel within
`dev bench`. This commit is light on testing (read: there isn't any), so
doesn't bump DEV_VERSION to roll it out to everyone just yet.
Release note: None
execgen: skip encoding/decoding JSON when hashing it
Previously, in order to hash a JSON object we would encode and decode it
(due to the behavior of `coldata.JSONs.Get` and the setup of the
execgen). This commit refactors how the hash functions are handled in
the execgen which allows us to peek inside of `coldata.Bytes` (that is
under the `coldata.JSONs`) in order to get direct access to the
underlying `[]byte`. This should be a minor performance improvement but
also allows us to remove the need for the overload helper in the hashing
context.
It is possible (although I'm not certain) that the hash computation is
now different, so this commit bumps the DistSQL versions to be safe.
Release note: None
colexechash: fix BCEs in some cases
Previously, we were not getting (and not asserting) some of the bounds
check eliminations in `rehash` function because we were accessing
`.Sliceable` property in the wrong context. This is now fixed (by
accessing it via `.Global`) as well as assertions are added correctly.
Additionally, this commit pulls out the call to the cancel checker out
of the templated block to reduce the amount of code duplication.
Release note: None
execinfra: prefer leased table descriptors
This commit makes better use of the table descriptor protos in the
DistSQL specs by first checking if they've already been leased. If so,
then we use those instead of re-generating catalog.TableDescriptor
instances.
This has a statistically-significant impact on memory allocations, as
illustrated with this microbenchmark which I ran on my development
machine:
name old time/op new time/op delta
KV/Scan/SQL/rows=1-16 190µs ± 7% 184µs ± 4% -3.60% (p=0.016 n=10+8)
KV/Scan/SQL/rows=10-16 196µs ± 4% 198µs ±12% ~ (p=0.762 n=8+10)
name old alloc/op new alloc/op delta
KV/Scan/SQL/rows=1-16 19.5kB ± 1% 17.4kB ± 1% -11.12% (p=0.000 n=9+10)
KV/Scan/SQL/rows=10-16 21.0kB ± 1% 18.9kB ± 1% -10.20% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
KV/Scan/SQL/rows=1-16 222 ± 0% 210 ± 1% -5.59% (p=0.000 n=7+10)
KV/Scan/SQL/rows=10-16 256 ± 0% 244 ± 0% -4.84% (p=0.000 n=10+7)
This change opens us to the possibility of no longer shipping the whole
table descriptor proto in the DistSQL spec.
Release note: None
sql: add missing error check in GetHydratedZoneConfigForNamedZone
Closes #74606
Release note: None
sql: rename pg_cast_provolatile_dump.csv to pg_cast_dump.csv
Release note: None
sql: remove castInfo
`castMap` now contains all volatility information and `castInfo` is no
longer needed. `castInfo` has been removed.
For backwards compatibility, some invalid casts between VARBIT and
integer types have been added to `castMap`. These casts have unexpected
behavior before and after this commit. They are not supported by
Postgres. They should be disallowed in the future.
See #74577 and 74580
Casts between tuple and string types are now dynamically handled in
`lookupCast` support casts with named record types that have different
OIDs than `oid.T_record`.
Casts from OID and REG* types to INT2 are now allowed to maintain
backward compatibility.
Release note: None
sql: update pg_cast_dump.csv
This commit updates the `pg_cast_dump.csv` file with new rows for casts
from the JSONB type (OID 3802). It also adds the Postgres version as a
column to annotate the version of Postgres that the CSV was generated
from. This is important because the `pg_cast` table can change from
version to version. This CSV was generated from Postgres version 13.5.
Release note: None
sql: consistently order pg_cast_dump.csv
Release note: None
sql: check cast contexts against Postgres's pg_cast table
This commit includes the `pg_cast.castcontext` column in
`pg_cast_dump.csv` and uses it to validate the `maxContext` field in
`castMap`. Note that Postgres's `pg_cast` table does not include all
possible casts, such as automatic I/O conversions, so this test is not
comprehensive.
Release note: None
ci: bazelize nightly pebble ycsb, write-throughput benchmarks
We still have to take care of the metamorphic nightly.
Part of #67335.
Release note: None
authors: add bardin to authors
Release note: None
opt: fix corner case with lookup joins and provided orderings
This commit fixes a case where the lookup join was passing through a
provided ordering with unnecessary columns. This was caused by
imperfect FDs at the join level such that the ordering cannot be
simplified at the join level but it can be simplified at the level of
its input.
Note that the case causes an internal error in test builds but there
are no known examples of user-visible problems in non-test builds
(hence no release note).
Fixes #73968.
Release note: None
rpc: fix span use-after-Finish in internalClientAdapter
The internalClientAdapter performs some local "RPCs" by directly calling
the server method, without going through gRPC. The streaming RPC calls
are made on a different goroutine, and these goroutines were using the
callers tracing span. These goroutines could outlive the caller's span,
resulting in a use-after-Finish crash. This patch fixes them by creating
dedicated RPC spans, mimicking what our gRPC interceptors do.
Fixes #74326
Release note: None
vendor: Bump pebble to 3d0ff924d13a3d5fdf6e56a391c5c178c18ff196
Changes pulled in:
```
3d0ff924d13a3d5fdf6e56a391c5c178c18ff196 *: Add trySeekUsingNext optimization to SeekGE
0c503048eb0365981929177c30178add8a56ae3e sstable: add (*sstable.Writer).RangeKey{Set,Unset,Delete} methods
fe52b49cc28df62dce9b00c382a5ce217936be56 tool/logs: aggregate compaction logs by node and store ID
8ab4358bc59dfa62e5e34e4b0e5ce81a68f5fe91 sstable: return err from `(BlockPropertyCollector).FinishTable`
91c18ef0ee999980c2869d11e5ce468410acbe8d internal/keyspan: add FragmentIterator interface
953fdb078ff0585489206ae96e1d80ca9f6f90c7 internal/keyspan: implement SetBounds on Iter
aa376a819bf67cd6766ee827feed4bf0bd508f1f tool: add compaction log event aggregation tool
```
Release note: None.
storage: Use optimized SeekGE in CheckSSTConflicts
This nearly reverts #73514 by moving back to calling SeekGE on
the engine to skip past any empty spans on either the engine
or the SSTable. This is the more optimal approach on average,
and given optimizations in cockroachdb/pebble#1412 which this
change depends on, it also ends up performing better than
a SeekPrefixGE-driven appraoch and the pre-#73514 approach.
Improvement when running BenchmarkCheckSSTConflicts
against the pre-#73514 revision (vs. this one):
```
name old time/op new time/op delta
CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=false-24 72.6µs ±11% 66.3µs ± 1% -8.67% (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=1000/overlap=true-24 12.2ms ± 1% 1.7ms ± 1% -86.41% (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=false-24 69.8µs ± 2% 67.4µs ± 1% -3.48% (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=8/sstKeys=10000/overlap=true-24 13.3ms ± 3% 2.8ms ± 1% -78.97% (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=false-24 75.8µs ± 3% 63.8µs ± 1% -15.86% (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=1000/overlap=true-24 13.0ms ± 1% 1.9ms ± 1% -85.11% (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=false-24 69.8µs ±11% 64.6µs ± 1% -7.45% (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/versions=64/sstKeys=10000/overlap=true-24 14.8ms ± 9% 3.1ms ± 2% -79.05% (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=false-24 66.1µs ± 2% 63.7µs ± 1% -3.65% (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=1000/overlap=true-24 14.2ms ± 9% 1.9ms ± 1% -86.55% (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=false-24 72.3µs ±10% 64.5µs ± 0% -10.77% (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=8/sstKeys=10000/overlap=true-24 122ms ± 2% 17ms ± 1% -86.03% (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=false-24 69.0µs ± 9% 62.4µs ± 1% -9.57% (p=0.032 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=1000/overlap=true-24 14.0ms ± 1% 2.3ms ± 2% -83.46% (p=0.016 n=4+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=false-24 69.4µs ± 9% 62.7µs ± 1% -9.63% (p=0.016 n=5+5)
CheckSSTConflicts/keys=10000/versions=64/sstKeys=10000/overlap=true-24 140ms ± 5% 26ms ± 1% -81.70% (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=false-24 69.2µs ±10% 62.5µs ± 1% -9.66% (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=1000/overlap=true-24 15.3ms ±11% 2.3ms ± 1% -85.21% (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=false-24 69.7µs ±12% 63.6µs ± 1% ~ (p=0.095 n=5+5)
CheckSSTConflicts/keys=100000/versions=8/sstKeys=10000/overlap=true-24 148ms ± 6% 28ms ± 2% -80.90% (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=false-24 67.1µs ±10% 61.1µs ± 2% -8.93% (p=0.016 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=1000/overlap=true-24 14.4ms ± 2% 2.5ms ± 5% -82.45% (p=0.016 n=4+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=false-24 68.9µs ±21% 62.2µs ± 1% -9.76% (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/versions=64/sstKeys=10000/overlap=true-24 204ms ±14% 42ms ± 5% -79.44% (p=0.008 n=5+5)
```
Fixes #66410.
Release note: None.
execinfrapb: add a helper for index joins based on the JoinReaderSpec
Release note: None
rowexec: refactor the joinReader to not exceed the batch size
The joinReader operates by buffering the input rows until a certain size
limit (which is dependent on the strategy). Previously, the buffering
would stop right after the size limit is reached or exceeded, and this
commit refactors the code to not exceed the limit except in a case of
a single large row. This is what we already do for vectorized index
joins.
Release note: None
sql,kv: introduce Streamer API and use it for index joins in some cases
This commit introduces the Streamer API (see
https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20210617_index_lookups_memory_limits.md)
as well as its implementation for the simplest case - when requests are
unique and can be served in any order. It additionally hooks up the
implementation to be used by the index joins in both execution engines.
There are three main pieces that this commit adds:
1. the `Streamer` struct itself. It is largely the same as described in
the RFC. Some notable changes are:
- `Cancel` is renamed to `Close` and is made blocking to ensure that all
goroutines kicked off by the `Streamer` exit before `Close` returns.
- `Shrink` has been removed from the `budget` struct (see below for
more details).
- furthermore, the `budget` interface has been unexported and the
`streamer` has been tightly coupled with the `budget`'s implementation.
- the TODO about collecting DistSQL metadata is removed because we are
collecting the LeafTxnFinalState already when using the LeafTxn.
2. the limited `Streamer` implementation - only `OutOfOrder` mode is
supported when the requests are unique. Notably, buffering and caching
of the results is not implemented yet.
3. `TxnKVStreamer` component that sits below the SQL fetchers, uses the
`Streamer`, and is an adapter from `BatchResponse`s to key/value pairs
that fetchers understand. Although not used at the moment,
`TxnKVStreamer` is written under the assumption that a single result can
satisfy multiple requests.
The memory budget of the `Streamer` is utilized lazily. The RFC was
calling for making a large reservation upfront and then shrinking the
budget if we see that we don't need that large reservation; however,
while working on this I realized that lazy reservations are a better fit
for this. The `Streamer` can reserve up to the specified limit
(determined by `distsql_workmem` variable) against the root monitor (in
the degenerate case of a single large row more memory will be reserved).
The reservation then never shrinks under the assumption that if the
reservation has gotten large, it means it was needed for higher
concurrency (or large responses), and it is likely to be needed for the
same reasons in the future.
The layout of the main components of the `Streamer` implementation:
- in `Enqueue` we have a logic similar to what DistSender does in order
to split each request (that might span multiple ranges) into
single-range requests. Those sub-requests are batched together to be
evaluated by a single `BatchRequest`.
- `workerCoordinator.mainLoop` is responsible for taking single-range
batches, estimating the corresponding response size, and issuing
requests to be evaluated in parallel while adhering to the provided
memory budget.
- `workerCoordinator.performRequestAsync` is responsible for populating
the `BatchRequest` and then processing the results while updating the
memory budget.
Current known limitations that will be addressed in the follow-up work:
- at the moment a single row can be split across multiple BatchResponses
when TargetBytes limit is reached when the table has multiple column
families; therefore, we use the streamer only for single column family
cases. We will expand the KV API shortly to not split the rows in
multiple column family cases.
- manual refresh of spans when `ReadWithinUncertaintyIntervalError` is
encountered by a single streamer in a single flow is not implemented. It
is an optimization that is considered a must for the final
implementation in order to not regress on simple cases in terms of
retriable errors. This will be implemented shortly as a follow-up.
- I'm thinking that eventually we probably want to disable the batch
splitting done by the DistSender to eliminate unnecessary blocking when
the streamer's splitting was incorrect. This would give us some
performance improvements in face of range boundary changes, but it
doesn't seem important enough for the initial implementation.
Release note: None
sql: check equivalent constraint when creating hash index
Fixes #68031
Previously we only try to create constraint for shard column if
it's newly created. We check duplicate constraint for shard
column when `Alter Primary Key` and `Create Index`, however the
check is simply a name check. This pr adds logic to check
equivalent constraint by checking if the formatted expression
string is the same. With this logic we can try to create the
constraint no matter if a shard column is newly created or not.
With this fix, we also don't need to expose the constraint through
`SHOW CREATE TABLE` result since we make sure the constraint is
created or skipped if one already exists.
Release note (sql change): Before this change, the check constraint
on shard column used by hash sharded index was printed in the
corresponding `SHOW CREATE TABLE`. The constraint had been shown
because cockroach lacked logic to ensure that shard columns which
are part of hash sharded indexs always have the check constraint
which the optimizer relies on to achieve properly optimized plans
on hash sharded indexes. We no longer display this constraint in
`SHOW CREATE TABLE` as it is now implied by the `USING HASH` clause
on the relevant index.
colexec: clean up the usage of the binary overload helper
Previously, for projection and selection operators we would always pass
in `execgen.OverloadHelper`. Up until recently it served several
purposes, but now it is only used in order to pass down the binary
function (as well as the eval context) for the cases when we fallback
to the row-by-row computation.
This commit takes advantage of this observation and cleans up the
situation: now the helper is only passed when it is needed which allows
us to remove a lot of redundant code. Additionally, the helper itself
has been renamed from `OverloadHelper` to `BinaryOverloadHelper`.
Release note: None
build: Add PATH to .bazelrc for dev builds.
Release note: none
opt: intern provided physical properties
This commit changes the provided physical properties to be referenced
with a pointer inside bestProps, in preparation for adding more fields
to physical.Provided.
Release note: None
opt: add Distribution physical property and Distribute enforcer
This commit adds a new physical property called "Distribution" to
expressions in the optimizer. Currently, Distribution is just the
list of regions that are touched by the expression.
This commit also adds a new enforcer called "Distribute", which enforces
a particular Distribution on its input. This is similar to the way
Sort enforces a particular Ordering. Currently, Distribute is only
used to enforce that the results of a query end up in the same region
as the gateway node.
The Distribution property and Distribute enforcer will enable us to
perform better costing of query plans in future commits. This commit
adds a tiny cost for the Distribute enforcer, but otherwise does not
yet take advantage of distribution for costing. In the future, we can use
Distribution to better cost distributed joins and aggregations, and
accurately cost the Distribute enforcer so as to avoid scanning data
in remote regions if possible.
This commit represents a departure from the approach taken in the earlier
prototype in #43831. Instead of using a "neighborhood" abstraction to
represent arbitrary collections of nodes, this commit sticks to the
existing abstractions of region and locality. If we need a new abstration
in the future, we should be able to simply modify the representation of
Distribution.
Additionally, this commit does not make an attempt to represent exactly
how data is partitioned across regions (e.g., which column is the
partitioning key, whether the data is hash or range partitioned, etc.);
it only tracks *which* regions contain data. This greatly simplifies the
implementation, and I don't think it will significantly reduce the
accuracy of costing.
Informs #47226
Release note: None
cli: add mt cert create-tenant-signing command
This key/cert will be used to generate session revival tokens.
No release note since this is only intended to be used internally.
Release note: None
sql/sem/catid: make a low-level package which defines ID types
I'm not sure where exactly in the tree this belongs.
Release note: None
sql/schemachanger/scpb: adopt catid
Release note: None
sql/catalog/catpb: add package below scpb and descpb
This doesn't go all the way to being principled about what is in catpb
and what is in descpb, but it pulls the relevant symbols referenced in
scpb down into the new package.
Release note: None
kvserver: fix bug causing spurious rebalance attempts by the store rebalancer
When rebalancing a range, the store rebalancer also tries to move the lease to
the voter with the lowest QPS. However, previously, the store rebalancer was
doing this even when it found no better replacement targets for the existing
stores. This meant that we were redundantly transferring leases over to the
coldest voter, even when we weren't rebalancing the range. This was likely
contributing to the thrashing observed in
https://github.com/cockroachdb/cockroach/issues/69817.
Release note (bug fix): A bug that could previously cause redundant lease
transfers has now been fixed.
schemachanger: delete comments when dropping schemas
Previously, the declarative schema changer did not clean
comments for schema objects when dropping them. This was
inadequate because the system.comment table would have
rows for dropped schema objects left over when using
the declarative schema changer versus the legacy schema
changer. To address this, this patch will remove rows
from the system.comment table for schema objects that
are dropped.
Release note: None
sql: adopt CommentUpdater from declarative schema changer.
Previously, there was duplicate code for adding/deleting comments
on schema objects inside the legacy schema changer. So, each
comment type would have similar code for setting up an
upsert/delete statements. This patch adopts the CommentUpdater
interface in the legacy schema changer, so that logic to
update/delete comments can be shared with the declarative
schema changer, and the simplify code in the legacy schema
changer.
Release note: None
kvserver: don't use `ClearRange` point deletes with estimated MVCC stats
`ClearRange` avoids dropping a Pebble range tombstone if the amount of
data that's deleted is small (<=512 KB), instead dropping point
deletions. It uses MVCC statistics to determine this. However, when
clearing an entire range, it will rely on the existing range MVCC stats
rather than computing them.
These range statistics can be highly inaccurate -- in some cases so
inaccurate that they even become negative. This in turn can cause
`ClearRange` to submit a huge write batch, which gets rejected by Raft
with `command too large`.
This patch avoids dropping point deletes if the statistics are estimated
(which is only the case when clearing an entire range). Alternatively,
it could do a full stats recomputation in this case, but entire range
deletions seem likely to be large and/or rare enough that dropping a
range tombstone is fine.
Release note (bug fix): Fixed a bug where deleting data via schema
changes (e.g. when dropping an index or table) could fail with a
"command too large" error.
backup: add memory monitor to manifest loading
Release note: none.
sql: add assignment casts for UPSERTs
Assignment casts are now added to query plans for upserts, including
`UPSERT`, `INSERT .. ON CONFLICT DO NOTHING`, and
`INSERT .. ON CONFLICT DO UPDATE ..` statements.
Assignment casts are a more general form of the logic for rounding
decimal values, so the use of `round_decimal_values` in mutations is no
longer needed. This logic has been removed.
Fixes #67083
There is no release note because the behavior of upserts should not
change with this commit.
Release note: None
sql: add logic tests for assignment casts of ON UPDATE expressions
Release note: None
sql: schema changer not to validate shard column constraint
Fixes #67613
The shard column constraint is created internally and should
be automatically upheld. So not need to verify it when backfilling
hash sharded indexes.
Release not: None
schemachanger: fix error messages for drop dependencies.
1) When a view depends on another view the message text generated was
incorrect.
2) When a sequence is OWNED BY a table, it could be dropped even
if there were other dependencies.
3) When RESTRICT was not specified DROP DATABASE would generate the wrong
error.
4) If the database name is empty we would generate the wrong error.
Release note: None
sql: modify message for drop schema privilege errors
Previously, the legacy schema changer used a less clear message
for when a DROP SCHEMA failed due to a privilege error. This
patch will improve the message generated to be more clear
by using message from the declarative schema changer.
The new message will have the form "must be owner
of schema <schema name>".
Release note: None
schemachanger: avoid leased descriptors for event logging
Previously, when fully resolving descriptor names, we would
fetch leased descriptors which could cause the schema change
transaction to get pushed out. This was inadequate because in
certain scenarios we would perpetually retry transactions,
since attempts to generate event log entries acquire leases
pushing out the schema change transaction. To address this,
this patch intentionally avoids leasing descriptors for event logging.
Release note: None
build: put regular file at bazel-out by default
Release note: none.
bulk,backup,import: add Write-at-now settings for RESTORE and IMPORT
This adds a new arguemnt to BulkAdderOptions and MakeSSTBatcher control
the WriteAtBatchTS field in the AddSSTableRequests it sends.
This flag is then optionally set in RESTORE and IMPORT based on the the
new hidden cluster settigs 'bulkio.restore_at_current_time.enabled' and
'bulkio.import_at_current_time.enabled' respectively.
These settings default to off for now, for testing usage only as setting
this flag is known to _significantly_ reduce the performance of those
jobs until further work is done.
Release note: none.
sql/row: remove an allocation for Get responses
```
name old time/op new time/op delta
IndexJoin/Cockroach-24 8.02ms ± 9% 7.92ms ± 2% ~ (p=1.000 n=9+10)
name old alloc/op new alloc/op delta
IndexJoin/Cockroach-24 1.68MB ± 1% 1.62MB ± 1% -3.83% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
IndexJoin/Cockroach-24 11.1k ± 1% 10.1k ± 0% -8.85% (p=0.000 n=9+9)
```
Release note: None
kvstreamer: refactor memory tokens to reduce allocations
```
name old time/op new time/op delta
IndexJoin/Cockroach-24 7.72ms ± 7% 7.51ms ± 3% -2.75% (p=0.001 n=10+9)
name old alloc/op new alloc/op delta
IndexJoin/Cockroach-24 1.61MB ± 1% 1.60MB ± 1% -0.91% (p=0.001 n=10+9)
name old allocs/op new allocs/op delta
IndexJoin/Cockroach-24 10.1k ± 1% 9.1k ± 1% -10.24% (p=0.000 n=10+10)
```
Release note: None
spanconfig: drop 'experimental_' suffix from settings
We're going to start to use this infrastructure for realsies soon.
Release note: None
spanconfig/testcluster: plumb in testing knobs for tenants
We were previously using an empty one with no control at the caller to
override specific values. This comes in handy when looking to control
knobs for all tenants in these tests.
Release note: None
migrationsccl: disable reconciliation job creation in tests
In a future commit we'll enable the span configs infrastructure by
default for all crdb unit tests. Doing so will surface the need to have
the reconciliation job disabled for the specific migration tests that
assert on the contents of `system.span_configurations` (also written to
by the reconciliation job/reconciler).
Release note: None
spanconfig/sqltranslator: deflake datadriven test
This is a follow up to #73531, there we forgot to update package tests
to also use consistent reads when looking up descriptors. Looking at the
surrounding commentary in these datadriven files and comparing against
the actual results, there was a mismatch (now no longer so).
Release note: None
sql: future-proof TestTruncatePreservesSplitPoints
This test previously relied on range split decisions to happen
near-instantaneously (especially for the single node variant). This was
a fair assumption with the gossiped SystemConfigSpan, but is no longer
true with the span configs infrastructure where
(i) updated descriptors/zone configs make its way to
`system.span_configurations` asynchronously, and
(ii) KV learns about learns about `system.span_configurations` updates
asynchronously.
We update the test to be agnostic to either subsystem (tl;dr - throw in
a SucceedsSoon block at the right places).
Release note: None
sql: future-proof TestScatterResponse
This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that no longer
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of just expecting it.
Release note: None
kvserver: future-proof TestElectionAfterRestart
This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.
Release note: None
multiregionccl: future-proof TestEnsureLocalReadsOnGlobalTables
This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.
Release note: None
server: future-proof TestAdminAPIDatabaseDetails
This is Yet Another Test that made timing assumptions on how
instantaneously range split decisions appear, assumptions that don't
hold under the span configs infrastructure. Adding compatibility is a
matter of waiting for splits to appear instead of only expecting it.
Release note: None
changefeedccl: future-proof TestChangefeedProtectedTimestamps
The low gc.ttlseconds in this test that applies to
system.{descriptor,zones}, when run with span configs enabled (in a
future commit), runs into errors introduced in #73086. The span configs
infrastructure makes use of rangefeeds against these tables within the
spanconfig.SQLWatcher process. These rangefeeds error out if the
timestamp they're started with is already GC-ed, something that's very
likely with low GC TTLs. To accommodate, we simply bump the TTL to a
more reasonable 100s.
Release note: None
kvfollowerreadsccl: future-proof TestBoundedStalenessDataDriven
This is Yet Another Test that made timing assumptions on how
instantaneously range config decisions are applied, assumptions that
don't hold under the span configs infrastructure. Adding compatibility
is a matter of waiting for the right configs to appear instead of only
expecting it.
Release note: None
changefeedccl: future-proof TestChangefeedBackfillCheckpoint
This testing knob was added in #68374 but I'm not sure that it was
necessary? Brief stress runs with an without this flag did not surface
anything. In a future commit where we enable span configs by default,
we'll actually rely on the reconciliation job running, so we get rid of
this flag now.
Release note: None
liveness: future-proof TestNodeLivenessStatusMap
Instead of using hooks that directly mutate the system config span,
using SQL statements to tweak zone configs future proofs this test for
compatibility with the span configs infrastructure.
Release note: None
sql: migrate has_database_privilege from evalPrivilegeCheck to
ctx.Planner.HasPrivilege
refs https://github.com/cockroachdb/cockroach/issues/66173
HasPrivilege is required to support WITH GRANT OPTION
Release note: None
roachtest: fix sequelize nightly
Upstream changed how imports are done, so this library had to be
updated.
Release note: None
kvstreamer: fix potential deadlock
We have two different locks in the `Streamer` infrastructure: one is for
the `Streamer` itself and another one for the `budget`. Previously,
there was no contract about which mutex needs to be acquired first which
led to the deadlock detector thinking that there is a potential deadlock
situation. This is now fixed by requiring that the `budget`'s mutex is
acquired first and by releasing the `Streamer`'s mutex in `Enqueue`
early in order to not overlap with the interaction with the `budget`.
I believe that it was a false positive (i.e. the deadlock cannot
actually occur) because without the support of pipelining, `Enqueue`
calls and asynchronous requests evaluation never overlap in time.
Still, it's good to fix the order of mutex acquisitions.
Release note: None
sql: remove PHYSICAL scrub code
The PHYSICAL scrub code is experimental and not considered production
ready. It complicates a lot of code paths involved in normal query
execution (it significantly overloads the semantics of TableReader and
of the Fetcher) and is getting in the way of some improvements in how
the fetchers work. In particular, we are trying to reduce the amount
of information passed to TableReader/Fetcher (which in the
non-scrubbing case should be a lot less than the full table
descriptor).
There are some proposals for a better design floating around, e.g.
provide a facility for returning KVs as results from DistSQL and have
some higher-level code run the scrub checks.
This change removes the code for the PHYSICAL scrub for now.
Release note (sql change): the experimental SCRUB PHYSICAL is no
longer implemented.
vars: add placeholder session variable for xmloption
Release note (sql change): The `xmloption` session variable is now
accepted, only taking in `content`. Note this does not do anything.
clusterversion: improve a version comment
Gets rid of a squiggly line in Goland.
Release note: None
kvserver: plumb in a context into (*Store).GetConfReader
We'll use it in a future commit.
Release note: None
spanconfig/reconciler: export the checkpoint timestamp
We'll make use of it in a future commit.
Release note: None
spanconfig: get rid of ReconciliationDependencies interfa…
This commit adds
gcassert:noescapedirectives tobigint.goto verifythat the temporary big.Int values that are intended to be stack-allocated
do not escape to the heap.
The commit then adds
gcassertto CI.