[opt][optimizer] optimize union all for colocated table.#61184
[opt][optimizer] optimize union all for colocated table.#61184wuxueyang96 wants to merge 3 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
@morrySnow @924060929 hi, could you help review this pull request? |
|
Hi @wuxueyang96 ! Thanks for submit this pr, and I already submit the same feature in #59006. |
I'm not certainly sure that the local bucket shuffle is same to this pr. This pr want to eliminate shuffle actually no matter local bucket shuffle or global shuffle. |
My PR includes the function of eliminating exchange under set operation, because supporting bucket shuffle itself requires the other end to distribute according to the stored hash algorithm: the base end does not need shuffle, and if the other end does not meet the requirements, the other end needs to use bucket shuffle. If both ends are colocated, then neither end needs to shuffle because they both satisfy the distribution of storing hash algorithms. So my PR is a superset of your PR, more abstract |
Actually, I rebuild the code from bf2e1c2, I suppose i've gotten you. But if you look at the fragment 5, It still contains 2 exchange below union, I just wondering what is the final effect you want to build. |
It seems to be some scenarios that need to be optimized, but the main idea of optimization is still bucket shuffle, allowing the Cascades framework to automatically identify lower layers that meet bucket distribution and ignore exchange |
|
@wuxueyang96 |
What problem does this PR solve?
Currently, execute a sql like:
The final plan will look like:
All of tables mentioned above are in a colocated group and the distribution key is same as the group key. It is obvious that the two exchanges in plan fragment 3 is unnecessary. Since identical dirtibution keys and aggregation keys ensure that all the same aggregation keys of the two colocated tables only exist in the corresponding tablet, aggregation can be directly performed on a single machine after the union of the two tables to obtain the correct result.
The current implementation adds a
PhysicalDistributeoperator for operators that require a distribution spec ofDistributionSpecAny, whose child nodes have a distribution spec ofDistributionSpecHashand a shuffle type ofNATURAL.This operation has a distribution type of
DistributionSpecAny, so the properties ofDistributionSpecHashcannot be propagated up to theSetOperation(UNION/EXCEPT/INTERSECT) operator.THE PR revises the current logic: for such scenarios, the
PhysicalDistributeoperator will not be added if and only if all child nodes of theSetOperationbelong to the same colocate group, have a distribution spec ofDistributionSpecHashand use a shuffle type ofNATURAL.For example, for sql like:
It will get a plan like:
But for sql like:
It will use plan like below:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)