Lately I have been working with Direct query in order to produce reports from Synapse, which will allow me to create dashboards without the need to cache potentially billions of rows into the dataset.
When Working with Direct Query there is a lesson I have learnt. This is it
"Dim Tables should be Dual Mode or Direct Query. Avoid Import Mode for Dim Tables ESPECIALLY calendars and other large dim tables!"
You will be surprised to see how complicated a query could look like for a simple visual that looks like this:
For starters, this chart has a value which is a COUNT on a dimension table. This sounds like a good idea as it avoids a COUNT DISTINCT on the entire fact query which is in direct query mode. This would avoid counting all the rows in the fact table right? also imports good as it will import all the data upfront making it even faster right?
Either way, you get the same number once you drag the measure into the visual
However, if you want to split this value across another dimension, then the dim table with the distinct count will need to have a 'cross filtered join. Otherwise you will get many bars with the same number as it wont be filtering back.
So lets add another imported dimension into the visual and see what happens
Below I have added Day Name to split my bar by Day of Week. Day name is situated in another dimension table.
And this is what happened!
The reason why this happens is because the cross join is being used for the dim count to filter for each day, forcing Power BI to load all the relevant data into memory at detailed level as apposed to grouping the data within Synapse. And as you see there is a default limit of 1,000,000 rows.
By ensuring that dimension with the count distinct is in Dual mode, makes Synapse aware of the full query, and can return the summarised aggregated result set as apposed to every single row.
Below is a subset of the query when both tables are referred to in synapse - either in Dual or Direct Query
As you can see Synapse is aware of the full picture, and therefore can select and group the appropriate numbers before returning to PowerBI.
However, remember I added 'Dayname' to the visual. This is derived from another imported table
As Synapse is unaware of the Calendar table, power bi implicitly tells exactly what dates is needed to select for the bar chart. Remember we only wanted 7 bars, yet power bi instructs a 'cherry picking' style of every date needed within each bar.
If the calendar is in Dual mode, then more efficient querying will be created.
The first select is a hand crafted statement made by myself in power query. I effectively used SQL to create a calendar. It would be even more efficient if this query was pre-defined inside the database itself, but at least now, Power BI is not asking for all the data on a date by date basis.
The same would apply to any other dim table. So the motto of the story is, if you choose to make use of Direct query, then try and ensure that the dim tables are in Dual Mode. This will ensure these tables are imported to get the speed, and at the same time, it will produce efficient queries when joined to the fact table which is in Direct Query Mode.
Below, the dashed lines indicate duel mode meaning that although the data is imported, the queries EXIST in synapse. The solid lines indicated that nothing is imported therefore the data needs to be queried within synapse. Dual mode allows Synapse to have the FULL picture of the query, therefore it will give you the best performing result.
댓글