How to Create a Box and Whisker Plot in Power BI Desktop
Microsoft Power BI Desktop is a comprehensive visualization tool. Sometimes while working with big data, one wants to understand the distribution of the data points. In this case, a box and whisker plot (known as a box plot) is used to get the desired outcome from the analysis. Not only this, but it also helps identify outliers. This article will highlight some basic principles of creating the box plot and, more specifically, how we can create a box plot in Power BI Desktop.
It is a fact that data visualization is one of the key prerequisites for a more optimal and successful business decision-making strategy. Data visualization enables monotonous arrays of data to be translated into more colorful and apparent graphical illustrations that offer highlighted patterns and trends, thus providing quick insights to the users.
One such data visualization technique is the box and whisker plot, otherwise known as the box plot. In this article, we will outline the steps to create box plots in Microsoft's Power BI, which is one of the most interactive and powerful data visualization platforms in the market.
A box plot diagram illustrates the distribution of a numerical variable. This is particularly useful when comparing the distributions of multiple numerical variables.
- The diagram comprises a rectangular box whose vertical edges represent the upper and lower quartiles.
- A vertical line within the box denotes the median value.
- Lines extend from this rectangle to capture the remaining range of the data.
- Data points beyond this horizontal line indicate the outliers in the dataset.
A typical boxplot diagram is illustrated below:
Components of Box and Whisker Plot
In short, a box plot summarizes data using five different statistics (five-number data summary) and includes:
- Minimum value: This represents the lower whisker of a box plot. Excluding any outliers, the minimum value reflects the 0th percentile or the lowest data point in the data set. The minimum value cannot fall below Q1 - (1.5 x IQR); otherwise, that would be counted as an outlier.
- Q1: The lower quartile reflects the 25th percentile. In other words, it is the value below which the lower 25 percent of the data is present. To calculate the Q1, it is the median (center value) of the observations in an ordered list left to the location of the median of the whole dataset.
- Q2: Median is the middle value of a dataset, and thus it will split the data in half. It can also be considered as the 50th percentile. To compute the median, we capture the middlemost value in the data set after the data is arranged in either an ascending or descending order.
- Q3: The upper quartile reflects the 75th percentile or the numerical value above which the upper 25 percent of the data is present. To calculate the 75th percentile, we calculate the median of the total observations in an ordered list to the right of the location of the overall median.
- Maximum value: It is the largest value in the data and is thus the 100th percentile. Any data point larger than the maximum value will be considered an outlier. The maximum value cannot exceed Q3 + (1.5 x IQR).
Importance of Box and Whisker Plot
Some other statistical concepts relevant to a box plot are interquartile range (IQR) and outliers. IQR is the difference between the upper quartile (Q3) and the lower quartile (Q2). This statistical metric computes the spread of the middle half of a given data distribution, which makes it useful as extreme values influence it less. In a box plot diagram, it is the horizontal length of the rectangular box. On the other hand, outliers are data points that appear extreme relative to the rest of the dataset. In a box plot, outliers are depicted as dots beyond the whiskers. It is important to scan data for outliers as it can help determine the skew in the distribution and, more significantly, determine the causality of a suspected outlier.
Now that we understand the basics of a boxplot, it is not difficult to interpret its usefulness to statisticians and analytical experts. Their importance is best reflected when we compare multiple variables of the same category. Since boxplots are compact and simplistic in illustrating data, it is much more suited for comparison-based use cases. Furthermore, they also provide a quick visual summary of statistics like the median value of data, spread, data symmetry, and signs of skewness.
Box and Whisker Diagram in Power BI
We can now finally move on to demonstrating this concept in Power BI. We will first set up our source tables to get started. For a practical demonstration, we will compare sales of two business branches.
We will first create our database in MySQL Workbench and access it through the following commands:
CREATE DATABASE carsales_by_branch; USE carsales_by_branch;
Now, let's create our tables for branch A and branch B. Input the following command:
CREATE TABLE sales_branch_A ( sales_person_ID int unsigned, customer_ID int unsigned, sales_in_thousand float, selldate date ); CREATE TABLE sales_branch_B ( sales_person_ID int unsigned, customer_ID int unsigned, sales_in_thousand float, selldate date );
To populate these tables with relevant inputs, do the following:
INSERT INTO sales_branch_A VALUES (36912151, 71115192, 44.225, "2018-07-05"), (61218243, 15192325, 45.705, "2018-07-13"), (36912151, 89025612, 54.005, "2018-07-15"), (71421283, 27161192, 48.002, "2018-07-19"), (81624324, 56272126, 51.456, "2018-07-23"), (71421283, 89237001, 53.334, "2018-07-24"), (71421283, 10002883, 47.004, "2018-07-27");
This will create the sales table for branch A:
We apply a similar treatment to create the sales dataset for branch B:
INSERT INTO sales_branch_B VALUES (24861012, 59131721, 65.450, "2018-07-02"), (48121620, 91317212, 39.340, "2018-07-06"), (51015202, 67785560, 28.001, "2018-07-07"), (91827364, 29901299, 26.895, "2018-07-12"), (48121620, 56729190, 26.895, "2018-07-13"), (51015202, 45567002, 21.62, "2018-07-15"), (51015202, 23488900, 24.46, "2018-07-17"), (48121620, 90437829, 9.010, "2018-07-22"), (48121620, 67738212, 11.345, "2018-07-25"), (24861012, 10029382, 23.456, "2018-07-27");
Here is the resultant sales table for branch B:
We are finally ready to import our sales source tables to Power BI and create our box plot visualizations. We will simultaneously illustrate two box plots in Power BI to compare car sales from branches A and B.
Two Box Plots Simultaneously
To get started, open Power BI Desktop and follow these steps:
To import our database from the MySQL server, click on "Get data" in the data section, as shown below, and then click on the "More…" option.
This will cause a new window to appear. Click on "Database" and locate "MySQL database," as shown below. Then click on "Connect" at the bottom of the window.
Enter the following relevant details about your MySQL Workbench server and database name and click "OK."
If your connection is successful, the following window will appear.
Select all the required tables and click "Load." Power BI not only allows us to preview our dataset at this stage but also enables the users to ensure that their dataset is complete and clean. The "Transform Data" option at the bottom of the window allows users to manipulate the dataset by deleting duplicate entries, accounting for any missing values and erroneous data entries, and so on.
In our case, our sales tables are clean and thus ready to use. So, we will be skipping this step.
Now we will be redirected back to the main interface of Power BI. Unfortunately, Power BI does not officially offer the box plot visualization tool. However, we can still acquire visualization extensions from third-party developers. Under the Visualizations section, click on the three dots following the various graph icons and select "Get more visuals," as shown below.
A new window will appear. Under the "Advanced Analytics" header, you will find "Box and Whisker chart by MAQ Software." Click Add. Note: Power BI specifically endorses this extension, as evident by the blue tick icon.
Once the installation has been successful, click on the new icon in the Visualizations section.
This will create an outline of your desired visual. We can also drag and reshape the dimensions of this outline to our desired liking.
Now it's time to populate the box plot. Under the Visualizations section, below the new box plot icon, we will find the "Axis" box, as shown below. Drag the sales_in_thousand column from the branch A table under the Fields section and insert it into the "Axis" section. This measure will define the axis of the box plot.
Below the "Axis" category, we will also find the "Value" box. Again, drag the sales_in_thousand column from branch A sales table and insert it into the box as shown below. This creates the box plot with different sales data points. Since we are only interested in illustrating the sales differences between the two branches, we will only use the sales column from both tables.
But wait, we are not done yet! This box plot visualization tool offers many customizations to ensure flexibility in building box plots. To get started, click on "Format your visual," as shown below.
We will see that there are several different editing categories. We can change the orientation of the box plot from vertical to horizontal, as shown below.
We can also alter the scale of the x-axis. Below the "X-Axis" tab, we can set the start to 0 and end it with 80 to accommodate all our data points. Users can also alter the decimal places and display units, as shown below.
Building upon what we have learned, we will move away from this tool's default criteria for creating box plots and setting our own. Under the "Box options," we can set the whisker type to custom, the lower percentile Q1 to 25, and the higher percentile Q3 to 75, as shown below. If we scroll down in the same list, we can also find the option to change the color of our boxplot from the monotonous gray.
Finally, we can change the box plot's heading. Under the "Title" section, we can insert and edit the text using different formatting tools available.
We have successfully plotted and edited the box plot in Power BI, illustrating the sales distribution of branch A.
Repeat steps 7-16 to create the second box plot, which presents the sales distribution of branch B. It's possible to create it directly below the first box plot, as shown below.
Conclusion and Interpretation
Let's interpret the results. Immediately we can tell that branch A had higher average car sales for October 2018 than branch B. In the diagram, the mean is illustrated by the white circle. We can also infer that there is a lot of variation in the car sales at branch B. Its interquartile range is much larger, and data is spread out more than the sales data of branch A. The variation in sales is a lot lower for branch A. We can also see that outliers are present in both box plots; however, branch B sales data have more outlier data points than branch A. Lastly, we can also interpret that our sales data from both branches are positively skewed.
- Check out all these other Power BI Tips
About the author
View all my tips
Article Last Updated: 2022-12-27