6 Chapter 6: Visualize Data
Early draft release: This chapter hasn’t been fully edited yet.
Data visualization is a vital step in any data engineering or analysis workflow. It allows you to see and understand complex datasets, turning raw numbers into visual stories that inform decisions. In Microsoft Fabric, you have multiple options for visualizing data, from built-in tools in notebooks to external Python libraries and integration with Power BI. This chapter will guide you through these options, showing how to create charts from PySpark data and share your findings with others. By the end, you will be able to perform exploratory data analysis (EDA) on big data and present results in a way that business users can easily grasp.
All code examples in this chapter assume that you have a “Sales” table, loaded from the sales_data.csv file, available in a Lakehouse attached to your notebook, and loaded into a DataFrame:
sales_df = spark.read.table("sales")Please refer to Chapter 1 for instructions on how to set up the environment and load the data.
6.1 Visualization Options within Fabric
When working within Microsoft Fabric, you can visualize data using two broad approaches: built-in visualization capabilities and external Python libraries. It’s important to understand the benefits of each approach:
- Built-in Fabric tools: Fabric notebooks provide native charting options that require no code. When you output a DataFrame with the
display()function, you can switch the result from a table to a chart view and configure the visualization through the UI. This is great for quick insights without writing any plotting code. Fabric is also tightly integrated with Power BI, which means you can leverage Power BI’s rich visualization engine directly within your notebooks (more on this later). - External Python libraries: You can also use popular Python visualization libraries such as Matplotlib, Seaborn, or Plotly within a Fabric notebook to create custom charts. These libraries give you fine-grained control over the appearance and type of charts. They do require a bit of coding, but they enable complex or highly customized visuals beyond the built-in options.
In practice, you might use a combination of both. For quick exploratory charts or simple needs, the built-in charts or a Power BI integration can save time. For polished or unique visuals (e.g., a specific statistical plot), external libraries are invaluable. In the next sections, we’ll dive into how to use these tools in the context of PySpark data within Fabric.
6.2 The Built-in Chart View
The fastest way to visualize data in a Fabric notebook is the one you’ve been using since Chapter 1 to look at tables: the display() function.
display(sales_df)The output is an interactive widget, not just a static table. Three features make it a genuine exploration tool:
- The table view lets you sort columns, search, and select which columns to show — convenient for a first look without writing any
select()ororderBy(). - The Inspect pane (also available by running
display(sales_df, summary=True)) shows a data profile for each column: distribution histogram, count of missing values, min/max/mean for numeric columns, distinct counts for categorical ones. Before plotting anything deliberately, a minute spent in this pane tells you which columns are worth plotting at all. - The chart view turns the result into a chart with no code. Select Chart (the + New chart button in the output toolbar), and Fabric suggests a visualization for your data; you can then configure everything through the UI: chart type (bar, line, area, scatter, pie, and more), the columns for the X and Y axes, an optional series column for grouping, and the aggregation method (sum, average, count…). You can add up to five charts on a single
display()output, each configured independently — handy for looking at the same result from several angles.
For example, run the following aggregation and switch the output to a bar chart of total_sales by store_location:
from pyspark.sql.functions import sum
sales_by_store = sales_df.groupBy("store_location").agg(
sum("sale_amount").alias("total_sales")
)
display(sales_by_store)A couple of practical notes. First, display() renders a limited number of rows (1,000 by default, and you can raise the limit to 10,000 in the output toolbar) — for charts over a large table, aggregate first with Spark, as we did above, so the chart sees the complete picture rather than a truncated one. Second, the chart configuration is saved with the notebook output, so a colleague opening your notebook sees the chart exactly as you configured it.
The built-in chart view is perfect for everyday exploration. Its limits are customization (you choose among preset chart types and options) and reproducibility (the configuration lives in the UI, not in code). When you need full control, or charts that are generated programmatically, it’s time for the Python libraries.
6.3 Integrating Python Visualization Libraries
One of the strengths of Microsoft Fabric notebooks is that the most common Python visualization libraries — Matplotlib, Seaborn, Plotly, Bokeh — are preinstalled and ready to use. However, using these libraries with PySpark requires understanding how to bridge between the distributed Spark world and the plotting functions, which typically expect local data (like a pandas DataFrame or NumPy array).
6.3.1 Using Matplotlib for Basic Charts
Matplotlib is the fundamental plotting library in Python. It allows you to create static charts like line graphs, bar charts, histograms, pie charts, and more. If you’ve never used it, think of Matplotlib as the Python equivalent of creating a chart in Excel or Power BI, but through code. It might feel low-level, but it’s very powerful and flexible.
To use Matplotlib in a PySpark notebook, follow these general steps:
- Import Matplotlib – Usually you’ll import it as
import matplotlib.pyplot as plt. This is the standard convention. - Convert the Spark DataFrame to pandas – Matplotlib cannot directly plot a Spark DataFrame. You first need to collect the data to the notebook’s driver memory. Typically, this means converting the Spark DataFrame to a pandas DataFrame using
.toPandas(). Be careful: this operation should only be done on a small amount of data, because it moves data out of Spark’s distributed memory into a single node’s memory. Often, you’ll filter, aggregate, or sample your Spark DataFrame before converting, to avoid running out of memory. - Create the plot – Use Matplotlib functions or pandas plotting methods to create a chart. For example, you can use
plt.plot()for line charts orplt.hist()for histograms. If you have a pandas DataFramepdf, you can also call methods likepdf.plot(kind='hist')orpdf.plot(kind='bar')to quickly plot using pandas’ built-in wrappers for Matplotlib. - Show the plot – In a Jupyter-style environment, simply creating the plot is often enough to display it. However, it’s good practice to call
plt.show()to ensure the figure renders. In Fabric notebooks, the Matplotlib output appears right below the cell.
Let’s walk through an example with our sales data. We want to visualize the distribution of sale amounts — are most sales small, with a few large outliers, or evenly spread? A histogram answers that. If the table were huge we would sample it first; here we demonstrate the full, safe pattern:
import matplotlib.pyplot as plt
# Sample the Spark DataFrame (10% of rows) and collect it to pandas.
# On a small table you could convert directly; on billions of rows you must reduce first.
sales_pd = sales_df.sample(fraction=0.1, seed=42).toPandas()
# Plot a histogram of sale amounts
ax = sales_pd['sale_amount'].plot(kind='hist', bins=25, color='lightblue')
ax.set_title('Sale Amount Distribution')
ax.set_xlabel('Sale Amount ($)')
ax.set_ylabel('Frequency')
plt.show()The resulting histogram shows the frequency of sales by amount — typically skewed toward smaller amounts, with a long tail of larger transactions.
In this code, we downsampled the data and used toPandas() to get a pandas DataFrame. As you can see, plotting with Matplotlib is straightforward once the data is in pandas form. We used pandas’ .plot(kind='hist') method for convenience, which is a wrapper around Matplotlib. Alternatively, we could have used plt.hist(sales_pd['sale_amount'], bins=25) to achieve a similar result. The key takeaway is that Matplotlib requires local data, so always ensure you’ve aggregated or sampled appropriately before plotting.
Matplotlib can create many other chart types. For example, here is a line chart of revenue over time, where the heavy lifting (grouping by day) is done by Spark and only the small aggregated result is collected:
from pyspark.sql.functions import sum
daily_pd = (
sales_df.groupBy("date")
.agg(sum("sale_amount").alias("daily_revenue"))
.orderBy("date")
.toPandas()
)
plt.figure(figsize=(8, 3))
plt.plot(daily_pd['date'], daily_pd['daily_revenue'])
plt.title('Daily Revenue')
plt.xlabel('Date')
plt.ylabel('Revenue ($)')
plt.show()This aggregate in Spark, plot in pandas pattern is the single most important habit of this chapter, and we’ll keep applying it.
6.3.2 Using Seaborn for Statistical Visualizations
Seaborn is a library built on top of Matplotlib that provides a high-level interface for drawing attractive statistical graphics. It integrates well with pandas DataFrames and can produce complex plots with relatively little code. For beginners, Seaborn is often appreciated for its sensible default styles and the ease of creating visuals like distribution plots and categorical comparisons.
When working with Seaborn in a Fabric notebook, you follow the same process as with Matplotlib:
- Import Seaborn (
import seaborn as sns). - Convert your data to pandas (Seaborn, like Matplotlib, works with arrays or pandas data).
- Use an appropriate Seaborn function to plot. Seaborn has specialized functions like
sns.histplot,sns.boxplot,sns.barplot,sns.scatterplot, etc., which typically take the DataFrame and the columns to use forx,y, andhue(for grouping).
Continuing our sales example, let’s examine how the distribution of sale amounts varies by store location. A box plot is a good way to show the distribution (median, quartiles, and outliers) of a numeric variable across different categories. We’ll use Seaborn’s boxplot:
import seaborn as sns
plt.figure(figsize=(7, 4))
ax = sns.boxplot(x="store_location", y="sale_amount", data=sales_pd, showfliers=False)
ax.set_title('Sale Amount Distribution per Store Location')
ax.set_xlabel('Store Location')
ax.set_ylabel('Sale Amount ($)')
plt.xticks(rotation=45)
plt.show()In the code above, sns.boxplot automatically aggregates the sale_amount data for each store location and displays the distribution. Each box shows the median sale and interquartile range for that store, allowing easy comparison — for instance, we might observe that one store sells fewer but larger transactions than the others. By setting showfliers=False, we chose not to display individual outlier points, to focus on the bulk of the distribution. Seaborn takes care of a lot of the work (computing quartiles, etc.), so you can get insights quickly with a single function call.
Seaborn can also be used for other complex visualizations. For example, a scatter plot with a regression line can be drawn with sns.regplot, and a matrix of values with sns.heatmap (often used for correlation matrices; here, for total sales by store and month):
from pyspark.sql.functions import month, sum
# Aggregate in Spark: total sales per store and month, as a matrix
sales_matrix_pd = (
sales_df.withColumn("month", month("date"))
.groupBy("store_location").pivot("month").sum("sale_amount")
.toPandas()
.set_index("store_location")
)
sns.heatmap(sales_matrix_pd, annot=True, fmt=".0f", cmap='Blues')
plt.title('Total Sales by Store and Month')
plt.show()It is a rich library for data exploration — browse the Seaborn example gallery for inspiration.
6.3.3 Interactive Charts with Plotly
Matplotlib and Seaborn produce static images. Plotly, also preinstalled in Fabric, produces interactive HTML charts: the reader can hover over points to see exact values, zoom into a region, and toggle series in the legend. This interactivity costs you nothing extra in code — the high-level plotly.express API mirrors what you already know:
import plotly.express as px
from pyspark.sql.functions import month, sum
# Aggregate in Spark, as usual (deriving the month from the date column)
monthly_pd = (
sales_df.withColumn("month", month("date"))
.groupBy("month", "store_location")
.agg(sum("sale_amount").alias("total_sales"))
.orderBy("month")
.toPandas()
)
fig = px.line(
monthly_pd,
x="month", y="total_sales", color="store_location",
title="Monthly Sales by Store Location",
labels={"total_sales": "Total Sales ($)", "month": "Month"}
)
fig.show()Each store location becomes a line; hovering shows the exact value for that month, and clicking a legend entry hides or isolates that store. For charts destined to be read by others — in a shared notebook, for example — this interactivity often makes Plotly the better choice over a static image.
Matplotlib, Seaborn, and Plotly all work best with reasonably small DataFrames. If your Spark data is huge, summarize or sample it before plotting. A common strategy is to use Spark to compute aggregated results (e.g., average values per category, time series of totals), collect that summary to pandas, and then plot the summary. This way you leverage Spark for the heavy lifting and Python libraries for visualization.
6.3.4 Handling Large Data in Visualizations
As mentioned, bringing a large dataset directly into a plotting library can be problematic. Here are some strategies when dealing with big data in visualization:
- Aggregation: Instead of plotting millions of raw data points, aggregate your data. For example, if you have a log of events, aggregate counts per day and plot the daily counts rather than every single event. An aggregated chart is usually also easier to read than a cloud of raw points.
- Sampling: If the pattern or distribution is what you care about and the data is too large, take a random sample (as we did with
sales_df.sample()). A small representative sample can often approximate the full data’s distribution, and it’s much faster to plot. Use aseedfor reproducible figures. - pandas API on Spark: The
pyspark.pandasAPI (which lets you treat Spark DataFrames like pandas ones) has a.plot()method that produces Plotly charts. However, under the hood it still must collect data to the driver for plotting. It’s usually clearer to do the aggregation and conversion yourself, as shown above, so the data-reduction step is explicit in your code.
By integrating these libraries into your Fabric workflow, you can perform exploratory data analysis right within your Spark notebook. You might start by writing a Spark SQL query or DataFrame transformation to get the data of interest, then visualize it to spot trends or outliers. In a single environment, you’re able to transform big data and then immediately visualize patterns — a very powerful combination for data engineers and analysts.
6.4 Using Power BI within Fabric Notebooks
While Matplotlib, Seaborn, and Plotly are excellent for coding custom visuals, Microsoft Fabric offers a unique advantage: tight integration with Power BI. Power BI is a business intelligence tool well-known for creating interactive dashboards and reports. In Fabric notebooks, you can embed Power BI visuals directly or even generate new Power BI reports on the fly. This means you have the option to leverage Power BI’s rich visualization capabilities alongside your PySpark code.
There are two primary ways to integrate Power BI into your notebook:
- Embed an existing Power BI report – This allows you to bring an already-built Power BI report (with all its interactive charts and slicers) into your notebook view. This is useful if you want to reference or display up-to-date dashboards as part of your analysis.
- Quickly create a new report from a DataFrame – Using the
powerbiclientlibrary, you can take a DataFrame and instantly generate a quick Power BI report without leaving the notebook. This report will automatically pick some visualizations to showcase your data, which you can then customize or save as a regular Power BI report.
Both scenarios use the powerbiclient Python package, which is preinstalled in the Fabric notebook runtime. Even better, when you run it inside a Fabric notebook, no extra authentication setup is needed — the library uses your current Fabric identity automatically. (If you use powerbiclient outside of Fabric — for example in a local Jupyter environment — you would need to authenticate explicitly, typically with DeviceCodeLoginAuthentication from powerbiclient.authentication.)
6.4.1 Embedding an Existing Power BI Report
Suppose your team already has a Power BI report in your Fabric workspace (for example, a sales dashboard built on the very tables you produce). Instead of switching contexts to Power BI, you can embed that report in your notebook. This allows you to view and even interact with the report right next to your code.
The powerbiclient package provides a Report class for this purpose:
from powerbiclient import Report
# IDs are visible in the report URL:
# https://app.powerbi.com/groups/<workspace_id>/reports/<report_id>/...
report = Report(group_id="your-workspace-id", report_id="your-report-id")
reportIf the report lives in the same workspace as your notebook, you can simply pass group_id=None. Once this cell runs, the Power BI report appears directly in the notebook output. It looks and behaves just like it does in the Power BI service: you can interact with filters, hover over visuals for tooltips, and navigate the pages of the report. The embedding is live, meaning if the underlying data or report is updated, you’ll see the latest content when you rerun the cell.
One powerful aspect of embedding a report is that you can combine it with your analysis. For example, you might run some PySpark code to compute a result and display it above, and right below, show a related Power BI chart for context or for further exploration. This bridges the gap between code-driven analysis and dashboard consumption.
6.4.2 Quick Visualize: Generating a Power BI Report from Data
What if you don’t have an existing report, but you want to create a quick visualization of a DataFrame using Power BI’s capabilities? The powerbiclient package provides the Quick Visualize feature for this. With a single function call, you can turn a DataFrame into a Power BI “quick report” embedded in your notebook.
A quick report is an automatically generated report that Power BI creates by analyzing your DataFrame. It includes some default visuals (like a bar chart of a categorical column against a numeric sum) to help you tell the story of your data quickly. The report is temporary by default — it won’t appear in your workspace’s list of reports unless you choose to save it.
To use Quick Visualize:
from powerbiclient import QuickVisualize, get_dataset_config
# Prepare a (reasonably small) pandas DataFrame.
# Aggregate with Spark first if the source data is large.
sales_summary_pd = (
sales_df.groupBy("store_location")
.agg(sum("sale_amount").alias("TotalSales"))
.toPandas()
)
# Generate and embed a quick Power BI report
pbi_visualize = QuickVisualize(get_dataset_config(sales_summary_pd))
pbi_visualizeWhen you run this, Power BI generates a report and embeds it in the output. You might see a couple of charts appear — for example, a bar chart of TotalSales by store_location — the actual visuals depend on your data’s characteristics.
Power BI’s quick report is meant for instant exploration. It’s not saved automatically — if you rerun the cell, it regenerates a fresh report. However, if you find the visualization useful, you can use the Save button in the embedded report to save it to your workspace as a regular Power BI report, which you can later open and edit in the Power BI interface like any other report.
One great aspect of using Power BI in notebooks is interactivity. Even the auto-generated quick report is interactive: you can click on parts of visuals, apply filters, and edit the visuals, all within the notebook. You could, for instance, filter that quick sales report to only a particular store and see the charts update, then write notes in a Markdown cell below about the observation.
By integrating Power BI this way, data engineers and analysts who are already familiar with Power BI’s visuals get the best of both worlds: the power of PySpark for data prep and the rich visuals of Power BI for data exploration and presentation, all in one place.
A simple decision guide for the options seen so far:
| Need | Tool |
|---|---|
| Quick look at a result while developing | display() chart view |
| Custom static chart, full control | Matplotlib / Seaborn |
| Interactive chart in a notebook | Plotly |
| Visuals for business users, reusing Power BI skills | powerbiclient / Quick Visualize |
| A durable, refreshable dashboard | A real Power BI report on your Lakehouse tables |
For production reporting, remember that the best “visualization” output of a data engineer is often not a chart at all, but a clean, well-modeled Lakehouse table that Power BI can consume in Direct Lake mode.
6.5 Interactive Visualizations with Jupyter Widgets
So far, we’ve covered static visualizations (with Matplotlib/Seaborn), interactive charts (Plotly), and Power BI reports. Another way to enhance interactivity in your Fabric notebooks is by using Jupyter widgets. Widgets allow you to add UI controls (like sliders, dropdowns, and checkboxes) to your notebook that can dynamically control your code. This can turn a notebook from a static report into a simple application that lets users explore data by adjusting parameters — useful for both your own exploratory analysis and for sharing with others.
For example, imagine you want to let a user (or yourself) choose a subset of the data to visualize, without rewriting code each time. You could use a dropdown widget to select a category, and have the chart update based on that selection.
Jupyter widgets are provided by the ipywidgets library, which is supported in Fabric notebooks. Common ways to use widgets include the interact function for quick use cases, or manually creating widget objects for more control.
Let’s demonstrate with our sales data: a dropdown that selects a store location and redraws the sale amount histogram for that store only. (We reuse sales_pd, the sampled pandas DataFrame collected in the Matplotlib section earlier — if you jumped straight to this section, run that cell first.)
import ipywidgets as widgets
# Build the list of stores once, from Spark
store_locations = [row['store_location']
for row in sales_df.select("store_location").distinct().collect()]
def plot_sales_for_store(store):
# Filter the pandas DataFrame for the chosen store
subset = sales_pd[sales_pd['store_location'] == store]
plt.figure(figsize=(5, 3))
subset['sale_amount'].hist(bins=20, color='orange')
plt.title(f'Sale Amount Distribution — {store}')
plt.xlabel('Sale Amount ($)')
plt.ylabel('Frequency')
plt.show()
# Tie a dropdown to the plotting function
widgets.interact(
plot_sales_for_store,
store=widgets.Dropdown(options=sorted(store_locations), description='Store:')
)When this cell is executed, it displays a dropdown listing the store locations. Selecting one calls plot_sales_for_store with that value and redraws the histogram — instantly, without rerunning the cell manually. The widget and function work together to update the output dynamically. Other useful controls include IntSlider (e.g., to vary the number of histogram bins), SelectionRangeSlider (for a date range), and Checkbox (to toggle options).
Interactive widgets can greatly enhance exploratory analysis:
- You can adjust parameters (like the number of bins in a histogram, a date range, or a category filter) and see the chart update immediately.
- This encourages what-if analysis. For instance, “What if I focus on weekends instead of weekdays?” could be a dropdown of days, updating a chart or calculation.
- For business users, a notebook with widgets can feel like a lightweight interactive report or app, where they can self-serve some questions by manipulating the controls you provided.
Note that widget interactivity requires a live notebook session: the Python function behind the widget has to run somewhere. When a notebook is viewed without an active session, readers see the last rendered output but moving a control won’t recompute results. Also notice a performance detail in our example: the callback filters the already-collected pandas DataFrame rather than launching a Spark job on every dropdown change — keep widget callbacks cheap, and do the Spark work once, before the widget.
6.7 Chapter Summary
In this chapter, we explored how to visualize data within Microsoft Fabric, using both built-in and external tools:
- The built-in
display()widget offers a table view, a data-profiling Inspect pane, and a configurable chart view (up to five charts per output) — the fastest path from query result to chart, with no code. - We learned how to use Python visualization libraries to create custom charts, always following the same pattern: aggregate or sample with Spark, convert the small result to pandas, then plot. With Matplotlib, we made histograms and line charts; with Seaborn, box plots and heatmaps for statistical comparisons; with Plotly, interactive charts that readers can hover over and zoom into.
- We discovered the Power BI integration in Fabric notebooks via the preinstalled
powerbiclientpackage, which works without any extra authentication inside Fabric. You now know how to embed existing Power BI reports in your notebook, and how to use Quick Visualize to generate new Power BI quick reports straight from your DataFrames. - We introduced interactive widgets (
ipywidgets) as a way to add user controls to notebooks, enabling interactive data exploration without any coding by the end user. - Finally, we covered how to share notebooks with business users in a safe, read-only manner using Fabric’s sharing permissions, presenting your analysis like a story where others can explore the outputs but not alter the underlying content.
At this point, you have a toolbox of visualization techniques at your disposal. For early-career data engineers or analysts, these skills are crucial. They enable you not only to crunch large datasets with PySpark but also to make sense of the results and communicate them. By practicing creating charts and interactive reports, you’ll become proficient in both the engineering side (data processing) and the analytics side (insight generation and communication).
In the next chapter, we will move on to engineering your code — where we discuss best practices to write efficient, maintainable PySpark code in Fabric, so that your data pipelines and analyses are not just effective, but also robust and reusable. Until then, take some time to experiment with visualizing your own data; it’s one of the most rewarding parts of the data journey.