7  Chapter 7: Engineer Your Code

Note

Early draft release: This chapter hasn’t been fully edited yet.

In this chapter, we focus on engineering your PySpark code in Microsoft Fabric for maintainability, performance, and reusability. We’ll cover how to configure Spark settings and manage different environments, how to organize and reuse code (including calling external .py files and using custom libraries), and techniques for measuring performance and logging. The goal is to ensure you can develop code that is efficient, modular, and production-ready within the Fabric ecosystem.

7.1 Spark Configuration and Environments

A Microsoft Fabric environment consolidates your Spark runtime, library packages, and configuration settings in one place. By attaching a notebook to a specific environment, you control the Spark version, default compute settings, and available libraries for that session. Environments allow you to manage different setups (e.g. development vs. production) easily.

  • Spark compute settings: When creating an environment, you choose a Spark runtime (which determines the Spark version and preinstalled packages) and can select or configure the Spark pool (cluster) resources. For example, you might use a smaller pool for development and a larger one for production jobs.
  • Spark properties: You can set Spark configuration properties either via the environment (in the UI or environment JSON) or in code. Environment-level Spark properties (like executor memory, shuffle partitions, etc.) are applied when the environment is published and attached. You can also tweak properties at runtime using PySpark. For instance, to change the default shuffle partitions for your session:
# Check the current setting
print("Default shuffle partitions:", spark.conf.get("spark.sql.shuffle.partitions"))

# Set a custom shuffle partition count
spark.conf.set("spark.sql.shuffle.partitions", "50")

This setting will apply to any new Spark operations in this session. Use configuration adjustments sparingly and only when needed — for example, increasing spark.sql.shuffle.partitions for very large joins, or tuning memory settings for a specific job.

  • Managing multiple environments: Fabric lets you create multiple environments for different purposes. For instance, you might have an environment with a newer Spark runtime or different library versions for testing. You can easily switch a notebook’s environment from the toolbar, or set a workspace default environment for all notebooks. Keep in mind that changing an environment (for example, updating the Spark runtime or libraries) requires publishing those changes, which will restart the Spark session with the new configuration. This publish step can take several minutes if libraries are being installed, so plan environment changes accordingly.
Tip

Each Spark runtime in Fabric comes with a set of built-in libraries (pandas, NumPy, Spark, and many others). You can find the list of preinstalled packages for each runtime in the official documentation — this helps you know what’s available out of the box, so you don’t reinstall those. You can verify some key libraries and versions in a notebook:

import pyspark
import pandas as pd
import numpy as np

print("PySpark version:", pyspark.__version__)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)

7.2 Using External Code and Libraries

Effective PySpark development often requires reusing code across notebooks and including external libraries. Microsoft Fabric provides several ways to incorporate code from outside your current notebook.

7.2.1 Calling Local Python Files or Notebooks

If you have common utility code, you can store it in a separate notebook or a Python file and invoke it when needed.

Using %run for notebooks or scripts. A quick way to reuse code is the %run magic command, which executes another notebook or a local .py script. For example, if you created a notebook named Utils with some function definitions, you could include it at the top of another notebook with:

%run Utils

This executes the entire Utils notebook in the current context, so you can use the functions and variables it defines. This approach is simple and works well in development. You can also %run a Python script file stored in your notebook’s built-in resources or in the environment’s resources (for example, %run builtin/myscript.py).

Using Python’s importlib or runpy. Alternatively, you can execute a Python file using the runpy module:

import runpy
runpy.run_path("myscript.py", run_name="__main__")

This runs the script as if it were a standalone program. If the script contains function definitions that you want to call, a cleaner way is to import it as a module:

import importlib.util

spec = importlib.util.spec_from_file_location("myscript", "myscript.py")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

# Now use functions from the module
result = module.some_function(arg1, arg2)

The above is more advanced; in many cases, %run is sufficient for running a local script.

Organizing code in separate notebooks or scripts helps avoid copy-pasting code. However, be mindful that %run essentially inlines the code, which can become slow if you chain many %run calls or include a very large notebook. It also re-runs that code each time, which can be repetitive.

7.2.2 Working with Libraries in Fabric

As noted, each Fabric Spark environment includes a set of built-in libraries that are ready to use. When you need additional libraries (for example, a Python package from PyPI), you have two main options in Fabric:

  1. Add the library to the Fabric environment (persistent). You can install a public library (PyPI or Conda) into the environment via the environment’s UI. Go to the Public Libraries section of your environment, search for the package name and select the version, then save and publish the environment. Once published, the library is available in any notebook attached to that environment, and it is loaded every time the Spark session starts. This approach ensures the library is reliably present for all runs (great for scheduled jobs or pipelines), but remember that publishing changes can take some time (several minutes to resolve dependencies and restart).

  2. Use inline installation in a notebook (session-scoped). During interactive development, you can install a package on the fly using the %pip (or %conda) magic:

%pip install requests

This quickly downloads and installs the requests library into the current Spark session — on both the driver and the executors. After running it, you can import and use the package in the notebook:

import requests
resp = requests.get("https://httpbin.org/ip")
print(resp.json())

Inline installation is convenient for experimentation, but note that it only applies to the current session and notebook. If you restart the session, you’ll have to reinstall — or attach an environment that includes the library. For scheduled pipeline runs, prefer environment-managed libraries: they guarantee a consistent, reproducible setup, whereas inline installs add startup time to every run and may be restricted in automated contexts. If you do use inline installation, put all %pip commands at the very beginning of your notebook, because installing packages restarts the Python interpreter.

NoteWhy use %pip instead of !pip?

In Fabric notebooks, %pip is a special command that ensures the package is available on all Spark executor nodes and handles dependency resolution gracefully. In contrast, !pip install (a shell command) would only affect the driver node and can lead to mismatches. Always prefer the %pip magic for installing Python packages in Spark notebooks.

7.2.3 Creating and Using Custom Python Libraries

Often, you’ll develop your own Python modules or utilities specific to your data engineering tasks. There are a few ways to integrate custom code libraries in Fabric:

Using environment resources for code. Each Fabric environment has a Resources section — a small file storage accessible by all notebooks attached to that environment. You can upload Python files here (like mylib.py or even a folder of modules). These files are immediately accessible to your notebooks, with no publish step needed. When the notebook is attached to the environment, you can browse the resources from the notebook’s explorer and drag a file into a code cell to generate a reference snippet. From there, you can %run the file, or add its folder to sys.path and import it as a regular module.

Environment resources are great for sharing code during development. However, files there are not automatically distributed to executor nodes — code that runs inside Spark tasks (for example, in a UDF) must be made available to executors explicitly (see addPyFile below) or installed as a library.

Uploading custom libraries to the environment (.whl or .py). For a more production-grade solution, package your Python code into a library and attach it to the environment as a custom library. Fabric supports uploading .whl, .jar, .tar.gz, and even single .py files. For example, if you have a Python package, build a wheel distribution and upload it in the environment’s Custom Libraries section, then publish the environment. Once done, any notebook using that environment has the library installed and can simply import it.

For example, suppose you created a simple library called data_cleaner with this structure:

data_cleaner/
    __init__.py
    cleaner.py        # contains functions like clean_dataframe, etc.
pyproject.toml

After building a wheel (e.g., with python -m build), you upload data_cleaner-0.1-py3-none-any.whl to the environment. After publishing, you can use it in notebooks:

import data_cleaner

df = spark.read.csv("Files/raw.csv")
df_clean = data_cleaner.cleaner.clean_dataframe(df)

If your custom code is just one or two Python files, you don’t necessarily need to create a full .whl: uploading a single .py as a custom library also lets you import it in your notebooks, and Fabric handles making it available on all nodes once the environment is published.

Dynamic code distribution with SparkContext (advanced). PySpark provides a method to add code files to all executors at runtime: spark.sparkContext.addPyFile(). This can distribute a .py file or a .zip archive of Python code to the cluster. If you prefer not to go through environment publishing for every code change, this approach is useful. For example, you could zip up a folder of utility modules into utils.zip, store it in the Lakehouse Files area, and do:

spark.sparkContext.addPyFile("abfss://.../Files/utils.zip")
import my_utils  # assuming utils.zip contains my_utils.py

This makes my_utils available on the executors for use in your Spark functions. Keep in mind that addPyFile is session-scoped; you’ll need to call it each time a new Spark session starts if you rely on it.

Choosing the right approach. For rapid development of shared code, environment resources or %run on notebooks are simplest (no external tools needed, quick iteration). For enterprise scenarios or production pipelines, packaging your code into libraries ensures better version control and stability. Note that attaching an environment with many libraries increases Spark session startup time (the cluster must resolve and install those libraries). If your library code is changing frequently, you might delay packaging until it stabilizes, to avoid the overhead of publishing new environments for every change. In such cases, use the quick methods during development, then package for production.

7.2.4 Organize Your Code for Clarity and Reuse

As you build PySpark projects in Fabric, organizing code becomes important:

  • Modularize your code: Break complex tasks into reusable functions or classes. Keep your notebook focused on the high-level workflow, while the heavy lifting (e.g., complex transforms, validations) resides in functions, ideally in an imported module. This makes the notebook easier to read and debug.
  • Use version control: Fabric workspaces integrate with Git, so your notebooks can be versioned alongside the rest of your code. For Python modules or packaged libraries, maintain them in a repository as well, so your custom library can be versioned and developed collaboratively using standard tools.
  • Follow best practices: Treat your data engineering code like software. Add comments, write docstrings for your functions, and include basic tests for critical logic (even if just simple assertions within notebooks). Logging (discussed below) is also a key part of making your pipeline code understandable and maintainable.
  • Manage configuration separately: Avoid hard-coding environment-specific details (like file paths or workspace names) in your code. Use a config file or notebook parameters (described next). This allows running the same code in dev/test/prod by changing configs, not code.

By organizing your code well, you make it easier for others (and future you) to follow the logic, and for yourself to extend or debug the pipeline.

7.2.5 Version Control and Deployment

Two Fabric features turn the “use version control” advice into a concrete workflow:

  • Git integration: A Fabric workspace can be connected to a Git repository (Azure DevOps or GitHub). Once connected, your notebooks, environments, Lakehouse definitions, and other items are synchronized with the repository: you commit changes from the workspace, review them as pull requests, and can roll back to any earlier state. For notebooks, the synchronized format is source code (a .py representation), which makes diffs reviewable — a major improvement over emailing notebook exports around.
  • Deployment pipelines: Fabric deployment pipelines promote content between workspaces representing your stages — typically Development → Test → Production. Combined with parameterized notebooks (no hard-coded paths or workspace names, as recommended above), this lets the exact code you tested be the code that runs in production, with only the configuration differing per stage.

Even a single-person project benefits from connecting the workspace to Git from day one: it gives you history, backup, and a safe way to experiment on branches.

7.3 Testing Your PySpark Code

Treating data pipelines like software also means testing them. You don’t need a heavyweight setup to start — a few well-chosen tests on your transformation functions catch most regressions.

The key enabler is the modularization advice from the previous section: logic that lives in a function taking a DataFrame and returning a DataFrame is easy to test, because you can feed it a tiny, hand-crafted DataFrame and check the result. Logic buried in a long notebook cell is not.

Suppose your shared module defines this transformation:

from pyspark.sql.functions import col

def add_sales_category(df):
    """Add a 'category' column: 'high' for amounts above 100, else 'low'."""
    from pyspark.sql.functions import when
    return df.withColumn(
        "category",
        when(col("sale_amount") > 100, "high").otherwise("low")
    )

You can test it with a small in-memory DataFrame and PySpark’s built-in testing utilities (available in Spark 3.5 and later, which Fabric Runtime 1.3 provides):

from pyspark.testing import assertDataFrameEqual

def test_add_sales_category():
    input_df = spark.createDataFrame(
        [(1, 50.0), (2, 150.0)],
        ["sale_id", "sale_amount"]
    )
    expected_df = spark.createDataFrame(
        [(1, 50.0, "low"), (2, 150.0, "high")],
        ["sale_id", "sale_amount", "category"]
    )

    result_df = add_sales_category(input_df)
    assertDataFrameEqual(result_df, expected_df)

test_add_sales_category()
print("All tests passed")

assertDataFrameEqual compares schemas and rows (ignoring row order by default) and raises a clear error showing the differing rows when the test fails. A few practical guidelines:

  • Test the logic, not Spark. Don’t test that groupBy works — test that your business rules (categorizations, deduplications, edge cases like nulls and empty input) produce the expected output.
  • Keep test data tiny. Two to five rows per case is enough; the point is correctness, not scale. Small data also keeps the tests fast enough to run on every change.
  • Run tests where it suits your maturity. The simplest setup is a dedicated test notebook that imports your module, runs all test functions, and is executed (manually or via notebookutils.notebook.run) before you promote changes. Teams that package their code as a wheel can go further and run pytest locally or in a CI pipeline on every commit — your transformation functions are plain Python, so they run anywhere a Spark session is available.
  • Complement tests with runtime validation. Unit tests verify the code; the data-validation checks from Chapter 5 verify the data. Production pipelines need both.

7.4 Orchestrating and Running Notebooks Programmatically

In Fabric, you can chain notebooks together or run notebooks from code. This is useful for orchestrating multi-step workflows — or even a full DAG (directed acyclic graph) of tasks — entirely from notebooks. The tool for this is NotebookUtils.

Note

mssparkutils has been renamed to notebookutils. The old namespace still works for backward compatibility, but new features only land in notebookutils — use it in new code.

Running a notebook with notebookutils.notebook.run(). Fabric provides notebookutils.notebook.run() to call one notebook from another, similar to how Azure Synapse or Databricks notebooks can be orchestrated:

result = notebookutils.notebook.run("ETL_Step2", 600, {"input_path": "Files/processed/step1/"})

The arguments are the notebook name (or path), a timeout in seconds, and an optional dictionary of parameters. The call blocks until the notebook finishes or the timeout is reached.

Passing parameters. In the called notebook, designate one cell as a parameters cell (toggle it in the cell menu) and define default values there:

# Parameters cell of ETL_Step2
input_path = "Files/default/"

When the notebook is invoked through notebookutils.notebook.run() (or from a pipeline), the values passed by the caller override these defaults. This is the standard way to make notebooks reusable across environments and datasets.

Returning a result. The called notebook can end with notebookutils.notebook.exit("some value") to send a result string back to the caller; run() returns this value. This is useful for passing a status or a small piece of data — not for large data transfer (for large data, write to a Lakehouse table in one notebook and read it in the next).

Sequential and conditional workflows. By combining multiple run calls with standard Python flow control, a parent notebook can act as a controller for a simple pipeline:

status = notebookutils.notebook.run("Data_Ingestion", 300, {"table": "customers"})

if status == "SUCCESS":
    notebookutils.notebook.run("Data_Processing", 300, {"table": "customers"})
else:
    print("Ingestion failed, skipping processing.")

Each child notebook runs within the same Spark session as the caller by default. This is efficient — there’s no new cluster to spin up for each step, and Spark-level state such as temporary views or cached data is shared. Python variables, however, are not shared between the caller and the callee: communicate through parameters, exit values, or data written to storage.

Parallel execution and DAGs with runMultiple(). When steps don’t depend on each other, you don’t have to run them one by one. notebookutils.notebook.runMultiple() runs several notebooks concurrently within the session, and can even execute a full DAG where notebooks declare dependencies on one another:

# Run two independent notebooks in parallel
notebookutils.notebook.runMultiple(["Load_Customers", "Load_Products"])

# Or define a DAG with dependencies and per-notebook parameters
dag = {
    "activities": [
        {"name": "Load_Customers", "path": "Load_Customers", "timeoutPerCellInSeconds": 300},
        {"name": "Load_Products", "path": "Load_Products", "timeoutPerCellInSeconds": 300},
        {
            "name": "Build_Sales_Mart",
            "path": "Build_Sales_Mart",
            "args": {"refresh_date": "2025-06-01"},
            "dependencies": ["Load_Customers", "Load_Products"]
        }
    ]
}
notebookutils.notebook.runMultiple(dag)

In this example, Load_Customers and Load_Products run in parallel, and Build_Sales_Mart starts only after both succeed. For larger orchestration needs — scheduling, retries, mixing notebooks with other activities like dataflows or copy jobs — use Fabric Data Pipelines, which can invoke your notebooks as pipeline activities.

Note

Fabric also supports %run, which differs from notebookutils.notebook.run(). %run inlines the target notebook’s code into the current notebook (like a copy-paste at runtime), sharing all variables. notebookutils.notebook.run() executes the target notebook as a separate, parameterized run with a clear start and end, an exit value, and its own run record (a snapshot you can inspect afterwards). Use %run for bringing in shared functions or setup code, and notebook.run for treating notebooks as pipeline tasks.

7.4.1 From Notebooks to Spark Job Definitions

Notebooks are not the only way to run Spark code in Fabric. A Spark job definition (SJD) is a Fabric item that runs one or more plain Python files (or a packaged application) as a batch Spark job — no cells, no outputs, just your main.py, optional library files, and command-line arguments.

When should you use one instead of a notebook?

  • Notebooks excel at interactive development, exploration, and pipelines where you want the run to leave a readable, cell-by-cell record with outputs. They are also the only place where magics (%run, %pip) and display() exist.
  • Spark job definitions suit mature, non-interactive batch workloads: the code is a standard Python application that your team develops in an IDE, tests with pytest, versions in Git, and builds in CI. There’s no notebook-specific syntax to strip out, and the job runs the exact artifact you built.

A common trajectory for a pipeline is exactly the path of this chapter: start in a notebook, extract the logic into modules, add tests — and once it’s stable and scheduled, the step to an SJD is small, because the notebook had already shrunk to a thin orchestration layer. Both notebooks and Spark job definitions can be scheduled directly or invoked as activities in a Fabric Data Pipeline, so the choice doesn’t constrain your orchestration.

7.5 Performance Tips and Monitoring

As your data flows grow, performance becomes critical. PySpark, backed by Spark, is designed for large-scale data, but how you write your code can significantly impact speed. Here are some hands-on tips.

7.5.1 Measuring Execution Time

When developing in notebooks, you can use the IPython magic commands %%time and %%timeit to measure execution time of code:

%%time reports how long a cell takes to execute. This is useful for timing a Spark operation or a function call:

%%time
big_df = spark.range(0, 10000000)  # Create a DataFrame with 10 million numbers
print(big_df.count())              # Trigger computation by counting

The output shows CPU and wall-clock time for the cell. Because Spark operates in parallel on remote executors, the wall time is the relevant number.

%%timeit runs the code multiple times to give an average runtime (7 runs by default). This is not always practical for expensive Spark jobs — you don’t usually want to run a heavy job multiple times — but you can tweak it, for example %%timeit -r1 -n1 to run just once. Generally, %%timeit is most useful for small Python code snippets, or to compare two approaches quickly (e.g., a pure Python loop vs. a pandas vectorized operation).

Use these tools to pinpoint slow parts of your code. For a deeper look at where a Spark job spends its time (skewed tasks, shuffles, serialization), use Fabric’s built-in Spark monitoring: each cell that runs a Spark job shows an inline progress indicator with per-job details, and from the notebook’s Run menu or the workspace Monitor hub you can open the full Spark UI (for live sessions) and the Spark history server (for completed runs) to inspect jobs, stages, and tasks.

7.5.2 Lazy Evaluation and Caching

Remember that Spark evaluates transformations lazily. An operation like df.filter(...).select(...) does nothing until an action is called (e.g., .count(), .write(), or display()). This means if you write code like:

df = spark.read.parquet("Files/large.parquet")
print(df.count())       # Action 1
first_row = df.first()  # Action 2

you have two actions on the same DataFrame. Spark recomputes the DataFrame for each, which is inefficient. To improve this:

Cache the DataFrame. If you need to reuse the results of a computation for multiple actions, use caching:

df = spark.read.parquet("Files/large.parquet").cache()

total = df.count()      # Triggers full computation; data is cached
print("Total records:", total)

first_row = df.first()  # Uses cached data, much faster
print("First record:", first_row)

Here, the second action doesn’t recompute from the source, because the data was cached in memory after the first action. Remember to call df.unpersist() when you’re done, to free executor memory.

Plan your actions smartly. If you only need the first row, calling first() alone suffices — Spark can optimize knowing you only need a single element. In contrast, doing count() just to check if data exists and then first() is overkill. A pattern like:

if df.isEmpty():
    print("No data")
else:
    print(df.first())

avoids a full count of the dataset just to know whether it’s empty.

Avoid unnecessary collect() or large show(). It’s tempting to call df.show(1000) to see data, or df.collect() to pull everything into Python. But this can be extremely slow (and memory-intensive) for large datasets. Where possible, use df.limit(n).collect() to fetch a few records, or use Spark’s sampling (e.g., df.sample(0.01) for 1% of the data) if you need a representative subset to inspect. Always think about the volume of data being moved to the driver.

The key is to minimize the number of actions and the amount of data transferred to the driver. Combine transformations and run a single action at the end, when feasible. If you need to perform multiple actions, consider caching intermediate results.

7.5.3 Basic Performance Tuning

Some general Spark performance tips that you can apply in Fabric:

  • Partitioning: Ensure your data is partitioned in a way that is optimal for your operations. For example, if you know you’ll frequently filter by a date column, partition your table by that column when writing (see Chapter 5). This reduces data scanned in future reads.
  • Shuffle parallelism: The configuration spark.sql.shuffle.partitions (200 by default) controls the number of partitions after shuffle operations (joins, aggregations). For small datasets, decreasing it avoids unnecessary overhead; for huge datasets, increasing it can improve parallelism. We showed above how to adjust it with spark.conf.set. Alternatively, use df.repartition(n) or df.coalesce(n) for specific cases.
  • Broadcast joins: If one of your DataFrames is small enough to fit in memory on each executor, use a broadcast join hint — df.join(broadcast(small_df), ...) — to avoid a heavy shuffle. Spark often does this automatically when it detects a table below the broadcast threshold, but you can control it explicitly when needed.
  • Persist to disk when needed: If caching in memory is not possible due to size, you can use .persist(StorageLevel.DISK_ONLY) or similar to avoid recomputation at the cost of writing to local disk. This is useful for a very expensive computation that you need to reuse but that doesn’t fit in RAM.

These topics can get very deep, but even as a beginner, understanding the lazy nature of Spark and avoiding common pitfalls (like redundant actions or excessive data collection) will go a long way.

7.5.4 Logging and Diagnostics

Logging is essential in data engineering. Good logs help you understand the flow of your pipeline and quickly pinpoint issues. In Spark, logging has some special considerations:

Use a logging framework, not prints. In production code, avoid using print() for status messages. Instead, use Python’s built-in logging module:

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("my_app")

logger.info("Starting transformation X")

This way, you can control log levels and formats easily. Printing — especially inside Spark transformations that run on executors — is hard to trace and can overwhelm the output. And remember: don’t flood your logs by logging every record; log high-level progress or summary statistics.

Log before and after important steps. A good practice is to log an informative message before a complex operation runs — for example, “About to join sales and customers on customer_id” right before the join. That way, if the job hangs or errors, you know which operation was last attempted. Similarly, log when an operation completes, perhaps including how long it took or how many records were produced. This creates a trace of the execution sequence in your logs.

Capture metadata and metrics. Logging can include row counts or schema information at key points. Be cautious though: df.count() purely for logging is an extra action that costs a full job. If you compute it anyway (for logic or validation), logging the result is free.

Don’t lose your logs. Ensure that your logs go somewhere persistent and visible even if the job fails or the cluster shuts down. In interactive notebooks, driver logs are captured and can be reviewed in the monitoring view after the session. For production runs, consider centralizing logs externally: Fabric provides the Apache Spark diagnostic emitter, which can send Spark logs, events, and metrics to destinations like Azure Event Hubs, Azure Storage, or Azure Log Analytics. With that enabled, your driver and executor logs are available for analysis and alerting long after the session ends — exactly what you need to debug an overnight pipeline failure. As a simpler approach, your job can also write important events to a Lakehouse table or file as a custom audit trail (as we sketched at the end of Chapter 5).

Use a consistent log format. For significant logging, configure a formatter such as [timestamp] [level] [module]: message. Structured logging (e.g., JSON) helps if logs will be parsed by tools.

Monitoring and alerts. If you run pipelines regularly, integrate with monitoring. With the diagnostic emitter sending logs and metrics to Log Analytics, you can query your Spark logs centrally and set up alerts on failures or anomalies. Within Fabric itself, the Monitor hub shows the status, duration, and logs of all your Spark applications — make checking it part of your operational routine.

In summary, logs are your lifeline when things go wrong. Add meaningful logs around key steps, include crucial variables in the messages (but not giant data dumps), and make sure they persist beyond the life of the session. Also, test your logging by inducing a small error or running in debug mode, to ensure you can see the messages as expected.

7.6 Chapter Summary

By engineering your PySpark code thoughtfully, you set yourself up for success in managing big data workflows:

  • We learned how Fabric environments encapsulate Spark runtimes, configurations, and libraries, enabling consistent setups across notebooks. Use them to manage dependencies and Spark settings cleanly.
  • You discovered ways to reuse code, from simply running a common notebook or script with %run, to creating and packaging your own libraries (environment resources, custom .whl libraries, addPyFile). This modular approach avoids duplication and errors.
  • We explored how to add external libraries — quickly with %pip for a session, or persistently via environment settings — so you can leverage the rich Python ecosystem in your Spark jobs.
  • We emphasized code organization and best practices, treating data pipelines with the same care as software projects: modularity, configuration management, Git integration and deployment pipelines for promoting code through dev/test/prod.
  • We saw how to test transformation functions with tiny in-memory DataFrames and assertDataFrameEqual, catching regressions before they reach your data.
  • For orchestration, you can call notebooks from notebooks with notebookutils.notebook.run(), pass parameters and exit values, and even run whole DAGs in parallel with runMultiple() — or step up to Fabric Data Pipelines for full scheduling. For mature batch workloads, Spark job definitions run plain Python applications without notebooks at all.
  • Finally, we delved into performance tuning and logging: understanding Spark’s lazy evaluation to avoid redundant work, caching wisely, timing your code to find bottlenecks, using the Spark UI and Monitor hub, and implementing robust logging — including Fabric’s diagnostic emitter for shipping logs and metrics to external systems.

With these techniques, you can write PySpark code in Microsoft Fabric that not only works, but is maintainable, efficient, and ready for production. As you move on to real-world projects, keep these principles in mind — they will save you time and headaches when your data grows or when something goes wrong. Happy coding and engineering!

7.7 Further Reading

Back to top