5 Chapter 5: Store Data

Note

Early draft release: This chapter hasn’t been fully edited yet.

In the previous chapter, we learned how to query data efficiently with PySpark. Now it’s time to look at the other side of the coin: persisting the results of your work. Whether you’re building an ETL pipeline, preparing a curated dataset for analysts, or simply checkpointing intermediate results, knowing how to store data correctly in Microsoft Fabric is an essential skill.

In this chapter, we’ll cover the storage options available in Microsoft Fabric, how to write data as files and as tables, how Delta Lake powers the Fabric Lakehouse, how to write to a Fabric Warehouse, and how to use notebookutils for file system operations. We’ll close with best practices for partitioning, optimization, and data validation.

Setup the environment

All code examples in this chapter assume that you have a “Sales” table, loaded from the sales_data.csv file, available in a Lakehouse attached to your notebook. Please refer to Chapter 1 for instructions on how to set up the environment and load the data.

5.1 Understanding Storage in Microsoft Fabric

Before writing any data, it helps to understand where the data actually goes. All Microsoft Fabric workloads store their data in OneLake, a single, unified, logical data lake for your whole organization. OneLake is built on top of Azure Data Lake Storage Gen2, which means every file you write is accessible through standard ADLS APIs and the abfss:// URI scheme.

Within a Lakehouse, your data lives in one of two areas:

Tables: the managed area of the Lakehouse. Everything stored here is registered in the Lakehouse catalog and stored in Delta Lake format. Tables are discoverable by other Fabric experiences: the SQL analytics endpoint, Power BI semantic models (including Direct Lake mode), and other notebooks.
Files: the unmanaged area. This is a general-purpose file storage where you can put anything: CSV exports, JSON documents, images, raw landing data, and so on. Files placed here are not automatically registered as tables.

A common pattern — often called the medallion architecture — is to land raw data in the Files area (bronze), then progressively clean and shape it into Delta Tables (silver and gold) for consumption.

When referencing locations from a notebook with a default Lakehouse attached, you can use relative paths:

# Relative paths, resolved against the default Lakehouse
df = spark.read.format("csv").option("header", "true").load("Files/raw/sales_data.csv")

# Absolute OneLake path (works without a default Lakehouse)
df = spark.read.format("csv").option("header", "true").load(
    "abfss://<workspace_name>@onelake.dfs.fabric.microsoft.com/<lakehouse_name>.Lakehouse/Files/raw/sales_data.csv"
)

Relative paths (Files/... and Tables/...) are convenient for interactive work; absolute abfss:// paths are more explicit and are required when accessing a Lakehouse that isn’t attached to the notebook.

5.2 Write Modes

Every write operation in PySpark accepts a mode that defines what happens when the destination already exists. There are four modes, and choosing the right one matters for both correctness and safety:

Mode	Behavior when the destination exists
`error` / `errorifexists` (default)	Fails with an error. The safest option.
`ignore`	Silently skips the write. No data is changed.
`append`	Adds the new rows to the existing data.
`overwrite`	Replaces the existing data entirely.

# The default mode fails if the destination already exists
sales_df.write.format("parquet").save("Files/output/sales")

# Append new rows to existing data
sales_df.write.format("parquet").mode("append").save("Files/output/sales")

# Replace the destination completely
sales_df.write.format("parquet").mode("overwrite").save("Files/output/sales")

Note that we specify the format explicitly. If you call save() without a format(), Spark falls back to the session default (spark.sql.sources.default), which may not be what you expect — being explicit avoids surprises about what actually lands on disk.

Warning

overwrite deletes the existing data before writing. There is no undo for plain file formats like Parquet or CSV. As we’ll see later in this chapter, this is one of the reasons Delta Lake — which keeps a transaction history — is the preferred format in Fabric.

5.3 Writing Files to the Lakehouse

5.3.1 Choosing a File Format

PySpark can write to many formats. In Fabric, the most common choices are:

Delta (the default and recommended format): a Parquet-based format with ACID transactions, schema enforcement, and time travel. Use it for all curated tables.
Parquet: a columnar, compressed format. Excellent for analytical workloads and data exchange with non-Delta systems.
CSV: a text format. Use it only for interoperability with external tools or for small exports; it has no schema, no compression by default, and poor read performance.
JSON: useful for semi-structured data exchange, with the same performance caveats as CSV.

5.3.2 Writing Parquet Files

# Write the DataFrame as Parquet files
sales_df.write.mode("overwrite").parquet("Files/output/sales_parquet")

# Equivalent, using the generic API
sales_df.write.format("parquet").mode("overwrite").save("Files/output/sales_parquet")

# Control the compression codec (snappy is the default)
sales_df.write.mode("overwrite") \
    .option("compression", "snappy") \
    .parquet("Files/output/sales_parquet")

Note that the path you provide is a directory, not a single file. Spark is a distributed engine: each task writes its own file, so the output is a folder containing one or more part-*.parquet files. This surprises many newcomers — if you need a single output file, we’ll see how to control that later in this chapter.

5.3.3 Writing CSV and JSON Files

# CSV with a header row and a custom delimiter
sales_df.write.mode("overwrite") \
    .option("header", "true") \
    .option("delimiter", ";") \
    .csv("Files/output/sales_csv")

# JSON (one JSON object per line)
sales_df.write.mode("overwrite").json("Files/output/sales_json")

Useful CSV options include header (write column names), delimiter, quote, escape, encoding (for non-UTF-8 destinations), and compression (e.g. gzip). For JSON, each row is written as a single JSON document per line (the JSON Lines convention), which is what most big-data tools expect.

5.3.4 Writing to Multiple Destinations

Sometimes you need the same data in more than one format — for example, a Delta table for analytics and a CSV export for a partner system. Simply chain multiple writes, and consider caching the DataFrame first so the source isn’t recomputed for each write:

sales_df.cache()

# Curated Delta table for analytics
sales_df.write.format("delta").mode("overwrite").save("Tables/sales_curated")

# CSV export for an external partner
sales_df.write.mode("overwrite").option("header", "true").csv("Files/exports/sales_csv")

sales_df.unpersist()

5.4 Saving Data as Lakehouse Tables

Writing files is flexible, but writing tables is what makes your data discoverable across Fabric. A table saved to the Lakehouse is immediately visible in the Lakehouse explorer, queryable through the SQL analytics endpoint, and usable from Power BI in Direct Lake mode.

5.4.1 Managed Tables with saveAsTable()

The simplest way to create a table is saveAsTable(). In Fabric, tables are written in Delta format by default — you don’t even need to specify it:

# Save as a managed table in the default Lakehouse
sales_df.write.mode("overwrite").saveAsTable("sales_curated")

# Reference a specific Lakehouse explicitly
sales_df.write.mode("overwrite").saveAsTable("MyLakehouse.sales_curated")

A managed table means the Lakehouse manages both the metadata and the data files (stored under the Tables/ area). If you drop the table, the underlying data files are deleted too.

You can verify the result immediately:

# List the tables in the default Lakehouse
spark.catalog.listTables()

# Read the table back
df_check = spark.read.table("sales_curated")
display(df_check.limit(5))

5.4.2 External Tables

An external table keeps its data at a location you choose (for example, a folder under Files/ or even another OneLake location via a shortcut), while still registering the table in the catalog. You create one by specifying a path — use an absolute abfss:// URI so there is no ambiguity about where the data lands:

sales_df.write.mode("overwrite") \
    .option(
        "path",
        "abfss://<workspace_name>@onelake.dfs.fabric.microsoft.com/"
        "<lakehouse_name>.Lakehouse/Files/external/sales_external"
    ) \
    .saveAsTable("sales_external")

With an external table, dropping the table removes only the catalog entry — the data files remain. External tables are useful when the data lifecycle is managed outside the Lakehouse, or when several tables or systems need to share the same underlying files. For most scenarios within Fabric, though, managed tables are simpler and integrate better with the rest of the platform.

5.4.3 Schema Enforcement and Evolution

Delta tables enforce their schema: an append with mismatched columns fails instead of silently corrupting your table. When the incoming schema legitimately changes — say a new column was added upstream — you can opt in to schema evolution:

# Allow new columns to be added to the table schema during append
new_sales_df.write.mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable("sales_curated")

# Replace the schema entirely during an overwrite
new_sales_df.write.mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("sales_curated")

Use these options deliberately, not as a default: schema enforcement is a feature that protects downstream consumers of your tables.

5.5 Working with Delta Lake

Since every Lakehouse table is a Delta table, it’s worth understanding what Delta Lake gives you beyond plain Parquet:

ACID transactions: writes either fully succeed or fully fail, even with concurrent readers and writers. A reader never sees a half-written table.
Transaction log: every change to the table is recorded in the _delta_log folder, which enables auditing and time travel.
Time travel: query the table as it existed at a previous version or timestamp.
Upserts and deletes: MERGE, UPDATE, and DELETE operations, which plain Parquet cannot do.

5.5.1 Inspecting Table History

%%sql
DESCRIBE HISTORY sales_curated

This returns one row per operation (writes, merges, optimizations…) with the version number, timestamp, user, and operation details.

5.5.2 Time Travel

# Read a previous version of the table by version number
df_v1 = spark.read.format("delta") \
    .option("versionAsOf", 1) \
    .table("sales_curated")

# Or by timestamp
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2025-06-01") \
    .table("sales_curated")

Time travel is invaluable for debugging (“what did this table look like before last night’s load?”), for auditing, and for recovering from accidental overwrites — remember the warning about mode("overwrite") earlier? With Delta, the previous version is still there.

5.5.3 Upserts with MERGE

A very common pattern in data engineering is the upsert: update rows that already exist, insert the ones that don’t. Delta Lake supports this through the MERGE operation:

from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "sales_curated")

target.alias("t").merge(
    updates_df.alias("u"),
    "t.sale_id = u.sale_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

The same operation is available in Spark SQL with the MERGE INTO statement, which you may find more readable for complex matching logic:

%%sql
MERGE INTO sales_curated AS t
USING sales_updates AS u
  ON t.sale_id = u.sale_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

5.5.4 Table Maintenance: OPTIMIZE and VACUUM

Over time — especially with frequent small appends or streaming writes — a Delta table accumulates many small files, which slows down reads. Two maintenance commands keep tables healthy:

%%sql
-- Compact small files into larger ones
OPTIMIZE sales_curated;

-- Optionally co-locate data on frequently filtered columns
OPTIMIZE sales_curated ZORDER BY (store_location);

-- Remove data files no longer referenced by the table
-- (default retention: files older than 7 days)
VACUUM sales_curated;

OPTIMIZE rewrites small files into fewer, larger files (targeting around 1 GB per file), dramatically improving scan performance.
ZORDER BY clusters related data together in the same files, so queries filtering on those columns skip more files.
VACUUM physically deletes files that are no longer referenced by recent table versions. Be aware that vacuuming reduces how far back you can time travel — by default, only files older than the 7-day retention period are removed.

You can also trigger maintenance from the Fabric UI: in the Lakehouse explorer, the contextual menu of each table offers a Maintenance command that runs OPTIMIZE and VACUUM for you.

V-Order: a Fabric-specific optimization

Microsoft Fabric supports V-Order, a write-time optimization applied to Parquet files that enables very fast reads from Power BI (Direct Lake mode) and the SQL analytics endpoint. V-Order applies special sorting and compression to the Parquet files while keeping them fully compatible with any Parquet reader.

In recent Fabric runtimes (1.3 and later), V-Order is controlled by the session setting spark.sql.parquet.vorder.default and is disabled by default in newly created workspaces, because it adds some write overhead. If your tables are read-heavy — typically anything consumed by Power BI — it’s usually worth enabling:

spark.conf.set("spark.sql.parquet.vorder.default", "true")

You can also enable it per table with the table property delta.parquet.vorder.default, or apply it during OPTIMIZE ... VORDER.

5.6 Partitioning Your Data

Partitioning splits the stored data into subfolders based on the values of one or more columns. When queries filter on the partition column, Spark reads only the matching folders (partition pruning), which can drastically reduce I/O.

# Partition by a single column
sales_df.write.mode("overwrite") \
    .partitionBy("store_location") \
    .saveAsTable("sales_by_location")

# Partition by multiple columns (creates nested folders: year=.../month=...)
sales_df.write.mode("overwrite") \
    .partitionBy("year", "month") \
    .save("Files/output/sales_partitioned")

On disk, this produces a folder structure like:

Tables/sales_by_location/store_location=Houston/part-....parquet
Tables/sales_by_location/store_location=Seattle/part-....parquet
...

5.6.1 Partitioning Best Practices

Partitioning is a double-edged sword. Done well, it speeds up reads; done poorly, it creates the small files problem and slows everything down.

Choose low-cardinality columns that are frequently used in filters: dates (at day or month granularity), regions, departments. Never partition by a high-cardinality column like a customer ID.
Aim for partitions of at least a few hundred megabytes. A useful rule of thumb: don’t partition tables smaller than a few gigabytes at all — Delta’s file statistics and OPTIMIZE already give you good performance.
Avoid over-partitioning. Partitioning by year, month, day, hour on a modest dataset creates thousands of tiny files. Each file has fixed overhead, and listing and opening many small files often costs more than scanning fewer large ones.
Match your query patterns. Partitioning only helps queries that filter on the partition columns.

5.6.2 Controlling the Number of Output Files

Independently of partitioning, you can control how many files Spark writes by changing the number of in-memory partitions before the write:

# Reduce the number of output files (no shuffle, fast)
sales_df.coalesce(1).write.mode("overwrite") \
    .option("header", "true") \
    .csv("Files/exports/sales_single_file")

# Redistribute into a specific number of evenly-sized files (full shuffle)
sales_df.repartition(8).write.mode("overwrite").parquet("Files/output/sales_8_files")

Use coalesce(n) to reduce the file count cheaply (it avoids a shuffle), and repartition(n) when you need evenly sized output or want to increase parallelism. Be careful with coalesce(1): it funnels all the data through a single task, which defeats Spark’s parallelism — only use it for small exports.

5.7 Writing to a Fabric Warehouse

The Lakehouse is the natural home for Spark workloads, but some teams serve their curated data through a Fabric Warehouse, which offers full T-SQL capabilities (multi-table transactions, T-SQL DML, granular SQL security). Fabric provides a built-in Spark connector for Data Warehouse that lets PySpark read and write Warehouse tables directly, using your Microsoft Entra identity — no connection strings or passwords required.

import com.microsoft.spark.fabric
from com.microsoft.spark.fabric.Constants import Constants

# Write the DataFrame to a Warehouse table
sales_df.write.mode("overwrite") \
    .synapsesql("MyWarehouse.dbo.SalesCurated")

# Append to an existing table
new_sales_df.write.mode("append") \
    .synapsesql("MyWarehouse.dbo.SalesCurated")

# Read a Warehouse table back into Spark
wh_df = spark.read.synapsesql("MyWarehouse.dbo.SalesCurated")

A few things to know about the connector:

The table name uses three parts: <warehouse name>.<schema name>.<table name>.
The supported save modes are errorifexists (default), ignore, append, and overwrite.
Under the hood, the write happens in two phases: the connector first stages the data in intermediate storage, then ingests it into the Warehouse table with a COPY INTO statement — an efficient bulk-load path.
You can also read from the SQL analytics endpoint of a Lakehouse with the same syntax, but that endpoint is read-only: writes only target Warehouse tables.
To access a Warehouse in a different workspace, add .option(Constants.WorkspaceId, "<workspace id>") to the read or write.

Lakehouse or Warehouse?

A frequent question: when should data land in a Warehouse instead of a Lakehouse? As a rule of thumb, keep your Spark-produced data in the Lakehouse — it’s the native target, and the SQL analytics endpoint already gives analysts read access through T-SQL. Use the Warehouse when consumers need to write with T-SQL, need multi-table transactions, or when your organization’s serving layer is standardized on the Warehouse. Remember that both store their data in OneLake in Delta format — moving between them is cheap.

5.8 File Operations with notebookutils

Beyond DataFrame writes, you’ll often need plain file system operations: copying files, creating folders, checking whether a path exists. Fabric notebooks include NotebookUtils (formerly known as mssparkutils), a built-in utility package for these tasks.

Note

mssparkutils has been renamed to notebookutils. The old mssparkutils namespace still works for backward compatibility but will eventually be retired, and new features only land in notebookutils — use notebookutils in new code.

import notebookutils

# List the contents of a directory
files = notebookutils.fs.ls("Files/raw")
for f in files:
    print(f.name, f.size)

# Check whether a path exists
if notebookutils.fs.exists("Files/raw/sales_data.csv"):
    print("Source file found")

# Create a directory (including intermediate directories)
notebookutils.fs.mkdirs("Files/staging/2025")

# Copy a file
notebookutils.fs.cp("Files/raw/sales_data.csv", "Files/staging/2025/sales_data.csv")

# Copy a whole directory tree — fastcp is optimized for large copies
notebookutils.fs.fastcp("Files/raw/", "Files/backup/raw/", True)

# Move (or rename) a file
notebookutils.fs.mv("Files/staging/2025/sales_data.csv", "Files/archive/sales_data.csv")

# Write a small text file (overwrite if it exists)
notebookutils.fs.put("Files/logs/run_status.txt", "Load completed", True)

# Append to an existing text file
notebookutils.fs.append("Files/logs/run_status.txt", "\n100 rows written", True)

# Delete a file or directory
notebookutils.fs.rm("Files/staging", True)  # True = recursive

A few practical notes:

notebookutils.fs works against OneLake paths (relative Files/... paths or absolute abfss:// URIs), so you can use it across Lakehouses and even across workspaces you have access to.
For copying large directories, prefer fastcp over cp — it parallelizes the copy.
put and append are designed for small text content like logs, markers, or configuration files — not for writing datasets. Use DataFrame writes for data.
Run notebookutils.fs.help() in a notebook cell to see the full list of available methods.

5.9 Validating Data Before You Store It

A pipeline is only as trustworthy as the data it produces. Before persisting data — especially with destructive modes like overwrite — it’s good practice to run a few validation checks:

from pyspark.sql.functions import col

def validate_sales(df) -> list[str]:
    """Return a list of validation errors; empty list means the data is valid."""
    errors = []

    if df.isEmpty():
        errors.append("DataFrame is empty")

    # Key columns must not contain nulls
    null_keys = df.filter(col("sale_id").isNull()).count()
    if null_keys > 0:
        errors.append(f"{null_keys} rows have a null sale_id")

    # Business rule: sale amounts must be positive
    negative_amounts = df.filter(col("sale_amount") <= 0).count()
    if negative_amounts > 0:
        errors.append(f"{negative_amounts} rows have a non-positive sale_amount")

    # No duplicate keys
    if df.count() != df.select("sale_id").distinct().count():
        errors.append("Duplicate sale_id values detected")

    return errors

errors = validate_sales(sales_df)
if errors:
    raise ValueError("Validation failed: " + "; ".join(errors))

sales_df.write.mode("overwrite").saveAsTable("sales_curated")
print(f"Wrote {sales_df.count()} rows to sales_curated")

This pattern — validate, then write, then log — costs a few extra Spark actions but saves hours of debugging downstream. Each validation count() triggers a job, so for large datasets consider caching the DataFrame during validation, or combining several checks into a single aggregation. For more sophisticated needs (rule catalogs, data quality reports), look at dedicated libraries such as Great Expectations, which can be installed in your Fabric environment.

Note also that writing to a Delta table gives you a free safety net: even if bad data slips through, DESCRIBE HISTORY and time travel let you identify and roll back the offending write.

5.10 Putting It All Together: a Small ETL Pipeline

Let’s combine what we’ve covered into a compact, realistic example: read raw CSV files, clean them, validate, and store the result as an optimized Delta table, with a small log file recording the run.

from pyspark.sql.functions import col, to_date, year, month
import notebookutils

# 1. Extract: read the raw landing files
raw_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("Files/raw/sales/*.csv")

# 2. Transform: clean and shape
clean_df = (
    raw_df
    .dropDuplicates(["sale_id"])
    .filter(col("sale_amount") > 0)
    .withColumn("date", to_date(col("date")))
    .withColumn("year", year(col("date")))
    .withColumn("month", month(col("date")))
)

# 3. Validate before writing
errors = validate_sales(clean_df)
if errors:
    raise ValueError("Validation failed: " + "; ".join(errors))

# 4. Load: write a partitioned, managed Delta table
clean_df.write.mode("overwrite") \
    .partitionBy("year") \
    .saveAsTable("sales_curated")

# 5. Maintain and log
spark.sql("OPTIMIZE sales_curated")
row_count = spark.read.table("sales_curated").count()
notebookutils.fs.put(
    "Files/logs/sales_curated_last_run.txt",
    f"OK - {row_count} rows",
    True
)

In a production setting you would schedule this notebook (or a Spark job definition) from a Fabric data pipeline, add error handling and alerting, and route logs to a monitoring destination — topics we’ll touch on in Chapter 7.

5.11 Conclusion

In this chapter, we explored the full range of options for storing data with PySpark in Microsoft Fabric:

All Fabric data lives in OneLake; within a Lakehouse, the Tables area holds managed Delta tables while the Files area holds arbitrary files.
PySpark writes files in Parquet, CSV, JSON, and — most importantly — Delta, the default format for Lakehouse tables, controlled by the four write modes (error, ignore, append, overwrite).
Delta Lake brings ACID transactions, schema enforcement, time travel, MERGE upserts, and maintenance commands (OPTIMIZE, VACUUM) to your tables — plus the Fabric-specific V-Order optimization for read-heavy workloads.
Partitioning accelerates queries that filter on partition columns, but over-partitioning creates the small files problem; coalesce() and repartition() give you control over output file counts.
The Spark connector for Data Warehouse (synapsesql) writes Spark DataFrames to Warehouse tables using your Microsoft Entra identity.
notebookutils.fs covers file system operations (copy, move, list, small text writes) that fall outside the DataFrame API.
Validating before writing — and logging what you wrote — turns a script into a trustworthy pipeline.

With your data ingested, shaped, queried, and now reliably stored, the next step is to make it speak. In the next chapter, we’ll explore how to visualize data in Microsoft Fabric, from quick charts in notebooks to full Power BI integration.