Working with Data — Textbook of Python

Learning Objectives

Read and write CSV and JSON files using both the standard library and pandas
Create, inspect, and manipulate pandas DataFrames using selection, filtering, and groupby operations
Produce line, bar, scatter, and histogram plots using matplotlib
Use NumPy arrays to perform vectorised arithmetic and explain why it is faster than Python loops
Determine when a project's data needs have outgrown flat files and implement basic SQLite queries

Most programs are, at their core, data plumbing. They read data from somewhere, transform it, and write it somewhere else. A web scraper pulls HTML and extracts structured records. A scientific analysis loads measurements, computes statistics, and produces charts. A business report reads a spreadsheet, aggregates numbers, and generates a summary. Python has become the dominant language for data work not because it is the fastest language — it is not — but because its ecosystem of libraries makes the journey from raw data to useful insight remarkably short.

Reading CSV Files

CSV (comma-separated values) is the lingua franca of tabular data. Python's standard library includes a csv module:

import csv

with open("temperatures.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(f"{row['city']}: {row['temperature']}°C")

csv.DictReader maps each row to a dictionary using the header row as keys. This works well for simple tasks. But the moment you need to filter rows, compute averages, join two files, or handle missing values, the csv module starts to feel like building a house with a screwdriver. You need a power tool.

pandas: The Data Analysis Library

pandas is Python's most important data analysis library. Install it with pip install pandas. Its core data structure is the DataFrame — a two-dimensional table with labelled rows and columns, like a spreadsheet that understands data types:

import pandas as pd

# Read a CSV file into a DataFrame
df = pd.read_csv("temperatures.csv")
print(df.head())       # first 5 rows
print(df.shape)        # (rows, columns)
print(df.dtypes)       # data type of each column
print(df.describe())   # summary statistics

pandas can read CSV, Excel, JSON, SQL databases, and dozens of other formats. The read_csv function handles headers, data types, missing values, and encoding automatically. A single line replaces dozens of lines of manual parsing.

A Series is a single column of a DataFrame — a one-dimensional labelled array:

temps = df["temperature"]     # a Series
print(temps.mean())           # average
print(temps.max())            # maximum
print(temps.min())            # minimum

Selecting and Filtering Data

Selecting columns uses bracket notation or dot access:

# Single column (returns a Series)
cities = df["city"]

# Multiple columns (returns a DataFrame)
subset = df[["city", "temperature"]]

Filtering uses boolean indexing — you write a condition that produces True or False for each row, then use it to select rows:

# Rows where temperature exceeds 30
hot = df[df["temperature"] > 30]

# Multiple conditions (use & for and, | for or, ~ for not)
hot_humid = df[(df["temperature"] > 30) & (df["humidity"] > 80)]

# String filtering
london = df[df["city"] == "London"]

This syntax looks odd at first — df[df["temperature"] > 30] reads like a stutter. But it is consistent and powerful. The inner expression df["temperature"] > 30 produces a Series of booleans, and the outer brackets use that Series to filter rows. Once you internalise this pattern, you can slice data with remarkable precision.

Grouping and Aggregation

groupby splits a DataFrame into groups based on a column, then applies a function to each group:

# Average temperature per country
df.groupby("country")["temperature"].mean()

# Multiple aggregations at once
df.groupby("country")["temperature"].agg(["mean", "min", "max", "count"])

# Group by multiple columns
df.groupby(["country", "season"])["temperature"].mean()

This is the pandas equivalent of a SQL GROUP BY. It is one of the most frequently used operations in data analysis — any time you want to answer "what is the average X per Y?", groupby is the answer.

You can also create new columns from existing ones:

# Fahrenheit conversion
df["temp_f"] = df["temperature"] * 9/5 + 32

# Categorical binning
df["category"] = pd.cut(df["temperature"], bins=[0, 15, 25, 40],
                         labels=["cold", "mild", "hot"])

Visualisation with matplotlib

Numbers in tables are hard to interpret. A chart reveals patterns instantly. matplotlib is Python's foundational plotting library:

import matplotlib.pyplot as plt

# Line plot
plt.figure(figsize=(10, 6))
plt.plot(df["date"], df["temperature"])
plt.xlabel("Date")
plt.ylabel("Temperature (°C)")
plt.title("Daily Temperature")
plt.tight_layout()
plt.savefig("temperature.png")
plt.show()

The four essential plot types cover most needs:

# Bar chart
df.groupby("city")["temperature"].mean().plot(kind="bar")

# Scatter plot
plt.scatter(df["humidity"], df["temperature"])
plt.xlabel("Humidity (%)")
plt.ylabel("Temperature (°C)")

# Histogram
df["temperature"].hist(bins=20)
plt.xlabel("Temperature (°C)")
plt.ylabel("Frequency")

matplotlib's API is verbose but flexible — you can customise every element of a plot. seaborn (built on matplotlib) provides a higher-level interface with better defaults and statistical plots. plotly creates interactive, web-based charts. For quick exploration, pandas has built-in plotting: df.plot(), df.plot.bar(), df.plot.scatter(x="col1", y="col2") — all of which call matplotlib under the hood.

NumPy: The Numerical Foundation

NumPy is the library beneath pandas, matplotlib, scikit-learn, and most of Python's scientific stack. Its core object is the ndarray — a multi-dimensional array of numbers stored in contiguous memory:

import numpy as np

a = np.array([1, 2, 3, 4, 5])
print(a * 2)           # Output: [ 2  4  6  8 10]
print(a + 10)          # Output: [11 12 13 14 15]
print(a.mean())        # Output: 3.0
print(a.std())         # Output: 1.4142135623730951

The key concept is vectorised operations. When you write a * 2, NumPy multiplies every element in the array by 2 in a single optimised operation — no Python loop, no per-element overhead. This is why NumPy is fast: the loop happens in compiled C code, not in Python.

Broadcasting lets NumPy operate on arrays of different shapes:

matrix = np.array([[1, 2, 3],
                    [4, 5, 6]])
row = np.array([10, 20, 30])

print(matrix + row)
# Output: [[11 22 33]
#          [14 25 36]]

NumPy "broadcasts" the smaller array across the larger one, adding element-wise without explicit loops. This works whenever the trailing dimensions match or one of them is 1.

Creating arrays is flexible:

zeros = np.zeros((3, 4))         # 3x4 array of zeros
ones = np.ones((2, 3))           # 2x3 array of ones
sequence = np.arange(0, 10, 0.5) # 0.0, 0.5, 1.0, ..., 9.5
linear = np.linspace(0, 1, 5)    # 5 evenly spaced points from 0 to 1
random = np.random.randn(1000)   # 1000 standard normal samples

The rule of thumb: if you find yourself writing a for loop over numbers, there is probably a NumPy way to do it that is both faster and more readable.

When to Reach for a Database

CSV files and DataFrames work beautifully for datasets that fit in memory. But some problems need a different tool. If your data exceeds available RAM, if multiple processes need to read and write simultaneously, or if you need complex queries that join multiple tables, you need a database.

Python's standard library includes sqlite3 — a full SQL database engine that stores everything in a single file:

import sqlite3

conn = sqlite3.connect("weather.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
    CREATE TABLE IF NOT EXISTS readings (
        city TEXT,
        date TEXT,
        temperature REAL
    )
""")

# Insert data
cursor.execute(
    "INSERT INTO readings VALUES (?, ?, ?)",
    ("London", "2024-07-15", 24.5)
)
conn.commit()

# Query data
cursor.execute("SELECT city, AVG(temperature) FROM readings GROUP BY city")
for row in cursor.fetchall():
    print(f"{row[0]}: {row[1]:.1f}°C")

conn.close()

SQLite requires no server, no configuration, and no separate installation. It is ideal for local analysis, prototyping, and applications with moderate data volumes. For production systems with heavy concurrent access, you would graduate to PostgreSQL or MySQL — but the SQL knowledge transfers directly.

pandas can read from and write to SQL databases in one line:

df = pd.read_sql("SELECT * FROM readings WHERE city = 'London'", conn)
df.to_sql("summary", conn, if_exists="replace")

The Data Analysis Workflow

Real data analysis follows a consistent pattern. You load the data, inspect it, clean it, analyse it, and present the results. The tools map neatly to each stage:

Load with pd.read_csv() or pd.read_sql(). Inspect with df.head(), df.info(), df.describe(). Clean with df.dropna(), df.fillna(), df.rename(), df.astype(). Analyse with df.groupby(), df.merge(), df.pivot_table(). Visualise with df.plot() or matplotlib. Export with df.to_csv(), df.to_excel(), or df.to_sql().

The beauty of this ecosystem is that each tool is designed to work with the others. pandas DataFrames sit on top of NumPy arrays. matplotlib plots pandas data directly. SQLite integrates with both. You are not learning isolated tools — you are learning a coherent stack.

Data is the raw material of every useful program. Files, databases, APIs, sensors — the sources are different, but the workflow is the same: load, clean, transform, analyse, present. Python's data libraries do not make this work trivial — real data is always messier than you expect — but they reduce the mechanical effort so you can focus on the questions that actually matter.