Strings and Text Processing — Textbook of Python

Learning Objectives

Explain why strings are immutable sequences and apply indexing and slicing to extract substrings
Use common string methods to split, join, strip, replace, and search text
Format strings using f-strings and compare them to older formatting approaches
Write basic regular expressions using the re module to match, search, and replace patterns
Describe Unicode and UTF-8 encoding and apply encode/decode to handle text and bytes

Text is the universal interface. Configuration files, web pages, log messages, user input, API responses, source code itself — it is all text. If programming is the art of transforming data, then string manipulation is the craft you will practise most often. Python was designed with text in mind, and its string handling is among the best of any mainstream language — rich, readable, and deeply integrated into the language itself.

String Basics

A string in Python is an immutable sequence of Unicode characters. You can create one with single quotes, double quotes, or triple quotes:

name = 'Alice'
greeting = "Hello, world!"
paragraph = """This is a
multi-line string."""

Single and double quotes are interchangeable. The only difference is convenience: use double quotes if your string contains an apostrophe, single quotes if it contains a double quote, triple quotes for anything spanning multiple lines.

Strings are sequences, which means they support indexing and slicing just like lists:

word = "Python"
print(word[0])      # Output: P
print(word[-1])     # Output: n
print(word[1:4])    # Output: yth
print(word[::-1])   # Output: nohtyP

But strings are immutable — you cannot change a character in place:

word[0] = "J"    # TypeError: 'str' object does not support item assignment

To "change" a string, you create a new one. This is not as wasteful as it sounds; Python's memory management handles short-lived strings efficiently.

Common String Methods

Python strings come with dozens of methods. You do not need to memorise them all, but a core set appears in nearly every program.

split() breaks a string into a list of substrings:

sentence = "the quick brown fox"
words = sentence.split()
print(words)    # Output: ['the', 'quick', 'brown', 'fox']

csv_line = "Alice,30,London"
fields = csv_line.split(",")
print(fields)   # Output: ['Alice', '30', 'London']

join() is the inverse — it glues a list of strings together:

words = ["Hello", "world"]
print(" ".join(words))     # Output: Hello world
print("-".join(words))     # Output: Hello-world

Note that join() is called on the separator, not on the list. This feels backwards at first, but it is consistent: the separator is a string, and join() is a string method.

strip() removes leading and trailing whitespace (or specified characters):

raw = "   hello   \n"
print(raw.strip())       # Output: hello
print(raw.lstrip())      # Output: 'hello   \n'
print(raw.rstrip())      # Output: '   hello'

replace() substitutes one substring for another:

text = "Hello, world!"
print(text.replace("world", "Python"))    # Output: Hello, Python!

find() returns the index of the first occurrence of a substring, or -1 if not found. startswith() and endswith() test the beginning and end:

filename = "report_2024.csv"
print(filename.find("2024"))          # Output: 7
print(filename.endswith(".csv"))      # Output: True
print(filename.startswith("report"))  # Output: True

Other methods you will reach for regularly include upper(), lower(), title(), count(), isdigit(), isalpha(), and zfill().

String Formatting with f-strings

Python has three ways to format strings. The modern, preferred approach is the f-string (formatted string literal), introduced in Python 3.6:

name = "Alice"
age = 30
print(f"My name is {name} and I am {age} years old.")
# Output: My name is Alice and I am 30 years old.

Any valid Python expression can go inside the curly braces:

print(f"2 + 2 = {2 + 2}")                   # Output: 2 + 2 = 4
print(f"{'hello'.upper()}")                  # Output: HELLO
print(f"Pi is approximately {3.14159:.2f}")  # Output: Pi is approximately 3.14

The :.2f after the expression is a format specification — it formats the number as a float with two decimal places. Other common specs include :, for thousands separators, :>10 for right-alignment, and :% for percentages.

The older .format() method works similarly but is more verbose:

print("My name is {} and I am {} years old.".format(name, age))

The oldest approach, % formatting, uses C-style placeholders:

print("My name is %s and I am %d years old." % (name, age))

Use f-strings for new code. They are faster, more readable, and more Pythonic. You will encounter .format() and % in older codebases, so recognising them is useful, but there is no reason to prefer them in new work.

Raw Strings and Escape Sequences

Backslashes in strings introduce escape sequences: \n is a newline, \t is a tab, \\ is a literal backslash. This causes problems when you write Windows paths or regular expressions:

# This does not do what you want
path = "C:\new_folder\test"
print(path)    # Output: C:  (newline)  ew_folder	est

A raw string prefixed with r treats backslashes as literal characters:

path = r"C:\new_folder\test"
print(path)    # Output: C:\new_folder\test

Raw strings are essential when writing regular expressions, where backslashes are pervasive.

Multi-line Strings

Triple-quoted strings preserve line breaks and indentation:

poem = """Roses are red,
Violets are blue,
Python is lovely,
And so are you."""
print(poem)

They are commonly used for docstrings and any text that spans multiple lines. If you need a long string without embedded newlines, you can use implicit string concatenation — Python automatically joins adjacent string literals:

message = ("This is a very long message "
           "that spans multiple lines "
           "in the source code but is one string.")

String Comparison and Sorting

Strings are compared lexicographically — character by character, using Unicode code points:

print("apple" < "banana")    # Output: True
print("Zoo" < "apple")       # Output: True (uppercase Z is 90, lowercase a is 97)

The surprise is that uppercase letters sort before lowercase letters because their Unicode values are lower. For case-insensitive sorting, pass a key function:

names = ["alice", "Bob", "Charlie"]
print(sorted(names))                          # Output: ['Bob', 'Charlie', 'alice']
print(sorted(names, key=str.lower))           # Output: ['alice', 'Bob', 'Charlie']

For locale-aware sorting (where "ä" sorts near "a" in German), you need the locale module — but for English text, str.lower as a sort key covers most cases.

Regular Expressions

When simple string methods are not enough, regular expressions provide a powerful pattern-matching language. Python's re module is the gateway:

import re

text = "My phone number is 020-7946-0958"
match = re.search(r"\d{3}-\d{4}-\d{4}", text)
if match:
    print(match.group())    # Output: 020-7946-0958

The pattern \d{3}-\d{4}-\d{4} means "three digits, a hyphen, four digits, a hyphen, four digits". The r prefix makes it a raw string, so \d is passed to the regex engine as-is.

Key re functions:

re.search(pattern, string) — finds the first match anywhere in the string.
re.match(pattern, string) — matches only at the beginning.
re.findall(pattern, string) — returns all non-overlapping matches as a list.
re.sub(pattern, replacement, string) — replaces all matches.

text = "Contact us at info@example.com or admin@example.com"
emails = re.findall(r"[\w.]+@[\w.]+", text)
print(emails)    # Output: ['info@example.com', 'admin@example.com']

cleaned = re.sub(r"\s+", " ", "too   many    spaces")
print(cleaned)   # Output: too many spaces

Regular expressions are a deep subject — entire books have been written about them. For everyday Python work, knowing search, findall, and sub covers ninety per cent of use cases. When you find yourself writing increasingly convoluted patterns, consider whether a simple string method or a proper parser would serve you better.

Encoding and Unicode

Every string in Python 3 is a sequence of Unicode characters. Unicode assigns a unique number (a code point) to every character in every writing system — English letters, Chinese characters, emoji, mathematical symbols, and much more.

When strings need to be stored in files or sent over a network, they must be converted to bytes using an encoding. UTF-8 is the dominant encoding on the modern web and the default in Python:

text = "café"
encoded = text.encode("utf-8")
print(encoded)          # Output: b'caf\xc3\xa9'
print(type(encoded))    # Output: <class 'bytes'>

decoded = encoded.decode("utf-8")
print(decoded)          # Output: café

The b'...' prefix indicates a bytes object — a sequence of raw byte values, not characters. The character "é" takes two bytes in UTF-8 (\xc3\xa9), which is why the encoded form is longer than the original string.

Most of the time, Python handles encoding transparently. But when you read binary files, communicate with external systems, or process text in multiple languages, understanding the distinction between str (text) and bytes (binary data) will save you from the dreaded UnicodeDecodeError.

Strings might seem like a simple topic — just characters in a row. But the depth beneath that surface is extraordinary: Unicode handles every human writing system, regular expressions match patterns that would take pages of procedural code, and formatting turns raw data into something humans can read. Mastering strings is not glamorous work, but it is the work that makes everything else presentable.