pandas is a Python library providing fast, flexible data structures for data analysis and manipulation. Its primary data structure is the DataFrame — a two-dimensional, labelled table (like a spreadsheet or SQL table) with typed columns. pandas also provides the Series (a one-dimensional labelled array) and powerful tools for reading data from CSV, Excel, JSON, SQL, Parquet, and many other formats.
Key pandas operations include: selecting data with .loc[] (by label) and .iloc[] (by position); filtering with boolean indexing (df[df['age'] > 30]); grouping with .groupby() and aggregation (.sum(), .mean(), .count()); merging DataFrames with pd.merge() and pd.concat(); pivoting with .pivot_table(); and reshaping with .melt() and .stack(). pandas handles missing data with NaN values and provides methods for detecting, filling, and dropping them.
pandas is built on top of NumPy and is the cornerstone of Python's data science ecosystem. It integrates seamlessly with matplotlib for visualisation, scikit-learn for machine learning, statsmodels for statistical analysis, and Jupyter notebooks for interactive exploration. It is typically imported as import pandas as pd, a convention so universal that seeing pd. in Python code immediately signals "pandas."
Related terms: DataFrame, NumPy, matplotlib
Discussed in:
- Chapter 16: Working with Data — pandas: The Data Analysis Library