Glossary

DataFrame

A DataFrame is the central data structure of the pandas library: a two-dimensional, size-mutable, labelled table. Each column is a Series with a name and a consistent data type (integer, float, string, datetime, etc.), and each row has an index label. You can think of a DataFrame as a dictionary of Series that share the same index, or as the Python equivalent of a spreadsheet or SQL table.

DataFrames are created from dictionaries (pd.DataFrame({'name': [...], 'age': [...]})), lists of lists, NumPy arrays, CSV files (pd.read_csv()), SQL queries (pd.read_sql()), and many other sources. Operations on DataFrames are vectorised — they operate on entire columns at once without explicit loops — making them far faster than iterating row by row.

The DataFrame API is enormous: selecting columns (df['col'] or df.col), filtering rows (df[df['x'] > 5]), adding columns (df['new'] = df['a'] + df['b']), sorting (df.sort_values('col')), grouping (df.groupby('col').mean()), merging (pd.merge(df1, df2, on='key')), pivoting, reshaping, window functions, string methods (.str accessor), datetime methods (.dt accessor), and more. Mastering DataFrames is the single most valuable skill for data analysis in Python.

Related terms: pandas, NumPy, Series

Discussed in:

This site is currently in Beta. Please email Chris Paton (cpaton@gmail.com) with any suggestions, questions or comments.