Mastering Pandas read_csv() with Examples — A Tutorial by

Codes With Pankaj
4 min readDec 9, 2023

Introduction:

Pandas, a powerful data manipulation library in Python, has become an essential tool for data scientists and analysts. One of its key functions is read_csv(), which allows users to read data from CSV (Comma-Separated Values) files into a Pandas DataFrame. In this tutorial, brought to you by CodesWithPankaj.com, we will explore the intricacies of read_csv() with clear examples to help you harness its full potential.

Understanding read_csv():

read_csv() is a versatile function that offers various parameters to handle diverse scenarios when reading CSV files. Whether your dataset has a specific delimiter, contains missing values, or requires custom column names, Pandas provides options to accommodate these situations

Download File

Importing a CSV file using the read_csv() function

# Importing the Pandas library
import pandas as pd

# Specify the file path or URL of the CSV file
file_path = 'DataSet/p4n_emp.csv'

# Use the read_csv() function to read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())

Setting a column as the index

# Importing the Pandas library
import pandas as pd

# Specify the file path of the CSV file
file_path = 'DataSet/p4n_emp.csv'

# Use the read_csv() function to read the CSV file into a DataFrame and set "name" as the index
df = pd.read_csv(file_path, index_col='name')

# Display the first few rows of the DataFrame
print(df.head())

Selecting specific columns to read into memory

# Importing the Pandas library
import pandas as pd

# Specify the file path of the CSV file
file_path = 'DataSet/p4n_emp.csv'

# Specify the columns you want to select
selected_columns = ['name', 'age', 'sex']

# Use the read_csv() function to read only the specified columns into a DataFrame
df = pd.read_csv(file_path, usecols=selected_columns)

# Display the first few rows of the DataFrame
print(df.head())

DataFrame Methods :

| Method          | Description                                                | Example                            |
|-----------------|------------------------------------------------------------|------------------------------------|
| `head(n)` | Displays the first n rows of the DataFrame. | `df.head(10)` |
| `tail(n)` | Displays the last n rows of the DataFrame. | `df.tail(8)` |
| `info()` | Provides a concise summary of the DataFrame. | `df.info()` |
| `describe()` | Generates descriptive statistics of numerical columns. | `df.describe()` |
| `shape` | Returns dimensions (rows, columns) of the DataFrame. | `print(df.shape)` |
| `columns` | Returns Index object with column labels. | `print(df.columns)` |
| `unique()` | Returns array of unique values in a specified column. | `unique_values = df['column'].unique()` |
| `value_counts()` | Returns Series with counts of unique values in a column. | `value_counts = df['column'].value_counts()` |
| `sort_values()` | Sorts the DataFrame by specified column(s). | `df_sorted = df.sort_values(by='column')` |
| `groupby()` | Groups DataFrame by specified column(s) for aggregation. | `grouped_data = df.groupby('column').mean()` |

DataFrame Attributes :

| Method          | Description                                                | Example                            |
|-----------------|------------------------------------------------------------|------------------------------------|
| `head(n)` | Displays the first n rows of the DataFrame. | `df.head(10)` |
| `tail(n)` | Displays the last n rows of the DataFrame. | `df.tail(8)` |
| `info()` | Provides a concise summary of the DataFrame. | `df.info()` |
| `describe()` | Generates descriptive statistics of numerical columns. | `df.describe()` |
| `shape` | Returns dimensions (rows, columns) of the DataFrame. | `print(df.shape)` |
| `columns` | Returns Index object with column labels. | `print(df.columns)` |
| `unique()` | Returns array of unique values in a specified column. | `unique_values = df['column'].unique()` |
| `value_counts()` | Returns Series with counts of unique values in a column. | `value_counts = df['column'].value_counts()` |
| `sort_values()` | Sorts the DataFrame by specified column(s). | `df_sorted = df.sort_values(by='column')` |
| `groupby()` | Groups DataFrame by specified column(s) for aggregation. | `grouped_data = df.groupby('column').mean()` |
| `index` | Returns the index (row labels) of the DataFrame. | `print(df.index)` |
| `values` | Returns a two-dimensional array of the DataFrame's values. | `print(df.values)` |
| `dtypes` | Returns a Series with the data type of each column. | `print(df.dtypes)` |
| `size` | Returns the number of elements in the DataFrame. | `print(df.size)` |

Exporting the DataFrame to a CSV File

# Assuming df is your DataFrame
import pandas as pd

# Specify the file path for the CSV output
output_file_path = 'output_data.csv'

# Use the to_csv() method to export the DataFrame to a CSV file
df.to_csv(output_file_path, index=False)

# Display a message indicating the successful export
print(f"DataFrame exported to {output_file_path}")
# Assuming df is your DataFrame
import pandas as pd

# Specify the file path for the CSV input
file_path = 'DataSet/p4n_emp.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(df.head())

# Filtering: Selecting employees aged 30 or younger
young_employees = df[df['age'] <= 30]

# Display the first few rows of the filtered DataFrame
print("\nYoung Employees:")
print(young_employees.head())

# Grouping: Calculate the average salary by job title
average_salary_by_job = df.groupby('job.title')['annual.salary'].mean().reset_index()

# Display the average salary by job title
print("\nAverage Salary by Job Title:")
print(average_salary_by_job)

# Summary Statistics: Displaying overall summary statistics of the dataset
summary_statistics = df.describe(include='all')

# Display the summary statistics
print("\nSummary Statistics:")
print(summary_statistics)

# Export the filtered DataFrame to a new CSV file
output_filtered_path = 'DataSet/young_employees.csv'
young_employees.to_csv(output_filtered_path, index=False)
print(f"\nFiltered DataFrame exported to {output_filtered_path}")

Conclusion:

The read_csv() function in Pandas is a versatile tool that enables seamless data import from CSV files. In this tutorial, we've explored various scenarios, including basic usage, handling custom delimiters, managing missing values, and specifying custom column names. Armed with this knowledge, you'll be better equipped to tackle diverse datasets in your data science endeavors.

Visit CodesWithPankaj.com for more in-depth tutorials and coding insights to enhance your Python and Pandas skills. Happy coding!

--

--