Pandas Applications and Regression Preview

Pandas Applications

Pandas is one of the most powerful and widely used Python libraries for data analysis.
Beyond basic operations such as loading CSV files or selecting rows and columns, Pandas provides advanced functionalities that make it suitable for real-world applications. These applications help researchers, data scientists, and engineers handle large datasets, transform them into meaningful structures, and prepare them for further machine learning tasks.

  • Working with Databases
    Pandas can directly interact with relational databases such as SQLite, MySQL, or PostgreSQL. This allows us to read tables into DataFrames, perform transformations in Python, and save the results back to the database. It is especially useful when datasets are too large to handle with plain CSV or Excel files.

  • Grouping and Aggregation
    With functions like groupby and agg, Pandas allows us to summarize data based on categories. For example, we can compute the average salary per department, the total sales per region, or the maximum score per student group.

  • Combining Data
    Real-world data is often split across multiple tables. Pandas provides merge, concat, and join functions to combine datasets seamlessly. This resembles SQL joins and is essential when integrating information from different sources.

  • Pivoting and Reshaping
    Using pivot, pivot_table, and melt, Pandas can reshape datasets into more convenient formats. This is especially useful for producing summary tables or preparing data for visualization and statistical modeling.

  • Applying Custom Functions
    Pandas integrates well with Python functions. The apply method and lambda expressions allow us to create new variables, transform existing ones, and apply complex logic to entire datasets with minimal code.

In practice, these applications make Pandas more than just a data manipulation tool. They transform it into a bridge between raw data and meaningful analysis, enabling us to prepare datasets effectively before applying machine learning algorithms.

Working with Database Files

Pandas can work not only with CSV or Excel files but also with relational databases such as SQLite.
This is especially useful when data is stored in structured tables. By combining Pandas with the sqlite3 library, we can create a database, save a DataFrame into it, read data back with SQL queries, and update it after processing.

  • Create a sample SQLite database file

This step demonstrates how to create a small database file and store a Pandas DataFrame inside it.
Instead of keeping all data in CSV or Excel files, we often need to use relational databases such as SQLite, which allow us to manage data more efficiently. Pandas integrates seamlessly with SQLite through the sqlite3 module.

SQLite is a lightweight, serverless relational database engine.

  • Unlike large database systems (e.g., MySQL or PostgreSQL), SQLite does not require a separate server process.

  • Instead, it stores the entire database in a single file (e.g., sample.db).

  • This makes it very convenient for learning, prototyping, and handling small to medium-sized datasets.

  • Because it follows the SQL (Structured Query Language) standard, you can still use familiar SQL commands such as SELECT, INSERT, and UPDATE.

sqlite3 is the Python built-in module that lets you work with SQLite databases.

  • With sqlite3.connect("sample.db"), Python opens a connection to the database file.

  • If sample.db does not exist, it will be automatically created.

  • If the file already exists, it simply opens the existing database file.

  • You can then use this connection to run SQL queries or, in combination with Pandas, to store and retrieve DataFrames.

Note

Pandas can also connect to other relational databases such as MySQL and PostgreSQL.

  • Instead of sqlite3, you would typically use connectors such as pymysql (for MySQL) or psycopg2 (for PostgreSQL).

  • Combined with SQLAlchemy, Pandas can use the same read_sql() and to_sql() methods to read from or write to those databases as well.

  • This makes Pandas a very versatile tool for integrating with various database systems.

The workflow is:

  1. Use Pandas to create a DataFrame with 10 rows of student data (ID, name, score).

  2. Open a connection to sample.db using the sqlite3 module.

  3. Save the DataFrame into the database as a table named students using Pandas’ to_sql() method.

  4. If the table already exists, we can choose to replace or append data.

  5. Close the connection to finalize the operation.

This process illustrates how Pandas can interact directly with a database, turning Python objects into a persistent SQL table that can be queried later.

import sqlite3
import pandas as pd

# Create a DataFrame with 10 rows of sample data
df = pd.DataFrame({
    "id": list(range(1, 11)),
    "name": ["Kim","Lee","Park","Choi","Jung","Han","Yoon","Lim","Shin","Kang"],
    "score": [90, 85, 88, 92, 75, 80, 95, 78, 84, 91]
})

# Save the DataFrame into an SQLite database
conn = sqlite3.connect("sample.db")
df.to_sql("students", conn, if_exists="replace", index=False)
conn.close()
  • Read data from the database using pandas
    Once the table is created, we can read it back into a DataFrame using pd.read_sql().
    This allows us to query the database using SQL and immediately work with the results in Pandas.
# Reconnect to the database
conn = sqlite3.connect("sample.db")

# Read the table into a DataFrame
df2 = pd.read_sql("SELECT * FROM students", conn)
print(df2)

conn.close()
  • Modify and save the DataFrame back into the database
    After making transformations, such as adding a new column, we can save the updated data back into the database using to_sql() again.
# Add a new column "grade" based on scores
df2["grade"] = df2["score"].apply(lambda x: "A" if x >= 90 else "B")

# Save the updated DataFrame back into the database
conn = sqlite3.connect("sample.db")
df2.to_sql("students", conn, if_exists="replace", index=False)
conn.close()

Grouping and Aggregation

In many data analysis tasks, we need to group data by categories and compute summary statistics.
Pandas provides the groupby method along with agg to calculate measures such as mean, sum, or standard deviation for each group.

# Create a DataFrame with teams and performance data
data = {
    "team": ["A","A","B","B","C","C","A","B","C","A"],
    "score": [85,90,78,88,95,92,89,84,91,87],
    "hours": [5,6,4,5,7,6,8,3,4,7]
}
df = pd.DataFrame(data)

# Group data by team and calculate mean and standard deviation
result = df.groupby("team").agg({"score":"mean", "hours":"std"})
print(result)

Combining DataFrames

Real-world datasets often come from multiple sources.
Pandas offers several ways to combine DataFrames: - merge: SQL-style join on keys
- concat: Concatenate DataFrames along rows or columns
- join: Merge DataFrames by index

# Create two separate DataFrames
df1 = pd.DataFrame({
    "id": [1,2,3,4,5],
    "name": ["Kim","Lee","Park","Choi","Jung"]
})
df2 = pd.DataFrame({
    "id": [1,2,3,4,5],
    "score": [90,85,88,92,75]
})

# Merge DataFrames on the "id" column
merged = pd.merge(df1, df2, on="id")

# Concatenate DataFrames along columns
concat = pd.concat([df1, df2], axis=1)

# Join DataFrames using index
joined = df1.set_index("id").join(df2.set_index("id"))

Pivoting and Reshaping

Pandas makes it easy to reshape data into more convenient formats.
- pivot_table summarizes data in a table, similar to Excel pivot tables.
- melt transforms wide-format data into long-format, making it easier for analysis or visualization.

# Pivot table to calculate average score per team
pivot_table = df.pivot_table(values="score", index="team", aggfunc="mean")

# Reshape DataFrame using melt
melted = pd.melt(df, id_vars="team", value_vars=["score","hours"])

Applying Custom Functions

We can extend Pandas functionality by applying our own logic to data.
The apply method combined with lambda expressions allows flexible transformations, such as converting numeric scores into letter grades.

# Apply a lambda function to categorize scores into grades
df["grade"] = df["score"].apply(lambda x: "A" if x>=90 else "B")
print(df)

Regression Preview

What is Regression?

Regression is one of the simplest machine learning algorithms.

Its goal is to predict a continuous output variable (y) from one or more input variables (X).

Common examples include:

  • Study hours → Exam scores

  • Advertising expenses → Sales revenue

The mathematical form of a simple linear regression model is:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

where:

  • \( y \): dependent variable (target to predict)

  • \( x \): independent variable (input feature)

  • \( \beta_0 \): intercept (constant term)

  • \( \beta_1 \): slope (effect of x on y)

  • \( \epsilon \): error term

Using Pandas with Other Libraries

One of the biggest strengths of Pandas is that it integrates seamlessly with other Python libraries such as NumPy and scikit-learn (sklearn).

  • With NumPy, we can easily perform numerical operations, fit simple regression lines, and calculate error metrics.

  • With scikit-learn, we can build more advanced regression models, evaluate them with standard tools, and extend to more complex machine learning workflows.

This integration makes Pandas a central hub for managing datasets before passing them to numerical or machine learning packages.

Example: Regression with NumPy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample data
hours = np.array([1,2,3,4,5,6,7,8,9,10])
scores = np.array([50,55,60,65,70,72,75,78,82,85])

# Fit a regression line using NumPy
coef = np.polyfit(hours, scores, 1)  # slope, intercept
line = np.poly1d(coef)

plt.scatter(hours, scores, label="Data")
plt.plot(hours, line(hours), color="red", label="Regression line")
plt.xlabel("Study Hours")
plt.ylabel("Exam Score")
plt.legend()
plt.show()

In this example, we used NumPy to fit a simple regression line between study hours (X) and exam scores (y). The function np.polyfit(hours, scores, 1) calculated the slope and intercept of the best-fitting line.

  • Scatter Plot of Data
    The scatter plot shows each student’s study hours on the x-axis and exam scores on the y-axis. As expected, there is a clear upward trend: students who studied more hours generally achieved higher scores.

  • Regression Line
    The red regression line represents the predicted relationship. It is drawn based on the slope and intercept computed by NumPy. The positive slope indicates a positive correlation: as study hours increase, exam scores also rise.

  • Interpretation

    • The slope value (β₁) tells us how many additional points in the exam score are expected for each extra hour of study.

    • The intercept (β₀) represents the predicted exam score when study hours = 0.

    • Together, these values give us a simple but useful model to estimate exam performance from study time.

Example: Regression with scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Sample data
hours = np.array([1,2,3,4,5,6,7,8,9,10]).reshape(-1,1)  # X must be 2D for sklearn
scores = np.array([50,55,60,65,70,72,75,78,82,85])

# Create a Pandas DataFrame
df = pd.DataFrame({"hours": hours.flatten(), "scores": scores})

# Build and fit a linear regression model
model = LinearRegression()
model.fit(df[["hours"]], df["scores"])

# Predict values
pred = model.predict(df[["hours"]])

# Display coefficients
print("Intercept:", model.intercept_)
print("Slope:", model.coef_[0])

# Plot regression line
plt.scatter(df["hours"], df["scores"], label="Data")
plt.plot(df["hours"], pred, color="green", label="Sklearn Regression line")
plt.xlabel("Study Hours")
plt.ylabel("Exam Score")
plt.legend()
plt.show()

This example shows how Pandas works smoothly with scikit-learn.
We can use Pandas DataFrames as inputs to fit() and predict(), making the workflow much simpler and more intuitive.

Evaluating Prediction Performance

We can measure how well the regression line fits the data using Mean Squared Error (MSE).

# Predictions from the regression line
pred = line(df["hours"])

# Calculate Mean Squared Error
mse = np.mean((df["scores"] - pred)**2)
print("MSE:", mse)
  • A smaller MSE value indicates a better fit between the model and the data.

Exercises

  1. Use Pandas groupby to calculate the average salary by department.

  2. Combine an employee information table and a salary table using merge.

  3. Summarize sales by region and gender using pivot_table.

  4. Using study hours and exam scores, draw a simple regression line and confirm the statement:
    "As the number of study hours increases, the exam score also increases."


⬆️목차이동