Human Vs Bot Event Classification Analysis and Visualization with Python (Part I)

Human Vs Bot Event Classification Analysis and Visualization with Python (Part I)

Introduction

In the world of digital marketing, it is important to distinguish between human and bot interactions in order to understand user behavior, optimize campaigns, and prevent fraudulent activities. This blog post explores the analysis of marketing events to identify patterns that differentiate human from bot interactions. We will use a dataset of marketing events, perform various data preprocessing steps, and apply classification criteria to identify human and bot activities. Finally, we will visualize the results to gain deeper insights.

Strategies and Tools

Data Preprocessing:
- Loading the Dataset: We start by loading the dataset using the pandas library, which is a powerful tool for data manipulation and analysis.
- Converting Timestamps: We convert the ts column to datetime format to facilitate time-based operations.
- Sorting the Data: Sorting the data by recipientId and ts ensures that the time differences are calculated correctly.
- Calculating Time Differences: We calculate the time difference between consecutive events for each recipient to identify patterns in interaction times.
- Filtering Out Rows with NaN in time_diff: Removing rows with NaN values in the time_diff column ensures that our analysis is based on complete data.
Outlier Detection:
- Calculating IQR: We use the Interquartile Range (IQR) to identify outliers in the time_diff column. The IQR is a robust measure of statistical dispersion and helps us identify unusual patterns.
- Defining Outlier Bounds: We define the lower and upper bounds for outliers using the IQR.
- Identifying Outliers: We filter the data to identify rows that fall outside the defined bounds.
- Filtering Large Outliers: We further filter outliers to focus on events with time differences greater than two days (172,800 seconds).
Classification Criteria:
- Defining Classification Function: We define a function to classify events as human or bot based on the presence or absence of links. Events with no links or short links (less than 100 characters) are classified as human, while events with long, complex links (100 characters or more) are classified as bot.
- Applying Classification Function: We apply the classification function to the large outliers to identify human and bot interactions.
Visualizations:
- Time Difference Distribution:
  - Histogram: We use a histogram to visualize the distribution of time differences between consecutive events.
  - Boxplot: We use a boxplot to highlight outliers in the time difference distribution for human and bot events.
- Link Length Distribution:
  - Histogram: We use a histogram to visualize the distribution of link lengths for human and bot events.
  - Boxplot: We use a boxplot to highlight the differences in link lengths between human and bot events.
- Event Type Distribution:
  - Countplot: We use a countplot to visualize the distribution of event types for human and bot events.
- Time of Day Analysis:
  - Histogram: We use a histogram to visualize the distribution of events by hour of the day.
  - Heatmap: We use a heatmap to visualize the distribution of events by hour of the day and day of the week.
- Event Count Distribution:
  - Histogram: We use a histogram to visualize the distribution of event counts per recipient.
  - Boxplot: We use a boxplot to highlight outliers in the event count distribution for human and bot events.

Tools and Libraries

Pandas:
- Purpose: Data manipulation and analysis.
- Why: Pandas provides a powerful and flexible data structure (DataFrames) and a wide range of functions for data manipulation, making it ideal for handling and preprocessing the dataset.
- Alternative: NumPy can be used for numerical operations, but Pandas is more suited for tabular data.
Matplotlib:
- Purpose: Data visualization.
- Why: Matplotlib is a versatile plotting library that supports a wide range of plot types, including histograms, boxplots, and heatmaps. It provides fine-grained control over the visual elements of the plots.
- Alternative: Seaborn is another popular data visualization library that builds on Matplotlib and provides a higher-level interface for more complex visualizations. Plotly are also excellent alternatives for interactive visualizations.
Seaborn:
- Purpose: Data visualization.
- Why: Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics. It simplifies the process of creating complex visualizations.
- Alternative: Plotly offer interactive visualizations and are suitable for web-based applications.
NumPy:
- Purpose: Numerical operations.
- Why: NumPy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. It is essential for numerical computations.
- Alternative: SciPy is another library that provides additional scientific computing tools and functions.
OS:
- Purpose: File system operations.
- Why: The os module provides a way to interact with the operating system, including getting the path to the Downloads folder and saving files.
- Alternative: Pathlib is a more modern and object-oriented way to handle file paths and directories.

Data Preprocessing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load the dataset
file_path = r"C:\Users\LENOVO\Downloads\classified_marketing_events.csv"
df = pd.read_csv(file_path)

# Convert timestamp to datetime with mixed format and dayfirst=True
df['ts'] = pd.to_datetime(df['ts'], format='mixed', dayfirst=True)

# Sort the data by recipientId and timestamp
df = df.sort_values(by=['recipientId', 'ts'])

# Calculate the time difference between consecutive events for each recipient
df['time_diff'] = df.groupby('recipientId')['ts'].diff().dt.total_seconds()

# Filter out rows with NaN in time_diff
df = df.dropna(subset=['time_diff'])

# Calculate the IQR for time_diff
Q1 = df['time_diff'].quantile(0.25)
Q3 = df['time_diff'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['time_diff'] < lower_bound) | (df['time_diff'] > upper_bound)]

# Filter outliers with time differences greater than two days (172,800 seconds)
large_outliers = outliers[outliers['time_diff'] > 172800]

# Define a function to classify events as human or bot based on the presence or absence of links
def classify_event(row):
    if pd.isna(row['link']):
        return 'human'
    elif len(row['link']) < 100:
        return 'human'
    else:
        return 'bot'

# Apply the classification function to the large outliers
large_outliers = large_outliers.copy()
large_outliers['human/automated'] = large_outliers.apply(classify_event, axis=1)

# Create a new column for link length
df['link_length'] = df['link'].apply(lambda x: len(x) if pd.notna(x) else 0)

# 1. Time Difference Distribution (Analysis and Visualization)
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='time_diff', bins=50, kde=True)
plt.title('Time Difference Distribution')
plt.xlabel('Time Difference (seconds)')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='human/automated', y='time_diff')
plt.title('Time Difference Distribution (Human vs Bot)')
plt.xlabel('Human/Automated')
plt.ylabel('Time Difference (seconds)')
plt.show()

# 2. Link Length Distribution (Analysis and Visualization)
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='link_length', hue='human/automated', bins=50, kde=True)
plt.title('Link Length Distribution (Human vs Bot)')
plt.xlabel('Link Length')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='human/automated', y='link_length')
plt.title('Link Length Distribution (Human vs Bot)')
plt.xlabel('Human/Automated')
plt.ylabel('Link Length')
plt.show()

# 3. Event Type Distribution (Analysis and Visualization)
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='event', hue='human/automated')
plt.title('Event Type Distribution (Human vs Bot)')
plt.xlabel('Event Type')
plt.ylabel('Count')
plt.show()

# 4. Time of Day Analysis (Analysis and Visualization)
df['hour'] = df['ts'].dt.hour
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='hour', hue='human/automated', bins=24, kde=True)
plt.title('Event Distribution by Hour of the Day (Human vs Bot)')
plt.xlabel('Hour of the Day')
plt.ylabel('Frequency')
plt.show()

df['day_of_week'] = df['ts'].dt.dayofweek
plt.figure(figsize=(12, 6))
sns.heatmap(pd.crosstab(df['hour'], df['day_of_week']), annot=True, cmap='viridis')
plt.title('Event Distribution by Hour of the Day and Day of the Week (Human vs Bot)')
plt.xlabel('Day of the Week')
plt.ylabel('Hour of the Day')
plt.show()

# 5. Event Count Distribution (Analysis and Visualization)
event_counts = df['recipientId'].value_counts().reset_index()
event_counts.columns = ['recipientId', 'event_count']
df = df.merge(event_counts, on='recipientId', how='left')

plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='event_count', bins=50, kde=True)
plt.title('Event Count Distribution')
plt.xlabel('Event Count')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='human/automated', y='event_count')
plt.title('Event Count Distribution (Human vs Bot)')
plt.xlabel('Human/Automated')
plt.ylabel('Event Count')
plt.show()

Explanation of the Code Structure
Data Preprocessing:
Load the Dataset: Load the dataset from the specified file path.
Convert Timestamps: Convert the ts column to datetime format.
Sort the Data: Sort the data by recipientId and ts.
Calculate Time Differences: Calculate the time difference between consecutive events for each recipient.
Filter Out Rows with NaN in time_diff: Remove rows where time_diff is NaN.
Outlier Detection:
Calculate IQR: Calculate the Interquartile Range (IQR) for time_diff.
Define Outlier Bounds: Define the lower and upper bounds for outliers.
Identify Outliers: Identify rows that are outliers based on the IQR.
Filter Large Outliers: Filter outliers with time differences greater than two days (172,800 seconds).
Classification Criteria:
Define Classification Function: Define a function to classify events as human or bot based on the presence or absence of links.
Apply Classification Function: Apply the classification function to the large outliers.
Create a New Column for Link Length:
Link Length: Create a new column for link length.
Visualizations:
Time Difference Distribution:Histogram: Show the distribution of time differences.
Boxplot: Highlight outliers in the time difference distribution for human and bot events.
Link Length Distribution:Histogram: Show the distribution of link lengths for human and bot events.
Boxplot: Highlight the differences in link lengths between human and bot events.
Event Type Distribution:Countplot: Show the distribution of event types for human and bot events.
Time of Day Analysis:Histogram: Show the distribution of events by hour of the day.
Heatmap: Show the distribution of events by hour of the day and day of the week.
Event Count Distribution:Histogram: Show the distribution of event counts per recipient.
Boxplot: Highlight outliers in the event count distribution for human and bot events.

Conclusion

We could find patterns that distinguish human from bot interactions by analyzing the marketing events dataset. The presence or absence of links, the length of links, and the time differences between consecutive events were the most important factors in our classification. Visualizations gave us insights into the distribution of these features and helped validate our classification criteria.

This analysis can be further improved with the inclusion of additional features and the application of advanced machine learning techniques. However, the simple rules-based approach used here serves as a great basis for understanding and classifying human vs. bot interactions in marketing events.

Conclusion

We could find patterns that distinguish human from bot interactions by analyzing the marketing events dataset. The presence or absence of links, the length of links, and the time differences between consecutive events were the most important factors in our classification. Visualizations gave us insights into the distribution of these features and helped validate our classification criteria.

This analysis can be further improved by adding additional features and applying advanced machine learning techniques. However, the simple rules-based approach used here serves as a great basis for understanding and classifying human vs. bot interactions in marketing events.