This project is conducted as part of the Graph Analytics course and aims to analyze user behavior on MOOCs platforms such as Coursera. The primary objective is to track user activity and detect patterns to verify whether the registered user is the one actually interacting with the platform or if fraudulent activities (e.g., unauthorized users completing courses or earning certificates) are occurring.
To achieve this, we track actions such as click events, session frequency, and course type, and leverage graph-based modeling techniques, including Graph Neural Networks (GNNs) and Autoencoders, to detect complex behavioral patterns in the dataset.
Notebooks: data_preprocessing.ipynb and preprocessing__with_pyspark.ipynb
The raw dataset contains multiple attributes, some of which are irrelevant to our study. Since our primary goal is anomaly detection in user behavior, we excluded attributes such as user demographic details (gender, birth year, education) since they are more relevant for tasks like community detection or course recommendations, rather than fraud detection. Additionally, we did not use dropout labels since our objective is not dropout prediction.
To ensure data consistency and reliability, we applied the following preprocessing steps:
-
Data Cleaning:
- Removed unnecessary attributes.
- Handled missing values.
- Ensured session-based interactions remained intact.
-
Feature Selection:
- Retained only essential attributes related to user activity, session behaviors, and course interactions, aligning with our graph-based modeling.
-
Dataset Integration:
- Merged multiple datasets (Dropout Prediction, User Profile, and Course Information) into a single dataset that accurately represents meaningful learning interactions.
To analyze user behavior dynamics, we engineered additional engagement-based features:
session_duration: Total time a user spends within a session, measuring engagement.session_gap: Time difference between consecutive sessions, analyzing learning continuity.action_count: Number of interactions per session, representing session intensity.action_frequency: Rate of user interactions per second, helping to detect engagement variations.
These additional features help us track behavioral anomalies such as:
- Unusual session lengths (e.g., a session lasting hours could indicate automated activity).
- Sudden drops in engagement (e.g., long session gaps without interactions).
- Unrealistically high action frequencies, which may suggest bot-like behavior.
The final preprocessed dataset contains the following attributes:
| Attribute | Description |
|---|---|
username |
Unique identifier for each user. |
enroll_id |
Unique enrollment ID per user-course pair. |
session_id |
Unique identifier for each session. |
course_id |
Unique course identifier. |
action |
Type of action performed (e.g., play video, seek video, click info, submit assignment). |
time |
Timestamp of the action. |
category |
Course category (e.g., Science, Business, Engineering). |
session_duration |
Total time spent in a session. |
session_gap |
Time difference between consecutive sessions of a user. |
action_count |
Total number of actions performed in a session. |
action_frequency |
Number of actions performed per second in a session. |
Due to the large size of the dataset, we faced memory constraints. To address this:
- We used Apache Spark and Pyspark to process and optimize the dataset efficiently.
- We selected the top 2% most active users, leading to the
top_02percent_most_sessions.csvdataset.
This optimized dataset forms the foundation for our graph-based modeling.
Notebook: Graph_construction_and_modeling_training.ipynb
IN this approach each user session as a graph, where:
- Nodes represent individual user actions, enriched with contextual features.
- Edges capture the temporal and logical relationships between actions.
- Global session attributes, such as session duration and click patterns, are incorporated to enhance the model.
To process these session graphs, we implemented a Graph Embedder based on GCN layers, which generates latent representations (embeddings) for each session. These embeddings are then passed into a Siamese network that classifies user similarities, leveraging techniques such as:
- Concatenation & absolute difference of session embeddings.
- Multi-layer classifier with ReLU activations and dropout for robust generalization.
- Pairwise classification to differentiate between sessions from the same or different users.
This methodology allows us to detect anomalous sessions and flag potential fraud cases.
Notebook: User_based_modeling_Autoencoder.ipynb
An alternative approach considers one graph per user rather than per session. We selected the top 20 most active users and constructed individual graphs where:
- Nodes represent sessions, actions, and courses.
- Edges model relationships between these entities.
We trained a Graph Autoencoder (GAE) to learn compact latent representations of user behaviors. The autoencoder consists of:
- Encoder: Compresses graph node features using Graph Convolutional Networks (GCN).
- Decoder: Reconstructs the original features from the latent space.
- MSE Loss Function: Measures the reconstruction error to detect anomalies.
Once trained, the model computes an error threshold. When a new user logs in, we compare their graph’s reconstruction error against the threshold to determine if:
It is the real account owner.
Someone else is fraudulently accessing the account.
This method is particularly effective in detecting irregular usage patterns, as fraudsters often exhibit different behavioral signatures compared to genuine users.
- The Mean Squared Error (MSE) loss steadily decreased over 50 epochs, indicating the model successfully learned user behavior patterns.
- The final loss value stabilized, suggesting the model reached a meaningful representation of user activities.
- Successfully differentiated legitimate and suspicious sessions.
- Demonstrated effectiveness in fraud detection by identifying unexpected user behavior.
- Showed strong generalization ability by detecting deviations from learned patterns.
The reconstruction error is computed using the Mean Squared Error (MSE) loss function
The lower the MSE loss, the better the autoencoder has learned the user behavior. A higher reconstruction error suggests an anomalous behavior, indicating a potential fraudulent activity.
Below is the MSE loss curve over the training epochs:
This project highlights the potential of Graph Neural Networks (GNNs) and Autoencoders in user behavior analysis and fraud detection on MOOCs platforms. By modeling interactions as graphs, we effectively identify deviations from normal activity, making these techniques valuable tools for security and authentication in online learning platforms.
Our work extends beyond MOOCs fraud detection. We are currently working on a larger project focused on user and host behavior analysis in cybersecurity. This research aims to detect anomalous activities in networks, identify potential threats, and enhance security mechanisms using graph-based AI models.
