For this project, I used a dataset of over 9.5 GB available on Kaggle, containing login records from more than 3 million users over the course of a year.
Originally, this dataset was created for research on Risk-Based Authentication (RBA) systems and synthesized from real-world login behavior of users on a large-scale online service in Norway. The data is fully synthetic, designed to reflect plausible usage patterns without exposing sensitive user information.
While its initial purpose was cybersecurity research, I leveraged the richness and granularity of this dataset to perform frequency and recency analyses, as well as calculate daily, weekly, and monthly active users, adapting it to a context of user analytics and adoption and retention metrics.
This work is based on the dataset published by Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono, under a Creative Commons Attribution 4.0 (CC BY 4.0) license.
To prepare the dataset and adapt it for user analytics, I followed the steps below in Python.
Initial filtering to exclude failed login records, resulting in 12,541,442 records.
Filtering out records associated with potential attacks, leaving 11,736,951 records.
Filtering out records related to account takeovers, obtaining 11,308,156 records.
Selecting only the relevant columns for analysis, reducing the dataset size from 3.1 GB to 2.7 GB.
Grouping by user and date (since only daily logins are relevant), obtaining 8,815,307 records, and calculating Daily Active Users (DAU), distinguishing first-time logins from returning users.
Exploratory data analysis:
Unique users: 2,933,249
Minimum date: 2020-02-03
Maximum date: 2021-02-11
Number of unique dates: 375
Creation of a weekly calendar table based on the detected date range. A daily calendar was avoided due to the enormous volume of records it would have generated (~1,099,968,375 records).
Re-grouping users and their logins by week, generating a per-user weekly calendar (one row per user per week), and merging it with login data. Certain data normalization was applied to facilitate metric calculations.
Calculation of frequency and recency statistics required for the adoption and retention indicators.
Work in progress - project will include:
Total users KPIs
Filters by date period, country, region, device type, browser, OS, etc.
DAU (daily active users) distinguishing "first login" (of a certain user) vs. "previously active".
Recency Categories (logged in this week, this month, this quarter, this year, ever, never) for each user (pie charts).
Weekly and monthly evolution of each category for each user (complete user-calendar analyisis; bar charts pero week/month/etc), as a sofistication of WAU and MAU analysis.