Spotify Listening History App - Ingestion Pipeline

Hello! Welcome back to my personal blog. In this post, I'll be going over how I implemented an app that ingests my Spotify listening history, allowing me to see detailed analytics on what I've been listening to!

I know that there are many apps out there that do this, but I wanted to practice, as well as create a personalized version that has a lot of features that those other sites don't have. For example, these personalized weekly charts that mimic the Hot 100/Billboard 200!

(Note: All UI is in progress😅)

I've built the entire application using AWS, keeping costs as low as I could. Though the project is solely for my own listening history, with a bit of refactoring, I could potentially expand the app to allow other users to create their own accounts and use the app.

For fun, I also added this AI chatbot, which I can ask questions about my listening history, ask it to modify playback/add songs to queue, or even generate playlists for me based on my listening history:

We’ll go over that in a later article, but in this article specifically, we'll go over how I set up the listening history ingestion pipeline to get the data in the first place. But stay tuned, as there are many articles to come about all of the cool features I was able to implement so far!

NOTE: This article isn’t really meant to be a follow-along tutorial, more of a high-level overview of what I did.

Ingestion

There are two mains step to the pipeline - ingestion and aggregation.

AWS has a great service called AWS EventBridge. EventBridge is a serverless event bus service that makes it easy to connect applications using data from your own apps, third-party Software-as-a-Service (SaaS) apps, and AWS services. Think of it as a central hub for routing events (like a scheduled timer, a file being uploaded, or an API call failing) to different targets, like AWS Lambda functions. This allows for the creation of decoupled, event-driven architectures.

Pairing Eventbridge with lambda functions makes for some very easy-to-implement cron jobs. In this case, I set up an ingestion job to run every hour that calls Spotify's recently-played API endpoint to fetch my recent listening history.

From there, after a little bit of data cleanup (for example, filtering out songs played through local files, as I only want official artists in my stats, and normalizing artist/genre names), the job uses DynamoDB's BatchWriteItemCommand to write the listening history to my recent-listening-history table, where it will wait to be aggregated.

DynamoDB is relatively cheap, but to save on costs, it's important to understand the lifecycle of the items in your table. The items in my recent-listening-history table don't need to stay there longer than a month, so I can specify a ttl (Time to Live) attribute on the table, set that value for each item, and DynamoDB will automatically delete that item when the ttl attribute arrives, without a need for another cron job to scan the table and delete them.

To ensure that the ingestion job can retry and re-fetch items if things go wrong, as well as ensure that it doesn't re-ingest songs multiple times, I have a last_ingestion_timestamp in a status DynamoDB table, representing the last time the ingestion job successfully ran. The ingestion job will always start by getting this timestamp, and ensuring it filters out any data that came before that timestamp. And upon successful ingestion into the recent-listening-history DynamoDB table, it will update this timestamp.

Aggregation - Data Modeling

It's important to know the access patterns of your data in order to properly model it. In my case, I want to be able to query my aggregation data, for example, in these ways:

What were my most listened to songs by Drake over the last 30 days?
How many unique songs did I listen to from October 2025 to Novermber 2025?
What does my listening history look like for this specific song over the last year, month by month?

There are obviously a lot of calculations that can be involved with queries like this. For example, since I want to be able to display graphs on the frontend that will show the listening history for specific songs over a time period, I would need to somehow calculate how many times I've listened to that song for each day in the graph. Calculations like these aren't really feasible using a NoSQL database like Dynamo, but they are made a lot simpler using a relational database. In my case, I used PostgreSQL, on AWS RDS.

RDS, or Relational Database Service, is a managed service that makes it easy to set up, operate, and scale a relational database in the AWS Cloud. It automates common administrative tasks like backups, patching, and failure detection. Essentially, it takes care of the infrastructure so you can focus on working with your data, supporting engines like PostgreSQL, MySQL, and MariaDB. In my case, I chose PostgreSQL.

An important thing to note if you are making a small side project—RDS is much more expensive for small loads than Dynamo. This is obviously because it isn't a serverless option—you must provision to be running at all times. Furthermore, obviously, it is a lot more complicated to set up than Dynamo, as you must create a VPC (Virtual Private Cloud) and handle all of the networking for the database itself as well as any compute services that interact with it. Typically, this will end up incurring much more costs if you need outbound network access for those compute services, too. For example, if those compute services within the private subnets needed to call the Spotify API for any reason, I would have to set up a NAT Gateway, which is a provisioned charge that is quite expensive for a small side project.

For small fun side projects, I try to use Dynamo when I can, as it is much cheaper and much simpler to set up.

In this case, I decided to go with RDS to make the querying much more feasible. I set up a bastion host in a public subnet to allow myself to connect to the RDS database on my local machine. I did this by associating the EC2 SSH key on my machine with the EC2 instance, and configuring the security group of the instance to only allow inbound access on port 22, and outbound access to the database on port 5432. I also make sure to tear down the instance whenever I am not using it.

Using the bastion host, I can connect to the database using a PostgreSQL client like SQLectron, and run some commands like:

-- Table to store daily aggregated play counts for tracks

CREATE TABLE IF NOT EXISTS daily_track_aggregates (
    date DATE NOT NULL,
    track_id VARCHAR(255) NOT NULL,
    artist_id VARCHAR(255) NOT NULL,
    album_id VARCHAR(255) NOT NULL,
    daily_play_count INTEGER DEFAULT 0,
    PRIMARY KEY (date, track_id)
);

-- Index for fast queries by artist within a date range
CREATE INDEX IF NOT EXISTS idx_daily_track_artist ON daily_track_aggregates (artist_id, date);
-- Index for fast queries by album within a date range
CREATE INDEX IF NOT EXISTS idx_daily_track_album ON daily_track_aggregates (album_id, date);

-- Table to store daily aggregated play counts for artists

CREATE TABLE IF NOT EXISTS daily_artist_aggregates (
    date DATE NOT NULL,
    artist_id VARCHAR(255) NOT NULL,
    daily_play_count INTEGER DEFAULT 0,
    PRIMARY KEY (date, artist_id)
);

-- Index for fast queries by artist over a date range
CREATE INDEX IF NOT EXISTS idx_daily_artist_id ON daily_artist_aggregates (artist_id, date);

-- Table to store daily aggregated play counts for albums
CREATE TABLE IF NOT EXISTS daily_album_aggregates (
    date DATE NOT NULL,
    album_id VARCHAR(255) NOT NULL,
    artist_id VARCHAR(255) NOT NULL,
    daily_play_count INTEGER DEFAULT 0,
    PRIMARY KEY (date, album_id)
);

These aren't the final schema for the tables, as I added more fields like genre later down the line. But this should give an idea of how the data is modeled. I also made similar tables for monthly and yearly aggregates.

An important thing to note - if your compute services are running in a private subnet and don’t require outbound internet access (no NAT Gateway), but require access to DynamoDB or S3 (in my case, my aggregation function needs to access the recent-listening-history table), you should create a VPC Gateway Endpoint. This allows networking to travel directly from your lambda function over the private AWS network to the DynamoDB service. It’s important to use the VPC Gateway Endpoint, as it is free! AWS only offers these endpoints for Dynamo and S3, whereas for other services like SNS or Secrets Manager, you must use a VPC Interface Endpoint, which cost money to provision.

Aggregation Job

Using Eventbridge once again, I configured a cron job to run every 4 hours to aggregate all of the data ingested by the ingestion job.

Similar to the ingestion job, it also uses a last-aggregation-timestamp to make sure it doesn't re-aggregate data.

The aggregation job will read all of the data from the recent-listening-history table since the last aggregation timestamp, aggregate all of the song counts for the day, and then add/update the values in the aggregation tables in RDS for all associated songs, artists, and albums.

At this point, my backend API can now query the RDS instance to get clean, aggregated data about the listening history. For example:

SELECT
    dua.date,
    dua.daily_play_count
FROM
    daily_artist_aggregates dua
WHERE
    dua.artist_id = '<drake_artist_id>' -- Artist ID for Drake
    AND dua.date >= current_date - interval '7 days'
ORDER BY
    dua.date;

Ingestion Pipeline Done!

Now that the core feature of the app is done, I can show cool graphs like this:

Some of the apps I use for Spotify listening history data analysis, like last.fm, aren't able to provide specific timeseries listening history for specific songs or artists like this. So this is cool!

There are a lot of other cool features I added for this application that are unique to this app (like the personalized song chart/Hot 100 feature you see above), and in the next articles we'll go over how I did those things. Stay tuned and thanks for reading!

Jeremy Quinto