ishan's notes

Anatomy of a recommendation system


At a previous workplace, we created a recommendation engine to feed into an adaptive learning system for school students. This descriptive post was never polished enough to go on the technical blog, so here is the draft.

Contents

Background

The client operates a learning platform for mid to high school students, with more than a million active users and growing. The platform lets them browse video and text content, ask doubts or practice lesson exercises. A big chunk of students use the app for test practice, which amounts to as much as half of the total engagement. We want to improve this experience.

Students pick a specific lesson they want to work through and the platform recommends questions for them to solve one after the other. Picking questions at random doesn’t work well since,

We conjecture that adapting the exercise experience for each student based on their prior history can increase both engagement and learning outcomes. To solve it, we designed an adaptive question recommendation system - given a user and a list of questions to choose from, it suggests the ones that they should attempt next. The API specification is general enough and deployed as a separate service so

Recommendation system

We implement a web API in python that the main platform backend calls repeatedly as a student works through the exercise session. The time budget to generate suggestions is limited considering it is not on device, so we aim to keep the average latency around 100 ms. The lifecycle of a single request can be split into three components, developed with little coupling.

  1. Fetching data
    • Pull the list of questions to select from, plus all the relevant user history and context.
    • We decided on using a Postgres database for the backend, backed by a Redis cache, which has served us well.
  2. Recommendation model
    • Compute/provide enough information to evaluate how likely the user is to solve each question.
    • We train a collaborative filter model on the data of all the previous student attempts, and make partial updates to it during a session.
  3. Recommendation strategy
    • Use the model output and possibly more contextual data for the user to generate the final output.
    • We converged on the policy of suggesting questions where the probability of students solving it correctly is close to 50 pc, with fallbacks.

We go over these one at a time.

Fetching data

The input API request from platform's backend specifies the active user and the content we need to select the questions from. We store all the relevant information needed in a separate datastore managed by us and don't make any other API calls.

We experimented with a bunch of different storage choices and data models before selecting Postgres.

ML model

The objective here is to score each of the questions available for the user to attempt. We model the score as probability of the user getting the question correct.

Input data

To make a prediction for each student-question pair, we have to select relevant data to feed in. Right now, we use the previous attempts made by users on the platform. The attempts history is useful since it indicates

Parsing through all the attempts ever made would take forever, so we want to work with only a subset of the attempt logs for each student-question pair. We introduce the constraint that this filter can only be specified as a set of question units. In pseudo code,

filter_mapping: question -> list[QuestionUnit]

def filter(attempts, question):
	select * from attempts
	where question_id in (
		question_id
		for unit in filter_mapping[question]
		for question_id in unit
	)

An important thing to note here is that the set of questions in the API request could be different than the set of questions the model learns from. The same question could come up in different contexts, and we would still score it the same way for a user. This allows us flexibility in training the recommendation model, like for questions belonging to a new concept with little history available, we could train the model over all attempts made in the full chapter. While there are some sections with already a big selection of questions for the model learn from, we might want to prune available attempt logs.

This setup is flexible enough for us right now, though we might want to tweak it in the future. For example - we might only want to use attempts made in the last year for inference, or only group users studying the same curriculum together, and so forth.

Model training

We want to train a model that adapts as a student progresses through a concept.

Real time prediction

Going from attempt logs to computing the probability of a user solving a question is still not very tractable. For instance, a chapter typically has multiple thousand questions, while each question has been attempted by tens of thousands of users earlier. For a single API request, this would mean processing around a million attempts.

Learn periodically?

To get around this, one option is to train the ML model periodically, say every morning and cache the results to use through the day. However, this puts us at odds with making the learning experience adaptive. For instance, if the student has solved three consecutive physics problems about pulleys and blocks, you want to challenge them by asking something like - what about the same problem but now a monkey is pulling down the rope at one of the ends and the pulley is inside a spaceship leaving earth at one tenth the speed of light?.

This is to say, to keep the student engaged, the model output must adjust in real time as the platform learns more about them.

Compromise

After some iterations, we converged on using models of a certain form, specifically Embedding models. Treating attempts as interactions between students and questions, the signature for such a model is given as,

Embedding model
* train input: interactions between M users x N items
* train output: (M + N) outputs, one for each user and each item
* inference for pair (user A, item B): f(output-A, output-B)

Embedding models produce output of linear (M + N) size compared to the general case where it'd be quadratic (M x N). To be clear, the output for each student-question pair will still be unique, this just imposes some structure on it. The model output is labeled as user-embeddings and question-embeddings, respectively. This allows us to balance caching some of the model output and making real time updates to the prediction in an interesting way.

Consider students attempting the chapter on Kinematics, and an embedding model trained on the corresponding attempt history. The relevant attempt logs for predicting student-A performance on question-B from the chapter can be split into three sets

  1. performance of student A in this chapter
  2. performance of all other students on question B

There is a clear bias in that questions are persistent while the users are not. We conjecture that once question-B has accumulated considerable history, each additional student attempt should affect the embedding for B only a little. However, as student A works through this chapter, the embedding for A should evolve rapidly. Thus, while the information content in set-2 is stable, that in set-1 changes during a user session and our model needs to adapt to it.

Model workflow

Armed with this assumption, we posit that the question embeddings can be learnt periodically while the user embeddings are evaluated live. The full sequence of steps for any given question unit is:

When a recommendation request shows up for a student working on the unit,

The embedding model setup ensures that incremental training is inexpensive and can be done in real time, while approximating the output of a full training fairly well. By making a full train step more frequent than once every 24h, we can further ensure that the deviation is minimal. This lets us capture the student progress during the daytime neatly.

Recommendation strategy

This component uses the student-question scores computed from the ML model and selects which one to recommend. We experimented with different policies through running A/B tests and assessing the impact on student engagement. However, gathering feedback from students as well as content creators working for the platform, turned out to be more useful and indicated a clear preference for exercises that are in the mid difficulty range.

The strategy module also allows us to do more things without tweaking the rest of the response pipeline.

System diagram

The recommendation system lives across three process boundaries.

reco-system-flowchart.svg

Other considerations