The advertising dataset

Feature-Rich Recommender Systems

CTR Prediction

Pure (user, item) collaborative filtering breaks for cold start — new users and new items have no history. Real recommenders integrate side features: item attributes, user profiles, time, device, query context, …

This deck sets up the online advertising CTR prediction problem: predict click probability from a sparse vector of categorical features. Feature-rich recommendation in its purest form. The next two decks (FM and DeepFM) train models on this loader.

from collections import defaultdict
from d2l import torch as d2l
import torch
import os

Tab-separated; each row has many one-hot categorical fields plus a binary click label. Sparsity is extreme — think “1 of 10000 in each field”:

d2l.DATA_HUB['ctr'] = (d2l.DATA_URL + 'ctr.zip',
                       'e18327c48c8e8e5c23da714dd614e390d369843f')

data_dir = d2l.download_extract('ctr')

Dataset wrapper

Build per-field vocabularies, encode each row as a sparse feature index vector, yield (features, label) pairs:

class CTRDataset(torch.utils.data.Dataset):
    def __init__(self, data_path, feat_mapper=None, defaults=None,
                 min_threshold=4, num_feat=34):
        self.NUM_FEATS, self.count, self.data = num_feat, 0, {}
        feat_cnts = defaultdict(lambda: defaultdict(int))
        self.feat_mapper, self.defaults = feat_mapper, defaults
        self.field_dims = torch.zeros(self.NUM_FEATS, dtype=torch.long)
        with open(data_path) as f:
            for line in f:
                instance = {}
                values = line.rstrip('\n').split('\t')
                if len(values) != self.NUM_FEATS + 1:
                    continue
                instance['y'] = [float(values[0])]
                for i in range(1, self.NUM_FEATS + 1):
                    feat_cnts[i][values[i]] += 1
                    instance.setdefault('x', []).append(values[i])
                self.data[self.count] = instance
                self.count = self.count + 1
        if self.feat_mapper is None and self.defaults is None:
            feat_mapper = {i: {feat for feat, c in cnt.items() if c >=
                               min_threshold} for i, cnt in feat_cnts.items()}
            self.feat_mapper = {i: {feat_v: idx for idx, feat_v in enumerate(sorted(feat_values))}
                                for i, feat_values in feat_mapper.items()}
            self.defaults = {i: len(feat_values) for i, feat_values in feat_mapper.items()}
        for i, fm in self.feat_mapper.items():
            self.field_dims[i - 1] = len(fm) + 1
        self.offsets = torch.tensor(
            (0, *torch.cumsum(self.field_dims, dim=0).numpy()[:-1]))

    def __len__(self):
        return self.count

    def __getitem__(self, idx):
        feat = torch.tensor([self.feat_mapper[i + 1].get(v, self.defaults[i + 1])
                             for i, v in enumerate(self.data[idx]['x'])])
        return feat + self.offsets, torch.tensor(self.data[idx]['y'])
train_data = CTRDataset(os.path.join(data_dir, 'train.csv'))
train_data[0]
(tensor([ 143,  144,  227,  231,  957, 1250, 1471, 1566, 1624, 1736, 2008, 2061,
         2225, 2304, 2305, 2360, 2745, 2746, 2747, 2748, 2892, 2988, 3165, 3182,
         3194, 3195, 3279, 3651, 3687, 3708, 3722, 3751, 3786, 3801]),
 tensor([1.]))

Recap

  • CTR prediction = binary classification on sparse categorical features.
  • Side features handle cold start; pure collaborative filtering can’t.
  • Output of this deck: indexed-categorical mini-batches the FM / DeepFM decks consume.
  • Real-world systems extend this with continuous features, multi-task heads, and embedding tables on the order of billions of entries.