Deep Factorization Machines

DeepFM

DeepFM (Guo et al., 2017) — combine a factorization machine and a deep MLP, sharing the embedding table.

FM branch — linear + pairwise bilinear interactions (same as the previous deck).
Deep branch — concat all field embeddings, feed to an MLP. Captures high-order nonlinear interactions that the bilinear FM misses.

Final prediction: \sigma(\hat y_{FM} + \hat y_{Deep}). End-to-end training. This became a widely used template for CTR models after 2017: explicit interaction terms plus learned nonlinear feature mixing.

Architecture

Shared embeddings feed both the FM head and the deep MLP head:

DeepFM architecture: shared field embeddings feed both an FM branch and a deep MLP branch.

\hat y = \sigma(\hat y^{(FM)} + \hat y^{(DNN)})

Implementation

class DeepFM(nn.Block):
    def __init__(self, field_dims, num_factors, mlp_dims, drop_rate=0.1):
        super(DeepFM, self).__init__()
        num_inputs = int(sum(field_dims))
        self.embedding = nn.Embedding(num_inputs, num_factors)
        self.fc = nn.Embedding(num_inputs, 1)
        self.linear_layer = nn.Dense(1, use_bias=True)
        input_dim = self.embed_output_dim = len(field_dims) * num_factors
        self.mlp = nn.Sequential()
        for dim in mlp_dims:
            self.mlp.add(nn.Dense(dim, 'relu', True, in_units=input_dim))
            self.mlp.add(nn.Dropout(rate=drop_rate))
            input_dim = dim
        self.mlp.add(nn.Dense(in_units=input_dim, units=1))

    def forward(self, x):
        embed_x = self.embedding(x)
        square_of_sum = np.sum(embed_x, axis=1) ** 2
        sum_of_square = np.sum(embed_x ** 2, axis=1)
        inputs = np.reshape(embed_x, (-1, self.embed_output_dim))
        x = self.linear_layer(self.fc(x).sum(1)) \
            + 0.5 * (square_of_sum - sum_of_square).sum(1, keepdims=True) \
            + self.mlp(inputs)
        return x

Training

Same CTR pipeline as the FM deck — only the model changes:

batch_size = 2048
data_dir = d2l.download_extract('ctr')
train_data = d2l.CTRDataset(os.path.join(data_dir, 'train.csv'))
test_data = d2l.CTRDataset(os.path.join(data_dir, 'test.csv'),
                           feat_mapper=train_data.feat_mapper,
                           defaults=train_data.defaults)
field_dims = train_data.field_dims
train_iter = gluon.data.DataLoader(
    train_data, shuffle=True, last_batch='rollover', batch_size=batch_size,
    num_workers=d2l.get_dataloader_workers())
test_iter = gluon.data.DataLoader(
    test_data, shuffle=False, last_batch='rollover', batch_size=batch_size,
    num_workers=d2l.get_dataloader_workers())
devices = d2l.try_all_gpus()
net = DeepFM(field_dims, num_factors=10, mlp_dims=[30, 20, 10])
net.initialize(init.Xavier(), ctx=devices)
lr, num_epochs, optimizer = 0.01, 30, 'adam'
trainer = gluon.Trainer(net.collect_params(), optimizer,
                        {'learning_rate': lr})
loss = gluon.loss.SigmoidBinaryCrossEntropyLoss()
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

loss 0.011, train acc 0.997, test acc 0.934
39974.0 examples/sec on [gpu(0)]

Recap

DeepFM = FM (low-order) + deep MLP (high-order), sharing the same embedding table.
Same input format as FM; one extra branch.
A member of the wide/deep interaction-model family: explicit low-order terms plus a nonlinear feature mixer and a sigmoid head.
Unlike retrieval architectures that score independently encoded users and items, DeepFM fuses all impression features before scoring one candidate.