%matplotlib inline
from d2l import torch as d2l
import torch
from torch import nn
import pandas as pdHouse Prices: Advanced Regression Techniques — predict sale prices of Ames, Iowa houses from 80 numeric and categorical features. A small but realistic end-to-end ML exercise.
What makes it interesting:
The MLP is 5 lines. The lesson is the pipeline around it — preprocessing, the right loss, CV, submission.
Kaggle hosts open ML competitions. Download train + test CSVs, train locally, upload predictions, get scored on a held-out portion of the test set.
Kaggle competition page.
The competition’s data tab — download and inspect.
Real-world ML practice: data isn’t preprocessed for you, the leaderboard tells you instantly if you’re better than baseline, and the public/private split keeps people honest about overfitting.
A reusable hash-checked download helper we’ll keep using throughout the book:
Wrap train and test CSVs in a KaggleHouse(d2l.DataModule):
class KaggleHouse(d2l.DataModule):
def __init__(self, batch_size, train=None, val=None):
super().__init__()
self.save_hyperparameters()
if self.train is None:
self.raw_train = pd.read_csv(d2l.download(
d2l.DATA_URL + 'kaggle_house_pred_train.csv', self.root,
sha1_hash='585e9cc93e70b39160e7921475f9bcd7d31219ce'))
self.raw_val = pd.read_csv(d2l.download(
d2l.DATA_URL + 'kaggle_house_pred_test.csv', self.root,
sha1_hash='fa19780a7b011d9b009e8bff8e99922a8ee2eb90')) Id MSSubClass MSZoning LotFrontage SaleType SaleCondition SalePrice
0 1 60 RL 65.0 WD Normal 208500
1 2 20 RL 80.0 WD Normal 181500
2 3 60 RL 68.0 WD Normal 223500
3 4 70 RL 60.0 WD Abnorml 140000
Mixed numeric and categorical columns; lots of missing values; final column is the target SalePrice. Models eat tensors, not DataFrames — preprocessing is mandatory.
Apply on train + test together so train statistics match what we’ll see at test time:
NaN becomes its own category — missing-as-a-signal.def preprocess(self):
# Remove the ID and label columns
label = 'SalePrice'
features = pd.concat(
(self.raw_train.drop(columns=['Id', label]),
self.raw_val.drop(columns=['Id'])))
# Standardize numerical columns using training-set statistics only
# (to avoid leaking test-set information into the normalization).
numeric_features = features.select_dtypes(include='number').columns
n_train = self.raw_train.shape[0]
train_mean = features[numeric_features].iloc[:n_train].mean()
train_std = features[numeric_features].iloc[:n_train].std()
features[numeric_features] = (
features[numeric_features] - train_mean) / train_std
# Replace NAN numerical features by 0
features[numeric_features] = features[numeric_features].fillna(0)
# Replace discrete features by one-hot encoding
features = pd.get_dummies(features, dummy_na=True)
# Save preprocessed features
self.train = features[:n_train].copy()
self.train[label] = self.raw_train[label]
self.val = features[n_train:].copy()Plain squared error penalizes a $10k mistake on a $70k house the same as on a $700k house. The relative error is more honest — predict the logarithm of the price:
\text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \big(\log y_i - \log \hat y_i\big)^2}.
The official Kaggle metric for this competition. Mistakes are measured as percentages, not dollars.
def get_dataloader(self, train):
label = 'SalePrice'
data = self.train if train else self.val
if label not in data: return
get_tensor = lambda x: d2l.tensor(x.values.astype(float),
dtype=d2l.float32)
# Logarithm of prices
tensors = (get_tensor(data.drop(columns=[label])), # X
d2l.reshape(d2l.log(get_tensor(data[label])), (-1, 1))) # Y
return self.get_tensorloader(tensors, train)With ~1500 training examples, a single 80/20 split is noisy. K-fold CV: split into K folds, train K times holding each fold out, average the scores.
fold 1: [ val ][ train ][ train ][ train ][ train ]
fold 2: [train][ val ][ train ][ train ][ train ]
fold 3: [train][ train ][ val ][ train ][ train ]
fold 4: [train][ train ][ train ][ val ][ train ]
fold 5: [train][ train ][ train ][ train ][ val ]
Costs K\times more compute; gives a stable estimate of generalization error.
def k_fold(trainer, data, k, lr):
val_loss, models = [], []
for i, data_fold in enumerate(k_fold_data(data, k)):
model = d2l.LinearRegression(lr)
model.board.yscale='log'
if i != 0: model.board.display = False
trainer.fit(model, data_fold)
val_loss.append(float(model.board.data['val_loss'][-1].y))
models.append(model)
print(f'average validation log mse = {sum(val_loss)/len(val_loss)}')
return modelsaverage validation log mse = 0.1828871488571167
In practice you’d grid- or random-search over learning rate, hidden size, weight decay, dropout. Same loop, different hyperparameters. Pick the config with the lowest average val score.
Final step: re-fit on all training data (no validation hold-out), predict the test set, write the Kaggle-format CSV:
preds = [model(d2l.tensor(data.val.values.astype(float), dtype=d2l.float32))
for model in models]
# Taking exponentiation of predictions in the logarithm scale
ensemble_preds = d2l.reduce_mean(d2l.exp(d2l.concat(preds, 1)), 1)
submission = pd.DataFrame({'Id':data.raw_val.Id,
'SalePrice':d2l.numpy(ensemble_preds)})
submission.to_csv('submission.csv', index=False)Upload the CSV; Kaggle scores it instantly against the held-out half of the test set.
Same shape works for almost any tabular ML competition:
GBDTs (XGBoost / LightGBM) usually win tabular; nets shine on images, text, audio. The pipeline is the same.