Data Preprocessing

Dive into Deep Learning · §1.2

From a messy file to a model-ready tensor
load · inspect · clean · encode · scale.

Raw data never arrives model-ready

Motivation

Real data arrives as messy files, not tensors. Turning a raw file into model input takes a sequence of decisions: which values are missing and how to fill them, how to turn categories into numbers, whether to rescale.

Get these wrong and even a perfect model learns nothing useful.

Load & look

read a CSV, then understand it

read_csv: named columns, NaN where data is missing

Load & look

read_csv loads a comma-separated file into a DataFrame: rows × named columns. Empty fields become NaN, pandas’s marker for a missing value:

import pandas as pd

data = pd.read_csv(data_file)
data

   NumRooms RoofType  Area   Price
0       3.0    Slate  1500  210000
1       NaN     Tile  2100  290000
2       2.0      NaN   850  127500
3       4.0    Slate  1940  258000
4       NaN      NaN  1200  168000
5       3.0     Tile  1650  225000
6       5.0    Slate  2600  375000
7       2.0      NaN   900  142000
8       NaN     Tile  1750  240000
9       4.0      NaN  2050  295000

Look before you transform

Load & look

dtypes shows how each column was read, numeric vs. string (a categorical cue); describe surfaces the ranges:

data.dtypes

NumRooms    float64
RoofType        str
Area          int64
Price         int64
dtype: object

data.describe()

       NumRooms         Area          Price
count  7.000000    10.000000      10.000000
mean   3.285714  1654.000000  233050.000000
std    1.112697   556.421104   76128.710455
min    2.000000   850.000000  127500.000000
25%    2.500000  1275.000000  178500.000000
50%    3.000000  1700.000000  232500.000000
75%    4.000000  2022.500000  282000.000000
max    5.000000  2600.000000  375000.000000

The three numeric columns live on very different scales.

Clean & encode

missing values · categoricals

Measure the gaps before choosing a remedy

Clean & encode

First measure the problem (how many, and where):

inputs.isna().sum()

NumRooms    3
RoofType    4
Area        0
dtype: int64

Three responses, each with a cost: delete the rows, impute an estimate, or add an indicator flag.

Deletion keeps only 4 rows of 10

Clean & encode

The simplest response is deletion, but dropping every row with a gap throws away most of this dataset:

len(inputs.dropna()), len(inputs)

(4, 10)

Too wasteful on small data. So we impute instead.

Imputation fills gaps without discarding rows

Clean & encode

Imputation fills each gap with an estimate. For a numerical column the simplest is the column mean (the median when it is skewed). The categorical RoofType we leave for encoding next:

inputs = inputs.fillna(inputs.mean(numeric_only=True))
inputs

   NumRooms RoofType  Area
0  3.000000    Slate  1500
1  3.285714     Tile  2100
2  2.000000      NaN   850
3  4.000000    Slate  1940
4  3.285714      NaN  1200
5  3.000000     Tile  1650
6  5.000000    Slate  2600
7  2.000000      NaN   900
8  3.285714     Tile  1750
9  4.000000      NaN  2050

One-hot encoding: numbers without a false order

Clean & encode

Models consume numbers, so the RoofType strings get one-hot encoded: one 0/1 column per category. dummy_na=True makes “missing” its own category (the indicator strategy again):

inputs = pd.get_dummies(inputs, dummy_na=True, dtype=float)
inputs

Integer codes (Slate=0, Tile=1) would invent an ordering the model would wrongly exploit.

Scale & convert

standardize, then hand off to tensors

NumRooms ≈ 3, Area ≈ 1700: standardize

Scale & convert

The continuous columns span very different scales, which makes optimization ill-conditioned. Standardize each to ~zero mean, unit variance:

continuous = ['NumRooms', 'Area']
inputs[continuous] = (inputs[continuous] - inputs[continuous].mean()) \
                     / inputs[continuous].std()
inputs

Compute these statistics on the training split only; using the whole dataset leaks the test set into training.

The payoff: a (10, 5) tensor, ready for a model

Scale & convert

Every column is numeric, so .to_numpy() hands the frame to the tensor constructor:

from jax import numpy as jnp

X = jnp.array(inputs.to_numpy(dtype=float))
y = jnp.array(targets.to_numpy(dtype=float))
X, y

X is a (10, 5) float tensor, y a (10,) target: five decisions turned a holey CSV into model input. (Real pipelines cast to float32.)

Recap

Wrap-up

Inspect before you transform (dtypes, describe).
Missing: delete (here: 4 rows of 10 survive), impute, or flag; each is a tradeoff, not a default.
Categoricals: one-hot, not integer codes.
Numericals: standardize; fit stats on training data only.
Convert: to_numpy → tensor, and on to the model.