Data Preprocessing

DataFrames to Tensors

Real datasets rarely arrive as tensors. The usual path:

  1. Read raw rows (CSV, JSON, Parquet, …) → a DataFrame.
  2. Preprocess — fill missing values, encode categoricals.
  3. Split into inputs and targets.
  4. Convert the numeric columns into a tensor.

This whole chapter walks through that pipeline on a tiny toy dataset.

Reading the data

First, dump a CSV to disk so we have something to load:

import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

pandas reads CSVs into a DataFrame. Note the NaNs — pandas’s sentinel for missing values:

import pandas as pd

data = pd.read_csv(data_file)
print(data)
   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000

Splitting inputs and targets

Conventionally the last column is the target (y); the rest are inputs (X). iloc slices by integer position:

inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True

Categorical columns with missing values often benefit from treating NaN as its own category — pd.get_dummies does that when dummy_na=True.

Imputing missing numbers

For numeric columns, the simplest fill is the column mean:

inputs = inputs.fillna(inputs.mean())
print(inputs)
   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True

This is mean imputation — fast and assumption-free, but biases the variance downward. More principled fills (median, KNN, model-based) live in sklearn.impute.

Conversion to a tensor

Once every entry is numeric, hand the DataFrame’s .to_numpy() view to the framework’s tensor constructor:

from jax import numpy as jnp

X = jnp.array(inputs.to_numpy(dtype=float))
y = jnp.array(targets.to_numpy(dtype=float))
X, y
(Array([[3., 0., 1.],
        [2., 0., 1.],
        [4., 1., 0.],
        [3., 0., 1.]], dtype=float32),
 Array([127500., 106000., 178100., 140000.], dtype=float32))

From here on we live in tensor-land — gradients, GPUs, the works.

Recap

  • pd.read_csv → DataFrame.
  • iloc[:, …] to slice columns into inputs and targets.
  • fillna(mean) for numeric; get_dummies(dummy_na=True) for categorical.
  • .to_numpy() then tensor(...) to leave pandas.
  • For anything beyond toy CSVs, reach for sklearn.preprocessing and sklearn.impute.