Big data enables machine learning

What is machine learning?

“Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.”

— Arthur Samuel (1959)

What is machine learning used for?

  • gain insights
  • predict events

in order to

  • provide a quantitative basis for decisions (actionable insights)
  • influence the data generating process

Build a food classifier

Structure of the learning problem

Model and learning

Model: A model is a mathematical, statistical, or logical representation that describes the relationship between variables and can be used to make predictions or understand patterns in data.

Learning/Training: Machine Learning employs adaptive models, which are configured and parameterised automatically based on the training data.

Three Main Paradigms

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

Each paradigm addresses different tasks and uses different learning strategies.

Supervised Learning

  • Training Data: Labeled samples (input features + output values)
  • Goal: Model to predict output values for new input features

Data Matrix and Labels

Design Matrix
Feature 1 Feature 2 Feature N Output
Sample 1 x11 x12 x1N y1
Sample 2 x₂₁ x₂₂ xN y
Sample 3 x₃₁ x₃₂ xN y
Sample M xM1 xM2 xMN yM
  • Dimension of \(X\): \(M\text{ (samples)} \times N \text{ (features)}\)
  • Dimensions of \(y\): \(M\)

Supervised Learning

  • Training Data: Labeled samples (input features + output values)
  • Goal: Model to predict output values for new input features

\(𝑓(𝑿_{m,:};\mathbf{\theta})\rightarrow 𝑦_{m}\)

  • Process: Algorithm adapts parameters \(\mathbf{\theta}\) of a model \(f\) to predict the correct outputs for the known training samples

Classification vs. Regression

Task Output Variable Type Applications
Classification Categorical
  • Spam detection
  • Credit approval
Regression Numerical continuous
  • Predict prices
  • Probability of default

Supervised Learning Example

# Simple example: Predicting house prices
from sklearn.linear_model import LinearRegression
import numpy as np

# Training data (house size in sq ft, price in $1000s)
X_train = np.array([[1500], [2000], [2500], [3000]])
y_train = np.array([300, 400, 500, 600])

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict price for new house
new_house_size = [[2200]]
predicted_price = model.predict(new_house_size)
print(f"Predicted price: ${predicted_price[0]:.0f}k")

Unsupervised Learning

Learning without a Teacher

  • Input: Unlabeled data (features only, no target)
  • Goal: Discover hidden patterns or structures
  • Process: Algorithm finds patterns without knowing “correct” answers

Common Applications:

  • Clustering: Grouping similar items
  • Dimensionality Reduction: Simplifying complex data
  • Anomaly Detection: Finding unusual patterns

Examples:

  • Customer segmentation
  • Data compression
  • Fraud detection
  • Gene analysis

Unsupervised Learning Example

# Customer segmentation using K-means clustering
from sklearn.cluster import KMeans
import numpy as np

# Customer data: [age, income]
customers = np.array([
    [25, 40000], [30, 60000], [35, 80000],
    [45, 90000], [50, 100000], [55, 120000],
    [22, 35000], [28, 50000], [40, 85000]
])

# Cluster customers into 3 groups
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(customers)

print("Customer segments:", clusters)
# Output: Different customers assigned to different segments

Reinforcement Learning

Learning through Trial and Error

  • Input: Environment with states, actions, and rewards
  • Goal: Learn optimal actions to maximize cumulative reward
  • Process: Agent explores environment, receives feedback, improves strategy

Key Components:

  • Agent: The learner/decision maker
  • Environment: The world the agent interacts with
  • Actions: What the agent can do
  • Rewards: Feedback from the environment

Reinforcement Learning Applications

Real-world Examples:

  • Game Playing: Chess, Go, video games
  • Robotics: Robot navigation, manipulation
  • Autonomous Vehicles: Self-driving cars
  • Trading: Algorithmic trading strategies
  • Recommendation Systems: Personalized content
  • Resource Management: Cloud computing optimization

Famous Success Stories:

  • AlphaGo (Google DeepMind)
  • OpenAI Five (Dota 2)
  • Autonomous drone racing

Choosing the Right Paradigm

Learning Paradigm When to Use Data Requirements
Supervised Predict values based on labeled examples Labeled dataset
Unsupervised Discover patterns or structure in data Unlabeled data
Reinforcement Learn optimal actions through interaction Agent-Environment interaction

Summary

  1. Machine Learning enables computers to learn from data without explicit programming
  2. Supervised Learning uses labeled data to make predictions
  3. Unsupervised Learning discovers hidden patterns in unlabeled data
  4. Reinforcement Learning learns optimal actions through trial and error
  5. Each paradigm suits different types of problems and data scenarios