The Machine Learning Landscape

Machine learning is often good for:

Problems for which existing solutions require a lot of hand-tuning or long list of rules

Complex problems for which there is no good solution at all using a traditional algorithms

Fluctuating environments, which requires constant adaptation to new data

Types of Machine Learning Systems

Things to check on

Are they trained with human supervision?
- supervised, unsupervised, semisupervised, reinforcement learning
Is the model capable of learning incrementally on the fly?
- online algo., batch learning
Comparing new data points to the known data points vs detecting new patterns to build a predictive model
- instance-based vs model-based

Are they trained with human supervision?

Supervised ones

k-Nearest Neighbors
Linear Regression
Logistic Regression
SVM
Decision Trees / Random Forests
NN? (only some of them)

Unsupervised ones

Clustering
- k-Means
- DBSCAN
- Hierarchical Cluster Analysis
Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
Visualizataion and dimensionality reduction
- PCA
- Kernel PCA
- Locally Linear Embedding
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning
- Apriori
- Eclat

Semisupervised ones

a lot of unlabeled data and a little bit of labeled data (partially labeled)

Google photos!
Reinforcement Learning:

Is the model capable of learning incrementally on the fly?

Batch Learning

System is trained -> launched into production and runs without learning (offline learning)
Training using the full set of data takes many hours (weak at rapidly changing data)

Online Learning

systems that receive data as a continuous flow
need to adapt to change rapidly or autonomously
environments with limited computing resources (out-of-core learning)
learning rate: how fast they should adapt to changing data
- high learning rate: rapid adaptation to new data, tend to quickly forget the old data
- low learning rate: learn more slowly, less sensetive to noise (outliers)
If bad data is fed to the system
- gradual decline to the overall performance
- need to monitor your system closely and promptly switch learning off (+revert) if you detect a drop in performance

Comparing new data points to the known data points vs detecting new patterns to build a predictive model

Instance-based learning

System learns the examples by heart, and generalizes to new cases by copmaring them to the learned examples (similarity measure)

Model-based learning

e.g. linear regression model

model = sklearn.linear_model.LinearRegression()
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

Challenges of Machine Learning

Bad data
- Insufficient Quantity of Training data
- Nonrepresentative Training Data (Sampling Bias)
  - With to small sample size, the trained model is unlikely to make accurate predictions about boundary cases
- Poor-Quality Data
  - Outliers: possibly a better choice to discard them / fix the errors manually
  - Data with missing a few features: ignore / fill the missing values(median, avg, etc.) / train one model with the feature and one model without it
- Irrelevant Features
  - Feature selection: select the most useful features to train
  - Feature extraction: combine existing features to produce a more useful one (e.g. dimensionality reduction)
Bad algorithms
- Overfitting (Overgeneralizing)
  - If the training set itself is noisy or too small, the model is likely to detect patterns in the noise itself.
  - Model is too complex relative to the amount and noisiness of the training data ()
  - Simplify the model by reducing the number of attributes in the training data or constrain the model (regularization)
  - more training data/reduce the noise in the training data
- Underfitting
  - Model is too simple to learn the underlying structure of the data
  - Select more powerful model+more parameters
  - Feeding better features to the learning algorithm
  - Reduce the constraints on the model (e.g., reduce the regularization hyperparameter)

Testing and Validating

training set + test set
- Does your model perform well on instances that have never seen before?
- low training error + high generalization(out-of-sample) error -> overfitting
Hyperparameter Tuning and Model Selection
- Prevent you model and hyperparamters to produce the best model for the particular test set
- Holdout validation: hold out part of the training set (validation set) to evaluate several candidate models
  - train the model with reduced training set
  - choose the best model, and train the best model on full training set
  - evaluate the model on the test set
- Size of validation set
  - too small: imprecise model evaluation
  - too large: remaining training set gets too smalla
  - PERFORM REPEATED CROSS-VALIDATION, USING MANY SMALL VALIDATION SETS
Data Mismatch
- Some data sets does not perfectly represent the data that will be used in production
- Validation set and test set must be as representative as possibel of the data you expect to use in production
- Hold out part of training pictures (train-dev set)
  - train the model on train set
  - evaluate the model on train-dev
  - if good: continue to validation process
    - if good on validation set: good
    - if not good on validation set: data mismatch->preprocess the data?
  - if not good: overfitting on train-dev set