Machine learning is often good for:

  • Problems for which existing solutions require a lot of hand-tuning or long list of rules
  • Complex problems for which there is no good solution at all using a traditional algorithms
  • Fluctuating environments, which requires constant adaptation to new data

Types of Machine Learning Systems

Things to check on


Are they trained with human supervision?

Supervised ones

  • k-Nearest Neighbors
  • Linear Regression
  • Logistic Regression
  • SVM
  • Decision Trees / Random Forests
  • NN? (only some of them)

Unsupervised ones

  • Clustering
    • k-Means
    • DBSCAN
    • Hierarchical Cluster Analysis
  • Anomaly detection and novelty detection
    • One-class SVM
    • Isolation Forest
  • Visualizataion and dimensionality reduction
    • PCA
    • Kernel PCA
    • Locally Linear Embedding
    • t-distributed Stochastic Neighbor Embedding (t-SNE)
  • Association rule learning
    • Apriori
    • Eclat

Semisupervised ones

a lot of unlabeled data and a little bit of labeled data (partially labeled)

  • Google photos!
  • Reinforcement Learning:

Is the model capable of learning incrementally on the fly?

Batch Learning

  • System is trained -> launched into production and runs without learning (offline learning)
  • Training using the full set of data takes many hours (weak at rapidly changing data)

Online Learning

  • systems that receive data as a continuous flow
  • need to adapt to change rapidly or autonomously
  • environments with limited computing resources (out-of-core learning)
  • learning rate: how fast they should adapt to changing data
    • high learning rate: rapid adaptation to new data, tend to quickly forget the old data
    • low learning rate: learn more slowly, less sensetive to noise (outliers)
  • If bad data is fed to the system
    • gradual decline to the overall performance
    • need to monitor your system closely and promptly switch learning off (+revert) if you detect a drop in performance

Comparing new data points to the known data points vs detecting new patterns to build a predictive model

Instance-based learning

  • System learns the examples by heart, and generalizes to new cases by copmaring them to the learned examples (similarity measure)

Model-based learning

  • e.g. linear regression model
model = sklearn.linear_model.LinearRegression()
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

Challenges of Machine Learning

  • Bad data
    • Insufficient Quantity of Training data
    • Nonrepresentative Training Data (Sampling Bias)
      • With to small sample size, the trained model is unlikely to make accurate predictions about boundary cases
    • Poor-Quality Data
      • Outliers: possibly a better choice to discard them / fix the errors manually
      • Data with missing a few features: ignore / fill the missing values(median, avg, etc.) / train one model with the feature and one model without it
    • Irrelevant Features
      • Feature selection: select the most useful features to train
      • Feature extraction: combine existing features to produce a more useful one (e.g. dimensionality reduction)
  • Bad algorithms
    • Overfitting (Overgeneralizing)
      • If the training set itself is noisy or too small, the model is likely to detect patterns in the noise itself.
      • Model is too complex relative to the amount and noisiness of the training data ()
      • Simplify the model by reducing the number of attributes in the training data or constrain the model (regularization)
      • more training data/reduce the noise in the training data
    • Underfitting
      • Model is too simple to learn the underlying structure of the data
      • Select more powerful model+more parameters
      • Feeding better features to the learning algorithm
      • Reduce the constraints on the model (e.g., reduce the regularization hyperparameter)

Testing and Validating

  • training set + test set
    • Does your model perform well on instances that have never seen before?
    • low training error + high generalization(out-of-sample) error -> overfitting
  • Hyperparameter Tuning and Model Selection
    • Prevent you model and hyperparamters to produce the best model for the particular test set
    • Holdout validation: hold out part of the training set (validation set) to evaluate several candidate models
      • train the model with reduced training set
      • choose the best model, and train the best model on full training set
      • evaluate the model on the test set
    • Size of validation set
      • too small: imprecise model evaluation
      • too large: remaining training set gets too smalla
      • PERFORM REPEATED CROSS-VALIDATION, USING MANY SMALL VALIDATION SETS
  • Data Mismatch
    • Some data sets does not perfectly represent the data that will be used in production
    • Validation set and test set must be as representative as possibel of the data you expect to use in production
    • Hold out part of training pictures (train-dev set)
      • train the model on train set
      • evaluate the model on train-dev
      • if good: continue to validation process
        • if good on validation set: good
        • if not good on validation set: data mismatch->preprocess the data?
      • if not good: overfitting on train-dev set