The Machine Learning Landscape
Machine learning is often good for:
- Problems for which existing solutions require a lot of hand-tuning or long list of rules
- Complex problems for which there is no good solution at all using a traditional algorithms
- Fluctuating environments, which requires constant adaptation to new data
Types of Machine Learning Systems
Things to check on
- Are they trained with human supervision?
- supervised, unsupervised, semisupervised, reinforcement learning
- Is the model capable of learning incrementally on the fly?
- online algo., batch learning
- Comparing new data points to the known data points vs detecting new patterns to build a predictive model
- instance-based vs model-based
Are they trained with human supervision?
Supervised ones
- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- SVM
- Decision Trees / Random Forests
- NN? (only some of them)
Unsupervised ones
- Clustering
- k-Means
- DBSCAN
- Hierarchical Cluster Analysis
- Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
- Visualizataion and dimensionality reduction
- PCA
- Kernel PCA
- Locally Linear Embedding
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning
- Apriori
- Eclat
Semisupervised ones
a lot of unlabeled data and a little bit of labeled data (partially labeled)
- Google photos!
- Reinforcement Learning:
Is the model capable of learning incrementally on the fly?
Batch Learning
- System is trained -> launched into production and runs without learning (offline learning)
- Training using the full set of data takes many hours (weak at rapidly changing data)
Online Learning
- systems that receive data as a continuous flow
- need to adapt to change rapidly or autonomously
- environments with limited computing resources (out-of-core learning)
- learning rate: how fast they should adapt to changing data
- high learning rate: rapid adaptation to new data, tend to quickly forget the old data
- low learning rate: learn more slowly, less sensetive to noise (outliers)
- If bad data is fed to the system
- gradual decline to the overall performance
- need to monitor your system closely and promptly switch learning off (+revert) if you detect a drop in performance
Comparing new data points to the known data points vs detecting new patterns to build a predictive model
Instance-based learning
- System learns the examples by heart, and generalizes to new cases by copmaring them to the learned examples (similarity measure)
Model-based learning
- e.g. linear regression model
model = sklearn.linear_model.LinearRegression()
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
Challenges of Machine Learning
- Bad data
- Insufficient Quantity of Training data
- Nonrepresentative Training Data (Sampling Bias)
- With to small sample size, the trained model is unlikely to make accurate predictions about boundary cases
- Poor-Quality Data
- Outliers: possibly a better choice to discard them / fix the errors manually
- Data with missing a few features: ignore / fill the missing values(median, avg, etc.) / train one model with the feature and one model without it
- Irrelevant Features
- Feature selection: select the most useful features to train
- Feature extraction: combine existing features to produce a more useful one (e.g. dimensionality reduction)
- Bad algorithms
- Overfitting (Overgeneralizing)
- If the training set itself is noisy or too small, the model is likely to detect patterns in the noise itself.
- Model is too complex relative to the amount and noisiness of the training data ()
- Simplify the model by reducing the number of attributes in the training data or constrain the model (regularization)
- more training data/reduce the noise in the training data
- Underfitting
- Model is too simple to learn the underlying structure of the data
- Select more powerful model+more parameters
- Feeding better features to the learning algorithm
- Reduce the constraints on the model (e.g., reduce the regularization hyperparameter)
- Overfitting (Overgeneralizing)
Testing and Validating
- training set + test set
- Does your model perform well on instances that have never seen before?
- low training error + high generalization(out-of-sample) error -> overfitting
- Hyperparameter Tuning and Model Selection
- Prevent you model and hyperparamters to produce the best model for the particular test set
- Holdout validation: hold out part of the training set (validation set) to evaluate several candidate models
- train the model with reduced training set
- choose the best model, and train the best model on full training set
- evaluate the model on the test set
- Size of validation set
- too small: imprecise model evaluation
- too large: remaining training set gets too smalla
- PERFORM REPEATED CROSS-VALIDATION, USING MANY SMALL VALIDATION SETS
- Data Mismatch
- Some data sets does not perfectly represent the data that will be used in production
- Validation set and test set must be as representative as possibel of the data you expect to use in production
- Hold out part of training pictures (train-dev set)
- train the model on train set
- evaluate the model on train-dev
- if good: continue to validation process
- if good on validation set: good
- if not good on validation set: data mismatch->preprocess the data?
- if not good: overfitting on train-dev set