Adding More Training Data Does Not Always Help
Data with no benefit should be left out for computational reasons. In the cat-detector example, scanned historical documents that contain nothing resembling a cat and look completely unlike the dev/test distribution have negligible benefit, and including them would waste computation resources and neural-network representation capacity.
0
1
References
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Tags
Machine Learning
Deep Learning
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Strategy
Machine Learning Yearning @ DeepLearning.AI
Related
Bias (Informal Definition)
Variance (Informal Definition)
Adding More Training Data Does Not Always Help
Total Error Equals Bias Plus Variance for Mean Squared Error
Estimating the Optimal Error Rate
Bias-Variance Tradeoff
Learning Curve for Dev-Set Error
Deciding Whether to Reduce Bias, Variance, or Data Mismatch
High Avoidable Bias with 10% Training, 11% Training-Dev, and 12% Dev Error
Algorithms Can Simultaneously Have Avoidable Bias, Variance, and Data Mismatch Problems
High Variance Bias-Variance Example for Cat Classification
High Bias Low Variance Bias-Variance Example
High Bias and High Variance Bias-Variance Example
Low Bias and Low Variance Bias-Variance Example
According to Machine Learning Yearning, what are the two major sources of error in machine learning?
Understanding bias and variance helps you decide whether adding more training data or other tactics to improve performance are a good use of time.
According to Machine Learning Yearning, the two major sources of error in machine learning are bias and _____.
Which two fundamental error components does Andrew Ng identify as targets for ML optimization?
Understanding bias and variance helps you decide whether adding more training data is a good use of time.
Machine Learning Yearning identifies _____ and variance as the two major sources of error in machine learning.
Match each term to its role in ML Yearning's two-major-sources-of-error framework.
Order the conceptual steps a practitioner follows when applying the bias-variance framework to guide improvement efforts.
What practical benefit does ML Yearning say comes from understanding bias and variance?
Machine Learning Yearning describes bias and variance as the only sources of error in machine learning.
Understanding bias and variance helps you decide whether _____ are a good use of time.
Match each child concept to the aspect of the bias-variance framework it addresses.
Order the reasoning steps a practitioner takes when deciding whether adding training data will improve performance.
Analyzing Error Sources to Direct Machine Learning Development Efforts
Evaluating Team Strategy for Improving an Image Classifier Using Error Analysis
Guiding Development Tactics Through Machine Learning Error Analysis
Adding More Training Data Does Not Always Help
Special Challenges from Different Training and Dev/Test Distributions
Risk of Merging Training Data Sources Depends on Algorithm Flexibility
Shared Label Mapping Across Data Sources
Training and Dev/Test Sets from Different Distributions
Inconsistent Auxiliary Data Source
Approximating Future Dev/Test Data Before Launch
Updating Dev/Test Sets with Actual User Data After Launch
Risk of Starting with Website Images When Future-Like Data Is Unavailable
Development Investment for Dev and Test Sets Requires Judgment
According to Machine Learning Yearning, what is the primary criterion for choosing dev and test sets?
True or False: When building a dev/test set, it is safe to assume the training distribution is the same as the test distribution.
Dev and test sets should contain examples that reflect what you ultimately want to perform well on, rather than only the _____ you happen to have for training.
Why is using a simple 30% random split of available data as your test set problematic when future data differs from training data?
According to ML Yearning, it is generally safe to assume your training data distribution is the same as your test data distribution.
Dev and test sets should be chosen to reflect data you expect to get in the _____ and want to do well on.
Match each dev/test set concept from ML Yearning to its correct description.
Order the steps for correctly choosing dev and test sets according to ML Yearning's guidance.
According to ML Yearning, what should the examples in your dev and test sets primarily reflect?
According to ML Yearning, dev and test sets must always come from the same distribution as the training data.
ML Yearning warns that the test set should not simply be _____ of the available data when future data differs from the training set.
Match each data scenario to the correct dev/test set strategy decision according to ML Yearning.
Order the reasoning steps for deciding whether a proposed dev/test set is well-chosen, per ML Yearning.
Why Standard Data Splits Fail With Different Future Distributions
Dev and Test Set Design for Mobile Image Applications
The Core Criterion for Dev and Test Set Selection
Learn After
In the cat-detector example, why does Machine Learning Yearning recommend excluding scanned historical documents that look nothing like the dev/test distribution?
True or False: Adding more training data always improves model performance when training, dev, and test sets share the same distribution.
According to Machine Learning Yearning, if the dev error curve has _____ (i.e., flattened out), adding more training data will not help you reach your performance goal.
Why does Machine Learning Yearning recommend leaving out training data that has no benefit for your model?
True or False: According to Machine Learning Yearning, adding more training data can actually hurt model performance.
If the dev error curve has _____, you can immediately tell that adding more training data won't help reach your performance goal.
Match each concept to its correct description in Machine Learning Yearning's discussion of when adding training data does not help.
Order the steps for using a learning curve to decide whether to collect more training data, as described in Machine Learning Yearning.
In the cat-detector example, why should a large collection of scanned historical documents be excluded from training?
True or False: According to Machine Learning Yearning, inspecting the learning curve can prevent wasting months collecting data that turns out not to help.
According to Machine Learning Yearning, data that has no _____ should be left out of training for computational reasons.
Match each training data scenario to the recommended action from Machine Learning Yearning.
Order the reasoning steps for evaluating whether a new data source (e.g., scanned documents) should be added to training, per Machine Learning Yearning.
Analyzing Computational and Representational Costs of Unhelpful Training Data
Evaluating the Inclusion of Historical Document Scans in a Casual Cat-Detector Model
Neural Network Capacity and Irrelevant Training Data