Learn Before
Evaluating the Inclusion of Historical Document Scans in a Casual Cat-Detector Model
Case context: You are building a casual cat-detector system where your target dev/test sets consist of everyday photographs of people, places, landmarks, and animals. You are offered a massive dataset of scanned historical text documents containing no cats or cat-like features to add to your training set.
Question: Based on the principles in Machine Learning Yearning, decide whether you should include these scanned historical documents in your training data, and explain the consequences of doing so regarding model capacity and computational efficiency.
Sample answer: The scanned historical documents should be excluded from the training data. Because they look completely unlike the dev/test distribution and contain nothing resembling a cat, they offer negligible benefit. Including them would waste computational resources during training and consume neural-network representation capacity on features irrelevant to the target task.
Key points:
- Exclude the scanned historical documents from the training data.
- The scanned documents look completely unlike the dev/test distribution and have negligible benefit.
- Including them wastes computational resources and neural-network representation capacity.
Rubric: The student should state that the documents must be excluded. They must identify that inclusion leads to wasted computational resources and wasted neural-network representation capacity on irrelevant features.
0
1
References
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Tags
Machine Learning
Deep Learning
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Strategy
Related
In the cat-detector example, why does Machine Learning Yearning recommend excluding scanned historical documents that look nothing like the dev/test distribution?
True or False: Adding more training data always improves model performance when training, dev, and test sets share the same distribution.
According to Machine Learning Yearning, if the dev error curve has _____ (i.e., flattened out), adding more training data will not help you reach your performance goal.
Why does Machine Learning Yearning recommend leaving out training data that has no benefit for your model?
True or False: According to Machine Learning Yearning, adding more training data can actually hurt model performance.
If the dev error curve has _____, you can immediately tell that adding more training data won't help reach your performance goal.
Match each concept to its correct description in Machine Learning Yearning's discussion of when adding training data does not help.
Order the steps for using a learning curve to decide whether to collect more training data, as described in Machine Learning Yearning.
In the cat-detector example, why should a large collection of scanned historical documents be excluded from training?
True or False: According to Machine Learning Yearning, inspecting the learning curve can prevent wasting months collecting data that turns out not to help.
According to Machine Learning Yearning, data that has no _____ should be left out of training for computational reasons.
Match each training data scenario to the recommended action from Machine Learning Yearning.
Order the reasoning steps for evaluating whether a new data source (e.g., scanned documents) should be added to training, per Machine Learning Yearning.
Analyzing Computational and Representational Costs of Unhelpful Training Data
Evaluating the Inclusion of Historical Document Scans in a Casual Cat-Detector Model
Neural Network Capacity and Irrelevant Training Data