1Cademy - Evaluating the Inclusion of Historical Document Scans in a Casual Cat-Detector Model

Learn Before

Adding More Training Data Does Not Always Help

Case Study

Evaluating the Inclusion of Historical Document Scans in a Casual Cat-Detector Model

Case context: You are building a casual cat-detector system where your target dev/test sets consist of everyday photographs of people, places, landmarks, and animals. You are offered a massive dataset of scanned historical text documents containing no cats or cat-like features to add to your training set.

Question: Based on the principles in Machine Learning Yearning, decide whether you should include these scanned historical documents in your training data, and explain the consequences of doing so regarding model capacity and computational efficiency.

Sample answer: The scanned historical documents should be excluded from the training data. Because they look completely unlike the dev/test distribution and contain nothing resembling a cat, they offer negligible benefit. Including them would waste computational resources during training and consume neural-network representation capacity on features irrelevant to the target task.

Key points:

Exclude the scanned historical documents from the training data.
The scanned documents look completely unlike the dev/test distribution and have negligible benefit.
Including them wastes computational resources and neural-network representation capacity.

Rubric: The student should state that the documents must be excluded. They must identify that inclusion leads to wasted computational resources and wasted neural-network representation capacity on irrelevant features.

Updated 2026-05-26

Contributors are:

Who are from:

References

Learn Before

Related