Learn Before
Case Study

Developing an Image-to-Text Model

Case context: A machine learning team wants to build a system that takes an image as input and generates a descriptive sentence as output. They are considering an end-to-end deep learning approach.

Question: Based on the concept of directly learning rich outputs, what specific data must the team possess to successfully train this end-to-end system, and how does ML Yearning classify the type of output this system generates?

Sample answer: The team must possess a dataset consisting of the right labeled (input, output) pairs, specifically pairs of images and their corresponding descriptive sentences. ML Yearning classifies the generated sentence as a 'rich output,' meaning it is an output much more complex than a single number.

Key points:

  • The team needs the right labeled input-output pairs.
  • The output (a sentence) is classified as a rich output.
  • Rich outputs are more complex than predicting a number.

Rubric: The learner should state the requirement for labeled input-output pairs (specifically images paired with sentences) and classify the sentence output as a 'rich output' or complex output.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Strategy

Machine Learning Yearning @ DeepLearning.AI