1Cademy - Developing an Image-to-Text Model

Learn Before

Directly Learning Rich Outputs

Case Study

Developing an Image-to-Text Model

Case context: A machine learning team wants to build a system that takes an image as input and generates a descriptive sentence as output. They are considering an end-to-end deep learning approach.

Question: Based on the concept of directly learning rich outputs, what specific data must the team possess to successfully train this end-to-end system, and how does ML Yearning classify the type of output this system generates?

Sample answer: The team must possess a dataset consisting of the right labeled (input, output) pairs, specifically pairs of images and their corresponding descriptive sentences. ML Yearning classifies the generated sentence as a 'rich output,' meaning it is an output much more complex than a single number.

Key points:

The team needs the right labeled input-output pairs.
The output (a sentence) is classified as a rich output.
Rich outputs are more complex than predicting a number.

Rubric: The learner should state the requirement for labeled input-output pairs (specifically images paired with sentences) and classify the sentence output as a 'rich output' or complex output.

0

1

Updated 2026-05-27

Contributors are:

Who are from:

References

Learn Before

Related