|
|
|
# π Code Explanation: Image Caption Generator |
|
|
|
This document explains the **Image Caption Generator** app, which uses a ViT+GPT2 model to generate descriptive captions for uploaded images. |
|
|
|
--- |
|
|
|
## π Overview |
|
|
|
**Purpose** |
|
Upload an image and receive a concise, descriptive caption generated by a Vision Transformer (ViT) combined with GPT-2. |
|
|
|
**Tech Stack** |
|
- **Model**: `nlpconnect/vit-gpt2-image-captioning` (Vision Transformer + GPT-2) |
|
- **Precision**: `torch_dtype=torch.bfloat16` for reduced memory usage and faster inference on supported hardware |
|
- **Interface**: Gradio Blocks + Image + Textbox |
|
|
|
--- |
|
|
|
## βοΈ Setup & Dependencies |
|
|
|
Install required libraries: |
|
|
|
```bash |
|
pip install transformers gradio torch torchvision pillow |
|
``` |
|
|
|
--- |
|
|
|
## π Detailed Block-by-Block Code Explanation |
|
|
|
```python |
|
import torch |
|
import gradio as gr |
|
from transformers import pipeline |
|
|
|
# 1) Load the image-to-text pipeline |
|
captioner = pipeline( |
|
"image-to-text", |
|
model="nlpconnect/vit-gpt2-image-captioning", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
|
|
# 2) Caption generation function |
|
def generate_caption(image): |
|
outputs = captioner(image) |
|
return outputs[0]["generated_text"] |
|
|
|
# 3) Build Gradio interface |
|
with gr.Blocks(theme=gr.themes.Default()) as demo: |
|
gr.Markdown( |
|
"## πΌοΈ Image Caption Generator |
|
" |
|
"Upload an image to generate a descriptive caption using ViT+GPT2." |
|
) |
|
|
|
with gr.Row(): |
|
input_image = gr.Image(type="pil", label="Upload Image") |
|
caption_output = gr.Textbox(label="Generated Caption", lines=2) |
|
|
|
generate_btn = gr.Button("Generate Caption") |
|
generate_btn.click(fn=generate_caption, inputs=input_image, outputs=caption_output) |
|
|
|
gr.Markdown( |
|
"--- |
|
" |
|
"Built with π€ Transformers (`nlpconnect/vit-gpt2-image-captioning`) and π Gradio" |
|
) |
|
|
|
demo.launch() |
|
``` |
|
|
|
**Explanation:** |
|
1. **Imports**: |
|
- `torch` for tensor operations and bfloat16 support. |
|
- `gradio` for the web interface. |
|
- `pipeline` from Transformers to load the image-captioning model. |
|
2. **Pipeline Loading**: |
|
- `"image-to-text"` task uses a ViT encoder and GPT-2 decoder. |
|
- Loading with half-precision reduces memory use and speeds up inference. |
|
3. **Caption Function**: |
|
- Accepts a PIL image, runs the pipeline, and returns the generated caption text. |
|
4. **Gradio UI**: |
|
- Uses **Blocks** and **Row** to layout the uploader and output. |
|
- **Image** component accepts uploaded images. |
|
- **Textbox** displays the generated caption. |
|
- **Button** triggers caption generation when clicked. |
|
|
|
--- |
|
|
|
## π Core Concepts |
|
|
|
| Concept | Why It Matters | |
|
|-----------------------------|---------------------------------------------------------------| |
|
| Vision Transformer (ViT) | Extracts visual features from images | |
|
| GPT-2 Decoder | Generates natural language text from visual features | |
|
| bfloat16 Precision | Lowers memory usage and speeds up inference on supported HW | |
|
| Gradio Blocks & Components | Simplifies web app creation without frontend coding | |
|
|
|
--- |
|
|
|
## π Extensions & Alternatives |
|
|
|
- **Alternate Captioning Models**: |
|
- `Salesforce/blip-image-captioning-base` |
|
- `microsoft/git-base-coco` |
|
|
|
- **UI Enhancements**: |
|
- Allow batch upload of multiple images. |
|
- Display generated captions alongside thumbnails. |
|
- Add option to download captions as a text file. |
|
|
|
- **Advanced Features**: |
|
- Fine-tune the model on a custom image dataset for domain-specific descriptions. |
|
- Integrate with image galleries or social media platforms for auto-captioning. |
|
|
|
|