Applications and final projects#

cover

Fig. 49 Image generated using OpenDALL-E#

Processing images and audio data is at the core of many applications. This chapter will overview some of the most common applications in this field and how they can be implemented using the tools presented in this book.

Computer vision#

Computer vision is the field of computer science that deals with the automatic extraction of information from images. It is a very broad field that includes many different tasks:

  • Image classification: assign a label to an image (e.g. which kind of plant is in a picture).

  • Object detection: detect specific objects in an image (e.g. cars, pedestrians, etc.).

  • Image segmentation: assign a label to each pixel of an image (e.g. which pixels belong to a car).

  • Image generation: generate new images (e.g. generate a picture given a text description).

  • Image captioning: generate a text description of an image (e.g. describe the content of a picture).

  • … many more!

Image classification#

Image classification is the task of assigning a label to an image. For example, given an image of a dog, the goal is to assign the label “dog” to it. Both CNNs and transformers can be used for image classification.

CNN implementation of image classification

The following code shows how to implement image classification using a CNN. The code is based on the PyTorch tutorial on image classification.

import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm.notebook import tqdm

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Define a CNN
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Load the data
dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transforms.ToTensor())

trainset, valset = torch.utils.data.random_split(dataset, [40000, 10000])

trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

valloader = torch.utils.data.DataLoader(valset, batch_size=4,
                                            shuffle=False, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                        download=True, transform=transforms.ToTensor()) 
            
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                          shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
            'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Define the model
net = Net()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

def train_one_epoch(net, trainloader, criterion, optimizer):
    running_loss = 0.0
    for i, data in enumerate(tqdm(trainloader)):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # to GPU
        inputs = inputs.to(device)
        labels = labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
    return running_loss / len(trainloader)

def evaluate(net, valloader, criterion):
    running_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for data in tqdm(valloader):
            images, labels = data
            images = images.to(device)
            labels = labels.to(device)
            outputs = net(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return running_loss / len(valloader), correct / total

# Train the model
net.to(device)

for epoch in range(10):  # loop over the dataset multiple times

    train_loss = train_one_epoch(net, trainloader, criterion, optimizer)
    val_loss, val_acc = evaluate(net, valloader, criterion)
    print(f"Epoch {epoch} - Train loss: {train_loss:.3f} - Val loss: {val_loss:.3f} - Val acc: {val_acc:.3f}")

print('Finished Training')

# Evaluate the model on the test set
test_loss, test_acc = evaluate(net, testloader, criterion)
print(f"Test loss: {test_loss:.3f} - Test acc: {test_acc:.3f}")

This code uses a standard CNN architecture for image classification. The model is trained on the CIFAR10 dataset, which contains 10 classes of images: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The model achieves an accuracy of ~58% on the test set.

Exercise - implement a ViT transformer for image classification - 45 min

Implement a ViT transformer for image classification. You can use the pre-trained ViT model from the HuggingFace model hub and fine-tune it on the CIFAR10 dataset.

You can use the model tag google/vit-base-patch16-224 and the feature extractor tag google/vit-base-patch16-224 to get the model and the feature extractor respectively. ✋ Remember that each model has its own feature extractor. The model documentation is available here.

Solution

Speech Processing#

Speech processing is the field of computer science that deals with the automatic extraction of information from audio signals. It is a very broad field that includes many different tasks:

  • Speech recognition: convert speech to text.

  • Speaker recognition: identify the speaker from a speech signal.

  • Speech synthesis: generate speech from text.

  • Speech translation: translate speech from one language to another.

  • … many more!

Audio signals can be analyzed using two different representations: the time-domain representation and the frequency-domain representation. The time-domain representation is the most intuitive one: it represents the amplitude of the signal as a function of time. The frequency-domain representation is obtained by applying a Fourier transform to the time-domain representation. It represents the amplitude of the signal as a function of frequency.

Time-frequency representations are usually treated as images and can be processed using CNNs or transformers. Time-domain representations, on the other hand, are time series and can be processed using RNNs or transformers.

Keyword spotting#

Keyword spotting is the task of detecting specific words in an audio signal. For example, given an audio signal, the goal is to detect the word “yes” in it. Both CNNs and transformers can be used for the task. One practical application of keyword spotting is the detection of wake words in smart speakers. For example, the wake word “Alexa” is used to activate the Amazon Echo smart speaker.

Exercise - implement a transformer for keyword spotting - 45 min

Using the superb dataset, implement a transformer for keyword spotting. You can use the pre-trained transformer from the HuggingFace model hub and fine-tune it on the superb dataset. Alternatively, you can implement your own transformer-based model from scratch using the PyTorch implementation of encoder layers.

Solution

Conclusion#

In this chapter, we have seen how to use CNNs and transformers for image and audio processing. We have seen how to implement a CNN for image classification and how to implement a transformer for keyword spotting. We have also seen how to use pre-trained models for these tasks.

Final project - 1 week

As a final assignment for the course, you are asked to provide a short report presenting an idea on how and where you would use the tools presented in this course in your research. You can provide a brief description of the data you would use, the model you would use and the results you would expect to obtain. Even if not directly related to image or audio processing, you are free to choose the model you want to use (CNN, transformer, etc.). If you wish, you can provide a draft implementation of your idea. You can use the code from the previous exercises as a starting point.

Use LaTeX to write your report. You can use Overleaf to write your report online. You can use this template to write your report. Submit your report by sending an email to moreno.laquatra@unikore.it.