深度学习与计算机视觉-语义分割-UNet手编重现与项目实战-EW帮帮网

作者：王语其, 蒋思卿，师燕舒，尤好，曾俊达

说明：这是深度学习与计算机视觉课程设计实验报告的项目规划和理论解释（英文），和正式的实验报告（论文）有较大区别且为正式报告提供理论指导和框架参考，著作权和版权仅归课程设计论文第一作者所有，如需转载请注明出处。

报告的部分代码已经上传至个人空间，可以下载使用，欢迎交流。

Abstract: The project reconstructs UNet model(proposed in 2015) by hand, training and evaluating the model on ISIC-2017 challenge dataset in order to: (1)Well master the architecture and principle of UNet and better understand how new ideas are born; (2)Get a glimpse of the application of deep learning in medical image segmentation area while getting familiar with common used indexes in evaluation section of semantic segmentation problems; (3)Learn the complete workflow of DL——i.e., data preprocessing, model training, model evaluation, as well as higher-level design such as data augmentation, training process optimization(acceleration).

Keywords：Semantic Segmentation；UNet；Data Augmentation；Cached Training；Model Evaluation

The project is based on DL framework torch(CPU or GPU version) and torchvision with Python interpreter version 3.8 and runs on virtual environment on miniconda, with packages such as matplotlib、numpy and time as assistant tools for calculation, visualization and evaluation.

Figure 1 Project Development Workflow

The report is organized by the process of a standard Deep Learning Model Design and Development. Mainly divided into four parts: Data Preprocessing and Augmentation, Network Structure Decomposition and Hyperparameters optimization, Model evaluation and Loss function settings and Training Schedule.

Content

1. Data Preprocessing and Augmentation

1.1 Downloading the original dataset

1.2 IO Setting, Integrity Checking and Size Matching

1.3 Data Transform, Data Augmentation and Cached Training

2. Network Structure Analysis and Hyperparameters optimization

2.1 Component Analysis

2.2 Data Size Transform Analysis

2.3 Parameter Analysis and Initialization Setting

3. Model evaluation and Loss function settings

3.1 Special Loss Functions and Accuracy Designs

3.2 Prediction Visualization

3.3 Learning Process Visualization

4. Training Schedule

4.1 Multiple platform parallelism

4.2 Shrinking network parameter size but maintaining network structure

4.3 Cached Training

4.4 How main memory overflow occur?

4.4.1 BP review

4.4.2 A rough calculation

References

1. Data Preprocessing and Augmentation

The operations are encapsulated within the CustomDataset class. The dataset has four different attribute settings separately intended for

Hyperparameters_choosing+training(2000), hyperparameters_choosing+validation(150),

Formal_training and evaluation+training(2150),

Formal_training and evaluation+test(600).

These numbers in the parenthesis are maximum supported number. They can be set to arbitrary number practically considering training time, memory cost and fast hyperparameter searching.

1.1 Downloading the original dataset

For later convenience of coding and dataset reconstruction, put all the dataset in one dataset directory named as ‘ISIC-2017’. The dataset is composed of 2000 training RGB images with ununified image sizes, 150 images(ununified sizes) in validation dataset, and 600 images(ununified sizes) in test dataset, with corresponding labels identical sizes but binary images.

The project makes use of the validation dataset to adjust and fix better hyperparameters, that is to say we treat 2150=2000 + 150 images as hyperparameter choosing dataset, employing methods such as k-fold cross-validation method or just fix the validation dataset while treating 2750=2150+600 images as final training and evaluation dataset. In other words, train (marked as green rectangle) contains 2000 images and validation (yellow) contains 150 images in the first round to search appropriate hyperparameter space, while in the second round enlarging training dataset to include 150 images in validation dataset and evaluate the model on test dataset consisting of 600 images.

Figure 2 Round 1-Searching Hyperparameter Space

Figure 3 Round 2-Formal Training and Evaluation

Figure 4 Using k-fold cross-validation method to find better hyperparameters

Figure 5 Reference architecture 1 of dataset directory

Figure 6 Reference architecture 2 of dataset directory

Superpixel images are not used in this project so different dataset architecture can be chosen (see Figure2-5).

Figure 7 Skin lesion images in ISIC-2017 test dataset

Figure 8 Superpixel images in ISIC-2017 training dataset

1.2 IO Setting, Integrity Checking and Size Matching

Vectors used in torch should be and only be of Tensor type—Strict restriction in subdimension alignment like ndarray.

cv2 reads ndarray in and has great limit when image gets larger. Choosing torchvision.io as IO interaction tool. The read-in images are tensor type themselves. Later transform just drop .ToTensor().

The basic procedure is: Reading matched names of data and labelsàloading all of the data into main memory storage (if supposed to do so, there is a parameter in class CustomDataset controlling data in-out amount)àFiltering out eligible images and force unified resizing or random cropping transform on them.

Since UNet is trained only with fixed input size for N,C,H,W. N corresponds to batch size (a hyperparameter), C refers to RGB or GRAYSCALE(see 1.3), H and W with default setting 572x572 in the original design. The project just takes the same default setting while providing different versions to compare. Labels change simultaneously.

1.3 Data Transform, Data Augmentation and Cached Training

For training data, it is available to obtain augmentation dataset by random horizontal flip、color jittering(brightness, contrast, saturation, hue and so on). Labels change simultaneously. Augmentation dataset is exported out to the same-level directory and named “ISIC-2017_Aug”. Here is one demonstration. Labels maintain in this kind of augmentation operation.

Figure 9 572x572 image and label after size matching and color jittering,12199

Figure 10 3008x2000 image and label before size matching and color jittering,12199

A common normalization by dividing 255 and standardization is applied in the overloaded __getitem__ method, or employing a container to store normalized dataset to fasten data iterating speed, which is called Cached Training technology. The technology can accelerate training speed up to 10-100 times and it generally has three implementations depending on dataset size and storage format.

Here is what AI assistant Kimi says:

### 1. 内存级缓存（最快、最简单）

- 优点：代码量最少，速度最快。

- 缺点：全部驻留在 **RAM**，大图或大数据集会爆内存。

### 2. 磁盘级缓存（.pt / .npy）

提前把每张图处理后存成 `.pt`（或 `.npy`）：

- 优点：不占用训练时内存；磁盘只读，训练速度仍然远高于在线解码。

- 缺点：第一次需要跑预处理脚本；磁盘占用 ≈ 原图大小 × 通道数 × 4 Byte。

### 3. 内存映射（memory-map）或 LMDB/HDF5

- 把 **所有样本拼成一个大 .npy / .h5**，然后用 `np.memmap` 或 `h5py.File(..., 'r')` 零拷贝读取。

- 适合 **TB 级** 数据，又比在线解码快 5~10 倍。

For ISIC-2017 dataset, images differ in size greatly and all images over 2MB, some big image even accounts for over 20MB space!

Figure 11 memory account of images and labels

That is to say, if you load all images and labels into main memory, at least 16G storage space is required, but that is actually far less than the truth. Remember that base channels=64 in UNet, and forward propagation and backward propagation needs far more abundant space to storage than input tiles. So here comes 32G,64G,128G and larger main memory. V2-8 TPU provided on Colab is equipped with main memory about 233G.

Number of instances (including original image and label) in CustomDataset is not a big problem compared with batch size in training when training on CPU (FP and BP, full space, BP requires gradient recording storage(batch view) and intermediate outputs storage(batch view as matrix), with the former one accounts not so much compared with latter one, using “l.backward()” and “net.train()”) and testing (FP only, much smaller space needed, using “with torch.no_grad()” and “net.eval()”).

If you choose to train model on GPU with independent GPU memory space, then it is not a problem anyway.

For this reason, the project adopts main memory cached training technology, sets loading number parameter for CustomDataset objects and cuts batch size to avoid main memory overflow.

The above code demonstrates a way of cached training storage, however noticing that random operations can not be adapted to cached training.

What’s more, images can be loaded into the dataset object by assigning reading mode as RGB or GRAYSCALE, later choice leading a bit smaller storage cost and faster training, while not causing an obvious precision drop(see experimental result in 3. Model Evaluation). More detailed training acceleration schedule can be found in 4. Training Schedule.

2. Network Structure Analysis and Hyperparameters optimization

2.1 Component Analysis

Easy implementation of UNet can be achieved by downloading unet package. The project decomposes the UNet model and implement it by hand coding to better understand principles behind it.

Figure 12 UNet architecture

UNet is a network model basically composed of down sampling (lower resolution) and up sampling processes. There are totally 27 layers (including pooling layers and 1x1 conv layer) within it. Basic modules are double convolutional layer, max pooling layer, transpose convolutional layer and 1x1 (FC) layer.

The project defines ‘down’ layer, ‘Up’ layer and double conv layer as submodules for UNet. Only 10=1+4+4+1 objects are needed. Generally speaking, there are only convolutional and transpose convolutional layers. FC layer in CNN corresponds to 1x1 conv layer.

2.2 Data Size Transform Analysis

The original paper makes use of a method called “overlap tile” to ensure equal size:

First, the team use mirror-padding to produce 696x696 image from original medical image with size 512x512;

Second, crop the 696x696 image from left-up, right-up, left-down, right-down corners to obtain four 572x572 images (input tiles) with 572-(696-572)=572-124=448x448 overlapping pixel area;

Third, propagate forward four input tiles through UNet to get four 388x388 output predictions (output tiles);

Last, reverse the second step, i.e., sliding four output tiles with stride=124 to obtain one 512x512 binary mask image. Average values are adopted in 388-124=264x264 overlapping pixel area.

Figure 13 Performance of UNet on three datasets[1]

However, the problems that the projects face have very different setting: Hardware resource limit and sample size in dataset.

Hardware resource limit will be discussed in Part IV (4. Training Schedule, 4.4).

Images in ISIC-2017 dataset are ununified in (H,W) size and the medium size maybe about 2000x1000.

Input tile adjusting: Data preprocessing in Part I does not need mirror-padding to enlarge original images (suitable for original images slightly smaller than 572x572 maybe), it rather unifies the (H,W) size by random cropping or resizing to get default unified (H,W) size=(572,572).

Output tile adjusting: For images with great difference with 572x572, and especially or those with larger sizes, it may be suggestable doing interpolation(adding an unlearnable interpolation layer) before the 1x1 conv layer of original UNet, notice this operation alters the primitive structure of UNet and does not belong to data postprocessing operation.

To conclude, the project employs “adaptive” version of UNet and almost identical to the original UNet with a total of 28 layers.

2.3 Parameter Analysis and Initialization Setting

Hand-coding UNet may lack precision to some extent, however providing much more flexible adjustment for parameters of UNet.

Additional adjustable hyperparameters:

[non-structural] (1)base channels (default:64);

[structural] (2)Depth of down-sampling and up-sampling(default:4);

[structural] (3)stacking levels of convolutional layer(default:2);

[non-structural] (4)kernel size and padding.

Transpose convolution use fixed bilinear interpolation for certain scaling ratio to initialize, while conv layer use Xaiver method for initialization.

3. Model evaluation and Loss function settings

3.1 Special Loss Functions and Accuracy Designs

UNet is oriented to semantic segmentation problem. The problem is similar to regression problem in quantity level of labels but essentially still a problem of classification. So whatever binary segmentation or multiple segmentation problem it is, designing loss function and evaluating corresponding models with respect to image classification problem.

However, do not only count matching pixels and calculate precision or confusion matrix on global image space (in this case, the index is named “standard accuracy” by our team) as that with image classification problem.

The label is called target T and has 3 dimensions here; the prediction is called predict P (3-dimensional).

Using Dice (with smoothing, similar to F1-score in explicit classification problem) and Jaccard (IoU) coefficients to evaluate performance of adaptive UNet model(s), which diminishes amount difference between positive and negative pixels in spatial distribution. Notice that labels are cropped or resized labels with default size 572x572 but not original labels in convenience.

The larger the coefficients are and the closer them to 1, the more similar target and predict are. Adding a smoothing turn makes Dice coefficient a bit more stable. Dice and IoU can be converted to the other.

Accuracy evaluation and loss functions are often discussed as an integrity. We use sigmoid and cross entropy loss corresponding to standard accuracy evaluation (mentioned above), use 1-Dice Coefficient and 1-IoU index as loss functions to make training more specific and efficient, and automatic backward gradient metrics with in torch make us design our loss functions freely.

3.2 Prediction Visualization

For skin lesion segmentation problem, only two classes of pixel are involved. We do not need encoded mapping dictionary as that in Pascal VOC2012 Dataset in textbook.

3.3 Learning Process Visualization

We import matplotlib.pyplot to demonstrate training loss curve per epoch and accuracy on test dataset and training dataset per epoch. When larger gap occur between accuracy curves of training and test dataset, overfitting may occur.

The project use BCE(binary cross entropy) as loss function while calculating common test accuracy, IoU accuracy and Dice accuracy separately. Use SGD optimization method and set learning rate decay mode as cosine decay.

The following pictures are as a demonstration.

Figure 14 3 epochs, cosine type learning rate decay, Loss Curve

Figure 15 number=1000(training), number=50(validation), batch size=16, lr0=0.01, Console output

Figure 16 number=500(training), number=50(validation), batch size=16, lr0=0.01, loss curve

Figure 17 number=500(training), number=50(validation), batch size=16, lr0=0.01, Console output

By the two above experiment records, it can be inferred that:

(1) larger dataset leads to longer training process upon the same model;

(2) common test accuracy only considering matching pixel number and global pixel number has its evaluation limit, see Figure 14/Figure16 common test accuracy VS IoU/Dice: Epoch1 and Epoch2.

Initialization influences loss and optimization greatly. See Figure 16 and Figure 17.

Figure 18 number=500(training), number=50(validation), batch size=16, lr0=0.01, another Console output

Figure 19 number=250(training), number=50(validation), base channels=32, batch size=16, lr0=0.01(cosine decay), GRAYSCALE, Loss Curve

Figure 20 number=250(training), number=50(validation), base channels=32, batch size=16, lr0=0.01(cosine decay), GRAYSCALE, Console output

About GRAYSCALE preprocessing

Figure 14-20 verify that GRAYSCALE image reading mode as data preprocessing not only does not affect model precision very much, but also raises accuracy when other setting maintains.

4. Training Schedule

Since training model so overwhelming as UNet needs greater hardware resources and longer time, model should be saved instantly to protect better parameters. We accelerate training process mainly from the following several aspects.

4.1 Multiple platform parallelism

Local device with RTX4060, Google Colaboratory with free T4 GPU and v2-8 TPU provided.

4.2 Shrinking network parameter size but maintaining network structure

Scale input and output channels of original UNet to 1/2 to accelerate training, i.e., alter base channels from 64 to 32.

Total parameter size shrinks to about 1/4 but slightly larger because the first layer and the last layer only shrink to 1/2 with other internal layers shrink to 1/4 and some unlearnable layers such as maximum pooling layer and interpolation layer maintain their size.

Figure 21 1/4 size mini-UNet and full UNet

4.3 Cached Training

Cached Training greatly reduce training time but maintain model performance because it does not need to do online-transform when interacting with data iterators. Readers can refer to a blog post I have posted serval days ago, which discusses time of theoretical training epoch and practical training epoch.

从零开始搭建深度学习大厦系列-3.卷积神经网络基础（5-9）-CSDN博客

From 3.3, we can see that 1000 sample training dataset takes 11 minutes for one training epoch on NVIDIA RTX 4060 GPU using (main memory) cached training technology, which is relatively fast.

Increasing base channels back to 64 would approximately lead to 4 time multiple of storage space account, so batch size should be adjusted to values equal or less than one quarter of current batch size setting otherwise main memory overflow will probably occur.

4.4 How main memory overflow occur?

4.4.1 BP review

Figure 22 BP metrics in a MLP

To sum up, to update weight at a certain layer, we need error from the next layer in the output flow direction (i.e., up-streaming flow), input feature maps of current layer and weights of the next layer in the output flow direction. Whether in practical implementation process or from theoretical analysis, it does not distinguish between whether this layer is learnable or not (i.e., pooling layer and conv layer are treated as the same), all of the layers in the model are included in a computation graph.

Figure 23 Practical implementation corresponds to theoretical derivation

4.4.2 A rough calculation

This section explains why main memory overflow occur by some distinct calculation. Notice that when considering convolutional loss in (H,W) shape in UNet setting without padding, actual storage account is less than the result in this section.

Hardware: Assuming our CPU memory is 16GB, CUDA GPU memory is 8GB.

Software1: Assuming using full version UNet model architecture (about 120MB, Figure 21).

Data1: 1000, (16,3,572,572)

Process1: Assuming having loaded 1000 images and labels into main memory, accounting for about 3GB CPU memory space and set batch size=16. To make BP, output at each internal layer should be recorded for every sample in the batch, presenting as matrix in space finally. Input tiles has shape (16,3,572,572), then →(16,32,570,570), →(16,32,568,568) ↓(16,32,284,284) →(16,64,282,282) →(16,64,280,280) ↓(16,64,140,140) →(16,128,138,138) →(16,128,136,136) ↓(16,128,68,68) →(16,256,66,66) →(16,256,64,64) ↓(16,256,32,32) →(16,512,30,30) →(16,512,28,28), and then going through the rest 14 up-sampling layer group (See Figure 12, architecture of UNet, in this project there are totally 28 layers in UNet).

As depth gets deeper by one, 4D size almost shrinks to half (slightly smaller if no padding) because (H,W) gets to 0.25 but output channels gets to 2. Exactly speaking, feature map gets to 0.25 after max pooling and gets to 0.5 during double convolutional layer.

A rough upper bound of storage space account ratio can be made, first taking 64/3=21, Ru=1+42+21x(1.25+0.625+0.3125+0.15625)+21x(0.9375+1.875+3.75)+21+0.667=251.698=250

Upper bound--Result1(Rough): Ru=250 represents scaling factor of input images and actual space required for BP. So N=16 corresponds to M1=16x250x572x572x3 Bytes≈3.93GB for intermediate output storage and M2=16x120MB≈1.92GB for gradient storage. Now at least 120MB+1.92GB+3.93GB=5.97GB GPU memory space and 3GB CPU memory space is busy.

Software2: Assuming using one quarter version UNet model architecture (about 38MB, Figure 21).

Data2: 1000, (64,3,572,572)

Upper bound—Result2(Rough): Ru≈127. So N=64 corresponds to M1=64x127x572x572x3 Bytes≈7.98GB for intermediate output storage and M2=64*38MB≈2.43GB for gradient storage. Now at least 38MB+2.43GB+7.98GB=10.45GB GPU memory space and 3GB CPU memory space is busy.

Software3: Assuming using one quarter version UNet model architecture (about 38MB, Figure 21).

Data3: 1000, (16,3,572,572)

Upper bound—Result3(Rough): Ru≈127. So N=16 corresponds to M1=16x127x572x572x3 Bytes≈1.99GB for intermediate output storage and M2=16*38MB≈0.61GB for gradient storage. Now at least 38MB+0.61GB+1.99GB=2.64GB GPU memory space and 3GB CPU memory space is busy.

Experiments verification: In section 4.3, running 1/4 UNet model on local device with RTX4060 with GPU memory 8GB with Data3 input does not raise memory overflow but with Data2 input indeed raises overflow.

Actual storage space account also includes test dataset storage and FP storage in evaluation section and so on.