Review on the Recent Welding Research with Application of CNN-Based Deep Learning Part II: Model Evaluation and Visualizations

합성곱 신경망기반 딥러닝의 용접연구 적용 Part II: 모델의 평가와 시각화

Article information

Publication date (electronic) : 2021 February 8
doi :
* Department of Mechanical and Materials Engineering, Portland State University, OR 97229, USA
** Department of Materials Science and Engineering, Inha University, Incheon, 22212, Korea
*** Joining R&D Group, KITECH, Incheon, 21999, Korea
Corresponding author:,
Received 2020 November 10; Accepted 2021 January 18.


With the development of deep learning technology, research on classification and regression models on welding phenomena using convolution neural networks (CNNs) are gradually increasing. Part 1 of this study introduced the characteristics of deep learning models using CNNs and their application to welding studies. In this paper, we reviewed recent welding research papers to analyze how to evaluate CNN models and visualize the modeling output, and details of evaluation index, comparison models, and visiualization methods were explained.

1. Introduction

There has been recent increase in the application of machine learning to a wide range of industrial sectors, and artificial neural networks, one of the machine learning techniques, have drawn much interest. Research with adoption of neural networks in welding was first introduced in the 1990s1-4), and a number of studies have recently been published applying deep learning and convolutional neural networks (CNN) to welding. CNNs triggered a breakthrough in image recognition, and compared to the previous approaches of extracting features and learning the relationship between the extracted features and results, CNNs facilitate more generalized learning, and its application to welding research has drastically increased.

Part I5) of this review introduces the basic structure and learning methods of CNN, and the previous studies6-22) that applied CNNs in welding research have been categorized into 5 groups according to the criteria of supervised/unsupervised learning, application of transfer learning, adoption of feature extraction or data augmentation in data preprocessing, and the characteristics and applications of the models used were introduced.

In the case of CNN, due to its complexity in dimension, it is not easy to perform evaluation and visualization of the results compared to more traditional and intuitive models such as linear regression. In this review, we classify the applicable publications according to evaluation indices, evaluation method and visualization method used in the research as can be seen in Table 1, and the cases of applications are introduced

Summary of research papers reviewed in this study

2. Evaluation Metric

For evaluation of the performance of classification and regression models developed through neural network modeling, evaluation indices such as accuracy and learning rate are used. In most cases, in addition to the use of evaluation indices, comparative analysis is performed through comparisons with other models.

2.1 Classification model

The field with most active utilization of image-based CNN is classification by supervised learning. In a classification model, each result can be represented using a confusion matrix according to the prediction result23), and the confusion matrix of binary classification, which has the simplest form, is shown in Fig. 1.

Fig. 1

Confusion matrix for binary classification

The prediction results of the binary classification model are classified into True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). Of these, true positives and true negatives represent the cases of good prediction performance, and false positives and false negatives represent errors. A false positive is an error that incorrectly classifies a negative as positive, and a false negative is an error that classifies a positive as negative.

In some studies, a true positive rate (TPR), a false positive rate (FPR), a true negative rate (TNR), and a false negative rate, which represent the percentage of the each classification result, have been adopted.

When the classification datasets are balanced datasets, accuracy, which is the ratio of correct classification out of the total datasets, is sometimes used as an evaluation index. Conversely, in some cases, the mean error rate, which is the case of using incorrect classification ratio out of the total datasets, is employed.


However, if the datasets are imbalanced, the indices may be biased, so complementary indices such as precision and recall are needed. Among the data classified as true by the model, the ratio of real true cases is called precision, and the ratio of classification as true when the real value is true is called recall.


The precision and recall are complementary indices, and the model shows good classification performance when the values of both indices are high. In addition, the F1 index (F1 score), which is the harmonic mean of precision and recall, can be used.


When CNN was applied to the classification model, mean accuracy was used as an evaluation index in most cases6-8,10,13-16,18,20,21). Y. Yang et al.16) and Z. Zhang et al.21) used recall as well as mean accuracy as an evaluation index. In particular, C. V. Dung et al.10) used all of the mean accuracy, recall, and F1 score for evaluation. Z. Zhang et al.21) presented a confusion matrix with mean accuracy for evaluation. Y. Zhang et al.11) and Z. Guo et al.17) used mean error instead of mean accuracy. J.-K. Park et al.19) emphasized the importance of identifying defective products in the manufacturing process of parts for which safety is a primary concern, such as engine transmissions, and used the TP ratio as well as the TN ratio, which is the accuracy of defect detection, as evaluation indices.

2.2 Evaluation indices of the regression model

In the regression model, the mean square error (MSE) based on the difference between the actual and predicted values is used as a loss function, and the mean absolute error (MAE) is most frequently used as an evaluation index.


Here, n represents the total number of data, i is the index number of data, y(i) is the measured value of the i-th data, and y’(i) is the predicted value of the i-th data by the regression model. The closer the mean absolute error is to 0, the higher the fit of the regression model. In addition, the mean absolute percentage error (MAPE) which represents the percentage of the mean absolute error is defined as in Eq. 8, and mean accuracy, the opposite concept, is defined as in Eq. 9.

(9)(mean accuracy )=1MAPE

The coefficient of determination (R2) is also used to indicate how well the predicted value obtained through the regression model explains the correct value.


¯y’(i)represents the mean of the predicted values. The coefficient of determination is in the range of 0≤R2≤1, and the closer the value is to 1, the higher the accuracy of the regression model. S. Choi et al.7) and T. Ashida et al.9), who applied the CNN model to the regression model, used mean accuracy and mean error as evaluation indices, respectively.

3. Comparative evaluation with other models

Various regression and classification models have been developed for image-based prediction of welding phenomena. CNN models proposed in a number of papers are compared with the traditional classifier or other neural network models and the performance of the model was verified.

3.1 Comparison with traditional feature extraction method and classifiers

In addition to neural networks, there are a variety of methods used for classification and regression. The conventional models presented in this section describe methods that have been employed before the introduction of neural networks, and include image processing algorithms such as Hough transform and contour extraction algorithm, traditional classifiers, and visual inspection of skilled personnel that has been traditionally used in manufacturing process.

T. Ashida et al. compared the proposed CNN regression model with the traditional contour extraction image processing technique and confirmed that errors in the prediction of molten pool width and leading end width can be reduced9). Z. Guo et al. evaluated the performance of a CNN model with comparison against the visual inspection of the skilled worker17).

Traditional image classification models other than neural networks can be described by categorizing into feature extraction and classifier. Feature extraction methods include HOG (Histogram of Oriented Gradient), LBP (Local Binary Pattern), Haralick feature24), and for classifiers, SVM (Support Vector Machine) is often used. In order to improve the performance of traditional feature extraction methods, image processing such as binarization or contour measurement to highlight features is conducted first. HOG is an algorithm that segments an image into a certain size and normalizes local cells through a gradient direction distribution to extract features, and LBP is an algorithm that binarizes 3x3 areas around all pixels in an image to 0 or 1 with relative brightness. LBP was originally developed to classify texture, but it is mainly used for image analysis such as face recognition. Also, Haralick feature is a feature extraction algorithm based on the gray level gradient of adjacent pixels. SVM is a classification algorithm using machine learning and consists of a set of hyper-planes. It is a classifier that performs classification using a maximum-margin hyper-plane that maximizes the distance between data classes in high dimensions.

W. Hou et al.20), N. Yang et al.18), Z. Zhang et al.21), and J.-K. Park et al.19) compared traditional classifiers with the proposed CNN model. In the comparison for performance evaluation, W. Hou et al. used HOG feature and Haralick feature extraction method20), and N. Yang et al. used SVM as a classifier18). Z. Zhang et al. compared various traditional classifiers such as HOG, LBP, HOG+LBP, and BOF with the proposed CNN model21). J.-K. Park et al. compared the Hough circle contour extraction algorithm with the CNN regression model to evaluate the performance of tracking the center point of the engine transmission weld, and also compared the CNN classification model with HOG+ SVM and LBP+SVM to evaluate performance19).

3.2 Comparison with other neural network models

For image recognition, SAE(Sparse Auto-Encoder) image feature extraction method, one of unsupervised learning neural networks, and CNNs are mainly used. SAE is a type of auto-encoder, which is unsupervised learning frequently used for image feature extraction. Through data flattening, a fully connected neural network (FCN) can also be used for image feature extraction.

N. Yang et al.18), Y. Zhang et al.11), and D. Bacioiu et al.12) compared the proposed CNN classification model with the FCN model for performance evaluation. In addition, W. Hou et al.20) and Z. Zhang et al.21) conducted performance evaluation using SAE and stacked SAEs, respectively.

CNN models vary widely depending on their structure and the adoption of transfer learning, so the performance of the CNN model can be improved through structure optimization, and modeling techniques. Thus, modeling performance can be evaluated using various CNN model structures and modeling techniques and in more than half of the papers reviewed, performance was evaluated through comparison between CNN models10,12,13,15-21).

D. Bacioiu et al. compared the difference in the accuracy of the proposed CNN model as the classification classes became more complex from 2, 4, to 612). C. V. Dung et al. used mean accuracy, recall, and F1 score for as evaluation indices comparison of three CNN models, which are SCNN (Shallow CNN), BN (Bottle Neck)-CNN, and FT (Fine Tuning) CNN. The comparison between SCNN and CNN model with application of VGG-16-based transfer learning showed that superior performance was obtained in the model with application of transfer learning. In addition, by comparing the BN-CNN model and the FT-CNN model, it was confirmed that through fine-tuning, the accuracy can be improved through optimization even in the same transfer learning model10).

W. Jiao et al. compared the accuracy of the proposed 9 layer CNN model and the transfer learning CNN model. Through the comparison, it was confirmed that the application of transfer learning is effective for learning with a small amount of training data13).

H. Zhu et al. compared the accuracy of each classifier in the CNN model. The feature map was extracted through the CNN model, and when compared with softmax and SVM techniques, which are commonly used classification methods, the proposed random forest method showed superior performance13).

Y. Zhang et al. conducted performance evaluation by comparing the accuracy of the FCN and the proposed CNN model. In addition, the cause of the error was investigated through analysis of misclassified samples11).

4. Visualization

In the reviewed papers, evaluation of the model through visualization methods is largely divided into two types: a method of feature extraction of the classification model through the intermediate activations and a method of visualizing the classification through a 2D graph by using t-SNE.

4.1 Intermediate activations

A feature map that has been filtered by a convolutional filter in the CNN learning process can be represented as an image, and the highlighted features can be identified to some extent in lower layers. Intermediate activations are used as an evaluation method by 1) extracting feature map of lower layers and determining if the highlighted features are suitable for the purpose of the model, or 2) identifying the number of images with significant features at higher layers.

Y. Yang et al. showed that by visualizing the image of the first layer of the optimized CNN model through intermediate activations, the features of the images suitable for the purpose of the model were properly highlighted16).

Z. Zhang et al. visualized and output a feature map at each CNN layer. Using the number of images with significant pixels (images in which values of all pixels are non-zero) in the feature map of each layer, it was confirmed that the proposed model shows effective performance in feature extraction15).

4.2 t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique suitable for visualization of high dimensional datasets. It represents the similarity of data as a distance in two or three dimensions. This method calculates neighboring probability in high-dimensional space and low-dimensional space assuming normal distribution by Kullback- Leibler divergence method and minimizes the probabilities and this process of calculation and minimization is learned and visualized.

W. Hou et al. extracted features using the Haralick feature, HOG and SSAE, and two CNN methods with different depths, and visualized these extracted features with a 2D map through t-SNE. In addition, the performance of the proposed model was evaluated by comparing the mean accuracy of classification results obtained from each method. It was confirmed that the proposed CNN model showed clear classification in the t-SNE 2D map compared to other methods, and the clearer the classification in the visualized 2D map, the higher the mean accuracy20).

Z. Zhang et al. extracted features from the optimized CNN model and visualized the features with a 2D map through t-SNE. The classification between each group was visually confirmed on the map, and the cause of the error was presented by analyzing the section where the classification was not clear21).

5. Visualization

There are various types of auto encoders, which are unsupervised neural networks described above, and among them, a convolutional auto-encoder is advantageous for image compression. The autoencoder compares the original images and the restored images which encoded (compressed) and decoded (decompressed) through the autoencoder, and performs unsupervised learning. Whether the classification model retains the features required for classification after image compression can be evaluated through the performance of the final classification model.

Muniategui et al. used a deep convolutional auto-encoder for image compression and used the compressed images as inputs to the fuzzy model22). 225 vectors were extracted from the compressed 15x15-pixel image and grouped by prediction class of Very Good, Good, Regular, Bad, and Very Bad. For evaluation of the compressed image, the mean of the grouped vectors was compared with each pattern. The patterns of Very Good, Good, and Regular groups of the proposed four fuzzy classification models were similar, and in the case of Bad and Very Bad groups, the patterns showed specificity depending on the classification model. In this case, in the compressed images, the similarity of the compressed image vectors from the similarity of the production parts was confirmed, and it was also identified that there were specific patterns related to the difference between each classification model. Finally, the performance of the fuzzy classification model when images were compressed using a deep convolutional auto-encoder was evaluated using accuracy.

6. Summary and Outlook

With the rapid development of deep learning technology, it is actively applied in welding research. Developing a model for application of deep learning technology is an important part of research, but evaluation of the model by assessing the performance of the proposed deep learning technology also plays a key role in research. The excellence of the model can be verified through visualization and performance indices in classification and regression models. If an error occurs in prediction through the developed model, it may be due to various causes such as lack of reproducibility and nonlinearity of physical phenomena, errors in the measurement process, and the performance of the model itself. It is necessary to identify the cause of the error through the development of a visualization method. Although the deep learning model is oriented for end-to-end learning, it is necessary to identify the causes of the error through the development of adequate evaluation indices and visualization methods.


This research was supported by the MOTIE (Ministry of Trade, Industry, and Energy) in Korea, under the Fostering Global Talents for Innovative Growth Program (P0008750) supervised by the Korea Institute for Advancement of Technology (KIAT)


1. Andersen K, Cook G. E, Karsai G, Ramaswamy K. Artificial Neural Networks Applied to Arc Welding Process Modeling and Control. IEEE Trans. Ind. Appl 26(5)1990;:824–830.
2. and S.-J. Na K.Bae. A Study of Vision-Based Measure- ment of Weld Joint Shape Incorporating the Neural Network. Proc. Inst. Mech. Eng., Part B:J. Eng. Manuf 208(1)1994;:61–69.
3. Cook G.E, Barnett R. J, Andersen K, Strauss A. M. Weld Modeling and Control Using Artificial Neural Networks. IEEE Trans. Ind. Appl 31(6)1995;:1484–1491.
4. Moon H.S, Na S. J. A Neuro-Fuzzy Approach to Select Welding Conditions for Welding Quality Impro- vement in Horizontal Fillet Welding. J. Manuf. Syst 15(6)1996;:392–403.
5. Lee K, Yi S, Hyun S, Kim C. Review on the Recent Welding Research With Application of CNN- Based Deep Learning - Part I:Models and Applications. J. Weld. Join 39(1)2021;:10–19.
6. Khumaidi A, Yuniarno E. M, Purnomo M. H. Welding Defect Classification Based on Convolution Neural Network (CNN) and Gaussian Kernel. Inter- national Seminar on Intelligent Technology and Its Applications (ISITIA) Surabaya, Indonesia. 2017;2:61–265.
7. Choi S, Hwang I, Kim Y, Kang B, Kang M. Prediction of the Weld Qualities Using Surface App- earance Image in Resistance Spot Welding. Met 9(8)2019;:831.
8. Zhang B, Hong K.-M, Shin Y. C. Deep-Learning- Based Porosity Monitoring of Laser Welding Process. Manuf. Lett 232020;:62–66.
9. Ashida T, Okamoto A, Ozaki K, Hida M, Yamashita T. Development of Image Sensing Technology for Automatic Welding (Image Recognition by Deep Learning). Kobelco Technol. Rev 37April. 2019;:77–81.
10. Dung C.V, Sekiya H, Hirano S, Okatani T, Miki C. A Vision-Based Method for Crack Detection in Gusset Plate Welded Joints of Steel Bridges Using Deep Convolutional Neural Networks. Autom. Constr 1022019;:217–229.
11. Zhang Y, You D, Gao X, Zhang N, Gao P. P. Welding Defects Detection Based on Deep Learning with Multiple Optical Sensors During Disk Laser Welding of Thick Plates. J. Manuf. Syst 512019;:87–94.
12. Bacioiu D, Melton G, Papaelias M, Shaw R. Auto- mated Defect Classification of Aluminium 5083 TIG Welding Using HDR Camera and Neural Networks. J. Manuf. Process 452019;:603–613.
13. Zhu H, Ge W, Liu Z. Deep Learning-Based Classi- fication of Weld Surface Defects. Appl. Sci 9(16)2019;:3312.
14. Jiao W, Wang Q, Cheng Y, Zhang Y. End-To-End Prediction of Weld Penetration:A Deep Learning and Transfer Learning Based Method. J. Manuf. Process 2020;In-Press, Available online 4 February 2020.
15. Zhang Z, Wen G, Chen S. Weld Image Deep Lear- ning-Based On-Line Defects Detection Using Convo- lutional Neural Networks for Al Alloy in Robotic Arc Welding. J. Manuf. Process 452019;:208–216.
16. Yang Y, Pan L, Ma J, Yang R, Zhu Y, Yang Y, Zhang L. A High-Performance Deep Learning Algorithm for the Automated Optical Inspection of Laser Welding. Appl. Sci 10(3)2020;:933.
17. Guo Z, Ye S, Wang Y, Lin C. Resistance Welding Spot Defect Detection with Convolutional Neural Networks. Proceeding of International Conference on Computer Vision Systems Thessaloniki, Greece. 2017;:169–174.
18. Yang N, Niu H, Chen L, Mi G. X-ray Weld Image Classification Using Improved Convolutional Neural Network. AIP Conference Proceedings 1995;2018:020035.∾.
19. Park J.K, An W. H, Kang D. J. Convolutional Neural Network Based Surface Inspection System for Non-patterned Welding Defects. Int. J. Precis. Eng. Manuf 20(3)2019;:363–374.
20. Hou W, Wei Y, Jin Y, Zhu C. Deep Features Based on a DCNN Model for Classifying Imbalanced Weld Flaw Types. Meas 1312019;:482–489.
21. Zhang Z, Li B, Zhang W, Lu R, Wada S, Zhang Y. Real-Time Penetration State Monitoring Using Convolutional Neural Network for Laser Welding of Tailor Rolled Blanks. J. Manuf. Syst 542020;:348–360.
22. Muniategui A, Hériz B, Eciolaza L, Ayuso M, Iturrioz A, Quintana P. lvarez, Spot Welding Monitoring System Based on Fuzzy Classification and Deep Learning. Proceeding of 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) Naples, Italy 2017;:1–6.
23. Müller A.C, Guido S. Introduction to Machine Learning with Python:A Guide for Data Scientists. O'Reilly Media, Inc California, USA. 2016;
24. Haralick R.M, Shanmugam K, Dinstein I. H. Textural Features for Image Classification. IEEE Trans. Syst. Man. Cybern SMC-3 61973;:610–621.

Article information Continued

Table 1

Summary of research papers reviewed in this study

Classification / Regression Evaluation index Comparison with Visualization 1stAuthor (Publication Year) Ref. No.
Conventional method Neural Networks
Classification Accuracy A. Khumaidi (2017) 6)
Regression/ Classification Accuracy S. Choi (2019) 7)
Classification Accuracy, Confusion matrix B. Zhang (2020) 8)
Regression MAPE* Edge detection T. Ashida (2019) 9)
Classification Accuracy, Recall, F1 score SCNN*8/CNN (pretrained and tuned models) C. V. Dung (2019) 10)
Classification Mean error rate FCN*9 Y. Zhang (2019) 11)
Classification Accuracy, Training time FCN, CNN(No. of Class) D. Bacioiu (2019) 12)
Classification Accuracy CNN (Classification module) H. Zhu (2019) 13)
Classification Accuracy CNN (with and without transfer learning) W. Jiao (2020) 14)
Classification Accuracy, Training Time CNN (reference) Intermediate activations Z. Zhang (2019) 15)
Classification Accuracy, Recall, Training time AlexNet, VGG-16, Resnet-50, Densenet-121, Mobile NetV3-Large (pretrained and tunned models) Intermediate activations Y. Yang (2020) 16)
Classification Training time, Mean error rate Visual inspection of expert CNN (No. of layers) Z. Guo (2017) 17)
Classification Accuracy SVM*4 CNN (Proposed model, Transfer learning), FCN N. Yang (2018) 18)
Regression/ Classification TPR**, TNR*3 Hough Circle/ HOG*5+SVM, LBP*6+SVM CNN J.-K. Park (2019) 19)
Classification Accuracy Haralick feature, HOG feature SSAE*10, CNN t-SNE*12 W. Hou (2019) 20)
Classification Accuracy, Recall HOG, LBP, HOG+LBP, BOF*7 SAE*11, CNN(No. of layers) t-SNE Z. Zhang (2020) 21)
Classification (Auto-encoder) Accuracy A. Muniategui (2017) 22)

Mean Absolute Percentage Error;


True Positive Rate;


3True Negative Rate;


4Support Vector Machine;


5Histogram of Oriented Gradient;


6Local Binary Pattern;


7Back Of Features;


8Shallow ConvNet;


9Fully Connected Net;


10Stacked Sparse Autoencoder;


11Sparse autoencoder;


12t-Distributed Stochastic Neighbor Embedding

Fig. 1

Confusion matrix for binary classification