Skin Cancer Diagnose Using Deep Learning

Duc Haba
11 min readAug 9, 2022


Generated by the Stable Diffusion Generative AI (Duc Haba fork version)

Welcome new friends and fellow readers to a new demystified AI series. It is my second Deep Learning (DL) deployment project on Hugging Face, where you can test-drive a DL model with your photo.

Fun fact: The image above is generated by the Stable Diffusion Generative AI (Duc Haba fork version) from feeding the first two sentences of this article.

We are returning to the healthcare topic. The “Skin Cancer Diagnose” combines two Deep Learning models. The dataset is from the International Skin Imaging Collaboration (ISIC) organization, which we will discuss in the technical section.

The United States Centers for Disease Control and Prevention (CDC) estimates that treating all kinds of skin cancer costs at least $8 billion annually. Several factors affect the costs of skin cancer treatment, and the most influential one is the stage of the disease at diagnosis. As a general rule of thumb, the treatment of the early cancer stage will cost less than the later stage. Thus using Deep Learning for early skin cancer detection will substantially lower the world’s annual treatment cost.

This article provides an insightful and entertaining overview of a Skin Cancer diagnosis. In particular, you will learn about the following:

  • Project description on HuggingFace
  • Technical implementation on Jupyter Notebook and Bessy
  • Data deep dive
  • Review results
  • Merging datasets challenges
  • Into the Bad Land (the solution)
  • Summary
  • Demystify AI Series Listing

Before we start,
let’s take a collective deep breath…1, 2, 3
…exhale, slow three, two, one. Begin.


There are two DL models on the “Skin Cancer Diagnose” website. The first DL model is to predict between malignant or benign. The model did exceptionally well with a 92% F1-Score. For first-look triage, it’s compatible with a human dermatologist. There is no Kaggle competition for this dataset, but the 92% F1-Score would be in the top twenty compared to other skin cancer competitions.

  • The definition of malignant is “cancer cells grow uncontrolled and can invade nearby tissues and spread to other parts of the body through the blood and lymph system.”
  • Conversely, benign is a “condition, tumor, or growth that is not cancerous. It does not spread to other parts of the body.”

The donut graphs display the color orange for malignant and blue for benign.

Warning: Do NOT use this for any medical diagnosis. I am not a dermatologist, and NO dermatologist has endorsed it. This DL model is for my independent research. Please refer to the GPL 3.0 for usage and license.

The second DL model is to predict the type of skin cancer. The seven skin cancer types are as follows.

  1. Bowen Disease (AKIEC)
  2. Basal Cell Carcinoma
  3. Benign Keratosis-like Lesions
  4. Dermatofibroma
  5. Melanoma
  6. Melanocytic Nevi
  7. Squamous Cell Carcinoma
  8. Vascular Lesions

The lovely donut graph displays only the top three predictions. As with the first model, the data is from the world-renowned International Skin Imaging Collaboration (ISIC) organization. However, the training data is a combination of three separate datasets. There are a few surprises in selecting and combining the dataset, and the higher F1-Score is not necessary the most satisfactory forecasting model. Please refer to the technical discussion for more detail.

It is noteworthy to restate the warning.

Warning: Do NOT use this for any medical diagnosis. I am not a dermatologist, and NO dermatologist has endorsed it. This DL model is for my independent research. Please refer to the GPL 3.0 for usage and license.

The “Skin Cancer Diagnose” website has an easy-to-use interface. Click or touch the “drop image here” frame to take a photo or upload your picture.

In two delightful donut graphs, the DL will predict your skin cancer type and whether it is malignant or benign. Click the “Clear” button to reset and prepare for the next image. The example pictures are selected to show the range of accuracy with the model. You can click on the example pictures and see the result for yourself.

Technical implementation

For the techie in all of us, read onward to see how Bessy and I do it. With Hugging Face Space (HFS), the deployed code is in the file “” located in the “Files and version” tab.

For starters, create “bessy” as an “ADA_SKIN” class that holds Fastai’s predict methods. Bessy is our companion coder, and she is an imaginary Frech Bulldog canine companion. The beautiful double “donut chart” is Bessy’s invention by taking the predicted output from two Fastai DL models, importing them into Pandas Data Frame, sorting, and graphing it using Matplotlib. The relevant code for methods “bessy.predict_donut() and bessy._draw_pred()” are here.


Fair warning: Bessy is a French Bulldog and a tad lazy in the deployment section. Thus she copies the majority of the code from Maxi. She could write it cleaner. Maybe after a vacation, I will ask or command her to rewrite it.

That’s it for deploying a DL model. Bessy made it easy peasy lemon squeezy.

Data deep dive

Bessy and I would like to shout out a big “Thank You” to the old and new online colleagues who spent time reviewing Bessy’s Deep Learning and Fastai work on Jupyter Notebook. The following is a clean and orderly narration of the work. It may not strictly follow the Jupyter Notebook sequence.

Bessy and I are not dermatologists. We relied on the dataset correctness from the International Skin Imaging Collaboration (ISIC) organization. The data are labeled accurately, but the challenges lie in multiple datasets with different structures. Bessy spends over a month downloading, analyzing, training, verifying, and finding that most of the data is inadequate.

Bessy combined the Malignant and Benign data with the eight types of skin cancers data. She trained it and achieved a fair F1-Score, but the real-world testing was disastrous. The result of the four datasets combined is as follows.

Data Inspection

Notice the steep slope of the “10 Label, Vocab” graph. The data distribution is not normal. The BCC images are more than 50 times the SCC images. BCC alone is 34.8% of the total images, and the top three image categories account for 73.2%.

The image size ranges from 275 to 4000+ pixels. Bessy surprised the model trained and coverage at a 0.34532 loss rate, with F1-Score at 63.2%. Bessy uses the Resnet34 model as her default transferred model. If it gives a reasonable result, then she will try another based model, but this does not.

Bessy is jumping ahead. Slow down, and start at the beginning. Many moons ago, Bessy thought it was painless to use Image Classification with DL in the Fastai framework to predict skin cancer type. She found the MIST-Ham10000 dataset on Kaggle. Here is the link to the Kaggle dataset:

Within one weekend, she got it done. Bessy is a working canine. Thus she does research late at night or on weekends. Here are her results.

I hope you remember Ada from the previous journey. She is an [imaginary] alpha dog. The ADA_SKIN class is a subclass of the ADA class. Technically speaking, Ada is Bessy’s mom.

The data analysis section yields a likable composite graph and image.

Inspect Data

The images are uniform in pixel size, 600 in width, and 450 in height. Seven skin cancer categories exist, but the number of MN images is too big. It is 6,705 versus 115 DF images.

Bessy looks at the charts, and she knows it is not a balanced dataset. MN images alone account for 67% of the total image. The slope of the “7 Labels, Vocab” graph is too steep.

Review results

Still, she forges ahead with the training. Bessy chooses the default ResNet34 base model because fixing the data is more critical than selecting a more complex model. Her results after training the model are as follows.

  • The final loss rate is 0.117823, which is a fair value. It implies an 89% accuracy. But it is wrong.
  • The confusion matrix does show a distinct diagonal line, and it is heavily favorite the MN type. It is a foreseen outcome because the MIST data skewed toward the MN photos.
Confusion Matrix

The precision is the accuracy of positive predictions.

  • Where: Precision = TP/(TP + FP)
  • TP = True Positive
  • TN = True Negative
  • FP = False Positive
  • FN = False Negative

The Recall is a fraction of positives that were correctly identified.

  • Where: Recall = TP/(TP+FN)

The F1 score is a weighted harmonic mean of precision and recalls such that the best score is 1.0 and the worst is 0.0.

  • Where: F1 Score = 2*(Recall * Precision) / (Recall + Precision)
F1 Score

So, where are the problems?

  • The F1-Score graph shows a steep slope, which implies an uneven prediction. The MN F1-Score is 95%, while the DF, AKIEC, and Melanoma are 65%, 67%, and 68%, respectively. Because of MN, the average F1-Score is artificially high.
  • Bessy saved four images from each of the seven categories. These 28 images are not in the training and validating datasets.
  • When using the “test” dataset, Bessy finds the model incorrectly classified 17 images as MN with 95% to 99% certainty. They are False-Positive. The actual four MN photos were identified correctly.
  • Bessy is devastated, but she has foreseen it from the input data.

Merging datasets challenges

As Bessy’s colleagues point out, she instructs the Fastai framework to take randomly 20% of the images as the validation dataset. It is not wise when Bessy notices the data is skewed in the top 3 categories.

She should write a function that randomly selects 20% of the photos from each category, i.e., not randomly 20% from the dataset. Hence the validation set will not lean toward the top-heavy skin cancer types.

Bessy does not follow the above logically split dataset. She might do it at a later time. Instead, she broke the work into two separate DL models. The first model is “Malignant versus Benign,” and the second is “Skin Cancer type Classification.”

The “Malignant versus Benign” data is a balanced dataset from Claudio Fanconi based on the ISIC-Archive data.

Since the data is clean and balanced, Bessy encounters no issues in analyzing the data, as seen from the below graphs and images.

Image Spec.

The training final loss value is 0.075893, the confusion map looks fair, and the F1-Score graph is level with malignant at 91% and benign at 93%.

Confusion Matrix
F1 Score

Into The Bad Land (the solution)

For the “Skin Cancer type Classification,” Bessy ventured into the “bad land,” i.e., thinking outside the box. She artificially caps the images per category to be 2,000 top.

She broke the cardinal rules of Deep Learning by deleting good data.

Bessy read many articles and scholarly papers about skin cancers and found that a few skin cancer types are more prevalent than the others. At the end of the day, she is confused, dazed, and paralyzed. Her head hurt.

Being a French Bulldog, she takes decisive action. Wisely or not, action is her middle name.

Capping the number of photos to 2,000 per category is a common-sense approach. The medium to the largest ratio is reduced to 1 to 3. Since the metadata is in the Pandas Data Frame, Bessy makes some fancy moves to limit each category artificially. They are her “Pandas’ Kong Fu” moves. :-)

The three datasets are ISIC images consolidated on Kaggle:

Bessy’s new method is the “ada.df_norm2000_image_info()” function.

  • The confusion matrix is confusing initially, and SCC has the most prominent issue. But that is anticipated from looking at the data distribution.
Confusion Matrix
  • The F1-Score average is lower at 0.52, but Bessy’s Test data shows that the top 3 choices have the correct skin cancer classification. Furthermore, there is NO False Positive from the Test dataset.
F1 Score


Bessy is worn out. It is an arduous journey, and there are many lessons learned. The first and most salient lesson is that a Deep Learning model trained with a lower loss rate and lower F1-Score is not the most practical model when testing real-world data.

The second lesson is that selecting the “validation” dataset is not a simple task. Bessy has to dig deep into the data to unearth the truth.

I love the previous sentence. The double entendre is funny. :-) Do you get it? Bessy is a dog, and she has to dig.

The third lesson is don’t shy away from a subject you are not an expert on. Maybe a few dermatologists read this article and think, “Deep Learning might be a solution for reducing the cost of skin cancer treatment.”

According to Bessy, the fourth and final lesson is one could develop and deploy an AI healthcare topic in a few weeks. The same AI healthcare project might require millions of dollars of research and development a few years ago. What you learned three years ago about AI may be outdated.

One dares dream that one day, a healthcare AI project is achievable with an imaginary canine coding companion laboring furiously in a few weeks. :-)

Bessy hopes you visit the “Skin Cancer Diagnose” website with your iPhone or Android phone [or desktop] and test-drive it. You will be surprised by how accurate it is.

There are many improvements that Bessy can make the model more accurate, but her [real] human companion needs more time from his company. He liked working in the company, so he told Bessy. But there are only 24 hours in a day. You can choose to spend time working on your job or with your imaginary canine coding friend.

That concludes our discussion on the “Skin Cancer Diagnose” article. Contact me directly on LinkedIn if you have questions. Ada, Bessy, and I are looking forward to reading your feedback. As always, I apologize for any unintentional errors. The intentional errors are mine and mine alone. :-)

Have a great day, and please give a “thumbs up, like, or heart.”



Duc Haba

AI Solution Architect at Prior Xerox PARC, Oracle, GreenTomato, Viant, CEO Swiftbot, CTO