July 6th — The new update on section “#15.0 — Baseline Update” calculates bias-vector in “Blind Baseline” and NLP baseline using an unsupervised algorithm.
Welcome to the “AI Start Here” (A1SH ) project. It is a new journey in the “Demystify AI” series.
The only thing these journeys have in common is the problems are taken from the real-world Artificial Neural Network (ANN) project. I have worked on them, coded them, cried over them, and sometimes had a flash-of-insight or an original thought about them.
The journeys are a fun and fascinating insight into an AI Scientist and Big-Data Scientist’s daily work. They are for sharing concepts, experiences, and code with colleagues and AI students, but I hope that you, a gentle reader, would enjoy them too.
The A1SH journey is the seventh in the demystify AI series. It is a cautionary tale that serves as a definitive guide for “how to start” an AI project for businesses and nonprofits alike.
It was a dark and stormy night when a friend asked me to talk with the founder and CEO of an environmental foundation. I dream of starting a journey with a dramatic monologue, but it has to wait. Actually, it was a typical California sunny weekend, and we meet over video chat.
We had a long informative talk about his environmental foundation and how they have spent over a year building an AI system but failed. What more troubling is that the foundation spends a lot of money on two consulting companies, but there is no AI baseline, the data is not clearly labeled, and last but not least, the known data biases are not documented.
I spend the past six months working on the weekend on a pro bono basis because I believe this nonprofit foundation can positively impact our society. In doing so, I discovered why consulting companies with deep-experienced in mobile apps or companies well-versed in digital transformation strategy do not understand how to start and manage an AI project.
Spoiler alert, in the end, I steered the foundation to employ an achievable AI strategy. This journey detailed, with Jupyter notebook, on how I do it.
The process starts with defining the three definitive steps for the AI project. The steps are establishing an AI baseline, collecting labeled data, and discovering intentional biases in the data, also known as the “A1SH” method. The recommended pronunciation is “Ash-e.”
Why are the first three steps essential?
There is not a comprehensive report on why the majority of AI projects failed. Still, there is anecdotal evidence, such as the “WSJ Pro AI article” stated, “Companies with artificial-intelligence experience moved just 53% of their AI proof of concepts into production, and according to a recent report [in 2020].” The second article is from BMC title “Why does Gartner predict up to 85% of AI projects will not deliver in 2018.”
The reasons are not for lacking intelligent people nor advances in technology. The answer is in front of us if we look slightly behind us.
In the early days of web and mobile development, we have the same dismal success rate as AI. The clients define the features requirements, and the developers rush to implement them. We know now that the above two-steps process is a recipe for disaster. The clients will not be happy with the app experience, and the developers are frustrated because they are not designers or user experience experts.
Today, nearly all web and mobile app development follow three steps. The steps are one, the client defines features, two, design agency or UIX experts envision the user experiences, and three, the developers create the app. Sometimes, the clients choose to do step one and step two collaboratively. As a result, the mobile and web development success rate is north of 90%.
Taking the lesson learned from mobile development, before we are rushing into developing an AI model, we must follow the A1SH method. It will increase the success rate substantially. Furthermore, you will know whether your AI project is achievable or not before entering into the AI developing phase.
Before we begin, let’s talk about the foundation project. I have a written consent email from the foundation CEO to use the data and the institute name, but I choose not to use the data nor the foundation name. It is because the first three steps are fundamental, and you can apply them to any AI project.
I will pattern the A1SH journey the same as what I did six months with the foundation. The difference is that I choose a fictitious AI project, but it is similar to the actual AI project.
For the A1SH journey, the AI project identifies the top thirteen chicken diseases on a chicken farm.
I selected the chicken-sickness AI model for many reasons. One of the key reasons is that you are not likely to be a chicken domain expert, and therefore, it reinforces you to learn the process and not jumping ahead to the conclusion.
The salient point is that the process applies to a wide range of AI projects. It gives you a solid foundation to build your AI model for business and nonprofit foundations. You will learn the first three steps to ground your AI project. Whether you are constructing a multi-million dollar AI or at the beginning of an idea, the first three must-have steps are the AI baseline, the labeled data, and the known biases.
If you are a chicken-whisper, you are already jumping ahead to the solution, but you must resist the call of the rooster and follow the process. It will be an enjoyable journey for the rest of us, where we will be one of the first to construct an AI model for identifying the top thirteen chicken diseases. To my knowledge, no one has built it. It may not be possible, and that is the beauty of the process. You will definitively know whether your AI idea is viable or not by merely doing the first three steps.
The A1SH journey has mind-maps and plenty of Python code. Furthermore, it is written for business and design professionals. If you are a UX practitioner, you will see the similarity between this approach and the design-thinking approach. Like all the “Demystify AI” series on Jupyter Notebook, you are welcome and even encourage to hack the code if you are an AI professional or AI student. The code is written in the “River” style, so it is effortless to read and hacked, even for novice programmers. :-)
It takes a little over six months for this journey to come to light. I use the original Jupyter Notebook to collaborate with the consultants and present to the foundation executive members. I worked as an Enterprise Mobility Solutions Architect for many years, and I am well versed in the art of PowerPoint. Still, I choose to use the Jupyter Notebook as the visual aid in the presentation. I want to drive home the point that developing AI is not the same as creating a mobile app or pontificating a digital transformation project.
I have fun writing and sharing the A1SH journey. It is a “sandbox” project. In other words, it is a project focusing on solving one problem.
So if you are ready, let’s take a collective calming breath … … and begin.
2 & 3 — Setup and The Journey
The power of the Jupyter notebook is interactive and individualized. The best method to learn is to make the journey your own.
The first step is to choose and name your canine companion. This project requires both a deep understanding of AI development and tact, and therefore, there is no better candidate than “Eggna.” She has no relation to the scientist Edna Marie “E” Mode from Disney’s “The Incredibles” movie. However, they are both great scientists and share a common trait. One is a fictitious cartoon character, while the other is an AI digital canine.
Typically, a dog name is chosen, e.g., “Beefy,” “Jerky,” or “Cakina,” but be warned, don’t name he or she after your cat because a feline will not follow any commands.
If you are serious about learning, start hacking by changing the companion name to your preference. Jupyter notebook allows you to add notes, add new codes, and hack Eggna’s code.
As a good little programmer, Eggna (or insert your companion name here) starts by creating an object or class. By the way, Eggna will skip copying a few “setup” code cells here because it has no output.
Eggna will skip copying the code-cells, but she will copy the output here.
Please visit Eggna’s “AI_Start_Here_A1SH.ipynb” Jupyter notebook on GitHub to view the code.
When copying the code into Atom’s project, Eggna would add the methods during the class definition, but she will hack it and add new class functions throughout the notebook journey.
Eggna will skip copying the section “4.0 | The “River” Coding Style” here because it is meant for programmers. She fears the gentle readers would fall asleep or stop reading this incredible journey where she works so diligently to write it.
5 — ML Baseline
In the previous journey, Rocky uses the “GraphViz” library to draw mind-map diagrams. They are super effective in reinforcing the journey lessons, so Eggna adopts the same mind-map technique. The mind-map code is a copy from Rocky’s Jupyter notebook. The full description is on Rocky’s “Fast.ai Book Study Group #G1FA” journey.
Why “drag and drop” while you can code. Writing code is more fun. :-)
Figure-1 illustrates the A1SH journey for Eggna. Her first stop is “Baseline,” then onward to “Labeled Data” and finishing up with “Known Data Biases.” Eggna loves using mind maps because she can trail off on a tangent and not get lost.
The baseline definition is a measurable and verifiable number or matrix compared to the AI-trained model accuracy. In other words, what is the target accuracy for the Machine Learning (ML) model? or how can Eggna tell if her trained ML model is a success or failure?
It is an obvious first question, but Eggna found that it is not easy to answer accurately. Most client’s knee-jerk response is 100% accurate because it is an artificial intelligence system, so it should be perfect.
Here is where most consulting companies failed. They rely on the customer to set the target accuracy without thoroughly review the dataset. Eggna recognizes the difficulty that when the client gave a measurable goal, e.g., 95% to 99% accuracy, the consultant must accept it. However, a better method is to educate the client on how to reach an achievable ML baseline.
On the nonprofit foundation project, the opposite happens. The consultants convinced the CEO that ML is too complex to set a target accuracy goal before development. Unfortunately, the “too complex” excuses are prevalent in many real-world AI projects. The answer is the A1SH process defines an achievable ML baseline, and Eggna will demonstrate it.
6 — Detour AI Classification
Before deep dive into calculating the baseline, Eggna will detour into the world of AI classification. It is an ugly picture out there. There are too many classifications, and most categories are a wishlist for future AI. Unfortunately, these optimistic classifications obfuscate AI instead of demystifying AI.
For example, the popular classification of AI is (1) Reactive Machines, (2) Limited Memory, (3) Theory of Mind, and (4) Self Aware. These four categories are a subset of the Forbes article “7 Types Of Artificial Intelligence” by Naveen Joshi in June 2019.
Base on Joshi’s categories there are no algorithms assign to each of the categories. For example, “#4, Self Aware,” “#6, Artificial General Intelligence (AGI),” and “#7, Artificial Superintelligence (ASI)” are AI future categories. They may not be achievable.
Hue-mon {human} could barely agree on the definition of “consciousness.” The closest description of artificial consciousness is the “Turing Test” by Alan Turing in 1950, which he named the “imitation game.” The original Alan’s paper is the “Computing Machinery and Intelligence. Mind 49: 433–460.”
As an analogy, if Eggna asks Mr. Neil Armstrong, Michael Collins, or Buzz Aldrin after they made the trip to the moon in 1969, “…so you have been to the moon, how long will it take you to go to Proxima-B planet in the Proxima Centauri star system?”
In 2021, the iPhone has 120 million times more computing power than the “Apollo 11” spacecraft, and yet, on February 18, 2021, it takes NASA 203 days to send a spacecraft rover, “Perseverance,” nicknamed “Percy,” to Mars. The distance between Mars and Earth is about 55 million kilometers. Therefore, a little bit of math calculation later, the trip to “Proxima B,” 4.25 light-year from Earth, would take 137 thousand years. Clearly, engineers must create breakthrough technology before Mr. Armstrong can make the trip.
Traveling to the “Proxima B” planet is the same technical difficulty as AI reaches “consciousness.” Google, Facebook, the Department of Defense (DOD), and hundreds of AI companies create impressive AI today. It is comparable to the “landing on the moon” analogy, but to achieve a sentient AI would take many unfathomable breakthroughs. Quantum computing could be one of those many breakthroughs.
In Figure-2, Eggna classifies AI into three groups, AI, ML, and DL. “Artificial Intelligence (AI)” is any technique, algorithm, or program which enables computers to mimic human behavior. “Machine Learning” (ML) is an AI technique that gives computers the ability to learn without being explicitly program to do so. ML is a subset of AI. “Deep Learning” (DL), also known as Artificial Neural Network (ANN), is a subset of ML, making the computational of multi-layer neural network feasible.
Most people use the word AI and ML interchangeably, but they shouldn’t. Machine Learning is a subset of AI. Eggna can write a “checker game” algorithm, in which she explicitly code every possible outcome and move. The checker algorithm will always win or tie and never lose a game. Therefore, it mimics hue-mon {human} thinking, and it qualifies as an AI, but it is not an ML system.
ML is when the algorithm learns by itself. For example, in facial recognition using the ANN algorithm, Eggna is not explicitly coding how to recognize a face. In other words, she does not need to write code about eyes, nose, mouth, ears, hair, or smile. Using labeled images, the ANN learns how to identify faces without an explicit program, e.g., that image is Duc’s face.
Eggna likes to group ML based on the algorithm, but the algorithms are not easily aligned. One of the best methods is categorizing ML into (1) reinforcement learning and (2) supervised and unsupervised learning.
Supervised learning is the easiest to understand. It means the algorithms require data with labels, e.g., chicken photos and disease labels for each chicken. In some literature, the data and label refer to the dependent and independent variables. The supervised learning algorithm identifies a probability on which label goes with the image, e.g., this chicken photo has a 92.3% probability of “Coccidiosis disease.”
Unsupervised learning, as the name implies, is very close to supervised learning. It means the algorithms require only the data and not the label. For example, Eggna takes all the people on Facebook and using an unsupervised learning algorithm. She creates groups or segments into categories.
What are the categories? Eggna doesn’t know, and she does not explicitly code them, but she puts a label after the unsupervised algorithm categorizes the Facebook people. She could label the emergent groups as “Democrat or Republican,” or “urban-trend-setter, midwestern-middle-class, or rural-farmer,” or “independent-teenager, young-and-impressionable, conspiratorial-adult-man, bible-thumper, or rich-old-man.”
A few algorithms can work with both supervised and unsupervised learning.
The ten ML supervised and unsupervised learning algorithms are as follows.
- Apriori Algorithm, (Unsupervised Learning)
- Deep Learning, aka. Artificial Neural Networks (ANN)
- Decision Trees, (Supervised Learning)
- K Means Clustering Algorithm, (Unsupervised Learning)
- K-Nearest Neighbors (KNN), (Supervised Learning)
- Linear Regression
- Logistic Regression
- Naive Bayes Classifier Algorithm, (Supervised Learning)
- Random Forests
- Support Vector Machine (SVM) Algorithm, (Supervised Learning)
The “Machine Learning Tutorial” blog from MindMajix in 2021 by Ravindra Savaram has an excellent short description of the above algorithms.
Reinforcement learning is where the algorithm continuously trains itself through the trial and error method. One of Eggna first ML programs, many years ago, uses reinforcement learning to teach the AI how to beat the “Flappy Bird” game. The “Flappy Bird” is a simple game where the bird has to fly through a maze of pipes. The bird flies forward, and Eggna presses the space bar for the bird to flap his wings to fly upward. That’s it.
Unlike the supervised or unsupervised learning algorithms, there is no data and no labels. Eggna creates 100 random flights and measures how long each flight took before it crashed into the pipes. Take the top 10 flights, i.e., longest time, vary the parameters slightly to generate another 100 flights, and pass them through again, i.e., next epoch, and repeat. After 4,300 epochs, Eggna beat the Flappy Bird game.
There is a total of 13 known algorithms for reinforcement learning. The “Algorithms for Reinforcement Learning” book in 2010 by Csaba Szepesvari is the algorithms’ definitive description.
- Monte Carlo, Every visit to Monte Carlo.
- Q-learning, State–action–reward–state
- SARSA, State–action–reward–state–action
- Q-learning — Lambda, State–action–reward–state with eligibility traces
- SARSA — Lambda, State–action–reward–state–action with eligibility traces
- DQN, Deep Q Network
- DDPG, Deep Deterministic Policy Gradient
- A3C, Asynchronous Advantage Actor-Critic Algorithm
- NAF, Q-Learning with Normalized Advantage Functions
- TRPO, Trust Region Policy Optimization
- PPO, Proximal Policy Optimization
- TD3, Twin Delayed Deep Deterministic Policy Gradient
- SAC, Soft Actor-Critic
The salient point is Eggna focuses on the algorithms and not the AI-type. Any AI categories are sufficient as long as the AI-type is mapped to the algorithms. If there is an AI-type with no mapped algorithm, then it is not practical. They add chaos to the AI universe, so Eggna implores you not to create or propagate AI-type without algorithm.
For example, no algorithm exists for “artificial consciousness.” Therefore, there shouldn’t be an AI-type name “self-aware” or “Artificial Superintelligence.”
The DL, ANN algorithm, is Eggna specialty. She further classifies DL using concepts from the book “Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a Ph.D.” in 2020 by Jeremy Howard and Sylvain Gugger. The three categories for DL are “Vision, NLP, and Tabular.”
In conclusion, the A1SH, “classify chicken sickness,” is in the category “ML, Deep Learning, Vision.” Eggna will proceed with showing how to define a baseline for any ML project.
7 — Blind ML Baseline
Welcome back from the detour. Let’s dive right in. The “ML Blind Baseline” is the simplest method to estimate a baseline, i.e., the target accuracy. Figure-3 reminds Eggna where she is in the journey and what comes next.
Blind selection is calculating a simple probability. For example, if Eggna flips a coin, the likelihood of landing head up is 50%. That’s it.
It is too simple. Why do we need it?
Eggna encounters many real-world projects, such as the nonprofit environmental foundation previously described, claiming it is impossible to define a baseline. When confronted with the “impossible” excuses, Eggna answers by specifying the “Blind baseline.” Blind baseline is not the most innovative method because the probability accuracy is the same as Eggna blindly picks an answer. Still, it is the quickest stop-gap answer, while Eggna works on the “Research Baseline.”
For the chicken sickness ML project, there are 13 fowl diseases (input, gamma), and therefore, the blind selection is 7.69%. In other words, there is a 7.69% chance of picking the correct answer.
The ML Blind Baseline (tau) is the base selection plus the midpoint of the reminder probability. Eggna is more fluent in math than English, so Figure-4 is the math equation for ML Blind Baseline.
Eggna tries the new method on the “flip a coin” problem, and the target ML Blind Baseline is 75%. That means if Eggna builds an ML model to predict which side of the coin for the next flip, it will have to be 75% or higher accuracy.
75% accuracy is huge. Eggna will be rolling in dog biscuits heaven using the ML model to play casino roulette game betting on “red” or “black.”
Here is another example. For Natural Language Processing (NLP) sentiment and rating users post or review, there are five ratings, “thumbs_up_5, thumbs_up_4, thumbs_up_3, thumbs_down_2, and thumbs_down_1.”
The NLP Blind Baseline above is 60.0%.
Finally, for the A1SH chicken sickness identification model, the ML Blind Baseline is 53.85%. It means the achievable target accuracy is 53.85% or better.
What if Eggnas relaxed the target rules and changed them to “predict the top three sicknesses with probability,” e.g., Coccidiosis 78% confidence, Avian Influenza 8.1% confidence, and Salmonellosis 4.7% confidence. If the correct answer is in the top three predictions, then Eggna counts it as a success.
A quick math substitution, Eggna founds the new equation in Figure-5.
The new ML Blind Baseline for the top three is 61.54% and for the top two is 57.69%.
Eggna has not researched the diseases and how often the outbreak per year. She found that the top three diseases account for 75% of the epidemic last year. Therefore, by blindly picking the top three, the baseline would be higher. Eggna is jumping ahead. It is in the next step.
So far, Eggna only looks at the input, i.e., the number of categories or labels. The next step, the “Research Baseline,” is learning about chickens and why they are sick.
8 — Detour NLP, Prediction, Regression Blind Baseline
So far, Eggna has been focusing on images and labels, i.e., Supervised Learning algorithms. What about other algorithms and types such as NLP, audio, multi-label, regression, prediction, video, and generative adversarial network (GAN)?
How would Eggna develop a Blind Baseline for the above type?
It is easy peasy lemon squeeze.
First, refer to Figure-2, “Eggna AI Classification” illustration. Eggna will go through a few examples using the map as a guiding line.
NLP (Natural Language Processing) is a general term for using text as an input for Machine Learning (ML) model. There are many different projects under NLP. A popular NLP project is “negative or positive sentiment,” e.g., “does the users like or don’t like a movie?”
It has input text and labels. Therefore, Eggna can use any Supervisor Learning algorithm to construct the ML model, including the ANN algorithm. There are two categories, “positive and negative,” so the Blind Baseline is 75%. It is the same math and code used in the previous section.
The second popular NLP project generates answers to customer questions, like a chatbot helping consumers shop for shoes. There are two ML models involve in solving this problem. The first ML model uses a Reinforcement Learning algorithm to predict the next character in a sentence. The second ML model uses a Supervised Learning algorithm to predict how good are the responses, e.g., rate the answers from 1 to 5 stars. The second ML model reused the same vocabulary tokens from the first ML model.
Eggna friend, Rocky, wrote an excellent introduction to NLP, including Jupyter notebook code. It is the “Fast.ai Book Study Group #G1FA”. In addition, her mom, Henna, wrote the “Demystify Neural Network NLP Input-data and Tokenizer” article. The salient point is the above articles explain the amazing world of NLP.
This is a hue-mon {human} interruption. Henna is Eggna mom. It is so funny :-)
In summary, if Eggna rates the chatbot response from 1 to 5 stars, then the Blind Baseline is 60%, or if the chatbot rating is 1 to 10, then the Blind Baseline is 55%.
Moving on to the second example, what if the ML model estimates sales value for a retail store? The input data is most likely a tabular format, i.e., data in a table or an array. If it is a sales prediction ML model, then the monthly or daily revenue is the dependent variable. These numerical predictive ML models are referring to as the “Regression model.”
In the Figure-2 diagram, any ML with data and labels is a Supervised Learning type. Therefore, Eggna can use any Supervised Learning algorithm, but she will choose the ANN algorithm. ANN is not only for image vision ML projects. For example, if the sales prediction per store ranges from $50,000 to $100,000 a month, would Eggna has 50,000 categories, assuming the value is rounding to a dollar?
Absolutely not.
There are at least two methods of calculating Blind Baseline for the sales prediction model. Assuming Eggna has four years’ worth of daily sales value for 30 stores. She will save the last 90 days of sales data as the dependent variable.
The first, and most simple, is the “inverse ratio percentage.” For the ratio of 0.1 (10%), if the sales prediction is within plus or minus ten percent, +-10%, of the actual, then the accuracy is at 90%. If the sales prediction is within plus or minus twenty percent, +-20%, the accuracy is 80%.
Eggna will write a simple function to calculate the sales prediction Blind Baseline using the inverse ratio. The input for the method is the sales prediction low and high range, the number of stores, and the ratio.
The second easy method for calculating sales prediction Blind Baseline is “performance labeling.” For example, the mid-point of the range is the expected target sales for the month. Above the mid-point are high-performance stores, and below are the poor-performance stores. Therefore, Eggna can label the prediction sales value from 1 to 10 stars, where the five-stars rating is the monthly target sales.
Eggna has ten categories, so she uses the Blind Baseline function in the previous section, and the Blind Baseline is 55%.
The “Inversed Ratio” and “Performance Labeling” methods are Eggna’s original thoughts.
There are few articles, blogs, and academic papers talking about ML Baseline, and therefore, the concept of “Blind, Research and Verify” Baseline are Eggna’s inventions. She is the most innovative digital canine north of the Appalachian mountains.
That is it for this detour. Please reach out to Eggna or the hue-mon {human} for more information on Blind Baseline calculation. Eggna might create a new journey just for Blind Baseline calculation.
9 — Research Baseline
Figure-6 is tracking where is Eggna in her journey. She is at “Research Baseline.”
The Research Baseline’s first step is to learn about the project and the client. The project is “identify 13 chicken diseases on a farm,” and the client is Mr. Cole MacDonal (a fictitious name).
Looking slightly behind, Eggna knows one of the best methods for understanding the project scope is creating the client’s persona. In the simplest definition, a “persona,” in user-centered design and marketing, is a fictional character created to represent a user type that uses the product daily.
There are numerous books, blogs, and videos about “what is” and “how to create” the client persona. Therefore, there are as many methods as there are clients. History traces that Alan Cooper, a noted pioneer software developer, is the first to use persona in 1983. He popularizes it in his book, “The Inmates Are Running the Asylum: Why High Tech Products Drive Us Crazy and How to Restore the Sanity” in 1999.
Requirement: “…an easy method for my farmers to [quickly] see which chicken might come down with a disease [one of the thirteen] in their daily chores.”
Goal: “…keeping my [chicken] flocks healthy and more profitable. If we can spot the disease early, we can stop the diseased chicken from spreading to the flock.”
Figure-7 is Eggna’s take on creating the client persona. It is one of the simplest formats, and it is sufficient for this project. Other projects might require a more profound and more extensive client persona.
After many interviews with Mr. Cole and the farmers, Eggna founds the following details. The actual interviews were with the CEO, the foundation staff, domain experts, and the two consulting companies. The information is factual. Only the name and the type of the disease are different.
Initially, the goal is to identify thirteen chicken sicknesses. However, after the interviews, only five chicken sicknesses plague this region for the past three years. The Deep Learning ANN algorithm works well for two categories or 100 categories, but more categories translate to more efforts in gathering images and labeling them.
If Eggna limits the number of chicken sicknesses, he will save time and expenses in gathering the data, but the resulting ML model is also limited to California. For other regions, such as Virginia or South Carolina, chicken diseases are not in the training data set. In other words, if limited to California, the ML model might give a “false positive.”
Furthermore, the goal is to isolate the sick chickens from the flock, so maybe the categories are simpler. There should be just two “healthy or sick” chickens. In other words, it does not matter what disease is inflicted on the chicken. If the chicken is sick, remove it from the flock.
Without going into a complex mathematical digression of the Universal approximation Theorem, Figure-8, there is no concept of “not” in ANN. It means Eggna can not use boolean logic to say, “…if the inference confidence level is low on the chicken sickness images, then labels it as healthy.”
The solution is to add healthy chicken images to the training data set. However, there are so many more images for healthy chicken, which might make the dataset lopsided. Similar to the “sparse array” problem, where there are many more zero (healthy) than one (sick).
Moving forward, Eggna checks with the domain experts, such as the American Poultry Association (APM), for any current statistics or ML model. The goal is to find an existing prediction accuracy number. Sadly, there is no prediction accuracy, but there are statistics on chicken disease in America. Eggna needs to consult with a statistician to deduce any relevancy.
Next, Eggna does an online search for any existing ML model or project for “identify chicken disease.” She checks on many popular ML competition sites, such as Kaggle competitions. There is no luck. She does not find any ML project that even close to identify chicken disease.
Digging deeper, she found many commercially available ML platforms, e.g., Amazon SageMaker, Microsoft AI360, Google AI Engine, and Open.ai GPT-3, claim they can quickly build an ML model for Eggna. Still, without the data, i.e., labeled images, she can’t use them. Therefore, it is impossible to use commercially available ML platforms to find the relevant prediction accuracy number.
Forging forward, Eggna interviews the field experts, the veterinarians. They are confident that they can spot a disease in a chicken, but it is next to impossible to quantify it without setting up a field test. For example, show a veterinarian 10,000 images of a diseased chicken and ask him or her to name that disease (image above). Afterward, Eggna can calculate the percentage of accuracy.
Putting it all together, Eggna goal is to deduce the Research Baseline, which should be higher than the “Blind Baseline.” Typically, Eggna pits the hue-mon {human} experts against the ML algorithm. The ML does not have to beat the hue-mon accuracy, but it has to be close.
Suppose no hue-mon {human} has the quantifiable and verifiable expertise, i.e., no “chicken whisperer.” In that case, Eggna will try to find exiting published ML academic papers to be set as the target accuracy, but there is none to be found. Eggna is wise not to use the marketing target accuracy from public or private companies because without verifying the data, the claims can not be trusted at face value.
The only available method left is the statistical analysis from statisticians. It is a similar method used by the Centers for Disease Control and Prevention (CDC) to predict the likelihood of which influenza (the common flu) strains will be widespread in the winter.
Before Eggna unveiled the answer, she would like to review Figure-7, the persona. The target accuracy is based on a quick visual inspection of random chickens and not sending the chicken’s blood and poop to the laboratory.
- For identifying “healthy versus sick,” using the top thirteen diseases, the Research Baseline is 77.14% accurate.
- For identifying “healthy versus sick,” using the top five diseases, the Research Baseline is 85.43% accurate.
- For identifying the top thirteen diseases (no “healthy” category), the Research Baseline is 69.23% accurate.
- For identifying the top five diseases (no “healthy” category), the Research Baseline is 71.43% accurate.
The above statistical equation is easy to program, but it demands rigorously analytical data analysis before deriving the equation.
Fair disclosure, Eggna did not hire a statistician for the “chicken disease” (A1SH) ML model. She reuses the equation from the nonprofit foundation project with the A1SH parameters. Eggna did not include the “equation code cells” in the Jupyter notebook because of a non-disclosure clause.
10 — Verify Baseline
Once Eggna has the Researched Baseline, which should be higher accuracy than the Blind Baseline, she will cross the road and establish the Verify Baseline.
As the name implied, the accuracy should be verifiable, but how can Eggna substantiate the accuracy before writing the first line of code?
There are three general methods for constructing the Verify Baseline. The first is to find a similar ML project that predicts a similar goal. For example, If Eggna has found an ML model that classifies the different automobiles, she can deduce the classifying motorcycle accuracy.
Eggna searches online for similar ML models for categorizing chicken diseases, such as checking your cats’ health or identifying common childhood illnesses in a classroom.
Eggna found a “CNN for skin cancer detection” project on Kaggle.com, which shows an 82.88% accuracy.
The second method for extrapolating the Verify Baseline is to find the hue-mon {human} subject matter experts and compare the result to their expert prediction. For example, if Eggna found a chicken-whisperer, and he/she can look at a chicken and say which disease inflicted the chicken. What is his/her documented accuracy? If not established, Eggna could set up a fishbowl analysis with 20,000 images and ask the chicken-whisperer or a veterinarian to name that diseases.
The resulting accuracy should be higher than the Researched Baseline. If not, then the hue-mon {human} is not that good.
The third general method is the “smoke test.” If Eggna has the data, she uses the chosen algorithm and runs it using the default hyper-parameters even before transformed or cleansed. There could be errors in the data, the images lack argumentation, and the hyper-parameters are not tuned. In fancier ML terminology, it is regularization.
If the smoke test produces a converging lost rate, then that is the Verify Baseline. In the “Fast.ai Book Study Group #G1FA” journey from Rocky, it takes only a handful of lines of code to create the ML model for supervisor learning type using the ANN algorithm. For classifying cat and dog breeds, and using all the default hyper-parameters, the accuracy is 94.72%.
Eggna would like to point out that the “Demystify Neural Network NLP Input-data and Tokenizer” section 9.1 in the journey by Henna has an excellent explanation of all eleven ANN hyper-parameters.
Eggna has no images of diseased chicken, so she can’t use the smoke test option. In summary, there are at least three methods for establishing the Verify Baseline. They are “find similar published ML model,” “compare to hue-mon {human} expert prediction,” and “use the Smoke Test.”
11 — Label Data
Figure-10 summarized how far Eggna had traveled. Her next destination is “label data.”
Eggna has her client persona, the chosen ML algorithm (the ANN), and the baseline accuracy target. She will proceed to collect the data, i.e., the diseased chicken images. The first initial thought is that is there an existing database with diseased chicken images?
Eggna checks with the “World’s Poultry Science Association (WPSA),” the “World Poultry Congress (WPC),” the “World’s Poultry Science Journal,” the “United Poultry Growers Association,” the “(USA) National Chicken Council,” the “American Pastured Poultry Producers Association (APPPA),” the “USA Poultry and Egg Export Council (USAPEEC),” and the “Poultry Industry Associations.”
There is plenty of literature about chicken diseases, but no one has a database with hundreds of thousands of diseased chicken images. Eggna’s client owns a chicken farm, but he does not have a database of chicken diseased photos. The last option is to create an image database. It’s easier than it sounds.
Before diving into collecting the data, Eggna defines a few guidelines and answers a few questions about the data. Such as, how many photos are sufficient for the ML model to converge? And what type of images are acceptable and what are not?
There are no general rules that governed how many labeled images are a prerequisite for an ANN model to converge. Working from the client’s persona, Eggna will estimate the number of photos required. She will start with the project’s goal.
“…keeping my [chicken] flocks healthy and more profitable. If we can spot the disease early, we can stop the diseased chicken from spreading to the flock.”
Eggna has tentatively chosen the ANN algorithm with Resnet-34 or Resnet-50 as the transfer learning model from the smoke test.
Since tens of millions of images train the resnet architecture, the retraining of the ANN may need from 5,000 to 8,000 images per category. With additional photos, Eggna can train the ML to have a lower loss rate.
With thirteen categories, the number of diseased chicken photos ranges from 65,000 to 104,000 images. It is a guesstimate. During the smoke test, the ANN model might need more pictures, or it can converge with a smaller number of photos.
If the ANN model needs more images, the good rules-of-thumb are using a modified Fibonacci number, i.e., the following sequence is 13,000 images per category. Using the technique described in the “Augmentation Data Deep Dive” journey by Wallaby, Eggna can increase the input data by 5, 8, or even 13 times during training. It means Eggna can achieve better results and guarantee the ANN model will not be overfitting. In other words, she may need 5,000 or fewer photos per category.
For the second question, the client’s persona requirement clarifies what type of picture is acceptable.
“…an easy method for my farmers to [quickly] spot which chicken might come down with a disease in their daily chores.”
The farmers will take pictures from their mobile phones, so using professional photographers are not in scope. As seen in Figure-9, the professional images are clear and beautiful. The model might attain a lower loss rate using the curated photos, but in the real-world release, the ML system predictions will have many “false positives.”
The solution is straightforward, but the implementation hit a snag. The first step is using an iPhone or Android phone to videotape the chickens afflicted with Colibacillosis, Mycoplasmosis, Pullorum, or any of the 13 top diseases.
A helpful hint is taking the video in different angles, e.g., side view, top view, front view, close up view, chicken in a flock, chicken in a barn, chicken on the roof, chicken on grass, chicken feeding, chicken plucking, but not chicken in a pot.
The second step is to convert the video into an image sequence. There are many video tools and online converters that can do the splicing job. A few examples are “VLC Media Player,” “VirtualDub,” “FFmpeg,” and many others.
The third step is to manually choose the images that are not too similar and save them into a folder labeled as the disease, e.g., the “colibacillosis” folder.
In general, Eggna can expect to have 5, 8, or 13 useable photos for every minute of video. After quick math, for 5,000 images, Egnna would need about 10 to 11 hours of video. If Eggna asks two friends to help, they could have it done in about 3 to 4 hours per disease.
That’s is not too bad, but where can we find sick chickens?
Based on Eggna client, a chicken farmer in California, and the California Department of Food and Agriculture, there are 142 chicken disease outbreaks from 2018 to 2020. The “Virulent Newcastle” disease attributes to most of it. Therefore, waiting for all thirteen top chicken disease outbreaks could take upward to 2 to 4 years. However, if Eggan’s client agreed to limit the top five chicken diseases in phase-1 release, the team can complete the task in one year.
The estimated cost for creating the labeled image database would be $5,200 per disease. So, the total cost for gathering data for five conditions would be around $26,000.
12 — Bias
Figure-12 said that Eggna is at her last stop in the A1SH journey, the “biases.”
Eggna has the baseline accuracy, the data, i.e., the labeled photos, and now she must define or discover the obvious or the hidden biases lurking in the dataset. Unfortunately, she can not skip the “bias step” and do it after training the ML model.
Knowing dataset biases is essential for Eggna to compare the difference between trained accuracy, which should be higher than the baseline accuracy, and the real-world test accuracy.
Before diving headlong into solving the data biases conundrum, Eggna will define what data biases are and give a few examples of real-world data bias, resulting in a disastrous consequence.
Data bias is similar to the standard statistical bias. A statistic is biased if it is calculated to be systematically different from the population parameter being estimated. A dataset is biased if it discriminates against a feature, a group, or a segment from the population.
A dataset without bias is a myth. All datasets have biases, and it is next to impossible to remove all the known biases from a dataset. In addition, there are adventitious biases that will not be known after the real-world release.
By now, everyone had heard about the famous “Kodak’s Shirley Cards Set Photography’s Skin-Tone Standard” bias used in the 1970s. The “Shirley” card was used to calibrate Kodak’s printers. During that time, Kodak owns 90+% of the photo printer market in the world.
Jersson Garcia creates the “Shirley” card, and he works at Richard Photo Lab in Hollywood. He’s 31 years old. But, more importantly, he’s got a total crush on Shirley, who is an employee of Kodak.
The bias was not intentional, but the consequence is disastrous. For 30 years, all printed photos of darker skin people are slightly off their true-tone color. Since printed photos available for the first time to the mass market, no one could recognize the bias setting in the printer. As a new technology, everyone thought this is how I look in a printed photo.
The “State Farm Distracted Driver Detection” (SFDDD) Kaggle competition is one of the most bias datasets that Eggna found. Kaggle is a recognized leader in the worldwide AI competition market space. The SFDDD was in 2016, and the winning prize is $65,000. A total of 1,438 teams competed in the competition, and the winner is the “Jacobkie” team with a verifiable 0.08793 loss rate or 91.26% accuracy rate.
The intention is to identify whether a driver is distracted while driving or not. In addition, there is a camera mounted in the rearview mirror taking photos of the driver.
Eggna could write a quick program to download the dataset and display 32 random images every time she ran the method. Thus, there are 122,124 labeled images for training and 79,696 images for validation, marked as the “test” folder.
As you can see in the below images, there are so many biases in the dataset. Where to begin?
Here is the first pass of a bias list from Eggna. The ML model could predict “false positive or false negative” when encountering images that are not in the training set.
- Age bias — The photos are of adult people. There is no image of teenagers, young adults, or adults older than 50 years old.
- Appearance bias — There is no photo of drivers in other clothing except “casual Western attire.”
- Diversity bias — The photos are of 9 people. There is not enough diversity in height, weight, and personal characteristics.
- Distracted activity bias — Many distracting activities are not included in the dataset, e.g., reading a newspaper, having a cat on your lap, eating a Big Mac burger with Fries, dozing off, searching for a new radio station, or driving with your knees, etc.
- Vehicle bias — It is the same car use in all the photos. What about trucks, SUVs, convertibles, European sports cars, semi-trucks, etc.?
Eggna does not know State Farm insurance deploys the winning ML model to a real-world test. And if State Farm uses the ML model to set members automobile coverage monthly rate or copay, then there might be biases against a specific ethnic or age group.
Back to the “Chicken Disease,” Eggna has not yet collected the data. However, she can deduce the biases and choose which one to eliminate from the dataset. Unfortunately, Eggna knows that there are biases that she can not remove from the dataset due to cost or schedule.
- Location bias — The photos should include outdoor, i.e., for free-range chickens, and indoor, i.e., chickens raised inside a chicken barn.
- Camera Angle bias — The photos should have different angles, e.g., close up, top view, side view, front view, side-way view, back view, in a flock view, chasing view, etc.
- Gender bias — The photos should include the hens, female, and the roosters, male.
- Age bias — The photos should include chicks and full-grown chickens.
- Species bias — The photos should include many chicken species, e.g., Ameraucana, America Game, Brahma, Buckeye, Dominique, etc.
With the baseline, dataset, and biases, Eggna is ready to cross the road and begin building the Diseased Chicken ANN model.
13 — Wrap-Up
The A1SH journey is six months in the making. Eggna based it on the real-world nonprofit project working as a volunteer, i.e., pro-bono work. There is freedom in working outside of consultancy or enterprise framework. For a nonprofit organization, Eggna can take the time and do it right, but it requires discipline on the other side of the coin. Weekends and late nights are a poor substitute for bright-eyes brainstorming sessions.
Eggna selected the “Diseased Chicken” project. It is a fictitious project, but it mirrored the actual nonprofit project. She chooses a fabricate project because of legal constraints. Furthermore, learning the A1SH three-step process is the goal and not about the result of the nonprofit project.
Having a visual guide throughout the journey helps Eggna focus on the process, as shown in Figure-12 above. The first step is to define a baseline accuracy for the ML model. The baseline is essential because it uses to judge whether the resulting trained model is successful or not. In other words, if the resulting trained ML model accuracy is lower than the baseline accuracy, then the project is a failure.
The three types of baseline are “Blind, Research, and Verify.” Blind and Research baselines are relatively easy to do, and the Verify baseline is more complex, but it’s worth it. The salient point is Eddna can not skip the first step.
A baseline accuracy is a must-have to ensure a successful ML project.
The detour to seek understanding how to categorize your project as “AI, ML, reinforcement learning, supervised, and unsupervised” is worth the time. It is demystifying the title wave of misinformation out in the world.
The second step is to find or create the labeled data. Collecting, augmenting, and cleansing the dataset is an enormous task in most ML projects, except for the reinforcement learning algorithm. A good general rule of thumb is an ML model is only as good as the data.
Finally, Eggna describes data biases and why they are imperative to complete for training the ML model.
The A1SH was a fun journey, and Eggna is looking forward to the next journey and hopes to see you again.
The end.
14 — Post Script
The A1SH journey takes longer to write than the other journeys, and the topic is not as technical. Still, it is imperative for AI Scientists, Data scientists, Product Manager, Data Analysts, UIX Specialists, and Strategist to use the same common framework for defining and setting the goal for the Machine Learning (ML) project.
As more enterprise companies and nonprofit organizations looking to use ML as a solution, they encounter more failure than success. Eggna mentioned the WSJ-Pro AI and BMC articles in the introduction. They failed typically because their ML project does not have a baseline accuracy, sufficient dataset, or defined biases.
Eggna leads you through a journey of how she solved the ML conundrum. She shows you how to calculate the baseline accuracy, gather the labeled images, and seek out the biases lurking in the dataset. She chooses the “chicken diseased” ML model because no one had done it before and predominantly because you thought it was impossible. Now, you know better.
Eggan described the A1SH three-step process that applied to any ML project. In other words, it applies to Image Classification, NLP, Regression, Deep Learning, Prediction, Supervised, Unsupervised, Reinforcement Learning, etc.
I said this in the introduction, and I am repeating it here again. Do not accept the excuse of “…my AI project is too complex for a baseline accuracy,” and never, ever, willy-nilly pick a baseline accuracy without doing the research.
It takes more than a month of research to draw Figure-2. The tidal wave of online information about AI, both real and fake, makes the task exponentially laborious. At the very least, now you know the difference between Artificial Intelligent system (AI), Machine Learning system (ML), and Deep Learning system (DL).
In the world of AI, I choose to be an expert in DL, also known as the Artificial Neural Network (ANN) algorithm. That is one algorithm out of two dozen or more in ML. I can spend the next 20 years studying and researching the ANN algorithm, and there will be more to discover.
When you are interviewing, if a candidate tells you that he or she knows everything about AI, it is most likely he/she is gaslighting, or he is the Lieutenant Commander Data from the Star Trek universe.
As the author, I lay out the journey beginning, middle, and ending. Still, through Jupyter Notebook, the readers can, and even encourage, hack the story by adding notes, modifying the codes, or writing the new detour.
Not all AI scientists work for Facebook, Twitter, YouTube, or the government Department of Defense (DOD) on large-scale omnipotent AI. If we are in the Star Trek universe, these are not the Rear Admiral’s logs on the Eaglemoss space dock. Instead, they are the logs of a disposal maintenance engineer, who is exiled, ex-military, excommunicated, and extricated the lower level.
I hope you enjoy reading the A1SH journey, sharing it, and giving it a thumb-up. I will see you on the next Demystify AI journey.
15 — Update Blind Baseline — July 6, 2021
Thank you for all the positive feedback. There are many questions and requests to expand the “Blind Baseline” section. Eggna personally responses to all requests, and she likes to share two stories here.
How to add bias-factor to the Blind Baseline when the categories are not weighted equally? For example, is that a harmful spider or a helpful spider? or who would rate this new article favorably: Republican, Democrat, Socialist, Communist, Libertarian, or extreme Right-Winger?
There are more harmless spiders than poisonous spiders in an urban setting, so it’s not a 50% to 50% chance. The answer is benign spider bias factor should not be in the Blind Baseline. It should be in the “Research Baseline.” For example, after doing the research, Eggna found that 70% of the urban spiders are harmless. Thus, Blind Baseline is 75% accurate, and the Research Baseline is 87% accurate, using the previous section formula.
For the second group of questions surrounding NLP, how to define a Blind Baseline for them? It depends on a few factors. Eggna will address the easier ones first. When Eggna has access to (1) “data” and (2) “labels,” she can use any one of thirteen supervised algorithms to train the model. ANN is one of the supervisor algorithms, and it is Eggna’s go-to solution.
After Eggna builds an ANN to generate the text, i.e., for an article, a tweet, or a Facebook post, she will create a second ANN to classify an input text as one of the six categories above. For training the second model, she will use the vocab-index from the first model, and she should have access to the data, e.g., the articles, and the label, such as Democrat, Socialist, Communist, and so on.
For the easy case above, the answer for the Blind Baseline is simple. Six labels equal 55% accurate rate target for the Blind Baseline.
For a slightly more complex case, what happens if Eggna has NLP text data but no label? ML supervised algorithm is for having both (1) data and (2) label, and as the name implies, ML unsupervised algorithm is having (1) data and no (2) label. In some literature, (1) data referred to as independent variable, and (2) label referred to as dependent variable. Eggna can’t force the emergent grouping of an unsupervised algorithm into the desired six categories above.
As with any unsupervised algorithm, the emergent collections require a domain expert hue-mon {human} to place a label on each group. Eggna does not know how many arising assemblages at the start of the training. It could be two, or it could be ten. Eggna could run the unsupervised algorithm repeatedly, and the emerging grouping might be different each time.
Nevertheless, Eggna must find a “baseline” accuracy as a target goal before programming the ML model. The “Research” and the “Verify” baseline processes would be better choices to calculate the baseline accuracy. Thus if Eggna defined a “Blind” baseline, she would select a target of 3, 5, 8, or 13 surfacing collections for the domain experts to specify the label for each group.
Please keep the comments and feedback coming. It is Eggna high point of the day to read and respond to them.
Epilogue
“Do no harm by doing positive deeds.”
2020 was the year where fake news and misinformation became mainstream. Unfortunately, I have read too many highly polarized articles about mistrusting AI on social media and the mainstream news channels. These fears are misplaced, misinformed, and fractured our society.
Doing nothing is not the same as doing no harm. Therefore, we can’t opt out and do nothing, so I take baby steps. The notebooks are not about large-scale omnipotent AI. Instead, they demystify AI by showing the mundane problems facing AI scientists in a real-world project. It is like demystifying crabs-fishermen by watching the TV series “Deadliest Catch.”
“Do no harm by doing positive deeds” is the foundation reason why I write and share the demystify AI series. The articles are for having fun sharing concepts and code with colleagues and AI students, but more than that, it is for building trust between AI scientists and social media.
I hope you enjoy reading it, sharing it, and giving it a thumb-up.
<<I published this article on LinkedIn earlier.>>
Demystify AI Series by Duc Haba
- Hot off the press. “First 3 Steps To A Successful AI Project” — on GitHub (July 2021)
- “Deep Learning For Coder Features Image Classification, NLP, and Collaborative Filtering With Fast.ai Framework” on LinkedIn — on GitHub (January 2021)
- “Data Augmentation Deep Dive For Machine Learning” — on GitHub (December 2020)
- “Demystify Neural Network NLP Tokenizer” on LinkedIn | on GitHub (November 2020)
- “Python 3D Visualization” on LinkedIn | on GitHub (September 2020)
- “Demystify Python 2D Charts” on LinkedIn | on GitHub (September 2020)
- “Deep Learning From Concept to Deployment, The Norwegian Blue Parrot Project | on K2fa-Website (August 2020)
- “The Texas Two-Step, The Hero of Digital Chaos” on LinkedIn (February 2020)
- “Be Nice 2020” Movement, #benice2020 (January 2020)