Mole or cancer? The algorithm that misses one in three melanomas and overlooks patients with dark skin.

Time is money. Especially when it comes to melanoma, the most dangerous skin cancer: diagnosing this tumor as early as possible is crucial for saving lives, more so than with almost any other cancer . In Spain, it is estimated that by 2025 there will be nearly 9,400 cases of melanoma, a highly aggressive tumor that can spread rapidly and cause metastasis in just a few months. When this occurs, the prognosis is often poor, so any error in detection can be fatal.

Precisely this urgency has led the Basque Country to invest in artificial intelligence (AI). The Basque Health Service, Osakidetza, is working to ensure that its public health centers and hospitals incorporate Quantus Skin , an algorithm designed to diagnose the risk of skin cancer, including melanoma. In theory, it promises to streamline the process: from primary care, family doctors will be able to send images of suspicious lesions to the hospital's dermatology department, along with the probability that they are malignant, automatically calculated by the algorithm. The Basque Government's idea is that Quantus Skin, currently being tested, will help decide which patients should be treated first.

However, the data show a worrying reality. Transmural Biotech, the company that markets Quantus Skin, conducted an initial study with promising results , but it had significant limitations: it was conducted entirely online and was not published in any academic journal, meaning it did not undergo the usual quality control required in science.

Later, dermatologists from the Ramón y Cajal Hospital in Madrid and professors from the Complutense University conducted a second study, which was published , to evaluate the real-life clinical efficacy of Quantus Skin. This work, which was funded and supported by Transmural Biotech, showed worse results: the algorithm missed one in three melanomas. Its sensitivity is 69%, meaning it misses 31% of real cases of this potentially lethal cancer.

Asked by Civio about the second study, Transmural Biotech's CEO, David Fernández Rodríguez, responded evasively by email: "I don't know which one it is right now." After pressing the phone, he changed his story: "What we were doing was testing" to detect potential implementation problems. And, at the end of the call, Fernández Rodríguez acknowledged that Quantus Skin "didn't stop working, it worked much worse, but we had to figure out why."

The CEO of Transmural Biotech attributes these poorer results to deficiencies in image capture due to a failure to follow Quantus Skin's instructions. This is something they also observed in the Basque Country trials. "Primary care physicians are not well trained in taking images," he says, which highlights the need for "training physicians." However, the second study involved dermatologists who specialize specifically in photographing suspicious lesions for later diagnosis. According to Fernández Rodríguez, reliability improved after "the images were carefully cropped" because they "weren't exactly following" the instructions.

Criticized by independent sources

“For skin cancer, a sensitivity of 70% is very poor. It's very poor. If you give this to someone to take a photo to tell you if it could be melanoma and they get one in three wrong, it's not adequate for skin cancer screening in a primary care setting; you have to ask for more,” Dr. Josep Malvehy Guilera , director of the Skin Cancer Unit at the Hospital Clínic in Barcelona, explains to Civio. For Dr. Rosa Taberner Ferrer , a dermatologist at the Son Llàtzer Hospital in Mallorca and author of the specialized blog Dermapixel , “31% false negatives sounds dangerous, to say the least. As a screening test, it's a chestnut.”

However, the CEO of Transmural Biotech attempts to minimize the problem by focusing only on data that favor his product, avoiding mentioning Quantus Skin's low sensitivity. According to the same study that analyzed its clinical efficacy, the system also fails on two counts: its specificity results in a 19.8% false positive rate, meaning it mistakes one in five benign moles for melanoma. This would mean that using Quantus Skin would lead to unnecessary referrals for almost 20% of the patients treated.

In the study , the authors—dermatologists at the Ramón y Cajal Hospital in Madrid and professors at the Complutense University of Madrid—argue that it is preferable for Quantus Skin to have high specificity (few false positives) even at the cost of low sensitivity (more false negatives), since it will not be used for definitive diagnosis, but rather for screening, that is, to help filter cases from primary care. According to their hypothesis, this could prevent specialist consultations from becoming saturated and reduce waiting lists and associated medical expenses.

The specialists consulted by Civio question the strategy behind the algorithm. Although there is no ideal standard for cancer diagnosis—partly because it depends on the aggressiveness of each tumor—what Quantus Skin has achieved is far from acceptable. “If they make a mistake by diagnosing melanoma in lesions with a potential risk of growing rapidly and even causing the patient's death, I have to be very intolerant. I already have to ask for sensitivities of 92%, 93%, 94% as a minimum,” says Malvehy Guilera.

"If they intend to use it for screening, then the system should have extremely high sensitivity at the expense of slightly lower specificity," explains Taberner Ferrer. In other words, it's preferable for an algorithm like this to err on the side of caution: better to err a little by generating false alarms in healthy people than to miss a real case of cancer.

Dark skin, uncertain diagnosis

The problems with Quantus Skin go beyond its low sensitivity. The study only evaluated its clinical efficacy in diagnosing melanoma, but did not analyze other more common but less aggressive types of skin cancer, such as basal cell carcinoma and squamous cell carcinoma, where the program can also be applied. The authors also did not study how skin color affects the algorithm's performance, although they acknowledge that this is one of the main limitations of their research.

Quantus Skin, based on neural networks, has learned to recognize skin cancer almost exclusively in white people. The algorithm was first fed just over 56,000 images from the International Skin Imaging Collaboration (ISIC) , a public repository of medical photographs collected mainly by Western hospitals , where the majority correspond to patients with light skin. Quantus Skin was then tested using images of 513 patients from the Ramón y Cajal Hospital in Madrid, all of them white.

The dataset used to feed Quantus Skin includes images of "Caucasian men and women," confirms the general director of Transmural Biotech. "I don't want to get into the issue of ethnic minorities and all that, because the tool is used by the Basque Country, by Osakidetza (the Basque National Institute of Statistics and Censuses). What I'm making available is a tool, with its limitations," says Fernández Rodríguez. Despite the lack of training in darker skin tones, the Basque Government indicates that it is not necessary to "implement" any measures "to promote equality and non-discrimination," according to the Quantus Skin file included in the catalog of algorithms and artificial intelligence systems of the Basque Country. However, since the neural networks have been trained almost exclusively with images of white people, they are likely to fail more frequently with darker skin tones, such as those of Roma ethnicity or migrants from Latin America and Africa.

“It’s very easy to make algorithms fail,” Adewole Adamson , a professor of dermatology at the University of Texas, told Civio. He warned in 2018 of the discrimination that artificial intelligence could lead to if it wasn’t developed in an inclusive and diverse way, a problem that goes beyond Quantus Skin.

Their predictions have been confirmed. In dermatology, when algorithms are fed primarily images of white patients, “diagnostic reliability in darker skin tones” decreases, says Taberner Ferrer. The Skin Image Search algorithm from the Swedish company First Derm, trained primarily on photos of white skin, saw its accuracy drop from 70% to 17% when tested on people with darker skin. More recent research has confirmed that these types of algorithms perform worse on Black people, not due to technical issues but to a lack of diversity in the training data.

Although melanoma is a cancer much more common in white people, people with darker skin have a significantly lower overall survival rate. American engineer Avery Smith is well aware of these figures. His partner, Latoya Smith, was diagnosed with melanoma just a year and a half after they were married. “I was really surprised by the survival rates by ethnicity. Latoya, being African American, was at the bottom. I didn't know that until it hit me like I'd been hit by a bus. It was terrifying,” he tells Civio. Some time after the diagnosis, in late 2011, Latoya died.

Since then, Avery Smith has been working to achieve more inclusive dermatology and to ensure that algorithms don't amplify inequalities . To emphasize the "impact" they can have, especially on vulnerable groups, Smith rejects referring to artificial intelligence as a "tool," as if it were simply "scissors": "It's a marketing term, a way to make people understand it. But it's much more."

Legal expert Anabel K. Arias , spokesperson for the Federation of Consumers and Users ( CECU ), also speaks of these effects: "When considering using it for early diagnosis, there may be a portion of the population that is underrepresented. In that case, the diagnosis could be erroneous and have an impact on the person's health. One might even consider harm."

Patients invisible to the eyes of an algorithm

“People tend to trust artificial intelligence a lot, we attribute to it qualities of objectivity that aren't real,” says Helena Matute Greño , professor of experimental psychology at the University of Deusto. Any AI uses the information it receives to make decisions. If that input data isn't good or incomplete, it may fail. When it makes systematic mistakes, the algorithm commits errors that we call biases. And, if they affect a certain group of people more—due to their origin, skin color, gender, or age—we speak of discriminatory biases.

A review published in the Journal of Clinical Epidemiology showed that only 12% of studies on AI in medicine analyzed whether it was biased. And, when they were, the most common bias was racial bias, followed by gender and age, with the vast majority affecting groups that had historically suffered discrimination. These errors can occur if the training data is not sufficiently diverse and balanced : if algorithms learn only from a portion of the population, they perform worse in different or minority groups.

Errors aren't limited to skin color alone. Commercial facial recognition technologies fail much more when classifying Black women because they have historically been trained on images of white men. A similar thing happens with algorithms that analyze chest X-rays or predict cardiovascular disease, whose diagnostic performance is worse in women if the training data is unbalanced . Meanwhile, one of the most widely used datasets for predicting liver disease is completely biased— 75% of the data is men —so the algorithms that use it fail much more frequently with women. In the United Kingdom, the algorithm for prioritizing transplants discriminated against younger people . The reason? It had been trained on limited data, which only took into account survival in the next five years , and not the entire life that patients who received a new organ could gain.

"The data used for training must represent the entire population where it will later be used," explains Dr. Nuria Ribelles Entrena , spokesperson for the Spanish Society of Medical Oncology ( SEOM ) and an oncologist at the Virgen de la Victoria University Hospital in Malaga. "If I only train with a certain group of patients, it will be very effective in that group, but not in another," she adds.

Avoiding biases, an obstacle course

The solution to avoiding bias exists: "The training set has to be as broad as possible," explains López Rueda. But this cannot always be verified. So far, most artificial intelligence systems implemented in Spain that use medical images do not usually publish training data. This is the case with two dermatology devices—the names of which are unknown—that will be activated first in the Caudal health area and then expanded to the entire Principality of Asturias. This is also the case with the commercial application ClinicGram , for detecting diabetic foot ulcers, implemented at the University Hospital of Vic (Barcelona); or with the various private radiology systems, such as BoneView and ChestView, or Lunit, which are operating in some hospitals in the Community of Madrid, the Principality of Asturias, and the Valencian Community.

When datasets are accessible, another obstacle is that they don't include metadata, such as origin, gender, age, or skin type, which would allow us to check whether they are inclusive and balanced. In dermatology , most public datasets typically don't label patients' origins or skin tone. Where this information is included, studies consistently show that Black people are significantly underrepresented . "There is increasing awareness of the problem, and algorithm developers have attempted to address these shortcomings. However, there is still work to be done," says Professor Adamson.

In 2022, Osakidetza awarded a contract worth almost €1.6 million to Transmural Biotech to implement "artificial intelligence algorithms in medical imaging," requiring a sensitivity and specificity of "at least" 85%. The company, a spin-off of the University of Barcelona and the Hospital Clínic, belongs to the private insurance company Asisa. According to Osakidetza, despite the fact that the specifications included several algorithms, only two were ultimately chosen, including Quantus Skin, for its "greater healthcare impact" and "greater health performance." As Civio has learned, the decision was made unilaterally, without consulting the relevant specialists. In February, Osakidetza also stated that Quantus Skin had passed "the validation phases" and was "in the integration phase." In response to Civio's questions about its clinical efficacy, it now states that it continues to be tested and that it will make decisions "based on the results obtained." However, he avoids answering whether he was aware that the published clinical efficacy data for Quantus Skin (69.1% sensitivity and 80.2% specificity) were below the 85% threshold required by the contract. Aside from the award in the Basque Country, Transmural Biotech only has one other public contract, in Catalonia, for a much smaller amount (€25,000) to certify artificial intelligence algorithms in radiology.

This article was originally published on Civio , an independent nonprofit newsroom that conducts in-depth research on public affairs. You can find the full methodology there .