Data – practical guidelines for empowering the critical asset in the AI era


Twelve years ago, at the University of Illinois Urbana-Champaign, computer science professor Li Fei-Fei had an idea. In the AI industry, the common perception had been that what matters is the best algorithm, regardless of the data. But what if we should change the paradigm, Li asked. What if it’s not about the algorithm but the datasets?

To look into this further, Li decided to map out the entire world of objects. A mission impossible, many colleagues said.

Three years later, in 2009, she and her co-authors published a database called ImageNet. And that database turned out to be the fuel that ran the AI engine.

The database spawned an annual machine learning competition, and by 2015 the best results in the competition matched human performance. Now the accuracy of classifying images is above 97%, which is beyond the 95% that humans are achieving.

Since then, it has become clear that when trained with large datasets, algorithms will also get better.

Here are six guidelines for empowering critical assets in the AI era – based on what we have learned so far.

Guideline 1. The amount of data matters

If you have enough data, it will make things simpler. For data-hungry machine learning methods, such as deep learning, the amount of code needed for your algorithm is significantly less than in cases where data scarcity is the problem.

In fact, it is claimed that Google machine translation from one language to other, utilizing legacy technologies, originally had a code base of 500,000 lines. Now, using deep learning, the code has 500 lines. Therefore, the effort in writing the code has been reduced to the extent that it is no longer the principal barrier of entry. It is now possible for small companies, even individuals, to write algorithms.

Guideline 2. Connecting datasets matters

Sometimes collecting a dataset is hard, and may even be prevented by legislation. To see if self-driving cars are able to cope with heavy snow, it is necessary to go and test these cars in such conditions. Doing this is possible in a well-defined sandbox in which reliable data can be gathered. Such data gathering would not be possible by other means.

A sandbox also requires an ecosystem of multiple players, and in the best case, a culture of sharing the gathered data to ensure its full impact is felt in society. The Government of Finland has set up a sandbox for autonomous ships to bring AI to the seas, accompanied by an ecosystem of participating companies.

Public open datasets should also be made available for training purposes – at least the ones that are created using public funding. In many cases, data from one dataset can be used to annotate other data. Combining a dataset from satellite images of ozone concentration with a GPS trail from a mobile phone can provide a good estimate of pollution exposure. Is there any correlation between exposure levels and the prevalence of asthma?

Guideline 3. Stand on the shoulders of giants

When we were studying retinopathy images using the data from Folkhälsan and the Central Hospital of Central Finland, we learned that by using an existing program that classifies dogs, cats, cars, etc., we were able to retrain the model to perform the task better than training from scratch. This is called transfer learning, and it is widely used in the community. This trick helps with X-ray image classification, even though it seems odd that classification of cars, cranes and giraffes can help in finding broken bones.

So, there are lots of trained models available that help in getting started with less data. Andrej Karpathy, currently the director of AI at Tesla, has said “Don’t be a hero” by trying to do everything yourself, just go and use what is available.

Guideline 4. Create a good data strategy

A good data strategy has three components.

Firstly, develop products that automatically annotate data. Big internet businesses are ‘machines’ that collect knowledge, clicks on search results, ‘likes’ on social media updates, retweets on Twitter, etc. Just think, is it or is it not important to understand the customer?

Secondly, create internal and external processes that include clear annotation steps that would make proper AI-ready data gathering part of normal business. This way you can grow the degree of automation and gradually move people away from dull repetitive tasks that can be automated.

And finally, join forces, build ecosystems and negotiate how to procure data from third party sources when you do not have the capability to collect it yourselves.

Guideline 5. Challenge your data and let the data challenge you

Be sure to gather data properly. As a computer is trained to do a task by showing examples, it is of paramount importance that the training datasets are of good quality. For instance, if a self-driving car is programmed from data on how people actually drive, the system will display the ethics of an average person. Sometimes it will drive against a red light, because people do. This is not what we are after. Hence, the training datasets have to be pruned and curated to contain non-biased, ethical examples. As with humans, giving a good example is the best way to teach and train.

We can turn this around and let the data teach us something about ourselves. Using datasets from human decisions (e.g. hiring) one can build a model of the current decision process and find out whether it contains evidence of biases against, say, gender, age or ethnic origin. Modern machine learning gives us a way to prove and root out bad undercurrents in our own decision-making.

Guideline 6. Using data to select AI solutions

You may be in a situation where you have to decide between many possible AI solution providers. The best way to validate a solution is to compare its predictions against the opinions of domain experts. A panel of radiologists would be given thorax X-rays to judge and then this result would be compared with the prediction of the particular AI solution. This would be a rather expensive method, however.

A secret dataset of annotated X-ray images would be handy. Ask for a trial version of the code and throw your secret dataset in. You can now compare the results and see if the code passes the test. However, make sure that the AI providers never get a glimpse of your dataset, as this may prevent them from training the code for stellar performance on your test data. Never send the test data to a provider for them to test it.

Actually, for self-driving cars, we should perhaps ask the systems to obtain driving licences, with virtual testing of huge datasets of tricky accidents and ethical dilemmas. And these would be kept secret as well.

About the author

Kärkkäinen Leo

Leo Kärkkäinen

Leader of Deep Learning Research Group, Nokia Bell Labs

All posts

What do you think?

Mitä mieltä sinä olet?