Where Will All The Data Come From?

All Artificial Intelligence and Machine Learning need is more data, or at least so we hear. Some of that data will need to be collected from the physical world, and this paper with Ahalya Prabhakar discusses methods to automate that data collection, optimizing for both quality and efficiency.
Where Will All The Data Come From?

We hear almost daily that soon we will have enough data for artificial intelligence (AI) to learn everything we need it to know. When researchers assert this, they of course mean that we will have as many images and sound bites as we could possibly want, that we will have completely mapped the surface of the earth, and that we will have specialized datasets (e.g., medical imagery) for highly prioritized applications.  That is, we can expect data in instances where people already happen to care about that data (e.g., imagery of cats doing cute things) and in settings where heavy investment has an estimated high payoff. These datasets will prioritize systems in their common configuration, versus uncommon configurations, since we observe common states of the world (e.g., cars going down the street) much more than we observe uncommon ones (e.g., cars crashing into each other). Moreover, this near-tautology is confounded by the fact that data collected by people will inherit the biases of those people—certainly caring about cute cats more than insects, but also biases in data representation (e.g., cats being centered rather than being partially obscured).  In this Nature Communications paper we ask how should autonomy reason about its learning goals and automatically collect data to support its learning.  

How good is our data now, and how good can we expect it to be? Even learning for images has its limits when working with pre-existing datasets.  One of the unacknowledged contributors to the success of machine learning in the context of vision is that we have a limitless supply of people who provide images and who potentially label both their own images and other images.  If they don’t do this for free, we can pay them to do it, as many companies do, so labeling datasets does not need to happen at the time of collection.  But vision datasets are then limited by the arbitrary preferences we have—images are taken with a still camera (robots will generally be in motion), subjects are centered or at least entirely contained in the image (rarely true for robots as they move), and images are taken in good illumination, often chosen specifically for the subject (rarely true for robots). These same considerations affect other datasets such as audio datasets, that by being anthropomorphic (modeled after human sensing), also are subject to human aesthetic choices.  Other, non-anthropomorphic sensing (e.g., electromyography and electroencephalogram measuring human biological signals, electrosense capabilities in electric fish) all require significant investment to create datasets. Moreover, labeling these datasets has to happen at the time of collection, increasing overhead. 

What does our paper accomplish?   Our paper automates data collection given a learning model.  That is, rather than solving a learning-for-control problem, where data-driven reasoning improves a feedback control process, we solve a control-for-learning problem, where control synthesis reasons about the state of the learning and optimizes its evolution, often called “active learning”. Our algorithmic/software pipeline is equally effective for vision (a far and mid field sensor) and electrosense (a near-field sensor), demonstrating the approach is independent of sensor physics.  An opportunity that this work creates is the development of non-anthropomorphic sensors—any device that records signals and for which this pipeline can create a generative model is a sensor, whether it maps to our sensor modalities (sight,hearing,touch) or not. 

Active Learning in Autonomy and Safety. Another opportunity in our software is understanding rare events, such as tripping in an exoskeleton or crashing a car.  These are examples of rare data, but are also a warning—we cannot use active learning without thinking about safety.  `Closing the loop’ on any dynamic process is simultaneously powerful and dangerous, as has been true since the early days of operational amplifiers, and this danger is no less true for systems that need to collect their own data during autonomous driving, minimally-invasive surgery, and other data-dependent robotic applications. 

Future of Data Collection.  If we want to obtain high-quality, unbiased data as machines learn about the world, we will need to automate that process.  This paper makes progress on doing so, enabling machines to reason about what to look at and how to look at it. 

Prabhakar, A., Murphey, T. Mechanical intelligence for learning embodied sensor-object relationships. Nat Commun 13, 4108 (2022). https://doi.org/10.1038/s41467-022-31795-2

Web: nature.com/articles/s4146

ePDF: rdcu.be/cRIOq