Integrating memristor arrays with CMOS circuitry to build efficient edge AI chips

Integrating memristor arrays with CMOS circuitry to build efficient edge AI chips

A key factor that determines how “intelligent” artificial intelligence (AI) systems will become is how much computing capacity the systems possess. As the AI models become more advanced and tasks become more complex, exponentially more data need to be processed. In a typical model, the input data need to be compared with stored features that are represented by weight values, generating an output that then becomes input to another layer in the model. The process repeats until the final output is obtained. Mathematically, these operations are implemented using multiply-accumulate (MAC) functions. How capable an AI hardware system is is then determined by how many MAC operations it can perform in a second, and improving the system performance is to a large extent improving the efficiency and parallelism of the MAC operations.

Additionally, since the number of weights is huge for practical models, e.g. 10s to 100s of millions, there is typically not enough memory to store them all on chip. Typically, the weights need to be stored off-chip, i.e. in Dynamic Random-Access Memory (DRAM), and loaded to the processor chip to perform the MAC operations when needed. Unfortunately, loading data from DRAM is much slower and consumes significant more energy than on-chip operations. As a result, even if the circuit itself can achieve high peak throughput and energy efficiency, when running a model end-to-end the measured throughput and energy efficiency of the complete system will be significantly worse.

These problems can be fundamentally addressed by using a new hardware architecture that natively matches the ebb and flow of data in the model (software). Specifically, memristor crossbars offer high storage capability to allow all weights in a model to be stored on chip; and perhaps importantly, these memory elements can also directly perform MAC operations through Ohm’s law and Kirchhoff’s current law, thus achieving very high throughput via high parallelism and high efficiency by minimizing data movement. More information on memristors and memristor crossbars can be found from reviews here and here.

However, prior demonstrations of memristor hardware implementations are based on discrete memristor arrays and separate processing elements using test boards. In the paper published in Nature Electronics, we aimed at building a complete, fully integrated system that can run AI models end to end, fully on chip. By demonstrating these capabilities, we hope this study can successfully lead to future large-scale implementations. 

Specifically, we integrated a 54x108 memristor crossbar directly on top of complementary metal-oxide semiconductor (CMOS) periphery and control circuitry. The memristor array takes an input voltage vector and produces an output current vector that equals to the product of the input and the weight matrix, thus producing MxN MAC operations in a single step, where M (N) is the number of rows (columns) of the weight matrix. The CMOS circuitry in our study provides all the necessary digital/analog conversion, routing, and control functions. A key decision we made was to make the system reprogrammable, by integrating a processor along with all necessary periphery circuitry under the crossbar. As a result, different models can be mapped through simple software changes without modifying the underlying hardware fabric.

The prototype chip we demonstrated can already perform tasks such as classification, sparse coding, and data clustering, through offline and online learning algorithms. For example, beyond forward propagation, the system supports backward propagation through the same array to obtain reconstruction of the input using network outputs. By incrementally changing the memristor conductance (weight) values, the chip is also capable of online learning through conventional and bio-inspired learning rules, and has been used to obtain trained features to perform breast cancer screening analysis.

Looking into the future, we expect this study can stimulate continued interest in new devices and new computer architectures, and in-memory computing systems based on we demonstrated here can eventually find applications in edge computing and other AI use scenarios after further optimizations of the devices and digital/analog interfaces.

Wei Lu is a Professor in the Electrical Engineering and Computer Science department at the University of Michigan. He received B.S. in physics from Tsinghua University, Beijing, China, in 1996, and Ph.D. in physics from Rice University, Houston, TX in 2003. From 2003 to 2005, he was a postdoctoral research fellow at Harvard University, Cambridge, MA. He joined the faculty of the University of Michigan in 2005. His research interest includes resistive-random access memory (RRAM), memristor-based computing architectures and neuromorphic computing systems, aggressively scaled transistor devices, and electrical transport in low-dimensional systems. To date Prof. Lu has published over 100 journal articles with 23,000 citations and h-factor of 65. He is co-founder of Crossbar Inc, a recipient of the NSF CAREER award, and IEEE Fellow.

Please sign in or register for FREE

If you are a registered user on Nature Portfolio Engineering Community, please sign in