Approximately 541 million years ago, during the Cambrian period in the Earth’s history, there was a tremendous increase in the types of lifeforms that resulted in the creation of some of the major families of creatures currently living and many that died out. This was a time of tremendous experimentation in the forms of living things and the ecological niches they could fill as creatures developed to live in a dramatically new and more energetic ecosystem.
Today, the slowdown in the rate of growth of processing power known as Moore’s Law coupled with increasing demand for processing to handle enormous volumes of data from IoT and Big Data applications has created an environment that is spawning a generation of new logic, memory and storage designs that could be called a Compute Cambrian Explosion. These new designs include chiplets, multi-chip stacked heterogenous devices and accelerator chips.
Intel and AMD are both pursuing breaking down monolithic processors into specialized chips, called chiplets (smaller special purpose chips), that work together on a multi-chip module. Both companies are struggling to go down to smaller lithographic features but chips made with sub-10 nm features will have more errors and much poorer yields. For this reason, AMD and Intel are focusing their small lithographic features on special purpose chips. For AMD their 7 nm minimum features are used to create denser CPU cores with various chiplets using 14 nm minimum features serving other uses.
Intel announced their Foveros Project that will use 10 nm minimum feature processes for power efficient activities and 14 nm minimum features for chiplets that serve other, higher power functions. Intel has said that their Lakefield products, using this design are due in the second half of 2019.
At the 2019 Salishan Conference on High Speed Computing Conference, Arun Rodrigues from Sandia National Laboratories gave a talk on “Hererogeneous Accelerators of the Memory, by the Memory, and for the Memory”. He said that we are entering an era of extreme semiconductor heterogeneity with new possibilities and solutions using specialized processor chips, often called accelerators.
He pointed out that conventional approaches to computing don’t manage memory very well. Main memory (especially tiered memory) is slow, caches are inefficient and processors are distant from the data they process. With the slowing in Moore’s Law, breaking up processing from monolithic chips to specialized chips located in more places has become more popular. In addition, the infrastructure to support ARM or RISC-V processing has made this easier to do and less expensive than in the past.
The US National Labs have collaborated on ways to boost the use of accelerators that they call Project 38. A key feature of this approach is what is called a scatter/gather architecture. Scatter/gather I/O is also called vectored I/O. It is a method of input and output in which a single processor call sequentially reads data from multiple buffers and writes it to a single data stream, or reads data from a data stream and writes it to multiple buffers. Scatter/gather refers to the process of gathering data from, or scattering data into, the given set of buffers. Vectored I/O can be very efficient and convenient. The slide below shows this concept in practice.
Photo by Tom Coughlin
The scatter/gather operations are done to a scratchpad (offload). The data in the scratchpad can be reused and this offloads a lot of the integer operations on the data. This approach can be more efficient that putting data into cache memory. In practice for analytical and simulation, performance improvements of 15-28% are found with reduced cache misses and consequently improved cache performance. This approach also makes it possible to do a lot of in-memory operations, and thus improve overall performance. Arun gave an example of this improvement for a Spiking Neural Network case study.
He discussed the advantages and trade-offs of multi-level memory to provide more effective bandwidth, but to keep costs under control, effective management of the multiple memories is required. He thought that automatic block-level swapping (a hardware assisted memory management approach, see the slide below, could be provide this memory management and provided evidence that this approach could be effective.
Multi-level memory management approach
Photo by Tom Coughlin
Arun pointed out that software will be the biggest barrier to this approach. There is already work being done to use GPUs (a specialized type of process accelerator) that may be extendable to other accelerators. Also, hardware assistance with synchronization, data marshalling and thread management will be required.
In addition to the growing number of specialized accelerator chips (GPUs, TPUs, IPUs and other specialized, often FPGA enabled devices) there is a growing movement to improve the technology to stack semiconductor chips, often with quite different types of devices on each chip, particularly for embedded and high-performance computing applications and at increasing level of interconnect density. This has led to very interesting structures and as Robert Patti from NHanced Semiconductors, another Salishan speaker, showed very complex systems can be built with such heterogeneity (which he called a LamdaFabric) as shown below. His goal was to build a synthetic quantum computing system.
System scale heterogeneous integration
Photo by Tom Coughlin
Limitations on the continued scaling of logic circuits has led to a Cambrian Explosion of new ways to design and use logic and memory circuit, leading to new generations of systems that can handle the avalanche of data that will be generated by industrial and consumer IoT, Smart Cities and big data for AI analysis.