Mythic, a leading analog AI processor company, has announced that the company has officially launched the industry’s first analog matrix processor (Mythic AMP?) M1108 AMP. According to them, the launch of this new product heralds an exciting new era in AI as it provides, for the first time, an analog computing solution that achieves best-in-class performance and performance with accuracy comparable to digital devices.
The report points out that the M1108 AMP offers unparalleled possibilities for edge deployment, with application markets including smart homes, AR/VR, drones, video surveillance, smart cities, and automation on the factory floor. With its revolutionary technology, the M1108 AMP is at the forefront of major new trends in AI processing.
The report pointed out that the M1108 integrates 108 AMP tiles, each of which is equipped with a Mythic analog computing engine (MythicACE?), which has a flash memory cell, ADC array, 32-bit RISC-V processor, SIMD vector engine, SRAM and a high-throughput network-on-chip (NOC) routing.
In addition, four control tiles provide a high-bandwidth PCIe 2.0 interface to the system host processor. With 108 AMP tiles, the M1108 provides up to 35 trillion operations per second (TOPS), enabling efficient execution of complex AI models (such as ResNet-50, YOLOv3 and OpenPoseBody25) with high efficiency and low latency on a single chip . The typical power consumption of the M1108 is about 4W when running complex AI models at peak throughput. And, with the inherent cost advantage of leveraging mature 40nm technology and not requiring any external DRAM or SRAM, the M1108 AMP will be available as PCIe M.2 and PCIe cards, and M1108PCIe evaluation kits are available upon request.
How to play the simulated AI chip? Mythic tells in detail
When it comes to AI hardware, we’re all about the details. One company WikiChip is keeping a close eye on is Mythic. The company has yet to fully reveal their architecture and products, but some details are starting to slowly emerge. At the recent AI Hardware Summit, the company’s founder and CEO Mike Henry gave an update on the chip.
Founded in 2012 by Mike Henry and Dave Fick, the Austin-based startup recently closed a $70 million Series B round, bringing its total raised to just over $85 million. Today, the company has 110 employees.
Mythic is a simulation company. However, before we get into the details, it’s important to point out that they are currently only focused on inference, whether in the data center or at the edge. TAM (Total Available Market) of these two markets by 2024
It’s expected to hit $25 billion, which is why there’s so much focus here. In the long run, Mythic plans to target mass-market consumer and automotive products.
Put everything on chip, in flash memory
As the accuracy of the model increases, so does its size. Today, models reach hundreds of millions of parameters, if not billions. On top of that, deterministic behavior, such as consistent frame rate and latency, is often required in real-time. That’s what Mythic does. Mythic’s argument is simple: pack enough storage on a chip with a large number of parallel computing units to maximize memory bandwidth and reduce the ability to move data. But there’s a shift — and that’s where Mythic’s original approach shines — the company has ditched traditional SRAM in favor of denser flash memory, and Mythic also plans to do local computing directly in memory in an analog environment.
But why flash? The answer is simple: because it is dense, low power, and cheap, almost two orders of magnitude denser than SRAM.
In theory, Mythic’s chips are more like memory than traditional CMOS. Looking at the longer roadmap, as SRAM bit cell scaling becomes more difficult, the benefits it brings become more profound. Overall, this is a potentially huge win in terms of performance per dollar, density per unit cost, and performance per watt.
We’ve seen a whole bunch of roadmaps over the years, and when they start talking about the next 10 years, it’s easy to ignore it. But for Mythic, there’s something different. It is worth noting that Mythic is currently working on 40nm embedded flash. They have a fairly clear path to 28nm and 22nm, so roughly half of this graph is based on existing nodes that have been released today.
Mythic CEO Mike Henry seems to believe they can continue to do so, but it’s unclear if it will hit the market despite some work being done to continue scaling to the 16/14nm node.
Many people in the industry believe that embedded flash has encountered a bottleneck at 22nm. In a brief chat with Mythic, they told us that they are not combined with embedded flash, and if one of these emerging technologies (like multi-bit ReRAM, PCM or NRAM) emerges as a strong alternative, they certainly will. Consider migrating to this technology.
Mythic’s chips are called IPUs or Intelligent Processing Units. In terms of peripherals, the chip is very simple, consisting of x4 lanes of PCIe, the basic control processor responsible for overall chip management, and a grid of DNN tiles. Since the chip is designed to store the entire model, there is no DRAM.
Mythic says that since this is a tiles-based design, they can further customize it by adding direct audio/video and various other interfaces if the need arises. At last year’s Hot Chip, Mythic was talking about an initial product with 50 million weights. At the recent AI Hardware Summit, Mike Henry said that the initial product weight will reach 120 million, which is a lot more than originally planned. In Fujitsu’s 40nm process, a near-reticle full-size chip should have a capacity of about 300M weight, so 120M is still a fairly large chip.
The role of the IPU is to act as a PCIe accelerator connected to the host. For large models or multiple models, multiple IPUs can be used. The model is initially loaded into the IPU and remains stationary. No DRAM and programming flash is relatively slow, so the model should be able to accommodate chips with multiple applications mapped to the same chip. This is typical for many edge applications. Under normal operation, the host CPU sends data to the IPU and receives the results over the PCIe port.
IPU overall design (WikiChip)
The chip consists of a grid of DNN tiles. Inside a tile is an analog matrix multiplier built on top of a huge pool of embedded flash memory that computes the weights. Embedded flash memory cells use floating gates to store bits by storing charge, controlling the threshold voltage. The transistor supports fully off and on states between 256 levels of conductance (G=1/R), which Mythic uses to represent 8-bit values.
By mapping all the neuron’s weights onto the flash transistors, they can use Ohm’s law to do matrix multiplication naturally. This is achieved by using flash transistors to represent the weights as variable resistors. This is performed once before the calculation. With an 8-bit DAC, the input vector is passed through a variable resistor as a set of voltages. According to Ohm’s law, the output current is the result of multiplying the input data by the weight vector (I = V x G). Finally, a set of ADCs converts the resulting current back into a digital value, which becomes the output vector. ReLu and various other non-linear operations are also done by the ADC at that point in time.
There is some additional logic surrounding this component. Regardless of operating conditions, the DAC/ADC packager is compensated and calibrated for accurate 8-bit calculations—similar to what is done with today’s image sensors.
Analog Matrix Multiplication (wikichip)
It should be noted that this scheme has no actual memory accesses. The matrix multiplication is done in memory, using Ohm’s law, so there are no weights to access the energy. There is also no batch size or other special handling when using fixed weights. Although fixed capacity may present some problems. By the way, they support neuron sparsity, but not weight sparsity.
Interestingly, Mythic says that for their first generation, they won’t use a DAC as an input in order to speed up development and time-to-market. Instead, they used a digital approximation circuit, whereby each input bit was calculated separately, and the results were accumulated. They will eliminate this with a DAC in the future, which will hopefully give them some nice improvements.