SiMa.ai’s Second-Gen Edge AI Chip Goes Multi-Modal

By：EE Newsroom Sep 28, 2024

SANTA CLARA, California—Edge AI chip startup SiMa has built a new generation of its MLSoC silicon, reflecting its embedded and edge customers’ moves toward large multi-modal models (LMMs) and generative AI (GenAI), SiMa CEO Krishna Rangasayee told EE Times.

“Gen 1 has been in production for two years,” Rangasayee said. “We had the luxury of building something from a clean slate, and after [our team’s] 30 years of studying embedded, we took a very software-centric approach, because the ease-of-use element was so important for our customers. … From when we defined the chip four years ago to where we are now, the market is evolving at a rapid pace. Since we have a software-centric architecture, and we don’t know what the future is going to hold, we didn’t pick a particular point to resolve for; we picked a direction and said the transformer-based architecture is going to rule the day for the next few years.”

Unlike some of its competitors in the edge LLM space, SiMa offers a full SoC designed to host full applications (it is not a standalone accelerator), and its second-gen chip is designed to run both convolutional neural networks (CNNs) and transformer networks, though it will continue to make and sell its first-generation vision/CNN-focused chip.

SiMa has doubled its number of customers in the last year, Rangasayee said, adding that traction is growing across all of SiMa’s target markets: robotics, industrial automation, smart vsion systems, aerospace and defense. The company is currently entering the medical device space. Even edge vision applications are coming to transformers, with customer requests for models ViT and LlaVa, he added.

“The other thing we’ve done is made it multi-modal,” Rangasayee said. “People think multi-modal and GenAI have to be the same, but it’s a very different problem, and I would say multi-modality has more value to the market than GenAI today.”

Hardware family

SiMa’s second-gen hardware, Modalix, will be a family of devices offering 25, 50, 100 and 200 TOPS of INT8 performance. Four members of the family will roll out at different times. The 50-TOPS version is coming first, sampling this quarter. The 25-TOPS version will use the exact same silicon but offer limited performance to applications that don’t need 50 TOPS. The 100- and 200-TOPS versions are still being developed; Rangasayee said the decision on whether to go with a chiplet solution has not yet been made.

“We have a lot of use cases, and people are getting clever about scaling down the size of transformers, and we do see transformer solutions coming in the 25-TOPS range,” Rangasayee said. “Clearly, for stereo vision, and for complex algorithms in robotics or automotive, you need 50 TOPS or more, but there are a lot of transformer-based approaches being deployed for better accuracy.”

The initial 50-TOPS version has a smaller die size than the first-gen chip but more features, thanks to migration from 16 nm to TSMC N6. Its power envelope is about 8 to 10 W (for the whole SoC, not just the accelerator), depending on the workload. SiMa is maintaining 100% software compatibility across the two generations.

Accelerator improvements

Modalix’s architecture is designed for large language models (LLMs) and LMMs, handling any modality of data. Like the first gen, it is a full SoC, with an improved and enlarged in-house-developed AI accelerator.

Accuracy was a big demand from customers, Srivi Dhruvanarayan, vice president of hardware engineering at SiMa.ai, told EE Times.

“CNNs were able to quantize to INT8 for Gen 1, but it’s a fine balancing act,” he said. “The easy answer would have been to throw all kinds of precisions at it: FP16, FP32, FP8. But then we would lose the advantage of a power-efficient chip. So we settled on BF16, which gives us enough floating point representation to serve us well for transformers, but at the same time to not lose the edge on power efficiency.”

As well as floating-point support, the accelerator also added hardware acceleration for piecewise polynomial activation functions and other nonlinear functions used in LLMs and LMMs. SiMa’s toolchain can automatically quantize to different precisions to optimize performance on a layer-by-layer basis.

DRAM bandwidth has been doubled and caching improved.

The result is a 50-TOPS second-gen accelerator that can run Llama2-7B at more than 10 tokens/second in a power envelope suitable for the edge.

Arm cores

SiMa has doubled the number of Arm Cortex-A65 CPU cores on its chip from four to eight.

“It would be a shame to increase everything else and get bogged down by the CPU,” Dhruvanarayan said. “We want to host the whole application on chip. … There’s a lot of decision-making that still needs to happen on the CPU.”

Layers that are quantized to formats not supported by the accelerator, or any functions that aren’t supported, can fall back to the Arm CPUs, he added.

“We tried to be smart about this,” Dhruvanarayan said. “We’ve not gone with the latest, greatest Arm cores and increased our area; we’ve stuck with what we had in Gen 1 because that’s more than sufficient. We just doubled the number … so the software doesn’t have to deal with new instructions.”

Rangasayee said that SiMa is targeting some more traditional applications where customers want to retain a significant Arm footprint on chip.

“We are building one architecture for everybody,” he said. “To a certain degree, we have done the tradeoff of if we optimize the Arm complex for a narrower set of applications. From a die size perspective, it was justifiable to increase it.”

Rangasayee’s example is in-vehicle infotainment, where there is significant demand for new multi-modal and GenAI functionality while retaining a demanding CPU workload. Another emerging application with similar demands is embodied AI (humanoid robots).

“While the broad market categories are the same – automotive, robotics – the sub-applications within them are new for us,” he said. “But the table stakes remain: We need to continue to be better than everybody else on performance, but while we continue to do that, we need to be relevant in terms of market need and the networks and workloads they want to run.”

Modalix’s architecture now includes an image signal processor (ISP), following customer feedback from Gen 1. While some MIPI cameras already have ISPs, and some ISP tasks can be handled in the AI accelerator, including a hardware ISP gives customers choice, Rangasayee said. The ISP is typically used at the front end of the vision pipeline for pre-processing, Dhruvanarayan added, so whether these tasks are done in the ISP or the AI accelerator is a tradeoff between the performance gained and the latency incurred by moving data between blocks on the chip.

Modalix retains the first gen’s Synopsys EV74 DSP, though it runs 20% faster, as many potential big customers are holding onto their legacy DSP code, Rangasayee said.

“We don’t want to fight religion,” he said.

Other new hardware features include 4 x 4 MIPI lanes to add support for MIPI cameras including LiDAR and radar sensors. Ethernet cameras also benefit from additional support, as Modalix has gone from 4x 1G Ethernet ports to 4x 10G ports. PCIe has been upgraded to Gen5 and PCI root complex to endpoint is now supported to allow for a potential future chiplet solution.

Use cases

Software-centric SiMa also learned a lot on its software toolchain.

“As a startup, it’s pretty ambitious for us to be world-class from Day 1 from a software perspective,” Rangasayee said. “But by architecture and construction, everybody has been very impressed with what we’ve done by feature set, completion, ruggedization. … There’s been a lot of learning, but that’s the nature of our business, you put something out there that’s mostly good, and you iterate and you learn.”

The team learned a lot from customers about their preferred quantization schemes and various aspects of the entire computer vision pipeline, he said. Key challenges included adding new hardware features without losing software compatibility.

GenAI in the data center has been criticized for its lack of commercial success in some quarters of the industry, but this problem doesn’t apply at the edge, he added.

“For LLMs and generative AI at the edge, there is really good meaning in doing it,” Rangasayee said. “There is revenue for the customers doing it. I worry that most of what AI is doing is not really going to make anybody any money – seeing Will Smith eat noodles will not really make a profit for anyone – but we see genuine real use cases at the edge where we’re making a difference.”

Tags:

Michael D. Shear

Elite author

I think all aspiring and professional writers out there will agree when I say that ‘We are never fully satisfied with our work. We always feel that we can do better and that our best piece is yet to be written’.