Wouldn’t it be great to get emulation speeds at simulator prices? If you’ve been looking for relief from long simulation runs, come have lunch on us and see a demo of affordable, easy-to-use personal desktop emulation software, Semu™.
Semu accelerates verification for designs up to 14M ASIC gates. Run against real data, cover hard-to-reach corner cases, and zip through regression tests at speeds up to 50 MHz.
At our Lunch & Learn, you’ll see two live Semu demonstrations:
- How Semu enables debug iterations in minutes, not hours or days. See Semu’s ability to set hardware breakpoints and configure visibility against the register state in emulation at any time — no re-instrumentation & no re-compile required!
- How Semu enables initial setup in less than a day, rather than weeks or months. Semu is easy-to-use and lets you run with C++/SystemC or SystemVerilog test benches.
If you’re going to have lunch anyway, why not Lunch & Learn on us? Arrive by 11:45A at the scheduled locations and dates below. We’ll have you on your way by 1:00P:
- Silicon Valley: Tuesday, November 12, 2013
- Network Meeting Center, 5201 Great America Parkway, Santa Clara, CA 95054
- Austin, TX: Thursday, November 14, 2013
- Omni Barton Creek Country Club, 8212 Barton Club Dr, Austin, TX 78735
- Boston Area: Tuesday, November 19, 2013
- Westford Regency Inn and Conference Center, 219 Littleton Rd, Westford, MA 01886
Rishiyur S. Nikhil, CTO of Bluespec, is participating in the Embedded Systems Week (Montreal, Canada) industry panel entitled "System-Level Design and High-Level Synthesis". Embedded Systems Week, located this week in Montreal, Canada, comprises three conferences: CASES 2013, CODES+ISSS 2013, EMSOFT 2013.
For your reference, here's the panel summary from the conference program. The panel is Monday, September 30 at 3:30P:
System-Level Design and High-Level Synthesis
Over the last years, high-level and system-level syntheses have increasingly made their way into industrial practice with continued or renewed interest also in academia. At the same time, there is skepticism about reaching widespread adoption and future growth potentials. This panel brings together practitioners both from the tool-vendor as well as the tool-user perspective to discuss experiences, opportunities and challenges going forward. Participants will first briefly present their views about the state-of-the-art and the road ahead, followed by a panel discussion involving questions and answers from the audience.
Organizer and Moderator: Andreas Gerstlauer (Univ of Texas).
Speakers: Yatin Hoskote (Intel), Rishiyur Nikhil (Bluespec), John Sanguinetti (Forte) and
Andres Takach (Calypto)
At Hot Chips last week, MIT showed off a 110 core processor chip that uses a shared memory abstraction. The chip comprises 357 million transistors on a 100 mm(2) die, and is fabricated in 45 nm. The paper was titled "Hardware-level Thread Migration in a 110-core Shared-Memory Processor" by Mieszko Lis, Keun Sup Shim, Brandon Cho, Ilia Lebedev and Srinivas Devadas.
(Source: MIT's HotChips presentation)
There are a few particularly interesting aspects about this chip:
According to one of the designers, "the key innovation is blindingly fast core-to-core thread migration done entirely in hardware". One of the goals of the chip architecture is to reduce the numbers of memory copies and thereby reduce power, which it does through a shared memory architecture that essentially replaces the cache. With this architecture, they've seen up to a 14X reduction in on-chip traffic.
It may be the largest multiprocessor design, in terms of core counts, that implements a shared memory abstraction (anyone know of a larger one?)
Aside from the SRAMs and PLLs, most of the rest of the chip was designed with BSV high-level synthesis! And, the entire chip development took only 18 man-months. (Stay tuned for an upcoming paper at ICCD. The paper includes comments on their experience using BSV for design)
If you'd like to read more, there's a nice piece on the chip in PC World.
Most car buyers care about capabilities, not the internal minutiae that make cars run. The Tesla Model S is doing very well right now -- and, from reports, customers love the car's comfort, 0-60 acceleration times, and zero emissions. These are the things that matter to them. Does the Tesla use bleeding-edge combustion engine technology? No. But that certainly doesn't disqualify it from being a modern automobile. And, it certainly doesn't disqualify it from delivering the capabilities that customers care most about. It would be ridiculous for another auto manufacturer to criticize Tesla for missing this technology, unless this technology was required to deliver the capabilities that matter most to car buyers.
I was recently drawn attention to comments on SemiWiki by the guys behind NEC's CyberWorkBench (CWB): "Bluespec - OK for controller, but no scheduling capabilities, not so HLS." Making a comment like this is like criticizing Tesla for lacking fuel injection.
Please let me explain. What the CWB guys probably mean is this: C-based HLS solutions (including CWB) usually work on an intermediate form such as CDFGs (Control and Data Flow Graphs), and these are "scheduled" into clock cycles based on various constraints, using various heuristics (ASAP, ALAP, ...). And indeed they are right in saying that Bluespec does not do this.
But to say that, therefore, Bluespec is "not really HLS" is relying a bit on a self-serving circular definition. As far as the user is concerned, CDFG-based scheduling is a second-level detail; the first-order questions are, "Is the source language high level? Does it get synthesized to efficient hardware? How productive am I with this approach?". And on these more end-to-end metrics, we're happy to take on any HLS tool.
Bluespec actually does extremely sophisticated scheduling, just in a form that is not done by other HLS tools. Bluespec's scheduling is based on concurrent atomic transactions, which none of the C-based HLS tools do, and which are, in our opinion, far more important and useful to the user. This kind of scheduling has been at the core of Bluespec technology from day one (more than 10 years ago).
In our view, the biggest question is really about the ability to specify architecture abstractly (and at many layers of abstraction). In HW design, algorithm and architecture are fused. An algorithm's quality is based on the cost model of its operations; the cost model is directly a consequence of architecture; and architecture itself is a design outcome (unlike SW design, where an architecture and its cost model is a fixed input).
Unfortunately, C/C++/SystemC are not well suited for describing architectures and their massive, fine-grain, heterogeneous parallelism and concurrency (these languages were designed for sequential von Neumann architectures). Incidentally, this is one major reason behind another remark that "'C for hardware' is actually very different from the 'normal' C code written with software", it's because people struggle to fit C/C++ to this purpose. This is why Bluespec has chosen a very different computation model, atomic rewrite rules, which is much better suited for HW architectures and concurrency.
If you think CDFG scheduling technology is critical to your needs, however, BSV is not for you.
In HW Emulation is Becoming Ubiquitous, we highlighted Jim Hogan's evaluation of the emulation market that was posted on DeepChip. John Cooley recently posted reader follow up to Jim's piece: 37 engineers react to Jim Hogan's eval of the emulation market. There's lots of feedback here -- one thing that stood out to us were the comments made by two readers about transactors (to read the full comments, as well as what other readers have said, please head over to DeepChip):
"Hogan should have written more on transactors. They're the life's
blood of a full emulation-based verification system. Get just a few
cheap or poorly written TLM's thrown in the mix and you'll be chasing
your tail for months trying to fix it. Not to mention your throughput
will be in the toilet."
"I can also confirm transactors are never plug-and-play. Their care and
feeding will devour 90% of your time after initial set-up is done."
Transactors are the key to connecting Virtual Platforms or verification testbenches on a workstation to RTL in an emulator or FPGA. These hybrid platforms (i.e. bridging virtual platforms and verification environments with FPGA prototyping) deliver the speed, accuracy and affordability needed for both pre-silicon software development and high-speed system-level validation.
For specific emulators, some transactors are available from the vendor. But, what if:
- You need your models or testbenches to be able to work with multiple emulation platforms (such as both a big-box emulator and an FPGA board or two different emulators)?
- You need to connect your models or testbenches to RTL in a third-party FPGA board? Or, your own, internally-designed FPGA board?
- Or, you need something else?
The Transactor Gateway from Bluespec provides a portable, plug-and-play and high-performance transactor and co-emulation solution for connecting host workstations to IP RTL in FPGA boards or emulators. The Transactor Gateway handles everything from TLM to your DUT: configuring highly parameterized transactors, providing a high-level TLM 2.0 API to the host, delivering a high-performance connection across SCE-MI, and connecting to your DUT in an FPGA board or emulator. The Transactor Gateway works with a variety of FPGA boards already -- and is portable to additional FPGA boards and emulators.
You can learn more on:
Jim Hogan, of Vista Ventures LLC and an industry luminary in the semiconductor space, just published his analysis of the emulation space on Deepchip yesterday. The complete analysis includes four separate pieces:
Hogan's core thesis is that emulation is becoming more ubiquitous.
From interactions with customers, that's evident. 10 years ago, emulation and FPGA prototyping were more the exception -- now it is almost the rule for chip development teams. This trend is being reflected in vendor results. In quarterly earnings calls, Mentor's Wally Rhines recently characterized the last two years of emulation growth as "really remarkable" -- and Cadence stated that 2011-2013 hardware sales, led by Palladium XP, grew 90% over the 2008-2010 period. Of course, Synopsys recently acquired EVE.
Not only are more teams using these solutions -- but additional groups within chip development teams are starting to as well. Many of these new groups are trying to leverage FPGA boards and open software to address their needs.
Some are doing this to validate IP and subsystems. Many are doing this to address the exploding demand for pre-silicon software development. Firmware engineers need pre-silicon models. Often RTL is the only model accurate enough for their needs or even available. FPGA boards run RTL fast -- and they are cost-effective to deploy for firmware development, once you get them working and integrated with your simulations. But FPGA boards are difficult to use, especially so when connecting to software, for models, test benches and debug.
These engineers could leverage emulation, if only it could meet their budget and productivity needs. These users need solutions that:
- Connect FPGA boards with host-based open software, such as SystemC/C/C++
- Reduce the significant challenges debugging RTL IP in FPGA boards
There's a new segment of emulation that addresses these needs. We're calling this new segment desktop emulation -- and it connects commodity FPGA boards together with C to deliver affordable emulation for individual use.
We're happy that Hogan included our desktop emulation solution, Semu, in his analysis. It starts at $9,500 and can be used with standard, low-cost Xilinx-based development boards for applications and users that have been trying to leverage FPGA boards for applications that benefit from emulation:
- Pre-silicon firmware development with RTL IP
- High-speed validation of IP and subsystems
If you would like to learn more, please download the Semu datasheet, or contact Bluespec for a demo:
Interested in the rapid prototyping and architectural exploration of algorithms and architectures? Please come join former MIT professor Rishiyur S. Nikhil, PhD, CTO of Bluespec, in Santa Clara, CA on November 2nd for an in-person, live seminar and lunch. He will be presenting a case study in the rapid prototyping and architectural exploration of the H.264 video algorithm.
What you'll learn about in this hour and half seminar:
- Algotecture: the critical role of architecture for optimal implementation of hardware algorithms (architecture + algorithm)
- Rapid architectural exploration: making rapid changes for performance, power and area, while still delivering optimal results across many tradeoff points
- Prototyping: the benefits of early FPGA prototyping for architectural exploration
At the culmination of Nikhil's presentation, the complete video compression algorithm will be demonstrated on a low-cost FPGA emulation system. Lunch will be served.
Click here to register and get details:
Schedule on November 2:
|| Welcome Reception
|| Seminar and Demo
About Rishiyur S. Nikhil
Rishiyur S. Nikhil is co-founder and CTO of Bluespec, Inc., which develops tools that dramatically improve correctness, productivity, reuse and maintainability in the design, modeling and verification of digital designs (ASICs and FPGAs). Earlier, from 2000 to 2003, he led a team inside Sandburst Corp. (later acquired by Broadcom) developing Bluespec technology and contributing to 10Gb/s enterprise network chip models, designs and design tools. Prior to that, he was at Cambridge Research Laboratory (DEC/Compaq), including one and a half years as Acting Director. He was a professor of Computer Science and Engineering at MIT. He has led research teams, published widely, and holds several patents in functional programming, dataflow and multithreaded architectures, parallel processing, compiling, and EDA. He received his Ph.D. and M.S.E.E. in Computer and Information Sciences from the Univ. of Pennsylvania, and his B.Tech in EE from IIT Kanpur.
In order to validate increasingly complex designs and deliver firmware earlier, emulation and FPGA prototyping are becoming imperatives for chip development. These hardware execution platforms deliver orders-of-magnitude higher speeds than simulation, which is too slow to handle the increased complexity and software content of today's designs. But when these platforms are connected to test benches or models on a host workstation, the performance can often significantly miss expectations, putting into question both the investment in this approach and project success. In order to avoid this, design teams need to architect and optimize their co-emulation environments to avoid the ramifications of Amdahl's Law. Transaction-level co-emulation and synthesizable testbenches are key tools for avoiding the bottlenecks that can kill emulation performance.
Emulation & Amdahl's Law
Amdahl's Law teaches us how the speedup of a system is governed by how much and how big a portion of that system is improved. With emulation and FPGA prototyping, the bottlenecks are typically the host and co-emulation link. While an emulator can run at MHz+ speeds, host-based simulation and the co-emulation link can severly impact the effective performance achieved. In a case study we performed of the co-emulation of an Ethernet switch test bench and DUT (Case Study: Using Synthesizable Transactors & Testbenches to Avoid Emulation Performance Disasters), the effective performance ranged from a low of 47 KHz to a very high 49 MHz, a dynamic range of 1000X, which depended on the choices made in the partitioning of the testbench across the co-emulation link and on the abstraction level of the transactors.
We've heard multiple first-emulation-experience anecdotes from teams seeing disappointing improvements over simulation. While you might get lucky simply dropping the DUT into emulation, typically you'll need to analyze and optimize performance to achieve your expectations -- and, in the process, you'll likely need to rework the test bench partitioning, transactors, and consider other items such as system reset/configuration and memory architectures.
Co-emulation links have limited throughput and considerable latency, which can bottleneck performance if:
Models or a testbench, running on a host workstation, have tightly coupled interactions with a DUT running in emulation. This can happen if the interface across the co-emulation link is running at too low a level (for example, running a low-level, signal-level interface between host and emulator). It can also happen if there are round-trip, latency sensitive interactions between host and emulator.
Co-emulation communication, from host to emulation, requires too much bandwidth. For example, transmitting uncompressed, high-resolution, high-frame rate video across the co-emulation link can easily swamp a link's bandwidth.
Transaction-level co-emulation, using higher level communication abstractions which are ideally latency decoupled, can help with both of these. Take for example a testbench on a host connected to an emulation-resident DUT that comprises an AMBA AXI-based SoC. And, the testbench connects into an AXI port, over which the switch traffic is sent and received. One option would be to just bring the AXI interface across the co-emulation link, which would require cycle-by-cycle communication, with very tight latency coupling, at the switch interface level. This would result in terrible performance. Alternatively, one could send each AXI transaction, or even multiple transactions, as a single transmission, including both the data for the AXI transaction and the parameters for this transaction (e.g. address/etc.). From performance standpoint, this second approach is far preferable.
What's required to communicate across the co-emulation link with transactions? You need synthesizable transactors running on the emulator that take these high-level transactions from/to the host and convert them to low-level DUT interfaces. Transactors can be very complex, requiring considerable time-to-develop and significant verification efforts, because they are typically done in RTL. Because of the types of designs (complex control/protocol) typically required for synthesizable transactors, C/C++ and SystemC don't offer much in terms of abstraction over RTL. In contrast, Bluespec BSV offers an effective, high-level approach for the development of transactors, and has been used in many, many transactor designs as a high-level alternative to RTL.
Synthesizable Test Benches & System Models
In addition to transactors, test benches can be re-partitioned, even fully migrated, into emulation. As well, System Models required for your test bench, such as a disk drive model, or real lab equipment that would typically only be used with FPGA prototyping, such as an Ethernet packet tester, can be modeled and synthesized for even higher performance. If you were developing these for simulation, you would write them at a high-level using transaction-level SystemVerilog/'e'/C++/SystemC. With these tools available, why would you write them in RTL? But, with emulation and FPGA prototyping, you are stuck with RTL, unless you use a powerful high-level, synthesizable modeling environment like Bluespec.
With over 50 universities in the Bluespec University Program, it is not surprising to stumble upon fascinating new projects and courses using Bluespec. But, it is surprising when it is in your backyard, at MIT, where Bluespec technology was originally invented.
Peggy Aycinena, the editor of EDACafe, did a nice piece on July 17th, MIT: towards the 1000-core processor
. This article is about a project that came as a surprise to many of us here. Here is a great quote by Prof. Srini Devadas from the article:
Also, our chip is being designed completely using Bluespec, which we want to shout from the rooftop of Stata! The primary designers involved swear there are huge benefits from using Bluespec versus coding directly – with the Verilog RTL done by Bluespec.
And we’re taking the academic attitude: Forget legacy designs! We’re not using any third-party IP. The RTL is all truly MIT home grown. We’re proving that a small group of 4 students can do this huge design – assuming it works, of course!
In a sense, we're not surprised by this kind of response -- Bluespec designers talk about Bluespec this way, and report similar results. Whether it's designing complex IP, exploring architectures, accelerating algorithms, expressing networks-on-chip, constructing highly parameterized IP, or putting models/testbenches/transactors into emulation/FPGAs, Bluespec designers can do it faster, with fewer errors, and don't want to go back.
Recently, Twitter user @avsm tweeted
about a BSV-based MIPS model booting BSD UNIX at the University of Cambridge in the UK. Add to that Bluespec's own ARM Cortex ISS models that boot Linux and are included in its family of Synthesizable Virtual Platforms (SVP) for firmware development. Think Virtual Platform, but running in FPGAs blazingly fast, so they can integrate with RTL IP and still run at MHz speeds -- speed + accuracy.
Then add in a whole bunch more, including x86, PowerPC, SPARC, Alpha and Itanium models. These models run the gamut from architectural pipelined to ISSes. Most of them boot real OSes, and do it running really fast in FPGAs.
Given the breadth of real processor architectures booting real OSes in FPGAs, it's clear that no other HLS solution comes close!