Parallel Architecture Group at Northwestern (PARAG@N)
Headed by Nikos Hardavellas
The overreaching research umbrella at the Parallel Architecture Lab at Northwestern (PARAG@N) is energy-efficient computing. At the macro scale, computers consume inordinate amounts of energy, negatively impacting the economics and environmental footprint of computing. At the micro scale, power constraints prevent us from riding Moore's Law. We attack both problems by identifying sources of energy inefficiencies and working at hardware/software techniques for cross-stack energy optimization.
Thus, our work extends from circuit and hardware design, through programming languages and OS optimizations, all the way to application software. In a nutshell, our work goes from near-term solutions (minimize data-transfer overheads with adaptive caching and DRAM management), to medium term (eliminate computational overheads with specialized computing on dark silicon), to medium-long term (minimize energy at logic circuits through selective approximate computation), to the long term (push back the bandwidth and power walls by designing 1000+-core virtual macro-chips with nanophotonics, along with the runtime environment that goes with it). An overview of our research at PARAG@N was presented at an invited talk at IBM T.J. Watson Research Center and Google Chicago in March 2012.
More specifically, we work on:
DRAM Thermal Management: In the near term, while more than a third of energy is consumed on memory, and thermal characteristics play an important role on the overall DRAM power consumption and reliability, power/thermal optimizations for DRAM have been largely overlooked. Together with fellow faculty S.O. Memik and G. Memik, we recognize the importance of the problem, and work on minimizing the power and thermal profile of DRAMs using OS-level optimizations. We have already published some of our results on DRAM thermal management at HPCA 2011 and work on subsequent publications.
Elastic Caches: As a significant fraction of the energy is consumed on data transfers/storage, together with SeaFire we work on Elastic Caches. In this project we develop adaptive cache management policies that minimize the overheads of storing and communicating data among the cores. An incarnation of Elastic Caches for near-optimal data placement was published at ISCA 2009 and won an IEEE Micro Top Picks award in 2010, while a newer paper at DATE 2012 presents an instance of Elastic Caches that minimize interconnect power. While Elastic Caches are independent of SeaFire (described below), their combination attains higher benefits than the sum of parts.
SeaFire (Specialized Computing on Dark Silicon): While Elastic Fidelity cuts back on the energy consumption, it does not push the power wall far enough. To gain another order of magnitude, we must minimize the overheads of modern computing. The idea behind the SeaFire project (targeting the medium term) is that instead of building conventional high-overhead multicores that we cannot power, we should repurpose the dark silicon for specialized energy-efficient cores. A running application will power up only the cores most closely matching its computational requirements, while the rest of the chip remains off to conserve energy. Preliminary results on SeaFire have been published at an IEEE Micro article in July 2011, an invited USENIX ;login: article in April 2012, the ACLD workshop in 2010, the keynote at ISPDC in 2010, a TR in 2010, an invited presentation at the NSF Workshop on Sustainable Energy-Efficient Data Management in 2011, and an invited presentation at HPTS in 2011.
Elastic Fidelity: At the circuit level, the shrinking transistor geometries and race for energy-efficient computing result in significant error rates at smaller technologies due to process variation and low voltages (especially with near-threshold computing). Traditionally, these errors are handled at the circuit and architectural layers, as computations expect 100% reliability. Elastic Fidelity computing targets near-to-medium term, and is based on the observation that not all computations and data require 100% fidelity; we can judiciously let errors manifest in the error-resilient data, and handle them higher in the stack. We develop programming language extensions that allow data objects to be instantiated with certain accuracy guarantees, which are recorded by the compiler and communicated to hardware, which then steers computations and data to separate ALU/FPU blocks and cache/memory regions that relax the guardbands and run at lower voltage to coserve energy. This is a relatively new project; we had a poster presentation on Elastic Fidelity at ASPLOS 2011, and Technical Report in Feb 2011.
Galaxy (Optically-Connected Disintegrated Processors): The combined works above offer a respite of a few orders of magnitude, but a fundamental problem remains unaffected: chips are ultimately limited by bandwidth, power delivery and cooling constraints. In the Galaxy project (targeting the long term) we take a step towards pushing back the power, bandwidth, and yield walls. Instead of building monolithic chips, we advocate split them into several smaller chiplets connected with photonic interconnects (fiber optics across chiplets, silicon photonics within chiplets for long distances, electrical interconnects for short). The photonics allow such high bandwidth communication that break the bandwidth wall entirely (8 TBps/mm bandwidth density demonstrated in lab prototypes by IBM), and such low latency that the virtual macro-chip behaves as a single chip. Yet, the power delivery is now split among multiple chiplets solving the problem of power delivery, the chiplets are distributed in space far apart to cool them efficiently with forced air, thereby pushing away the power wall, and they are sized optimally to maximize yield (another hurdle of technology scaling). While competing designs --e.g., the Oracle macrochip-- have to reside to liquid cooling and microfluidics to cool 4KW from a wafer-size device, our design allows us to space chiplets 8-10 cm apart and minimize heat transfer. Our preliminary results indicate that Galaxy scales seamlessly to 4000 cores. It is important to note that all the other research above is still relevant, as Elastic Caches are required to handle data accesses in such large scales, specialized computing on dark silicon can maximize performance and minimize energy consumption on a per-chiplet basis, and Elastic Fidelity and DRAM optimizations shave off significant overheads at the circuit layer and the main memory. A preliminary report on the Galaxy design was presented at WINDS 2010, and at a talk at Google Madison in March 2013.
Computation Affinity: The research steps presented above have gradually taken us from circuit-level optimizations for the near term, to new architectures in the medium and long term, that may allow us to realize 1000+ core virtual macro-chips. However, to make 1000+ core systems practical to more than just a narrow set of highly-optimized applications, we need to revisit the fundamentals of data access and sharing. For this last part of our vision we advocate Computation Affinity, an execution system that partitions the data objects among clusters of cores in the system, allowing the code to migrate from one cluster to the other based on the data it accesses. Computation Affinity is a hybrid of active messages and process-in-memory that minimizes data transfers and conserves enormous amounts of energy. The first incarnation of this idea was DORA, published in VLDB 2010 with database systems as a proof-of-concept. We specifically chose database systems as a proof-of-concept because they are notoriously hard to optimize, as their arbitrarily complex access and sharing patterns resist most forms of architectural optimization. Our current work extends this paradigm to general-purpose programs.