Skip to content

Description of work for WP4: Hybrid countermeasures at the micro-architecture and system level

   Systems will have a very low yield and likely to have a poor operational performance and reliability unless defect-tolerant and fault-tolerant techniques are introduced in the design of those systems. Microarchitectural and system intervention offers a novel way to manage reliability and testing without significantly sacrificing cost and performance. Microarchitects have traditionally treated processor lifetime reliability as a manufacturing problem, best left to device and process engineers. However, the microarchitecture has a unique knowledge of logical operation of hardware, whereas the system has a lot of the information on the applications. Both together offer an opportunity to decrease the impact of future technologies in design cost and overall revenue, and at the same time, allow increasing the performance and reliability of the resulting multicores. In this project, we will show how the combination of circuit, microarchitectural and system techniques will be leveraged to increase the overall yield, performance and reliability.

   In this particular WP, we will work on the microarchitectural and system level methodology. As the momentum behind chip multiprocessor (CMP) architectures continues to grow, it is expected that future microprocessors will have several cores sharing the on-die and off-die resources. The success of CMP platforms depends not only on the number of cores but also heavily on the platform resources (cache, main memory, etc) available and their efficient usage. Therefore, we will focus on the two most important memory resources on a processor: cache memories and register files.

   Given a memory structure, cells may have different delays and power properties, which will strongly depend on process variations, usage (degradation) and environmental conditions (temperature). All these different circumstances will be assessed in previous WPs and will be used as an input for WP4. The net results of this variability will be that the multicore system will be heterogeneous, since the caches and register files of each core may end up being different in capacity, latency and power consumption. Moreover, the heterogeneity will not be fixed, and will vary along the time due to aging. We propose to investigate a methodology that adapts the multicore to three basic user requirements: performance, power and reliability. It is important notice that all three requirements may be important with different weights, which we also plan to tackle.

   ¡Error! No se encuentra el origen de la referencia. shows our framework. We can clearly that it is articulated around three different components: (i) memory and program characterization, (ii) reconfiguration policies for the memory structures, and (iii) application of dynamic reconfiguration based on the state of the memories, program characteristics and user demand.

   Memory characterization. In the first step, we will work on many different methodologies that will determine the main properties of each memory cell. Since cell behaviour strongly depends on the environment (e.g., temperature), processor configuration (e.g., frequency and voltage operating mode) and the processor utilization (i.e., wearout), we cannot rely on tests run at fabrication time; rather, we will need methodologies that run periodic checks. Furthermore, extreme conditions may cause a cell to behave in a faulty manner. Therefore, we will also need to have online error detection schemes that guarantee correctness.
High-level reconfiguration framework
   First, we will use mechanisms proposed in WP3 to characterize memory properties (such as latency at different voltages and temperature). Since degradation is a continuous effect, we will extend them so they can choose the granularity at which the online tests are executed.

   Program characterization. Usually, applications show different phases where behaviour is wildly different. Therefore, our proposal is to identify those phases, so we can reconfigure the processor the best way. Phases will be identified based on their resources requirements for performance, power consumption and level of vulnerability to errors. Based on this collected information and the user requirements, we will be able to do the best possible reconfigurations. For instance, if a memory is no longer in the critical path of the application during a certain period, switching it to a slower mode will allow energy savings.

   Memory reconfiguration. Cache memories and register files will have different operating points, which provide an energy-delay-capacity-reliability mode that can be exploited at run-time to adapt them to the actual requirements. Some application may require a fast cache rather than high capacity. Other phases may require a large cache, although they can be slower. If the data used is part of a video streaming, we can relax the reliability requirements, since a wrong pixel is not that important. We will elaborate a set of different microarchitectural reconfiguration policies that may fit all different scenarios, so we can choose among them later on when we actually reconfigure the multicore.

   Choosing the appropriate configuration. Due to process variability, the operating points differ across different memories. Moreover, due to the large spectrum of applications (and phases within an application), and the shared resources in a multicore environment, the solution space will be large and there will not be a “one configuration fits all”. This means that we will have to reconfigure the system at runtime and at fine granularity.

   We will explore different reconfiguration mechanisms that will maximize performance and/or power and/or reliability. We will combine solutions at coarse grain (solutions related to the mapping of applications to the heterogeneous cores and frequency/voltage selection) and at fine grain (single core reconfiguration in order to exploit phase characteristics).