Skip to content

Description of work for WP3: Hardware countermeasures at the circuit level

   Due to the lack of reliability of the devices and components in late CMOS and emerging technologies, strategic hardware countermeasures are required at circuit level in order to get reasonable quality of blocks used at architectural level. This is the objective of this section where we focus on circuit level. We are addressing the countermeasures at two different levels: one corresponding to late CMOS technologies with a high parameter variability but reasonable defect ratio and a second level for technologies with a very high defect ratio (worse than 10-6), where yield and reliability of system would approach to zero if we used conventional design principles. For this second level, principles of highly redundant circuits are required.

Countermeasures for late CMOS technologies

   We propose a dynamic and concurrent technique to mitigate the negative impact of process variability on the energy consumption and performance of the system. We assume that the pure functionality related issues of the various system modules can be solved at the circuit level design phase. If the modules are functional after fabrication, then the proposed approach will make sure that the timing constraints at the system or application level will be met despite any performance degradation at the module level due to process variability. It is based on the run-time re-configuration of critical system modules that offer run-time configuration options (knobs) by which they can trade-off reduced delay for additional energy consumption in order to minimize timing violations.

   Memories are among the most variability sensitive modules of an processing system. The reason is that most of the transistors in a memory are minimum-sized and are thus more prone to variability. Additionally, memories are dominated by parallel paths (word lines and bit lines), hence timing can be severely degraded by variability due to the dominance of a worst-case path over the rest. In order to create a memory knob that is scalable to more advanced technology nodes we resorted to circuit level techniques. This knob adds configuration capabilities in the memories which can be switched at run-time between a slow low-energy configuration and a fast high-energy configuration.

   The configuration capabilities at the memory level can be provided by using configurable buffers for the large capacitive loads inside the memories. Such capacitances consist of long interconnect wires with large fan-outs and examples include the pre-decoder to post-decoder stage buses and the word-line which is connected to all the pass transistor in a single memory row. Driving large capacitive loads requires buffers, which are implemented as inverter chains, with different transistor sizes depending on the metric that must be optimized. Typically, large transistor sizes are used to minimize the delay to achieve the correct logic value at the end of the wire. For energy minimization, the sizes are usually as small as possible. To create run-time configuration capabilities, run-time configurable drivers include a parallel implementation of 2 or 3 different drivers and a simple control mechanism to choose which one will be activated at each moment.

   A second strategy for mitigating variability is the use of compensating techniques. These techniques are based on the assumption that for large complex systems a significant part of the aggressions has an independent random- behaviour. For such deviations, noise or variability, a specific circuit-level organization allows the compensation of effects. For such compensating techniques, it is necessary the design of consumption, noise, temperature and time violation monitors.

Countermeasures for mitigating very high defect ratio technologies: High redundancy requirements

   Current electronic systems are composed by 108–109 transistors. Latest predictions for electronic industry indicate a goal of 1010–1012 transistors per chip for the year 2020. Building such systems will require a large effort either on improving the device reliability or more probably on including fault tolerant techniques and dividing the system in submodules with a manageable number of devices. In fact, the use of tolerant techniques is widely used and required in current electronic systems. For example, consider the current error probability for MOS devices is around 10−6. The probability of having no defect for a current system built using this number of devices is 3.72 • 10−44.

   As a matter of fact, in such a system we can expect to find an average of 100% defective devices. If the design had no tolerance capability at all it would simply be impossible to build it. Thus, nowadays electronic designs use tolerant techniques for internal memories where correction codes are normally employed to avoid faults and defects. Besides, most memory modules include spare cells and reconfiguration capabilities to substitute defective circuits. The number of processing elements (combinational logic and sequential logic) is 100 to 1000 times smaller than the number of memory devices. Therefore, the error probability for the active logic block is much lower. However, a thorough test analysis is done on each chip to detect those where a defect produces a system malfunction.

First Tolerant Hierarchy Layer: The Averaging Cell (AC)

   The first tolerant layer should be able to provide basic building structures that have a reasonable reliability. The Averaging Cell fits perfectly into this first tolerant layer. It builds logic cells out of redundant nanodevices at a low area cost even for large redundancy factors. The resulting error probabilities are in the range where system modules can be easily implemented with a reasonable reliability. On top of all these, the proposed implementation for the AC cells is also filling the gap between deep nanoscale and fabrication dimensions because the interconnections can be scaled as needed depending on the available technology.

   To further illustrate the improvements of AC we present the ratio between a simple cell and an AC cell with N = 50. The error probability for the CMOS cell has been calculated considering the resolution and chemical error values. To calculate the error probability of AC cell we have considered the same values for resolution and chemical error and have also considered internal and external noise sources with intensity s/V = 0.05 V/V. The reliability improvement provided by AC cells against these aggression sources is clearly several orders of magnitude along the considered region.

Higher Tolerant Hierarchy Layers

   The AC cells are a promising solution for providing nanoscale systems. However, they do not solve all the problems of the nanoscale environment. Area defects, either static defects (particles) or transient upsets (cosmic rays) will induce errors on the redundant cells. Also, correlated noise affecting all the devices will also cause errors. For all this, it is necessary to consider the utilization of at least a second tolerant technique to cope with these remaining aggression sources. However, at this point the error probability will be much lower. Therefore, techniques such as retrial added to some detection mechanism or some low hardware redundancy or a combination of those to get to an optimal cost reliability point should suffice.
Fig_WP3.jpg
                    (a) Error probability ratio between a simple CMOS cell and an AC with
N = 50 for different values of resolution and chemical error. (b) Resolution and Chemical
error versus the CMOS error probability for comparison on (a).