As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit(GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor(SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU's poor warp scheduling method.Thus, benefits of GPU's high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected(CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter(LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit(PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.
A signal probability and activity probability (SPAP) model was proposed firstly, to estimate the impacts of the negative bias temperature instability (NBTI) and positive bias temperature instability (PBTI) on power gated static random access memory (SRAM). The experiment results show that PBTI has significant influence on the read and write operations of SRAM with power gating, and it deteriorates the NBTI effects and results in a up to 39.38% static noise margin reduction and a 35.7% write margin degradation together with NBTI after 106 s working time. Then, a circuit level simulation was used to verify the assumption of the SPAP model, and finally the statistic data of CPU2000 benchmarks show that the proposed model has a reduction of 3.85% for estimation of the SNM degradation after 106 s working time compared with previous work.