Data speculation technique has been heavily exploited in various scenarios of architecture design. It bridges the time or space gap between data producer and data consumer, which gives opportunities to processors to gain significant speedups. However, large instruction windows, deep pipeline and increasing latency of on-chip communication make data misspeculation very expensive in modern processors.</p> <p>This paper proposes a Distributed Replay Protocol(DRP) that addresses data misspeculation in a distributed uniprocessor, named TFlex. The partition feature of distributed uniprocessors aggravates the penalty of data misspeculation. After detecting misspeculation, DRP avoids squashing pipeline; on the contrary, it retains all instructions in the window and selectively replays the instructions that depend on the misspeculative data. As one possible use of DRP, We apply it to recovery from data dependence speculation. We also summarize the challenges of implementing selective replay mechanism on distributed uniprocessors, and then come up with two variations of DRP to effectively solve these challenges. The evaluation results show that without data speculation, DRP achieves 99% of the performance of perfect memory disambiguation. It speeds up diverse applications over baseline TFlex(with a state-of-art data dependence predictor) by a geometric mean of 24%.
%0 Conference Paper
%1 Mao:2012:DRP:2304576.2304580
%A Mao, Mengjie
%A An, Hong
%A Deng, Bobin
%A Sun, Tao
%A Wei, Xuechao
%A Zhou, Wei
%A Han, Wenting
%B Proceedings of the 26th ACM international conference on Supercomputing
%C New York, NY, USA
%D 2012
%I ACM
%K relay uniprocessor
%P 3--14
%R 10.1145/2304576.2304580
%T Distributed replay protocol for distributed uniprocessors
%X Data speculation technique has been heavily exploited in various scenarios of architecture design. It bridges the time or space gap between data producer and data consumer, which gives opportunities to processors to gain significant speedups. However, large instruction windows, deep pipeline and increasing latency of on-chip communication make data misspeculation very expensive in modern processors.</p> <p>This paper proposes a Distributed Replay Protocol(DRP) that addresses data misspeculation in a distributed uniprocessor, named TFlex. The partition feature of distributed uniprocessors aggravates the penalty of data misspeculation. After detecting misspeculation, DRP avoids squashing pipeline; on the contrary, it retains all instructions in the window and selectively replays the instructions that depend on the misspeculative data. As one possible use of DRP, We apply it to recovery from data dependence speculation. We also summarize the challenges of implementing selective replay mechanism on distributed uniprocessors, and then come up with two variations of DRP to effectively solve these challenges. The evaluation results show that without data speculation, DRP achieves 99% of the performance of perfect memory disambiguation. It speeds up diverse applications over baseline TFlex(with a state-of-art data dependence predictor) by a geometric mean of 24%.
%@ 978-1-4503-1316-2
@inproceedings{Mao:2012:DRP:2304576.2304580,
abstract = {Data speculation technique has been heavily exploited in various scenarios of architecture design. It bridges the time or space gap between data producer and data consumer, which gives opportunities to processors to gain significant speedups. However, large instruction windows, deep pipeline and increasing latency of on-chip communication make data misspeculation very expensive in modern processors.</p> <p>This paper proposes a Distributed Replay Protocol(DRP) that addresses data misspeculation in a distributed uniprocessor, named TFlex. The partition feature of distributed uniprocessors aggravates the penalty of data misspeculation. After detecting misspeculation, DRP avoids squashing pipeline; on the contrary, it retains all instructions in the window and selectively replays the instructions that depend on the misspeculative data. As one possible use of DRP, We apply it to recovery from data dependence speculation. We also summarize the challenges of implementing selective replay mechanism on distributed uniprocessors, and then come up with two variations of DRP to effectively solve these challenges. The evaluation results show that without data speculation, DRP achieves 99% of the performance of perfect memory disambiguation. It speeds up diverse applications over baseline TFlex(with a state-of-art data dependence predictor) by a geometric mean of 24%.},
acmid = {2304580},
added-at = {2012-11-07T14:59:45.000+0100},
address = {New York, NY, USA},
author = {Mao, Mengjie and An, Hong and Deng, Bobin and Sun, Tao and Wei, Xuechao and Zhou, Wei and Han, Wenting},
biburl = {https://www.bibsonomy.org/bibtex/262931e8fcf2d0bf96bbb31dcf187423d/ytyoun},
booktitle = {Proceedings of the 26th ACM international conference on Supercomputing},
doi = {10.1145/2304576.2304580},
interhash = {63384271b682a62fbb043adbe669b17a},
intrahash = {62931e8fcf2d0bf96bbb31dcf187423d},
isbn = {978-1-4503-1316-2},
keywords = {relay uniprocessor},
location = {San Servolo Island, Venice, Italy},
numpages = {12},
pages = {3--14},
publisher = {ACM},
series = {ICS '12},
timestamp = {2012-11-07T14:59:45.000+0100},
title = {Distributed replay protocol for distributed uniprocessors},
year = 2012
}