ProWAFT: A ROMA-LPD Instance for Workload-Aware and Dynamic Fault Tolerance in FPGA-Based CNN Accelerators
Quick Answer
ProWAFT is a proactive fault-tolerance framework for FPGA-based CNN accelerators that uses partial reconfiguration to minimize latency, energy, and reliability risks.
Quick Take
ProWAFT is a proactive fault-tolerance framework for FPGA-based CNN accelerators that uses partial reconfiguration to minimize latency, energy, and reliability risks. Tested on a Xilinx Zynq UltraScale+ ZCU104 with a 500-task trace from ResNet-18 and others, it outperforms static TMR and reactive recovery, achieving high task success rates and low overhead.
Key Points
- ProWAFT employs partial reconfiguration to apply TMR selectively across reconfigurable partitions.
- Achieves lower composite cost compared to static TMR and reactive recovery methods.
- Maintains high task success rates and near-baseline throughput during fault conditions.
- Evaluated on a 500-task trace derived from ResNet-18, MobileNetV2, and EfficientNet-Lite.
- Implemented on Xilinx Zynq UltraScale+ ZCU104 platform with six reconfigurable regions.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01602v1 Announce Type: new Abstract: SRAM-based FPGAs provide an attractive platform for energy- and latency-constrained CNN inference at the network edge, yet transient faults can lead to silent errors that compromise reliability. Always-on redundancy (e. g. , full TMR) improves correctness but incurs substantial performance and energy overhead, while reactive recovery may introduce unacceptable latency on the critical path.
We propose \textbf{ProWAFT}, a proactive workload-aware fault-tolerance framework for FPGA-based CNN accelerators that uses partial reconfiguration to selectively apply TMR across reconfigurable partitions. ProWAFT quantifies workload criticality, models fault propagation and reconfiguration overhead, and selects configurations that minimize a composite objective over latency, energy, and reliability risk.
Implemented on a Xilinx Zynq UltraScale+ ZCU104 platform with six reconfigurable regions and evaluated on a 500-task trace derived from ResNet-18, MobileNetV2, and EfficientNet-Lite under time-varying SEU injection, ProWAFT achieves lower composite cost than static TMR and reactive reconfiguration while maintaining high task success rate and near-baseline throughput with low online decision overhead.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.