

- Director, RIKEN Center for Computational Science
- 20191106 France-Germany-Japan Presentation @ Tokyo



#### Alliances with **Science of Computing by Computing for Computing** C domestic and overseas **SIK=N** universities and research **R-CCS** institutes including other research centers in **Science for computing** RIKEN International core research center in the Alliance with other scientific disciplines that science of high performance computing contribute to the evolution of HPC (HPC) Development of new electronic devices - and new materials to make them a reality -- to enable new concepts of computing, New computer Acceleration of such as photonic, neuromorphic, quantum, and architectures and computation utilizing new reconfigurable devices New algorithms Analysis and computational models computing technologies and programing simulation to develop models for new devices new computing technologies Synergies and Integration Science of computing Science by computing Foundational research on computing technologies Research utilizing HPC to address issues in basic essential for HPC science and of public concern Development of new computing technologies, architectures, and Research utilizing analysis and simulation with high resolution and high algorithms toward the "post-Moore" era fidelity in life sciences, engineering, climate and environment, disaster Research on programing methods, software, and prediction and prevention, material sciences, space and particle operational technologies physics, and social sciences Development of methodologies to handle big Development of machine learning applications data and Al for the coming Society 5.0 Fostering of human Alliances with industry resources in computational science

R-CCS



## **Challenges Ahead for R-CCS**



- 1. Launching, Operating, and Improving 'Fugaku' the first 'Exascale' Supercomputer for Simulation, Big Data and AI
- 2. Extreme improvements in convergence of HPC for AI
  - Improving processor performance for inference & training
  - Extreme data parallelism for extreme scaling
  - Incorporating model parallelism for performance and ultra large neural networks
  - ... and AI for HPC (challenges in apps & algorthms)
- 3. Big data with IoT and HPC convergence --- how to process data WITHOUT moving or storing them
  - Not just traditional compression, filtering...
- 4. Post-Moore computing towards 2030s --- sustainable future for HPC, Big Data, and AI (and Fugaku-Next)

## **Challenges Ahead for R-CCS**



# A Launching, Operating, and Improving 'Fugaku' – the first 'Exascale'

- 2. Extreme improvements in convergence of HPC for AI
  - Improving processor performance for inference & training
  - Extreme data parallelism for extreme scaling

- Incorporating model parallelism for performance and ultra large neural networks beyond 10s GByte
- ... and AI for HPC (challenges in apps & algorthms)
- 3. Big data with IoT and HPC convergence --- how to process data WITHOUT moving or storing them
  - Not just traditional compression, filtering...
- 4. Post-Moore computing towards 2030s --- sustainable future for HPC, Big Data, and AI (and Fugaku-Next)







The 'Fugaku' Supercomputer, Successor to the K-Computer



# The Nex-Gen "Fugaku" 富岳 Supercomptuer



I-Pea

Capab

R

lication

eratio

0

Broad Base --- Applicability & Capacity Broad Applications: Simulation, Data Science, AI, ... Broad User Bae: Academia, Industry, Cloud Startups, ...





- Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU
  - HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.)
  - Gen purpose CPU Linux, Windows (Word), other SCs/Clouds
  - Extremely power efficient > <u>10x power/perf efficiency for CFD benchmark</u> over current mainstream x86 CPU
- Largest and fastest supercomputer to be ever built circa 2020
  - > 150,000 nodes, superseding LLNL Sequoia
  - > 150 PetaByte/s memory BW
  - Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic)
  - 25~30PB NVMe L1 storage
  - The first 'exascale' machine (not exa64bitflops =>apps perf.)
  - Acceleration of HPC, Big Data, and AI to extreme scale





## **Brief History of R-CCS towards Fugaku**

**R**-CCS



![](_page_8_Picture_0.jpeg)

## **Co-Design Activities in Fugaku**

![](_page_8_Picture_2.jpeg)

![](_page_8_Figure_3.jpeg)

- Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc.
- Chose 9 representative apps as "target application" scenario
- Achieve up to x100 speedup c.f. K-Computer
- Also ease-of-programming, broad SW ecosystem, very low power, ...

# A64FX Leading-edge Si-technology

![](_page_9_Picture_1.jpeg)

- TSMC 7nm FinFET & CoWoS
  - Broadcom SerDes, HBM I/O, and SRAMs
  - 8.786 billion transistors.
  - 594 signal pins

![](_page_9_Picture_6.jpeg)

![](_page_9_Picture_7.jpeg)

## Fugaku's FUjitsu A64fx Processor is...

![](_page_10_Picture_2.jpeg)

### • an Many-Core ARM CPU...

- 48 compute cores + 2 or 4 assistant (OS) cores
- Brand new core design
- Near Xeon-Class Integer performance core
- ARM V8 --- 64bit ARM ecosystem
- Tofu-D + PCIe 3 external connection
- ...but also an accelerated GPU-like processor
  - SVE 512 bit x 2 vector extensions (ARM & Fujitsu)
    - Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)
  - Cache + scratchpad-like local memory (sector cache)
  - HBM2 on package memory Massive Mem BW (Bytes/DPF ~0.4)
    - Streaming memory access, strided access, scatter/gather etc.
  - Intra-chip barrier synch. and other memory enhancing features
- GPU-like High performance in HPC, Al/Big Data, Auto Driving...

![](_page_10_Figure_17.jpeg)

# "Fugaku" CPU Performance Evaluation (2/3)

- Himeno Benchmark (Fortran90)
  - Stencil calculation to solve Poisson's equation by Jacobi method

![](_page_11_Figure_3.jpeg)

FUITSU

# "Fugaku" CPU Performance Evaluation (3/3)

FUJITSU

- WRF: Weather Research and Forecasting model
  - Vectorizing loops including IF-constructs is key optimization
  - Source code tuning using directives promotes compiler optimizations

![](_page_12_Figure_5.jpeg)

WRF v3.8.1 (48-hour,12km, CONUS) on 48 cores

# A64FX: Tofu interconnect D

![](_page_13_Picture_1.jpeg)

- Integrated w/ rich resources
  - Increased TNIs achieves higher injection BW & flexible comm. patterns
  - Increased barrier resources allow flexible collective comm. algorithms
- Memory bypassing achieves low latency
  - Direct descriptor & cache injection

|                     | TofuD spec           |
|---------------------|----------------------|
| Port bandwidth      | 6.8 GB/s             |
| Injection bandwidth | 40.8 GB/s            |
|                     | Measured             |
| Put throughput      | 6.35 GB/s            |
| Ping-pong latency   | 0.49~0.54 <i>µ</i> s |

![](_page_13_Figure_8.jpeg)

![](_page_14_Picture_0.jpeg)

- 3-level hierarchical storage
  - 1<sup>st</sup> Layer: GFS Cache + Temp FS (25~30 PB NVMe)
  - 2<sup>nd</sup> Layer: Lustre-based GFS (a few hundred PB HDD)
  - 3<sup>rd</sup> Layer: Off-site Cloud Storage
- Full Machine Spec

ык=и

- >150,000 nodes
   ~8 million High Perf. Arm v8.2 Cores
- > 150PB/s memory BW
- Tofu-D 10x Global IDC traffic @ 60Pbps
- ~10,000 I/O fabric endpoints
- > 400 racks
- ~40 MegaWatts Machine+IDC PUE ~ 1.1 High Pressure DLC
- NRE pays off: ~= 15~30 million state-of-the art competing CPU Cores for HPC workloads (both dense and sparse problems)

![](_page_14_Figure_13.jpeg)

R-((S

## Fugaku Performance Estimate on 9 Co-Design Target Apps

![](_page_15_Picture_1.jpeg)

**Brief description** 

### Performance target goal

- ✓ 100 times faster than K for some applications (tuning included)
- ✓ 30 to 40 MW power consumption

### Peak performance to be achieved

|                               | PostK                     | K           |  |
|-------------------------------|---------------------------|-------------|--|
| Peak DP<br>(double precision) | >400+ Pflops<br>(34x +)   | 11.3 Pflops |  |
| Peak SP<br>(single precision) | >800+ Pflops<br>(70x +)   | 11.3 Pflops |  |
| Peak HP<br>(half precision)   | >1600+ Pflops<br>(141x +) |             |  |
| Total memory<br>bandwidth     | >150+ PB/sec<br>(29x +)   | 5,184TB/sec |  |

Geometric Mean of Performance Speedup of the 9 Target Applications over the K-Computer

> 37x+

As of 2019/05/14

#### ance Estimate on 9 Co-Design T Categor y Priority Issue Area Performance Speedup over K 1. Innovative computing infrastructure for drug discovery 125x +

| Health                            | 1. Innovative computing<br>infrastructure for drug discovery               | 125x + | GENESIS         | MD for proteins                                                                                   |  |
|-----------------------------------|----------------------------------------------------------------------------|--------|-----------------|---------------------------------------------------------------------------------------------------|--|
| longevity                         | 2. Personalized and preventive medicine using big data                     | 8x +   | Genomon         | Genome processing<br>(Genome alignment)                                                           |  |
| Disaster<br>preventio             | 3. Integrated simulation systems induced by earthquake and tsunami         | 45x +  | GAMERA          | Earthquake simulator (FEM in unstructured & structured grid)                                      |  |
| n and<br>Environm<br>ent          | 4. Meteorological and global<br>environmental prediction using<br>big data | 120x + | NICAM+<br>LETKF | Weather prediction system using<br>Big data (structured grid stencil &<br>ensemble Kalman filter) |  |
| Energy                            | 5. New technologies for energy creation, conversion / storage, and use     | 40x +  | NTChem          | Molecular electronic simulation<br>(structure calculation)                                        |  |
| issue                             | 6. Accelerated development of<br>innovative clean energy<br>systems        | 35x +  | Adventure       | Computational Mechanics System<br>for Large Scale Analysis and<br>Design (unstructured grid)      |  |
| Industrial<br>competiti<br>veness | 7. Creation of new functional devices and high-performance materials       | 30x +  | RSDFT           | Ab-initio simulation<br>(density functional theory)                                               |  |
| enhance<br>ment                   | 8. Development of innovative design and production processes               | 25x +  | FFB             | Large Eddy Simulation<br>(unstructured grid)                                                      |  |
| Basic<br>science                  | 9. Elucidation of the fundamental laws and evolution of the universe       | 25x +  | LQCD            | Lattice QCD simulation<br>(structured grid Monte Carlo)                                           |  |

Application

![](_page_15_Picture_10.jpeg)

![](_page_16_Picture_0.jpeg)

## **Fugaku Programming Environment**

![](_page_16_Picture_2.jpeg)

- Programing Languages and Compilers provided by Fujitsu
  - Fortran2008 & Fortran2018 subset
  - C11 & GNU and Clang extensions
  - C++14 & C++17 subset and GNU and Clang extensions
  - OpenMP 4.5 & OpenMP 5.0 subset
  - Java
- Parallel Programming Language & Domain Specific available
   Library provided by RIKEN
  - XcalableMP
  - FDPS (Framework for Developing Particle Simulator)
- Process/Thread Library provided by RIKEN
  - PiP (Process in Process)

- Script Languages provided by Linux distributor
  - E.g., Python+NumPy, SciPy
- Communication Libraries
  - MPI 3.1 & MPI4.0 subset
    - Open MPI base (Fujitsu), MPICH (RIKEN)
  - Low-level Communication Libraries
    - uTofu (Fujitsu), LLC(RIKEN)
- File I/O Libraries provided by RIKEN
  - Lustre
  - pnetCDF, DTF, FTAR
- Math Libraries
  - BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu)
  - EigenEXA, Batched BLAS (RIKEN)
- Programming Tools provided by Fujitsu
  - Profiler, Debugger, GUI
- NEW: Containers (Singularity) and other Cloud APIs
- NEW: AI software stacks (w/ARM)
- NEW: DoE Spack Package Manager

![](_page_17_Picture_0.jpeg)

## **Fugaku Cloud Strategy**

![](_page_17_Picture_2.jpeg)

18

 Industry use of Fugaku via A64fx and other Fugaku intermediary cloud SaaS **Technology being incorporated** vendors, Fugaku as laaS into the Cloud Other Cloud laaS Vendor 1 Commerc Industry HPC SaaS ial Cloud User 1 Provider 1 Extreme Cloud FUITSU Performance Vendor 2 A64FX Industry Advantage HPC SaaS User 2 Provider 2 Cloud Workload Becoming HPC (including Cloud Industry AI) Vendor 3 User 3 HPC SaaS Significant Performance 富岳 Provider 3 Advantage Various Cloud Millions of Units shipped KVM/Singularity, Service API for to Cloud HPC Kubernetes, etc. RIKEN Robirth of IP

![](_page_18_Picture_0.jpeg)

Search

Recen

Presid

Award

How to Grant

Grant

Coone

Condit Specia

Federa

Partne

Policy

## A64fx in upcoming Stony Brook Cray System

Since 1987 - Covering the Fastest Computers in the World and the People Who Run Them

Home

Sectors

AI/ML/DL

Exascale

Specials

Podcast

Events

Job Bank

Resource Library

 $\odot$ 

 $\odot$ 

Technologies

0

![](_page_18_Picture_2.jpeg)

National Science Foundation WHERE DISCOVERIES BEGIN

| RESEARCH AREAS                                 | FUNDING                         | AWARDS                                               | DOCUMENT LIBRARY                                                        | NEWS                              | ABOUT NSF              |  |  |
|------------------------------------------------|---------------------------------|------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------|------------------------|--|--|
| rds                                            | Award At<br>Catego<br>scientifi | ostract #192788<br>ry II : Ookami<br>ic discovery en | o<br>: A high-productivity<br>abled by exascale sy                      | path to fro<br>stem tech          | ontiers of<br>nologies |  |  |
| Awards                                         |                                 | NSF Or                                               | g: <u>OAC</u><br><u>Office of Advanced Cy</u>                           | <u>berinfrastruc</u>              | <u>ture (OAC)</u>      |  |  |
| : Awards<br>ential and Honorary                | Initia                          | al Amendment Dat                                     | e: July 11, 2019                                                        |                                   |                        |  |  |
| s<br>Awards                                    | Lates                           | st Amendment Dat                                     | August 29, 2019                                                         |                                   |                        |  |  |
| Manage Your Award                              |                                 | Award Numbe                                          | ar: 1927880                                                             |                                   |                        |  |  |
| Policy Manual                                  |                                 | Award Instrumen                                      | t: Cooperative Agreement                                                |                                   |                        |  |  |
| General Conditions<br>rative Agreement<br>ions |                                 | Program Manage                                       | er: Robert Chadduck<br>OAC Office of Advanced<br>CSE Direct For Compute | Cyberinfrastru<br>r & Info Scie & | cture (OAC)<br>Enginr  |  |  |
| Conditions                                     |                                 | Start Dat                                            | e: October 1, 2019                                                      |                                   |                        |  |  |
| I Demonstration<br>rship                       |                                 | End Dat                                              | e: September 30, 2024 (Es                                               | stimated)                         |                        |  |  |
| Office Website                                 | Awaro                           | led Amount to Dat                                    | e: \$2,780,373.00                                                       |                                   |                        |  |  |
|                                                |                                 |                                                      |                                                                         |                                   |                        |  |  |

Investigator(s): Robert Harrison robert.harrison@stonybrook.edu (Principal Investigator) Barbara Chapman (Co-Principal Investigator) Matthew Jones (Co-Principal Investigator) Alan Calder (Co-Principal Investigator)

SEARCH

Sponsor: SUNY at Stony Brook WEST 5510 FRK MEL LIB Stony Brook, NY 11794-0001 (631)632-9949

NSF Program(s): Innovative HPC

Program Reference Code(s):

Program Element Code(s): 7619

#### ABSTRACT

The State University of New York proposes to procure and operate for at least four years the first computer outside of Japan with the A64fx processor developed by Fujitsu for the Japanese path to exascale computing (i.e., computers capable of 10^-18 operations per second). The ARM-based, multi-core, 512-bit SIMD-vector processor with ultrahigh-bandwidth memory promises to retain familiar and successful programming models while achieving very high performance for a wide range of applications including simulation and big data. The testbed significantly extends current NSF-sponsored HPC technologies and will enable the community to evaluate and demonstrate the potential of this technology for deployment in multiple settings. Through integration with NSF's Extreme Science and Engineering Discovery Environment (XSEDE), the system will be widely accessible and fully leverage existing cyber infrastructure including the XDMoD monitoring system.

What does this mean for science? Compared with the best CPUs anticipated during the deployment period, A64fx offers 2-4x better performance on memory-intensive applications such as sparse-matrix solvers found in many engineering and physics codes. Cray ARM-based 'Ookami' to Serve as Testbed for Computational Studies at Stony Brook August 16, 2019

STONY BROOK, N.Y., August 16, 2019 – A \$5 million grant from the National Science Foundation (NSF) to the Institute of Advanced Computational Science (IACS) at Stony Brook University will enable researchers nationwide to test future supercomputing technologies and advance computational and datadriven research on the world's most pressing challenges.

Serving as a testbed for advanced computer technologies, the Ookami system is expected to signal a new generation of high-speed U.S. supercomputers. Using a Cray ARM-based system, Ookami will deliver remarkably high performance for scientific applications, in part due to its blazing-fast memory. Robert J. Harrison, PhD, professor of applied mathematics and statistics and director of IACS, expects that these advanced technologies will enable researchers to more quickly and effectively conduct computational investigations. The project is led by IACS faculty in partnership with co-PI Matt Jones, PhD at the State University of New York at Buffalo, whose team will lead the capture of detailed operational metrics and provision of extensive

![](_page_18_Picture_16.jpeg)

## Ookami

- · Test bed for NSF researchers
  - First planned deployment of the Post-K processor outside of Japan
- · Collaboration with Riken CCS
  - http://www.riken.jp/en/research/labs/r-ccs/
- Installation 3Q 2020
- \$5M award NSF OAC 1942140 for purchase and operations

| Node      |             |
|-----------|-------------|
| Processor | A64FX       |
| #Cores    | 48+4        |
| Peak DP   | 2.76 TOP/s  |
| Peak INT8 | 22.08 TOP/s |
| Memory    | 32GB@1TB/s  |
| System    |             |
| #Nodes    | 176         |
| Peak DP   | 486 TOP/s   |
| Peak INT8 | 3886 TOP/s  |
| Memory    | 5.6 TB      |
| Disk      | 0.5 PB      |
| Comms     | IB HDR-100  |

## Pursuing Convergence of HPC & AI (1)

![](_page_19_Picture_1.jpeg)

- Acceleration of Simulation (first principles methods) with AI (empirical method) : AI for HPC
  - Interpolation & Extrapolation of long trajectory MD
  - Reducing parameter space on Paretho optimization of results
  - Adjusting convergence parameters for iterative methods etc.
  - Al replacing simulation when exact physical models are unclear, or excessively costly to compute
- Acceleration of AI with HPC: HPC for AI

6

- HPC Processing of training data -data cleansing
- Acceleration of (Parallel) Training: Deeper networks, bigger training sets, complicated networks, high dimensional data...
- Acceleration of Inference: above + real time streaming data
- Various modern training algorithms: Reinforcement learning, GAN, Dilated Convolution, etc.

![](_page_20_Picture_0.jpeg)

## **R-CCS Pursuit of Convergence of HPC & AI (2)**

![](_page_20_Picture_2.jpeg)

- Acceleration of Simulation (first principles methods) with AI (empirical method) : AI for HPC
  - *Most* R-CCS research & operations teams investigating use of AI for HPC
  - 9 priority co-design issues area teams also extensive plans
  - Essential to deploy AI/DL frameworks efficiently & at scale on A64fx/Fugaku
- Acceleration of AI with HPC: HPC for AI
  - New teams instituted in Science of Computing to accelerate AI
    - Kento Sato (High Performance Big Data Systems)
    - Satoshi Matsuoka (High Performance AI Systems)
    - Masaaki Kondo Next Gen (High Performance Architecture)
  - NEW: Optimized AI/DL Library via port of DNNL (MKL-DNN)
    - Arm Research + Fujitsu Labs + Riken R-CCS + others
    - First public ver. by Mar 2020, TensorFlow, PyTorch, Chainer, etc.

![](_page_21_Picture_0.jpeg)

### Large Scale simulation and AI coming together [Ichimura et. al. Univ. of Tokyo, IEEE/ACM SC17 Best Poster 2018 Gordon Bell Finalist]

![](_page_21_Picture_2.jpeg)

130 billion freedomearthquake of entire Tokyoon K-Computer (2018 ACMGordon Bell Prize Finalist,SC16,17 Best Poster)

![](_page_21_Figure_4.jpeg)

![](_page_21_Figure_5.jpeg)

generate candidate soft soil Structure 2 structure

## **Convergence of HPC & Al in Modsim**

![](_page_22_Picture_1.jpeg)

- Performance modeling and prediction with AI (empirical method) AI for modsim of HPC systems
  - C.f. GEM5 simulation first principle perf. modeling
  - Al Interpolation & Extrapolation of system performance
  - Objective categorization of benchmarks
  - Optimizing system performance using machine learning
- Performance Modeling of Al esp. Machine Learning HPC modsim techniques for Al
  - Perf. modeling of Deep Neural Networks on HPC machines
  - Large scaling of Deep Learning on large scale machines
  - Optimization of AI algorithms using perf modeling
  - Architectural survey and modeling of future AI systems

### Deep Learning Meets HPC 6 orders of magnitude compute increase in 5 years [Slide Courtesy Rick Stevens @ ANL]

Exascale Needs for Deep Learning

- Automated Model Discovery
- Hyper Parameter Optimization
- Uncertainty Quantification
- Flexible Ensembles
- Cross-Study Model Transfer
- Data Augmentation
- Synthetic Data Generation
- Reinforcement Learning

![](_page_23_Figure_10.jpeg)

# 4 Layers of Parallelism in DNN Training

- Hyper Parameter Search
  - Searching optimal network configs & parameters
  - Parallel search, massive parallelism required
- Data Parallelism
  - Copy the network to compute nodes, feed different batch data, average => network reduction bound
  - TOFU: Extremely strong reduction, x6 EDR Infiniband
- Inter-Made Parallelism (domain decomposition)
  - -Split and parallelize the layer calculations in propagation
  - Low latency required (bad for GPU) -> strong latency tolerant cores + low latency TOFU network
  - Intra-Chip ILP, Vector and other low level Parallelism
    - Parallelize the convolution operations etc.
    - SVE FP16+INT8 vectorization support + extremely high memory bandwidth w/ HBM2
  - Post-K could become world's biggest & fastest platform for DNN training!

![](_page_24_Figure_14.jpeg)

![](_page_24_Figure_15.jpeg)

Massive amount of total parallelism, only possible via supercomputing

![](_page_25_Picture_2.jpeg)

**Fugaku Processor** ♦ High perf FP16&Int8 High mem BW for convolutio ♦Built-in scalable Tofu network High Performance DNN Convolution CPU Fugaku

(FFT+Winograd+GEMM)

## **Unprecedened DL scalability**

High Performance and Ultra-Scalable Networ for massive scaling model & data parallelism

![](_page_25_Figure_6.jpeg)

# A64FX technologies: Core performance

- High calc. throughput of Fujitsu's original CPU core w/ SVE
  - 512-bit wide SIMD x 2 pipelines and new integer functions

![](_page_26_Figure_3.jpeg)

![](_page_27_Picture_0.jpeg)

## "Isopower" Comparison with the Best GPU

![](_page_27_Picture_2.jpeg)

![](_page_27_Picture_3.jpeg)

![](_page_27_Picture_4.jpeg)

|                         | NVIDIA Volta v100                                       | Fujitsu A64fx (2 A0 chip nodes)          |
|-------------------------|---------------------------------------------------------|------------------------------------------|
| Power                   | 400 W (incl. CPUs, HCAs DGX-1)                          | "similar"                                |
| Vectorized MACC Formats | FP 64/32/16, INT 32(?)                                  | FP 64/32/16, INT 32/16/8 w/INT32<br>MACC |
| Multi-node Linpack      | 5.9 TF / chip (DGX-1)                                   | > 5.3 TF / 2 chip blade                  |
| Flops/W Linpack         | 15.1 GFlops/W (DGX-2)                                   | > 15 Glops/W                             |
| Stream Triad            | 855 GB/s                                                | 1.68 TB / s                              |
| Memory Capacity         | 16 / 32 GB                                              | 64 GB (32 x 2)                           |
| Al Performance          | 125 (peak) / ~95 (measured)<br>Tflops FP16 Tensor Cores | ~48 TOPS (INT8 MACC peak)                |
| Price                   | ~\$11,000 (SXM2 32GB board<br>only)                     | Talk to Fujitsu 😌                        |

| 6                                    |                             | La                       | irge Scal          | e Public Al Ir          | nfrastructu             | res in Japan                        |                     | C                      |
|--------------------------------------|-----------------------------|--------------------------|--------------------|-------------------------|-------------------------|-------------------------------------|---------------------|------------------------|
| RIKEN                                |                             | Deployed                 | Purpose            | Al Processor            | Inference<br>Peak Perf. | Training<br>Peak Perf.              | Top500<br>Perf/Rank | Green500<br>Perf/Rank  |
| Γ                                    | Tokyo Tech.<br>TSUBAME3     | July<br>2017             | HPC + AI<br>Public | NVIDIA P100<br>x 2160   | 45.8 PF<br>(FP16)       | 22.9 PF / 45.8PF<br>(FP32/FP16)     | 8.125 PF<br>#22     | 13.704 GF/W<br>#5      |
| Inference<br>838 5PF                 | U-Tokyo<br>Reedbush-H/<br>L | Apr.<br>2018<br>(update) | HPC + Al<br>Public | NVIDIA P100<br>x 496    | 10.71 PF<br>(FP16)      | 5.36 PF /<br>10.71PF<br>(FP32/FP16) | (Unranked<br>)      | (Unranked)             |
| Training<br>36.9 PF                  | U-Kyushu<br>ITO-B           | Oct.<br>2017             | HPC + AI<br>Public | NVIDIA P100<br>x 512    | 11.1 PF<br>(FP16)       | 5.53 PF/11.1 PF<br>(FP32/FP16)      | (Unranked<br>)      | (Unranked)             |
| vs. Summit<br>Inf. 1/4<br>Train. 1/5 | AIST-AIRC<br>AICC           | Oct.<br>2017             | Al<br>Lab Only     | NVIDIA P100<br>x 400    | 8.64 PF<br>(FP16)       | 4.32 PF / 8.64PF<br>(FP32/FP16)     | 0.961 PF<br>#446    | 12.681 GF/W<br>#7      |
|                                      | Riken-AIP<br>Raiden         | Apr.<br>2018<br>(update) | AI<br>Lab Only     | NVIDIA V100<br>x 432    | 54.0 PF<br>(FP16)       | 6.40 PF/54.0 PF<br>(FP32/FP16)      | 1.213 PF<br>#280    | 11.363 GF/<br>W<br>#10 |
|                                      | AIST-AIRC<br>ABCI           | Aug.<br>2018             | Al<br>Public       | NVIDIA V100<br>x 4352   | 544.0 PF<br>(FP16)      | 65.3 PF/544.0<br>PF<br>(FP32/FP16)  | 19.88 PF<br>#7      | 14.423 GF/W<br>#4      |
|                                      | NICT<br>(unnamed)           | Summer<br>2019           | Al<br>Lab Only     | NVIDIA V100<br>x 1700程度 | ~210 PF<br>(FP16)       | ~26 PF/~210 PF<br>(FP32/FP16)       | ????                | ????                   |
|                                      | C.f. US<br>ORNL<br>Summit   | Summer<br>2018           | HPC + AI<br>Public | NVIDIA V100<br>x 27,000 | 3,375 PF<br>(FP16)      | 405 PF/3,375 PF<br>(FP32/FP16)      | 143.5 PF<br>#1      | 14.668 GF/W<br>#3      |
|                                      |                             |                          |                    |                         |                         |                                     |                     | L                      |

Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers Background Proposal

 In large-scale Asynchronous Stochastic Gradient Descent (ASGD), mini-batch size and gradient staleness tend to be large and unpredictable, which increase the error of trained DNN We propose a empirical performance model for an ASGD deep learning system SPRINT which considers probability distribution of mini-batch size and staleness

![](_page_29_Figure_3.jpeg)

 Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington

## **Pushing the Limits for 2D Convolution Computation On GPUs**

#### [To appear SC19]

### Background of 2D convolution

- Convolution on CUDA-enabled GPUs is essential for Deep Learning workload
- A typical memory-bound problem with regular access

![](_page_30_Figure_5.jpeg)

[1] Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Satoshi Matsuoka. Pushing the Limits for 2D Convolution Computation On CUDA-enabled GPUs. 第163回ハイパフォーマンスコンピューティング研究会, Mar. 2018.

Also applicable to vector processor

with shuffle ops, e.g. A64FX

Applying Loop Transformations/Algorithm Optimizations to Deep Learning Kernels on cuDNN [1] and ONNX [2]

- **Motivation**: How can we use faster convolution algorithms (FFT and Winograd) with a small workspace memory for CNNs?
- Proposal: μ-cuDNN, a wrapper library for cuDNN, which applies loop splitting to convolution kernels based on DP and integer LP techniques
- Results: μ-cuDNN achieves significant speedups in multiple levels of deep learning workloads, achieving 1.73x of average speedups for DeepBench's 3×3 kernels and 1.45x of speedup for AlexNet on Tesla V100

![](_page_31_Figure_4.jpeg)

#### Convolution algorithms supported by cuDNN

- Motivation: How can we extend µ-cuDNN to support arbitrary types of layers, frameworks and loop dimensions?
- Proposal: Apply graph transformations on the top of the ONNX (Open Neural Network eXchange) format
- Results: 1.41x of speedup for AlexNet on Chainer only with graph transformation and Squeezing 1.2x of average speedup for DeepBench's 3x3 kernels by multi-level splitting

![](_page_31_Figure_9.jpeg)

#### AlexNet before/after the transformation

[1] Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, Accelerating Deep Learning Frameworks with Micro-batches, In proceedings of IEEE Cluster 2018, Belfast UK, Sep. 10-13, 2018. [2] (To appear) Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, Applying Loop Transformations to Deep Neural Networks on ONNX, 情報処理学会研究報告, 2019-HPC-170. In 並列/分散/協調処理に 関するサマーワークショップ (SWoPP2019), Jul. 24-26, 2019. μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-batches [1]

- Motivation: How can we use faster convolution algorithms (ex. FFT and Winograd) with a small workspace memory for Convolutional Neural Networks (CNNs)?
- Proposal: μ-cuDNN, a wrapper library for the math kernel library cuDNN which is applicable for most deep learning frameworks
  - μ-cuDNN applies loop splitting by using dynamic programming and integer linear programming techniques
- Results: µ-cuDNN achieves significant speedups in multiple levels of deep learning workloads
  - 1.16x, 1.73x of average speedups for DeepBench's 3×3 kernels on Tesla P100 and V100 respectively
  - achieves 1.45x of speedup (1.60x w.r.t. convolutions alone) for AlexNet on V100

![](_page_32_Figure_7.jpeg)

[1] Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka, Accelerating Deep Learning Frameworks with Micro-batches, In proceedings of IEEE Cluster 2018, Belfast UK, Sep. 10-13, 2018.

## Training ImageNet in Minutes

Rio Yokota, Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Hiroki Naganuma, Shun Iwase, Kaku Linsho, Satoshi Matsuoka Tokyo Institute of Technology/Riken + Akira Narus

+Akira Naruse (NVIDIA)

![](_page_33_Figure_3.jpeg)

![](_page_33_Figure_4.jpeg)

![](_page_33_Figure_5.jpeg)

|                                  | #GPU  | time    |
|----------------------------------|-------|---------|
| Facebook                         | 512   | 30 min  |
| Preferred Networks               | 1024  | 15 min  |
| JC Berkeley                      | 2048  | 14 min  |
| Tencent                          | 2048  | 6.6 min |
| Sony (ABCI)                      | ~3000 | 3.7 min |
| Google (TPU/GCC)                 | 1024  | 2.2 min |
| TokyoTech/NVIDIA/Riken<br>TABCI) | 4096  | ? min   |

Source Ben-nun & Hoefler https://arxiv.org/pdf/

Accelerating DL with 2<sup>nd</sup> Order Optimization and Distributed Training [Tsuji et al.] => Towards 100,000 nodes scalability

- Background
  - Large complexity of DL training.
  - Limits of data-parallel distributed training.
  - > How to accelerate the training further?
- Method
  - Integration of two techniques: 1) data- and model-parallel distributed training, and 2) K-FAC, an approx 2<sup>nd</sup> order optimization.
- Evaluation and Analysis
  - Experiments on ABCI supercomputer.
  - Up to 128K batch size w/o accuracy degradation.
  - Finish training in 35 epochs/10 min/ 1024 GPUs in 32K batch size.
  - A performance tuning / modeling.

![](_page_34_Figure_12.jpeg)

Design our hybrid parallel distributed K-FAC

|              | Batch size | # Iterations | Accuracy |
|--------------|------------|--------------|----------|
| Goyal et al. | 8K         | 14076        | 76.3%    |
| Akiba et al. | 32K        | 3519         | 75.4%    |
| Ying et al.  | 64K        | 1760         | 75.2%    |
| Ours         | 128K       | 978          | 75.0%    |

Comparison with related work (ImageNet/ResNet-50)

![](_page_34_Figure_16.jpeg)

Time prediction with the performance model

Osawa et al., Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks, CVPR 2019

# Fast ImageNet Training

![](_page_35_Figure_1.jpeg)

Assert

**DFG** 

Federal Ministry of Education

and Research

### **Top 10 Arxiv Papers Today in Computer Science**

#### #1. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs

Kazuki Osawa, Yohci Tsuji, Yuichiro Ucno, Akira Naruse, Rio Yokota, Satoshi Matsuoka

30

Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second-order optimization method that shows similar generalization capability to first-order methods, but converges faster and can handle larger mini-batches. To test our method on a benchmark where highly optimized first-order methods are available as references, we train ResNet-50 on ImageNet. We converged to 75% Top-1 validation accuracy in 35 epochs for mini-batch sizes under 16,384, and achieved 75% even with a mini-batch size of 131,072, which took 100 epochs.

## Our measurements (following DeepBench specs) for Interconnect on Tsubame 2.5 and K computer

### Baidu's Allreduce Benchmark

![](_page_36_Figure_2.jpeg)

### Our "sleepy-allreduce" (modified Intel IMB)

- emulated DL training
- alternating 400MiB Allreduce and 0.1s sleep for compute

![](_page_36_Figure_6.jpeg)

#### Nvidia's Collective Comm. Library (NCCL) Tests

- benchmark GPU collectives for DL frameworks which use NCCCL as backend
- Example visualization:

![](_page_36_Figure_10.jpeg)

- Others:
- Tensorflow's allreduce benchmark (see

Tf\_cnn\_benchmarks for details; needs very recent TF) - PFN has benchmark/data for ChainerMN / PFN-Proto (see their blogpost; unknown if open-source)

Fig. from Nvidia Devblog

## **Common/Generic Interconnect Benchmarks**

#### Intel MPI Benchmarks (IMB)

- IMB and OSU benchmarks very similar
- testing many P2P, collectives, MPI-I/O functions
- Default comm. size range from 0B→4MiB (power-2 steps; can be modified manually)
- MPI-Allreduce example for K:

![](_page_37_Figure_6.jpeg)

#### OSU Micro-Benchmarks (from Ohio-State Univ)

• MPI collectives relevant for DL training + p2p BMs:

![](_page_37_Figure_9.jpeg)

- SPEC MPI2007 (more application-centric)
- etc.

Optimizing Collective Communication in DL Training (1 of 3)

- > Reducing training time of large-scale AI/DL on GPUs-system.
  - > Time for inference = O(seconds)
  - > Time for training = O(hours or days)
- > Computation is one of the bottleneck factors
  - > Increasing the batch size and learning in **parallel** 
    - > Training ImageNet in 1 hour [1]
    - Training ImageNet in ~20 minutes [2]
- > Communication also can become a bottleneck
  - > Due to large message sizes

[1] P. Goyal, P. Doll´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch SGD: training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
[2] Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer, "Imagenet training in minutes," CoRR, abs/

## Optimizing Collective Communication in DL Training (2 of 3) (Challenges of Large Message Size)

![](_page_39_Figure_1.jpeg)

## Example of Image Classification, ImageNet data set

| Model              | AlexNet<br>(2012) | GoogleNet<br>(2015) | ResNet<br>(2016) | DenseNet<br>(2017) |
|--------------------|-------------------|---------------------|------------------|--------------------|
| # of gradients [1] | 61M               | 5.5M                | 1.7 - 60.2M      | 15.3 - 30M         |
| Message size       | 244 MB            | 22MB                | 240 MB           | 120 MB             |

[1] T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," arXiv preprint arXiv:1802.09941, 2018.

## **Optimizing Collective Communication in DL Training (3 of 3)**

**Proposal**: Separate intra-node and inter-node comm. → **multileader hierarchical algorithm** 

- > Phase 1: Intra-node reduce to the node leader
- > Phase 2: Inter-node all-reduce between leaders
- > Phase 3: Intra-node broadcast from the leaders

**Key Results:** 

- $\succ$  Cut down the communication time up to 51%
- $\succ$  Reduce the power consumption up to 32%

![](_page_40_Figure_8.jpeg)

• Worse with inter-node comm.

![](_page_40_Figure_10.jpeg)

Multileader hierarchical algorithm

• Optimized for inter-node comm.

"Efficient MPI-Allreduce for Large-Scale Deep Learning on GPU-Clusters", Truong Thao Nguyen, Mohamed Wahib, Ryousei Takano, Journal of Concurrency and Computation: Practice and Experience (CCPE), Accepted: to appear in 2019.10

#### 1<sup>st</sup> large-scale Prototype – Motivation for HyperX

![](_page_41_Picture_1.jpeg)

![](_page_41_Picture_2.jpeg)

## **Evaluating the HyperX and Summary**

1:1 comparison (as fair as possible) of 672-node 3-level Fat-Tree and 12x8 2D HyperX

- NICs of 1<sup>st</sup> and 2<sup>nd</sup> rail even on same CPU socket
- Given our HW limitations (few "bad" links disabled)

### Wide variety of benchmarks and configurations

- 3x Pure MPI benchmarks
- 9x HPC proxy-apps
- 3x Top500 benchmarks
- 4x routing algorithms (incl. PARX)
- 3x rank-2-node mappings
- 2x execution modes

### **Primary research questions**

- Q1: Will reduced bisection BW (57% for HX vs. ≥100% for FT) impede performance?
- **Q2:** Two mitigation strategies against lack of AR? ( $\rightarrow$  e.g. placement vs. "smart" routing)

![](_page_42_Figure_14.jpeg)

![](_page_42_Figure_15.jpeg)

![](_page_42_Figure_16.jpeg)

Fig.4: Baidu's (DeepBench) Allreduce (4-byte float) scaled  $7 \rightarrow 672$  cn (vs. "Fat-tree / ftree / linear" baseline)

- Placement mitigation can alleviate bottleneck
   HyperX w/ PARX routing outperforms FT in HPL
- 3. Linear good for small node counts/msg. size
- 4. Random good for DL-relevant msg. size (- 1%)
- 5. "Smart" routing suffered SW stack issues
- FT + ftree had bad 448-node corner case

Conclusion HyperX topology is promising and cheaper alternative to Fat-Trees (even w/o adaptive R) !

#### Evaluating the HyperX Topology: A Compelling Alternative to Fat-Trees?[SC19]

![](_page_43_Picture_1.jpeg)

#### Fig.1: HyperX with n-dim. integer *lattice* (*d*<sub>1</sub>,...,*d*<sub>n</sub>) *base structure* fully connected in each dim.

![](_page_43_Figure_3.jpeg)

14-ary-3-tree

0000

RRR

12x8 Hyper

Fat-Tree

- NICs of 1<sup>st</sup> and 2<sup>nd</sup> rail even on same CPU socket
- Given our HW limitations (few "bad" links disabled)

#### Advantages (over FT) assuming adaptive routing (AR)

- **Reduced HW cost** (AOC/switches)  $\rightarrow$  similar perf.
- **Lower latency** when scaling up (less hops)
- **Fits** rack-based **packaging** model for HPC/racks
- Only needs 50% bisection BW to provide 100% throughput for uniform random

#### **Q1:** Will reduced bisection BW (57% for HX vs. $\geq$ 100%) impede Allreduce

![](_page_43_Figure_12.jpeg)

[1] Domke et al. "HyperX Topology: First at-scale Implementation and Comparison to the Fat-Tree" to be presented at SC'19 and HOTI'19

### Breaking the limitation of GPU memory for Deep Learning

Haoyu Zhang, Wahib Mohamed, Lingqi Zhang, Yohei Tsuji, Satoshi Matsuoka

Motivation: GPU memory is relatively small in comparison to recent DL work load

Analysis:

![](_page_44_Figure_4.jpeg)

## Breaking the limitation of GPU memory for Deep Learning

Haoyu Zhang, Wahib Mohamed, Lingqi Zhang, Yohei Tsuji, Satoshi Matsuoka

### Proposal:

### **OOC-Paleo**

![](_page_45_Figure_4.jpeg)

### Case Study & Discussion:

#### Memory Capacity:

Not so important as latency and throughput

#### Latency:

- Higher Bandwidth make no • sense when buffer is too small
- Latency is decided by physical • law

### **UM-Chainer**

#### prefetch()->explicit swap-in no explicit swap-out

**Bandwidth:** 

![](_page_45_Figure_13.jpeg)

#### **Processor:**

- Slower processor is • acceptable
- Lower Memory bandwidth

Higher connection

bandwidth

### Breaking the limitation of GPU memory for Deep Learning

Haoyu Zhang, Wahib Mohamed, Lingqi Zhang, Yohei Tsuji, Satoshi Matsuoka

## Assuming we have higher Bandwidth...

| Resnet50,Batch-size=128    | Bandwidth        | Time        | percentage           | percentage                    |
|----------------------------|------------------|-------------|----------------------|-------------------------------|
|                            | Dandwiddii       | Time        | (bandwidth not full) | (computation can not overlap) |
| 16GB/s->64GB/s:            | 16               | 967.8595572 | 0.409                | 0.733                         |
| I raining time can be naif | 32               | 569.9550342 | 0.466                | 0.642                         |
| 64GB/s->128GB/s:           | 64               | 407.2978908 | 0.574                | 0.472                         |
| Only a little time reduced | 128              | 371.9318064 | 0.688                | 0.438                         |
|                            | 256              | 362.5661138 | 0.835                | 0.398                         |
| >128GB/s:                  | 512              | 359.7637498 | 0.915                | 0.398                         |
| use of the bandwidth       | 1024             | 359.3012901 | 0.983                | 0.386                         |
|                            | $\infty$         | 359.3012901 | 1.000                | 0.386                         |
| >512GB/s:                  | Original version | 306.9286403 | N/A                  | N/A                           |
|                            |                  |             |                      |                               |

Time almost do not decrease

![](_page_47_Picture_0.jpeg)

#### Toward Training a Large 3D Cosmological <sup>1</sup> Tokyo Institute of Technology,<sup>2</sup> Lawrence Livermore National Laboratory,<sup>3</sup> University of Wirklins at Urbana-Champaign,<sup>4</sup> Lawrence Berkeley National Laboratory,<sup>5</sup> RIKEN Center for Computational Science, <sup>\*</sup> oyama.y.aa@m.titech.ac.jp August 5, 2019 CNN with Hybrid Parallelization

### The 1st Workshop on Parallel and Distributed Machine Learning 2019 (PDML'19) - Kyoto, Ja

osuke Oyama 💷 Naoya Maruyama 2, Nikoli Dryden 3/2, Peter Hamington 4, Jan Balewski 4, Satushi Maruuka 5,17 Marc Shir 3, Peter Nugent 4, and Brian

LLNL-PRES-XXXXXXX

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

![](_page_47_Picture_6.jpeg)

## Background

CosmoFlow [1] is a project to estimate cosmological parameters from 3-dimensional universe data by using a 3D CNN

![](_page_48_Figure_2.jpeg)

- **Problem:** GPU memory is too small to process high-resolution universe data
  - $\rightarrow$  Another way to parallelize the model efficiently?

![](_page_48_Picture_6.jpeg)

![](_page_48_Picture_7.jpeg)

## Background

- Data-parallel training distributes data samples among GPUs
  - ✓ Good weak scalability (O(1000) GPUs)

![](_page_49_Figure_3.jpeg)

- Model-parallel training distributes the computation of a single sample (model) among GPUs
  - ✓ Can use more GPUs per sample
  - ✓ Can train larger models

![](_page_49_Figure_7.jpeg)

Data-parallelism + model-parallelism = Hybrid-parallelism

![](_page_49_Picture_10.jpeg)

![](_page_49_Picture_11.jpeg)

## **Proposal: Extending Distconv for 3D CNNs**

LBANN + Distconv [2]: A parallelized stencil computation-like hybrid-parallel CNN kernel library

![](_page_50_Figure_2.jpeg)

![](_page_50_Picture_5.jpeg)

- Achieved 111x of speedup over 1 node by exploiting hybrid-parallelism, even if layer-wise communication is introduced
- The 8-way partitioning is 1.19x of 4-way partitioning with a mini-batch size of 64

![](_page_51_Figure_3.jpeg)

![](_page_51_Figure_4.jpeg)

Figure: Weak scaling of the CosmoFlow network.

![](_page_51_Picture_6.jpeg)

![](_page_51_Picture_8.jpeg)

## **Evaluation: Strong scaling**

Achieved 2.28x of speedup on 4 nodes (16 GPUs) compared to one node when N = 1The scalability limit here is 8 GPUs, and the main bottleneck is input data loading

![](_page_52_Figure_2.jpeg)

Figure: Breakdown of the strong scaling experiment when N = 1.

![](_page_52_Picture_4.jpeg)

![](_page_52_Picture_5.jpeg)

![](_page_52_Picture_6.jpeg)

### Machine Learning Models for Predicting Job Run Time-Underestimation in HPC system [SCAsia 19]

- Motivation & Negative effects Evaluating by Average Precision(AP)
- 1. When submitting a job, users need to estimate their job runtime
- 2. If job runtime is underestimated by the users
- 3. Job will be terminated by HPC system upon reaching its time limit
- Increasing time and financial cost for HPC users
- Wasting time and system resources.
- Hindering the productivity of HPC users and machines
- Method
- Apply machine learning to train models for predicting whether the user has underestimated the job run-time
- Using data produced by TSUBAME 2.5

![](_page_53_Figure_11.jpeg)

![](_page_53_Figure_12.jpeg)

- Runtime-underestimated jobs can be predicted with different accuracy and SLR at different checkpoint times
- Summing up the "Saved" time of all the applications at best SLRs checkpoints, 24962 hours can be saved in total with existing TSUBAME 2.5 data
- Helping HPC users to reduce time and financial loss
- Helping HPC system administrators free up computing resources

Guo, Jian, et al. "Machine Learning Predictions for Underestimation of Job Runtime on HPC System." Asian Conference on Supercomputing Frontiers. Springer, 2018

Many Core Era

Post Moore Cambrian Era

![](_page_54_Picture_2.jpeg)

Flops-Centric Monolithic Algorithms and <u>Apps</u> Flops-Centric Monolithic System Software

Hardware/Software System APIs Flops-Centric Massively Parallel Architecture

![](_page_54_Figure_6.jpeg)

Transistor Lithography Scaling (CMOS Logic Circuits, DRAM/SRAM) -2025 M-P Extinction Event

Hardware/Software System APIs "Cambrian" Heterogeneous Architecture Heterogeneous CPUs + Holistic Data  $\circ \circ \circ \circ \circ \circ \circ$ Reconfigurable Dataflow Massive BW 000 Optical  $\mathbf{O}$ 3-D Package Computing **DNN**& Neuromorphic Non-Volatile Quantum  $\mathbf{O}$  $\bigcirc$ Low Precision Memory Computing **Frror-Prone** Ultra Tightly Coupled w/Aggressive **3-D+Photonic Switching Interconnected** 

Cambrian Heterogeneous Algorithms and

Apps

Cambrian Heterogeneous System Software

Novel Devices + CMOS (Dark Silicon) (Nanophotonics, Non-Volatile Devices etc.)

![](_page_55_Picture_0.jpeg)

![](_page_55_Picture_2.jpeg)

- Basic Research on Post-Moore
  - Funded 2017: DEEP-AI CREST (Matsuoka)
  - Funded 2018: NEDO 100x 2028 Processor Architecture (Matsuoka, Sano, Kondo, SatoK)
  - Funded 2019: Kiban-S Post-Moore Algorithms (NakajimaK etc.)
  - Submitted: Neuromorphic Architecture (Sano etc. w/Riken AIP, Riken CBS (Center for Brain Science))
  - In preparation: Cambrian Computing (w/HPCI Centers)
- Author a Post-Moore Whitepaper towards Fugaku-next
  - All-hands BoF last week at annual SWoPP workshop
  - Towards official "Feasibility Study" towards Fugaku-next
  - Similar efforts as K => Fugaku started in 2012

![](_page_56_Picture_0.jpeg)

## **Basic Research #1: NEDO 100x Processor**

![](_page_56_Picture_2.jpeg)

### 2028: Post-Moore Era

- ∼2015 ~25 Years Post-Dennard, Many-core Scaling era
- 2016~Moore's Law Slowing Down
- 2025~Post-Moore Era, end of transistor lithography (FLOPS) improvement

![](_page_56_Figure_7.jpeg)

### Research: Architectural investigation of perf. improvement ~2028

• 100x in 2028 c.f. mainstream high-end CPUs circa 2018 across applications

# Key to performance improvement: from FLOPS to Bytes – data movement architectural optimization

- CGRA Coarse-Grained Reconfigurable Vector Dataflow
- Deep & Wide memory architecture w/advanced 3D packaging & novel memory devices
- All-Photonic DWM interconnect w/high BW, low energy injection
- Kernel-specific HW optimization w/low # of transistors & associated system software, programming, and algorithms

![](_page_57_Picture_0.jpeg)

## **NEDO 100x Processor**

![](_page_57_Picture_2.jpeg)

### **Towards 100x processor in 2028**

- Various combinations of CPU architectures, new memory devices and 3-D technologies
- Perf. measurement/characterization/models for high-BW intra-chip data movement
- Cost models and algorithms for horizontal & hierarchical data movement
- Programming models and heterogeneous resource management

![](_page_57_Figure_8.jpeg)

12 Apr, 2019