

## **FPGA-based Simulated Bifurcation Machine**

Kosuke Tatsumura, Alexander R. Dixon, and Hayato Goto

FPL2019 @ Barcelona 2019.09.09

© 2019 Toshiba Corporation

#### Contents

# An ultra-fast solver for combinatorial optimization problems

01 Background & objective: Accelerator for Simulated Bifurcation

## 02 Design

**03** Implementation & Evaluation

**04** Discussion: practicality



#### **Combinatorial optimization problems**

### Economically valuable but computationally hard



Standard approach: Simulated annealing (SA)

<sup>© 2019</sup> Toshiba Corporation 3

#### New approach: Ising machine

# Map to an Ising problem, Special-purpose machine solves it quickly



#### **Ising machines**

### Very competitive, CIM deserves attention



\*4 https://www.ntt.co.jp/news2017/1711e/171120a.html

#### A recently proposed, quantum-inspired algorithm

### **Simulated Bifurcation (SB)**

classical

counterpart

VS.

[H. Goto, K. Tatsumura, A. R. Dixon, Sci. Adv. 5, eaau0823, '19

#### Simulated Bifurcation (SB)



#### **Quantum Bifurcation (QB) machine**

a quantum adiabatic optimization method [H. Goto, Sci. Rep 6, 21686, '16]



#### Plentiful parallelism



 $\rightarrow$ Substantial speedup by massively parallel processing

#### Simulated Annealing (SA)



#### **Objective of this work**

### Designing & evaluating massively-parallel custom accelerator for SB

#### **Motivation**:

#### Be the world's fastest Ising machine

 $\rightarrow$  Pursue a custom circuit that can fully exploit the parallelism in SB

Meet the diversity of problems (size, bit precision, I/F)

 $\rightarrow$  Describe the design in an HLS to make it flexible & scalable

#### **Toward "Intelligent real-time response systems"**

 $\rightarrow$  Be a component of ALL-in-HW system enabling sub-msec latency



7

Contents

# An ultra-fast solver for combinatorial optimization problems

01 Background & objective: Accelerator for Simulated Bifurcation

# 02 Design

**03** Implementation & Evaluation

**04** Discussion: practicality



#### How it works: Simulated Bifurcation (SB)

#### **N-body system dynamically searches for a good solution**

#### **Movement of the system in N-dimensional space**



#### If N is large,

imagine the process of finding the brightest "star" among 2<sup>N</sup> stars in the dark

#### How it works: Simulated Bifurcation (SB)

#### **Time evolution of N-body system**

Movements of N(=4000) spin-variables as a function of time



#### Algorithm of SB and it's parallelism

#### SB step: spin state at $t_{n+1} \leftarrow$ the previous state at $t_n$



Top-level parallelism: Simultaneous update of N spins is possible 11

#### **Design: Top-level circuit architecture**

#### **End-to-end HW implementation of SB**

- 1. SB step iteration
- 2. Block parallelism  $(P_b)$

- Circulative structure
- $P_b$  number of MMTEs (responsible for  $N_b = N/P_b$  spins)
- 3. Data dependency for  $\boldsymbol{x}$





#### **Design: the most computationally intensive part**

#### Matrix-vector multiplication module (MM)



#### Timing Design: spatial and temporal parallelization

#### Pc x Pr x Pb Speedup for MM & hiding TE part



Contents

# An ultra-fast solver for combinatorial optimization problems

01 Background & objective: Accelerator for Simulated Bifurcation

02 Design

**03** Implementation & Evaluation

**04** Discussion: practicality



Implementation for MAX-CUT benchmark problem

4,096-spin SB machine on Arria10 FPGA Massively Parallel, High Utilization, 4X larger vs CIM



#### Implementation for MAX-CUT benchmark problem

### 4,096-spin SB machine on Arria10 FPGA Massively Parallel, High Utilization, 4X larger vs CIM



|                       | 4,096-size |
|-----------------------|------------|
| Architecture          |            |
| Pr/Pc/Pb              | 32/32/8    |
| # of MAC<br>PEs       | 8,192      |
| Effective<br>activity | 92%        |
| Resource              |            |
| ALM                   | 40%        |
| BRAM                  | <b>56%</b> |
| DSP                   | 7%         |
| System<br>Clock       | [MHz]      |
| Fsys                  | 269        |

Spin-Size: limited by BRAM (for J matrix) MAX-CUT:  $J_{ij}$ ={-1,0,1},  $J_{4096x4096}$  $\rightarrow$ 16Mbit (1,024 BRAMs) Parallelism: limited by routing congestion

 $\ensuremath{\mathbb{C}}$  2019 Toshiba Corporation 17

#### **Evaluation: FPGA-SB vs. CIM**

#### 14X faster, 288X more energy efficient than CIM



#### **Evaluation: FPGA-SB vs. GPU-SB**

FPGA is computation-bound, GPU memory-bound -11X faster, 26X more energy efficient than GPU-SB



Contents

# An ultra-fast solver for combinatorial optimization problems

01 Background & objective: Accelerator for Simulated Bifurcation

## 02 Design

**03** Implementation & Evaluation

**04** Discussion: practicality



#### Hit record-high speed, meet diversity of problems

## Write fully(bit-by-bit/clock-by-clock) customized HWs in an HLS language (OpenCL)

HLS: high-level-synthesis

#### **Good size-scalability of our parameterized design**



| "all-to-all-connected MAX-CUT |            |            |           |  |  |
|-------------------------------|------------|------------|-----------|--|--|
| Problem                       | K2000*     | K4000*     | Ratio     |  |  |
| Complexity<br>per SB step     | 4M         | 16M        | <b>4X</b> |  |  |
| Solver                        | 2K-size SB | 4K-size SB | Ratio     |  |  |
| Time-to-<br>solution [ms]     | 0.055      | 0.211      | 3.9X      |  |  |
| Energy-for-<br>solution [mJ]  | 2.7        | 10.8       | 4.0X      |  |  |

\*all-to-all-connected MAX-CUT

#### Hit record-high speed, meet diversity of problems

## Write fully(bit-by-bit/clock-by-clock) customized HWs in an HLS language (OpenCL)

HLS: high-level-synthesis

#### Good size-scalability of our parameterized design



#### **FPGA+HLS** is a good choice for practical problems

| Ising problems             | J matrix          | h vector | size    |                            |  |
|----------------------------|-------------------|----------|---------|----------------------------|--|
| MAX-CUT                    | Ternary, {-1,0,1} | none     | various |                            |  |
| Arbitrage<br>(path search) | Ternary, {-1,0,1} | FP32     | various |                            |  |
| Portfolio<br>optimization  | FP32              | none     | various |                            |  |
| •••                        | various           | various  | various | © 2019 Toshiba Corporation |  |

#### **Toward "Intelligent real-time response systems"**

Interface (I/F) is important as well; Be flexible, wide, directly connected to external I/F



#### Summary

#### **FPGA-based Simulated Bifurcation Machine** An ultra-fast solver for combinatorial optimization problems

### Massively-parallel, fully-customized SB accelerator

-the world's fastest, most energy-efficient (14X faster, 288X more energy efficient than CIM)

#### **Very practical**

-No refrigerator, no laser, but just an FPGA -written in an HLS language, flexible & scalable to meet the diversity of problems

# Be a key component for intelligent real-time response systems

-End-to-End HW implementation that can be directly connected to external systems, enabling *optimal* responses in sub-msec