

#### Fletcher: A Framework to Efficiently Integrate FPGA Accelerators with Apache Arrow

@ FPL2019, Barcelona, September 11, 2019

Johan Peltenburg<sup>1</sup>, Jeroen van Straten<sup>1</sup>, Lars Wijtemans<sup>1</sup> Lars T.J. van Leeuwen<sup>1</sup>, Zaid Al-Ars<sup>1</sup>, H. Peter Hofstee<sup>2</sup>

> 1. Delft University of Technology, Netherlands 2. IBM, Austin, Texas, USA

Thanks to our supporters:

Fitoptivis European ECSEL project no. ECSEL2017-1-737451

Xilinx



## Outline

- The challenge of FPGA integration with Big Data Analytics
- Overcoming serialization bottlenecks with Apache Arrow
- Fletcher
- Mini-tutorial (if time)
- Results
- Conclusion & future work

#### **An FPGA Accelerator Dev. Perspective**



### A Big Data Analytics Dev. Perspective:

#### Ease of Use

Write applications quickly in Java, Scala, Python, R, and SQL.

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it *interactively* from the Scala, Python, R, and SQL shells.

```
df = spark.read.json("logs.json")
df.where("age > 21")
   .select("name.first").show()
```

Spark's Python DataFrame API Read JSON files with automatic schema inference

Source: https://spark.apache.org/

- DataFrame: like a database table or excel spreadsheet, but...
- Huge. Typically in the order of GiBs to TiBs.
- **Distributed** over multiple worker nodes (also in storage).
- Operations on it build Directed Acyclic Graphs (DAGS) and are lazily evaluated.
- DAGs are optimized, planned and scheduled to exectue in parallel over a cluster.
- Resilient to node failure, provides automatic recovery and continuation.
- What is all that computer scientist magic that makes this possible?

### **Big Data Analytics SW Ecosystem**



## A string



## Serialization



- Iterate over all objects in collection (data is big)
- Traverse all object graphs (memory latency)
- Copy fields to some intermediate format both A and B understand (bandwidth lost)
- Reconstruct objects in B ((de)allocation overhead)

# I/O bandwidth catching up



[1] F. Kruger, "CPU Bandwidth The Worrisome 2020 Trend," Mar. 2016. [Online]. Available: https://blog.westerndial.com/cpu-bandwidth-theworrisome-2020-trend/

# **Relative impact on accelerators**

Original process on CPU:

Process on GPGPU/FPGA with serialization (potentially, but not necessarily, exaggerated)

#### **NON-FUNCTIONAL**

#### **NON-FUNCTIONAL**

#### Desired profile:





[2] J. Peltenburg, A. Hesam, and Z. Al-Ars, "**Pushing Big Data into Accelerators: Can the JVM Saturate Our Hardware?**" in International Conference on High Performance Computing. Springer, 2017, pp. 220–236.

#### **Overcoming serialization bottlenecks**

- In-memory formats determined by:
  - Programming languages
    - Run-time system design choices
    - Standard libraries
  - Algorithms
  - Programmers
- Increased heterogeneity  $\rightarrow$  more IPC  $\rightarrow$  more serialization overhead
- What if data is...
  - In a standardized format?
    - That every language can use (through libraries or otherwise).
  - As contiguous as possible?
    - We can move it using large bursts, no pointer chasing, less misalignment overhead

# **>>>>** Apache Arrow<sup>[3]</sup>



- Standardized representation in-memory Common Data Layer
- Columnar format
  - Hardware friendly while iterating over entries in single column (SIMD, caches, etc...)
  - Better for many algorithms, worse for some others.
- Libraries and APIs for 10+ languages to build and access data sets (zero-copy)

[3] The Apache Software Foundation, "Apache Arrow," 2018. [Online]. Available: https://arrow.apache.org



#### Arrow terminology: Schema:

Description of data types in a *RecordBatch* 

#### RecordBatch:

Tabular structure containing Arrow *arrays* 

#### Arrays:

A RecordBatch "column". Combination of Arrow *buffers*, can be nested

#### **Buffers:**

Contiguous C-like arrays



### Integrating FPGA and Arrow

- Arrow 'turns out' to be hardware-friendly
  - In-memory format clearly specified, to every bit
  - Highly contiguous & columnar format
    - Iterate over a column in streaming fashion
    - Useful for: maps, reductions, filters, etc...
  - Parallel accessible format
    - E.g. uses offsets, not lengths, for variable length data we can start anywhere
    - Useful for: maps, reductions, filters, etc...
- Backed by a large and ever growing community
- Integration in many BDA frameworks, even without official format stability
- Can we generate easy-to-use, high throughput hardware interfaces automatically?





#### Example: Interface for accelerator parsing strings

Typical:

#### Fletcher:



## Generated interface overview

- Architecture based on library with streaming primitives
- BufferReader/Writer : Basic unit to read (N) Arrow Buffer elements
- ArrayReader/Writer : Combination of BufferReaders/Writers [1]
  - Dictated by the schema field and format specification
  - Generated through pure HDL; vendor agnostic
- RecordBatchReader/Writer : Combination of ArrayReaders/Writers
- Mantle : Wraps multiple RecordBatchR/W + bus infrastructure

<sup>[4]</sup> J. Peltenburg, J. van Straten, M. Brobbel, H. P. Hofstee, and Z. Al-Ars, "Supporting Columnar In-memory Formats on FPGA: The Hardware Design of Fletcher for Apache Arrow", in Applied Reconfigurable Computing, Cham: Springer International Publishing, 2019, pp. 32–47.

#### **Combining BufferReaders into ArrayReaders**



| Index | Value |  | Index | Valid |
|-------|-------|--|-------|-------|
| 0     | 1.33f |  | 0     | 1     |
| 1     | 7.01f |  | 1     | 1     |
| 2     | Ø     |  | 2     | 0     |

| Index | Offset | Offset | Value |
|-------|--------|--------|-------|
| 0     | 0      | 0      | ο     |
| 1     | 3      | 1      | l     |
| 2     | 6      | 2      | а     |
| 3     | 10     | 3      | f     |
|       |        |        |       |
|       |        | ••••   |       |

| Index | Value | Index | Value |
|-------|-------|-------|-------|
| 0     | 1     | 0     | 3.14  |
| 1     | 5     | 1     | 1.41  |
| 2     | 3     | 2     | 1.61  |

- Arrow Schema & format spec dictate how to combine buffers.
- Passed to ArrayReaders through configuration string in HDL.
- Seeking the limits of synthesis tools :-)
- Over 10k+ random field types simulated.

#### High-level architecture generation: Fletchgen

Arrow support: ☑ RecordBatches ☑ Arrays ☑ Buffers



- Need syntactically pleasing interfaces
  - Grouping of ArrayReader/Writer interfaces for RecordBatches
  - Stream names must correspond to schema fields
  - Synthesizable HDL too limited
- Need kernel template generation for kernel implementation in HDL/HLS
- Need simulation
- Need platform integration
- High-level architecture generator: Fletchgen

## Fletcher run-time stack



- Reap the benefits of Arrow:
  - Create one accelerator.
  - Leverage in any supported language.
- Fletcher Generated Hardware Interface is platform agnostic – requires no IP, tcl scripts, etc...
- Top level with AXI4 interface available.

### Mini-tutorial: Fletcher "Hello, World!"



- Trivial example:
  - Sum a column of integers
- Get to know the toolchain
- More realistic applications:
  - Complex types
  - More Arrow Arrays
  - More input/output RecordBatches

#### Also on GitHub: https://github.com/abs-tudelft/fletcher

|                |         |                  | 💭 abs-tudelfi     | / fletcher |          |               |           |
|----------------|---------|------------------|-------------------|------------|----------|---------------|-----------|
|                | Code    | Issues (14)      | Pull requests (8) | Security   | Pulse    | Community     |           |
| Branch: devel  | lop -   |                  |                   |            |          | Find file     | Copy path |
| letcher / e:   | kample  | rs / sum / R     | EADME.md          |            |          |               |           |
| dad19d3 o      |         | e -no-tempdir fi | x vhdeps          |            |          |               |           |
| 3 contributors |         | 19               |                   |            |          |               |           |
| Raw Bi         | arre F  | listory          |                   |            |          |               | 1.1       |
| 216 lines      | (154 s  | loc) 10.7        | KB                |            |          |               |           |
| Get<br>Exa     | mp      | le               | ted: Sin          | nple (     | Colu     | mn Si         | um        |
| Drore          | squi    |                  |                   |            |          |               |           |
| Prere          |         |                  | d a few libraries | and tools  |          | this tutorial |           |
| Before         | ning up | to step 5 (      | simulation) can l | e done wi  | ith comp | letely free a | nd        |

### Step 1: Create an Arrow Schema

import pyarrow as pa

```
number_field = pa.field('number', pa.int64(), nullable=False)
schema = pa.schema([number_field])
```

# Step 2: Create a RecordBatch

(optional, for simulation)

data = [pa.array([1, -3, 3, -7])]
recordbatch = pa.RecordBatch.from\_arrays(data, schema)

writer = pa.RecordBatchFileWriter('recordbatch.rb', schema)
writer.write(recordbatch)
writer.close()



## Step 4: Implement the kernel

- Start from template.
- Use your favorite tools:
  - Custom HDL
  - Vivado HLS

. . .

- Kernel interfaces:
  - AXI4-lite MMIO
  - Command streams to generated interface
  - Data streams from generated interface

```
when RUNNING =>
  stat_done <= '0';
  stat_busy <= '1';
  stat_idle <= '0';
  -- Always ready to accept input
  ExampleBatch_number_ready <= '1';
  if ExampleBatch_number_valid = '1' then
    -- Sum the record to the accumulator
    accumulator_next <= accumulator + signed(ExampleBatch_number);
    -- Wait for last element from ArrayReader
    if ExampleBatch_number_last = '1' then
        state_next <= DONE;
    end if;
end if;</pre>
```





/home/user/fletcher/examples/sum/hardware/vhdl/SimTop\_tc.vhd:342:5:@1650ns:(report note): Stimuli done.

Final summary:
 \* PASSED simtop\_tc
Test suite PASSED

### Step 6: Write host-side software

import pyarrow as pa import pyfletcher as pf

```
• • •
```

. . .

```
platform = pf.Platform()
platform.init()
```

```
context = pf.Context(platform)
context.queue_record_batch(batch)
context.enable()
```

```
kernel = pf.Kernel(context)
kernel.start()
kernel.wait_for_finish()
```

## Step 7: Target a Platform

- Supported platforms:
  - OpenPOWER CAPI SNAP
    - If implementation allows, directly streamable from host-memory using virtual addresses on FPGA



 Requires copy to on-board memory



NIMBIX



#### **Regular Expression Matching**

- Given N strings
- Match M regular expressions
- Count matches for each regexp





## Regex throughput/speedup result

#### TABLE I

**REGULAR EXPRESSION MATCHING RESULTS** 

|           |          | Throughput (GB/s)         |               |           |             | Spe              | edup           |             |
|-----------|----------|---------------------------|---------------|-----------|-------------|------------------|----------------|-------------|
| System    | Language | Native data<br>set w/ CPU | Serialization | FPGA Copy | FPGA Kernel | w/ Serialization | Arrow/Fletcher | Improvement |
| AWS/F1    | C++      | 0.08                      | 0.55          | 7.13      | 14.27       | 6.18             | 59.73          | 9.67        |
| (16 regex | Python   | 0.04                      | 0.83          | 7.17      | 14.28       | 15.93            | 107.73         | 6.76        |
| units)    | Java     | 0.03                      | 0.27          | 7.13      | 14.27       | 8.24             | 152.91         | 18.56       |
| P9/SNAP   | C++      | 0.43                      | 0.81          | n/a       | 7.61        | 1.70             | 17.78          | 10.44       |
| (8 regex  | Python   | 0.11                      | 0.81          | n/a       | 7.61        | 6.77             | 70.72          | 10.45       |
| units)    | Java     | 0.16                      | 0.16          | n/a       | 7.61        | 0.95             | 46.49          | 48.69       |

### Regex on 1GiB of tweet-like strings



### Writing random length (0-255) strings

| TABLE II              |          |                   |             |      |              |  |  |  |  |  |
|-----------------------|----------|-------------------|-------------|------|--------------|--|--|--|--|--|
| STRING WRITER RESULTS |          |                   |             |      |              |  |  |  |  |  |
|                       |          | Throughput (GB/s) |             |      |              |  |  |  |  |  |
| System                | Longuaga | To native         | To Arrow    | FPGA | Total (Arrow |  |  |  |  |  |
| System                | Language | container         | RecordBatch | copy | RecordBatch) |  |  |  |  |  |
|                       | C++      | 0.85              | 2.53        | -    | 2.53         |  |  |  |  |  |
| AWS/F1                | Python   | 0.96              | 2.60        | -    | 2.60         |  |  |  |  |  |
| AW S/FT               | Java     | 0.59              | 1.81        |      | 1.81         |  |  |  |  |  |
|                       | FPGA     | -                 | 9.76        | 2.75 | 2.15         |  |  |  |  |  |
|                       | C++      | 0.76              | 7.52        | -    | 7.52         |  |  |  |  |  |
| P9/SNAP               | Python   | 1.60              | 7.68        | -    | 7.68         |  |  |  |  |  |
| F9/SINAF              | Java     | 0.28              | 3.96        | -    | 3.96         |  |  |  |  |  |
|                       | FPGA     | -                 | 9.76        | -    | 9.76         |  |  |  |  |  |

#### K-Means clustering, internal iteration bandwidth & total run-time

| TABLE III                  |                             |         |       |          |           |  |  |  |  |
|----------------------------|-----------------------------|---------|-------|----------|-----------|--|--|--|--|
| K-MEANS CLUSTERING RESULTS |                             |         |       |          |           |  |  |  |  |
|                            | Avg. GB/sTotal run-time (s) |         |       |          |           |  |  |  |  |
| Longuaga                   | CPU                         | FPGA CP | CDU   | FPGA     | FPGA      |  |  |  |  |
| Language                   | CFU                         |         | CFU   | (w/ ser) | (w/o ser) |  |  |  |  |
| C++                        | 1.40                        | 11.15   | 19.24 | 6.08     | 2.55      |  |  |  |  |
| Python                     | 1.29                        | 11.15   | 20.77 | 8.07     | 3.03      |  |  |  |  |
| Java                       | 1.00                        | 11.15   | 26.92 | 3.88     | 2.55      |  |  |  |  |

## Conclusion

- Big data analytics systems increasingly heterogeneous many different tools in many different technologies.
- Apache Arrow: one in-memory format for IPC through shared memory for most languages / runtimes / technologies.
- Fletcher: Arrow format allows us to generate high-throughput, easy-to-use hardware interfaces for FPGA.
- Streaming kernels benefit the most, more computationally oriented kernels less.
- Paves the way for more efficient FPGA accelerator integration with any of the supported big data analytics tools.

# Spin-off projects & future work

- **Dynamic Arrow Buffers in Hardware** to support buffer resizing (Lars Wijtemans)
- **Parquet-to-Arrow decoder and decompressor** (Lars van Leeuwen, Jian Fang)
- HLS integration for map, reduce, **SQL-defined filters** (Erwin de Haan)

- Data-defined architecture.
- Can we turn this into a closed-loop self-optimizing interface generation and profiling framework?
  - Long-running workload: plenty of time to synthesize.
  - We only need one node of a cluster to do it.
  - Are the gains worth the cost?

# Thank you for your attention!

#### **References:**

[1] F. Kruger, "CPU Bandwidth The Worrisome 2020 Trend," Mar. 2016. [Online]. Available: https://blog.westerndial.com/cpu-bandwidth-the-worrisome-2020-trend/ [2] J. Peltenburg, A. Hesam, and Z. Al-Ars, "Pushing Big Data into Accelerators: Can the JVM Saturate Our Hardware?" in International Conference on High Performance Computing. Springer. 2017. pp. 220–236.

[3] The Apache Software Foundation, "Apache Arrow," 2018. [Online]. Available: https://arrow.apache.org

[4] J. Peltenburg, J. van Straten, M. Brobbel, H. P. Hofstee, and Z. Al-Ars, "Supporting Columnar In-memory Formats on FPGA: The Hardware Design of Fletcher for Apache Arrow", in Applied Reconfigurable Computing, Cham: Springer International Publishing, 2019, pp. 32–47.

#### https://github.com/abs-tudelft/fletcher **FLETCHER**

#### Open-sourced example projects / existing applications:

| Regular Expression matching example / benchmark:                                       | Thanks to our supporters:         |
|----------------------------------------------------------------------------------------|-----------------------------------|
| https://github.com/abs-tudelft/fletcher-example-regexp                                 |                                   |
| <ul> <li>K-Means clustering example/benchmark:</li> </ul>                              | Fitoptivis European ECSEL project |
| https://github.com/abs-tudelft/fletcher-example-kmeans                                 | no. ECSEL2017-1-737451            |
| <ul> <li>Writing strings to Arrow format using CAPI 2.0 and SNAP @ 10 GB/s:</li> </ul> |                                   |
| https://github.com/abs-tudelft/fletcher/blob/develop/examples/stringwrite              | Xilinx                            |
| • Posit arithmetic on FPGA, BLAS and PairHMM accelerators by Laurens van Dam:          |                                   |
| https://github.com/lvandam/posit_blas_hdl                                              |                                   |
| https://github.com/lvandam/pairhmm posit hdl arrow                                     | 35                                |

### Area utilization

Tab. 2: Area utilization statistics for a Xilinx XCVU9P device

| True    | Deseures        | 117_0 | $W_{-16}$ | W_22  | W-CA  | W-199 | W-256 | $W_{-510}$ |
|---------|-----------------|-------|-----------|-------|-------|-------|-------|------------|
| Type    | Resource        |       |           |       |       |       |       |            |
| Column  | CLB LUTs        | 0.30% | 0.28%     | 0.26% | 0.24% | 0.22% | 0.20% | 0.21%      |
| Reader  | CLB Registers   | 0.20% | 0.20%     | 0.20% | 0.20% | 0.22% | 0.24% | 0.26%      |
| Prim(W) | Block RAM (B36) | 0.65% | 0.65%     | 0.65% | 0.65% | 0.65% | 0.65% | 0.65%      |
|         | Block RAM (B18) | 0.05% | 0.05%     | 0.05% | 0.05% | 0.05% | 0.05% | 0.05%      |
| Column  | CLB LUTs        | 2.34% | 1.81%     | 1.46% | 1.32% | 1.03% | 1.04% | 0.78%      |
| Reader  | CLB Registers   | 1.01% | 1.01%     | 1.01% | 1.01% | 1.00% | 1.00% | 1.00%      |
| List of | Block RAM (B36) | 1.30% | 1.30%     | 1.30% | 1.30% | 1.30% | 1.30% | 1.30%      |
| Prim(W) | Block RAM (B18) | 0.09% | 0.09%     | 0.09% | 0.09% | 0.09% | 0.09% | 0.09%      |
| Column  | CLB LUTs        | 0.20% | 0.19%     | 0.19% | 0.20% | 0.20% | 0.22% | 0.23%      |
| Writer  | CLB Registers   | 0.28% | 0.28%     | 0.28% | 0.28% | 0.29% | 0.31% | 0.33%      |
| Prim(W) | Block RAM (B36) | 0.37% | 0.37%     | 0.37% | 0.37% | 0.37% | 0.37% | 0.37%      |
|         | Block RAM (B18) | 0.02% | 0.02%     | 0.02% | 0.02% | 0.02% | 0.02% | 0.02%      |
| Column  | CLB LUTs        | 1.03% | 0.97%     | 0.91% | 0.87% | 0.80% | 0.78% | 0.52%      |
| Writer  | CLB Registers   | 1.18% | 1.12%     | 1.11% | 1.11% | 1.06% | 1.06% | 0.73%      |
| List of | Block RAM (B36) | 1.11% | 1.11%     | 1.06% | 1.06% | 1.06% | 1.06% | 0.74%      |
| Prim(W) | Block RAM (B18) | 0.07% | 0.05%     | 0.07% | 0.07% | 0.07% | 0.07% | 0.05%      |