Go and Apache Arrow: building blocks for data science
Today we will see how Apache Arrow could be useful for data science, or – really – a lot of analysis workloads.
In Data Science and in many scientific fields, the lingua franca is Python.
This means that a lot of the libraries are written in Python, with the CPU intensive parts leveraging either NumPy, SciPy or, directly, a C/C++ library wrapped with the CPython C-API. This also means that:
- the vast majority of the analysis pipelines are written in Python;
- it is not completely straightforward to migrate parts of these pipelines to other languages, in a smooth and piecewise process.
So why would anyone use Go for this?
Well… the usual reasons:
- Go is easier to deploy,
- Go is faster than pure Python,
- Go code is simpler when dealing with concurrency,
- Go code and results are easy to reproduce,
- Go code is more amenable to mechanical refactoring,
- Go code tends to be more robust at scale than Python code.
The real question is: do we have all the building blocks, all the libraries to write a modern, robust, efficient analysis pipeline in Go?
Most of the science-y things one might carry out with the Python scientific stack (NumPy/SciPy) can be performed with Gonum:
- linear algebra, matrices,
- statistics, probability distributions,
- integration, optimization,
- network creation,
- Fourier transformations,
There are also a few packages that enable file-level interoperability with many useful data science oriented or scientific “file formats”:
Many basic ingredients are already there, even if some are still missing.
But even if basic file-level interoperability is (somewhat) achieved, one still needs to implement readers, writers and converters from one file format to any another when integrating with a new analysis pipeline:
In many scenarios, this implies a large fraction of the computation is wasted on serialization and deserialization, reimplementing over and over the same features for converting from
format-n or from
What if we had a common data layer between all these analysis pipelines, a common tool that would also be efficient?
What is Apache Arrow? From the website:
1 2 3 4 5 6 7
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
The idea behind Arrow is to describe and implement a cross-language open standard for representing structured data, and for efficiently sharing this data across languages and processes.
This is embodied as an
arrow::Array class in C++, and the
Vector class in Java, the reference implementations.
… and – of course – Go :)
- primitive arrays (
- parametric types (
- memory management
- typed metadata
- SIMD math kernels (via a nice automatic translation tool -
c2goasm- to extract vectorized code produced by CLang, see the original blog post for more details)
Since then, a few contributors provided implementations for:
- list arrays
- struct arrays
- time arrays
CSVdata to Arrow arrays,
- tables and records,
- tensors (a.k.a. n-dimensional arrays), the Swiss Army knife of machine learning.
Note that tensors are not part of the columnar format: they are actually built on top of Arrow arrays.
But what are those arrays?
From the Arrow perspective, an array is a sequence of values with known length, all having the same type.
Arrow arrays have support for “missing” values, or
null slots in the array.
The number of
null slots is stored in a bitmap, itself stored as part of the array.
Arrays with no
null slot can choose not to allocate that bitmap.
Let us consider an array of
int32s to make things more concrete:
[1, null, 2, 4, 8]
Such an array would look like:
1 2 3 4 5 6 7 8 9 10 11 12
* Length: 5, Null count: 1 * Null bitmap buffer: |Byte 0 (validity bitmap) | Bytes 1-63 | |-------------------------|-----------------------| | 00011101 | 0 (padding) | * Value Buffer: |Bytes 0-3 | 4-7 | 8-11 | 12-15 | 16-19 | 20-63 | |----------|-------------|-------|-------|-------|-------------| | 1 | unspecified | 2 | 4 | 8 | unspecified |
Arrow specifies the expected memory layout of arrays in a document.
How does it look like in Go? The Go implementation for the Apache Arrow standard is documented here:
- github.com/apache/arrow/go/arrow: data types, metadata, schemas, …
- github.com/apache/arrow/go/arrow/array: array types,
- github.com/apache/arrow/go/arrow/memory: support for low-level allocation of memory,
- github.com/apache/arrow/go/arrow/csv: support for reading CSV files into Arrow arrays,
- github.com/apache/arrow/go/arrow/tensor: types for n-dimensional arrays.
The main entry point is the
Most of the interface should be self-explanatory.
The careful reader might notice the
Retain/Release methods that are used to manage the memory used by arrays.
Gophers might be surprised by this kind of low-level memory management, but this is needed to allow for
mmap- or GPU-backed arrays.
Another interesting piece of this interface is the
Data() method that returns a value of type
array.Data type is where all the semantics of an Arrow arrays are encoded, as per the Arrow specifications.
The last piece to understand the memory layout of a Go Arrow array is
In a nutshell, Arrow arrays – and the
array.Interface – can be seen as the cousins of the buffer protocol from Python which has been instrumental in (pythonic) science.
Let us come back to the
It is implemented like so in Go:
Creating a new array value holding
int32s is done with
data holds a carefully crafted
array.Data value with the needed memory buffers and sub-buffers (for the
As this can be a bit unwiedly, a helper type is provided to create Arrow arrays from scratch:
Arrow arrays can also be sub-sliced:
Similar code can be written for arrays of
float32, unsigned integers, etc…
But not everything can be mapped or simply expressed in terms of these basic primitives.
What if we need a more detailed data type?
That is expressed with
Let us imagine we need to represent an array of “Person” type like:
Handling nested data and finding an efficient representation for this kind data are key features of the Apache Arrow project. Minute details of this efficient representation are reported in the memory layout document.
Structured types in Arrow can be seen as entries in a database, with a specified schema. While many databases are row oriented, data is column-oriented in Arrow:
The reasoning behind organizing data along columns instead of rows is that:
- compression should work better for data of the same type, (presumably, data should be relatively similar)
- many workloads only care about a few columns to perform their analyses, the other columns can be left at rest on disk,
- one can leverage vectorized instructions to carry operations.
This is the well-known structure of arrays (
SoA) vs arrays of structure (
With Arrow, the
Person type defined previously could be implemented as the following structure of arrays:
A complete example is provided here.
As hinted during this blog post, Arrow needs to be able to interoperate with industry provided or scientific oriented databases, to load all the interesting data into memory, as efficiently as possible.
Thus, it stands to reason for Arrow to provide tools that maps to operations usually performed on databases.
In Arrow, this is expressed as
Table interface allows to describe the data held by some database, and access it via the
array.Column type – a representation of chunked arrays: arrays that may not be contiguous in memory.
The code below creates a pair of records
rec2 (each record can be seen as a set of rows in a database.)
Each record contains two columns, the first one containing
int32 values and the second one
Once a few entries have been added to each of the two records, a
Table reader is created from these in-memory records.
The chunk size for this
Table reader is set to
5: each column will hold at most 5 entries during the iteration over the in-memory
Executing the code above would result in:
1 2 3 4 5 6 7 8 9
$> go run ./table-reader.go rec["f1-i32"]: [1 2 3 4 5] rec["f2-f64"]: [1 2 3 4 5] rec["f1-i32"]: [6 7 8 (null) 10] rec["f2-f64"]: [6 7 8 9 10] rec["f1-i32"]: [11 12 13 14 15] rec["f2-f64"]: [11 12 13 14 15] rec["f1-i32"]: [16 17 18 19 20] rec["f2-f64"]: [16 17 18 19 20]
Finally, to conclude with our quick whirlwind overview of what Go Arrow can provide now, let us consider the arrow/csv package.
As mentioned earlier, many analysis pipelines start with ingesting CSV.
The Go standard library already provides a package to decode and encode data using this comma-separated values (CSV) “format”.
But what the encoding/csv package exposes is slices of
strings, not values of a specific type.
string values to a specific type (
int, …) is left to the user of that package.
The arrow/csv package leverages the
array.Record types to provide a typed and (eventually) scalable+optimized API.
Running the code above will result in:
1 2 3 4 5 6 7 8 9 10
$> go run ./read-csv.go rec["i64"]: [0 1 2 3] rec["f64"]: [0 1 2 3] rec["str"]: ["str-0" "str-1" "str-2" "str-3"] rec["i64"]: [4 5 6 7] rec["f64"]: [4 5 6 7] rec["str"]: ["str-4" "str-5" "str-6" "str-7"] rec["i64"]: [8 9] rec["f64"]: [8 9] rec["str"]: ["str-8" "str-9"]
This post has only scratched the surface of what can be done with Go Arrow and how it works under the hood. For example, we have not talked about how the typesafe array builders and array types are generated: this is - BTW - an area where the Go2 draft proposal for generics would definitely help.
There also many features available in C++/Python Arrow that are yet to be implemented in Go Arrow. The main remaining one is perhaps implementing the IPC protocol that Arrow specifies. This would allow to create polyglot, distributed applications with an eye towards data science.
In the same vein, the Arrow
Schema types could be seen as building blocks for creating a dataframe package interoperable with the ones from Python, R, etc…
Finally, Go Arrow tensors could be used as an efficient vehicle to transfer data between various machine learning packages (ONNX, Gorgonia.)
Gophers, follow the arrows and send your pull requests!