fbpx
Repurposing Binary Serialized Data Structures During the Process of Data Ingestion Repurposing Binary Serialized Data Structures During the Process of Data Ingestion
There are a great many binary formats that data might live in. Everything very popular has grown good open-source libraries, but... Repurposing Binary Serialized Data Structures During the Process of Data Ingestion

There are a great many binary formats that data might live in. Everything very popular has grown good open-source libraries, but you may encounter some legacy or in-house format for which this is not true. Good general advice is that unless there is an ongoing and/or performance-sensitive need for processing an unusual format, try to leverage existing parsers. Custom formats can be tricky, and if one is uncommon, it is as likely as not also to be under-documented in regards to data ingestion.

This article is an excerpt from the book, Cleaning Data for Effective Data Science by David Mertz – A comprehensive guide for data scientists to master effective data cleaning tools and techniques.

If an existing tool is only available in a language you do not wish to use for your main data science work, nonetheless see if that can be easily leveraged to act only as a means to export to a more easily accessed format. A fire-and-forget tool might be all you need, even if it is one that runs recurringly, but asynchronously with the actual data processing you need to perform.

For this section, let us assume that the optimistic situation is not realized, and we have nothing beyond some bytes on disk, and some possibly flawed documentation to work with. Writing the custom code is much more the job of a systems engineer than a data scientist, but we data scientists need to be polymaths, and we should not be daunted by writing a little bit of systems code. 

For this relatively short section on data ingestion, we look at a simple and straightforward binary format. Moreover, this is a real-world data format for which we do not actually need a custom parser. Having an actual well-tested, performant, and bullet-proof parser to compare our toy code with is a good way to make sure we do the right thing. Specifically, we will read data stored in the NumPy NPY format, which is documented as follows (abridged): 

  • The first 6 bytes are a magic string: exactly \x93NUMPY.
  • The next 1 byte is an unsigned byte: the major version number of the file format, e.g. \x01.
  • The next 1 byte is an unsigned byte: the minor version number of the file format, e.g. \x00. 
  • The next 2 bytes form a little-endian unsigned short int: the length of the header data HEADER_LEN.
  • The next HEADER_LEN bytes are an ASCII string that contains a Python literal expression of a dictionary. 
  • Following the header comes the array data.  

First, we read in some binary data using the standard reader, using Python and NumPy, to understand what type of object we are trying to reconstruct for data ingestion. It turns out that the serialization was of a 3-dimensional array of 64-bit floating-point values. A small size was chosen for this section, but of course, real-world data will generally be much larger. 

In [35]:
arr = np.load(open('data/binary-3d.npy', 'rb'))
print(arr, '\n', arr.shape, arr.dtype)

Out [35]:
[[[ 0.  1.  2.]
  [ 3.  4.  5.]]
 [[ 6.  7.  8.]
  [ 9. 10. 11.]]]
 (2, 2, 3) float64

Visually examining the bytes is a good way to have a better feel for what is going on with the data. NumPy is, of course, a clearly and correctly documented project for data ingestion, but for some hypothetical format, this is an opportunity to potentially identify problems with the documentation not matching the actual bytes. More subtle issues may arise in the more detailed parsing; for example, the meaning of bytes in a particular location can be contingent on flags occurring elsewhere. Data science is, in surprisingly large part, a matter of eyeballing data. 

In [36]:
%%bash
hexdump -Cv data/binary-3d.npy

Out [36]:
00000000  93 4e 55 4d 50 59 01 00  76 00 7b 27 64 65 73 63  |.NUMPY..v.{'desc|
00000010  72 27 3a 20 27 3c 66 38  27 2c 20 27 66 6f 72 74  |r': '<f8', 'fort|
00000020  72 61 6e 5f 6f 72 64 65  72 27 3a 20 46 61 6c 73  |ran_order': Fals|
00000030  65 2c 20 27 73 68 61 70  65 27 3a 20 28 32 2c 20  |e, 'shape': (2, |
00000040  32 2c 20 33 29 2c 20 7d  20 20 20 20 20 20 20 20  |2, 3), }        |
00000050  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00000060  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00000070  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 0a  |               .|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 f0 3f  |...............?|
00000090  00 00 00 00 00 00 00 40  00 00 00 00 00 00 08 40  |.......@.......@|
000000a0  00 00 00 00 00 00 10 40  00 00 00 00 00 00 14 40  |.......@.......@|
000000b0  00 00 00 00 00 00 18 40  00 00 00 00 00 00 1c 40  |.......@.......@|
000000c0  00 00 00 00 00 00 20 40  00 00 00 00 00 00 22 40  |...... @......"@|
000000d0  00 00 00 00 00 00 24 40  00 00 00 00 00 00 26 40  |......$@......&@|
000000e0

As a first step, let us make sure the file really does match the type we expect in having the correct “magic string.” Many kinds of files are identified by a characteristic and distinctive first few bytes. In fact, the common utility on Unix-like systems, file, uses exactly this knowledge via a database describing many file types. For a hypothetical rare file type (i.e. not NumPy), this utility may not know about the format; nonetheless, the file might still have such a header. 

In [37]:
%%bash
file data/binary-3d.npy

Out [37]:
data/binary-3d.npy: NumPy array, version 1.0, header length 118

With that, let us open a file handle for the file and proceed with trying to parse it according to its specification. For this, in Python, we will simply open the file in bytes mode, so as not to convert to text, and read various segments of the file to verify or process portions. For this format, we will be able to process it strictly sequentially, but in other cases, it might be necessary to seek to particular byte positions within the file. The Python struct module will allow us to parse basic numeric types from bytestrings. The ast module will let us create Python data structures from raw strings without a security risk that eval() can encounter. 

In [38]:
import struct, ast
binfile = open('data/binary-3d.npy', 'rb')

# Check that the magic header is correct
if binfile.read(6) == b'\x93NUMPY':
    vermajor = ord(binfile.read(1))
    verminor = ord(binfile.read(1))
    print(f"Data appears to be NPY format, "
          f"version {vermajor}.{verminor}")
else:
    print("Data in unsupported file format")
    print("*** ABORT PROCESSING ***")

Out [38]:
Data appears to be NPY format, version 1.0

Next, we need to determine how long the header is, and then read it in. The header is always ASCII in NPY version 1, but may be UTF-8 in version 3. Since ASCII is a subset of UTF-8, decoding does no harm even if we do not check the version. 

In [39]:
# Little-endian short int (tuple 0 element)
header_len = struct.unpack('<H', binfile.read(2))[0]
# Read specified number of bytes
header = binfile.read(header_len)
# Convert header bytes to a dictionary
# Use safer ast.literal_eval()
header_dict = ast.literal_eval(header.decode('utf-8'))
print(f"Read {header_len} bytes "
      f"into dictionary: \n{header_dict}")
Out [39]:
Read 118 bytes into dictionary:
{'descr': '<f8', 'fortran_order': False, 'shape': (2, 2, 3)}

While this dictionary stored in the header gives a nice description of the dtype, value order, and the shape, the convention used by NumPy for value types is different from that used in the struct module. We can define a (partial) mapping to obtain the correct spelling of the data type for the reader. We only define this mapping for some data types encoded as little-endian, but the big-endian versions would simply have a greater-than sign instead. The key for ‘fortran_order’ indicates whether the fastest or slowest varying dimension is contiguous in memory. Most systems use “C order” instead. 

We are not aiming for high efficiency here, but to minimize code. Therefore, I will expediently read the actual data into a simple list of values first, and then later convert that to a NumPy array. 

In [40]:
# Define spelling of data types and find the struct code
dtype_map = {'<i2': '<i', '<i4': '<l', '<i8': '<q',
             '<f2': '<e', '<f4': '<f', '<f8': '<d'}
dtype = header_dict['descr']
fcode = dtype_map[dtype]

# Determine number of bytes from dtype spec
nbytes = int(dtype[2:]) 

# List to hold values
values = [] 

# Python 3.8+ "walrus operator"
while val_bytes := binfile.read(nbytes):
    values.append(struct.unpack(fcode, val_bytes)[0]) 
print("Values:", values)

Out [40]:
Values: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]

Let us now convert the raw values into an actual NumPy array of appropriate shape and dtype. We will also look for whether to use Fortran- or C-order in memory. 

In [41]:
shape = header_dict['shape']
order = 'F'if header_dict['fortran_order'] else'C'
newarr = np.array(values, dtype=dtype, order=order)
newarr = newarr.reshape(shape)
print(newarr, '\n', newarr.shape, newarr.dtype)
print("\nMatched standard parser:", (arr == newarr).all())

Out [41]:
[[[ 0.  1.  2.]
  [ 3.  4.  5.]]
 [[ 6.  7.  8.]
  [ 9. 10. 11.]]]
 (2, 2, 3) float64 

Matched standard parser: True 

Just as binary data can be oddball, so can text. 

Summary 

There are formats as well that, while directly intended as a means of recording and communicating data as such, are not widely used, and tooling to read them directly may not be available to you. The specific example this article presents is a binary format.  

About the Author on Data Ingestion

David Mertz, Ph.D. is the founder of KDM Training, a partnership dedicated to educating developers and data scientists in machine learning and scientific computing. He created a data science training program for Anaconda Inc. and was a senior trainer for them. With the advent of deep neural networks, he has turned to training our robot overlords as well.  

He previously worked for 8 years with D. E. Shaw Research and was also a Director of the Python Software Foundation for 6 years. David remains co-chair of its Trademarks Committee and Scientific Python Working Group. His columns, Charming Python and XML Matters, were once the most widely read articles in the Python world. 

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1