tensorflow 数据读取之TFRecords

wangli

TFRecords

TFRecords其实是一种二进制文件，虽然它不如其他格式好理解，但是它能更好的利用内存，更方便复制和移动，并且不需要单独的标签TFRecords文件包含了tf.train.Example协议内存块(protocolbuffer)(协议内存块包含了字段 Features)。我们可以写一段代码获取你的数据，将数据填入到Example协议内存块(protocolbuffer)，将协议内存块序列化为一个字符串，并且通过tf.python_io.TFRecordWriter写入到TFRecords文件。从TFRecords文中读取数据,可以使用tf.TFRecordReader的tf.parse_single_example解析器。这个操作可以将Example协议内存块(protocol buffer)解析为张量。

写入TFRecord文件

读取图片

tf.gfile.FastGFile(image_path, 'rb').read()
PIL.Image.open(image_path).resize([{new_shape}]).tobytes()
两种方式的比较：

第一种tf的方式，或将图片信息根据后缀类型编码后得到二进制格式的bytes，第二种则直接读取像素信息再转化成bytes类型。

# image_path是一张大小为2316*2320*3的图片
import tensorflow as tf
from PIL import Image

image_data_tf = tf.gfile.FastGFile(image_path, 'rb').read()
print(type(image_data_tf))
# [output] str
print(len(image_data_tf))
# [output] 3179831

img = Image.open(image_path)
image_data_pil = img.tobytes()
print(type(image_data_pil))
# [output] str
print(len(image_data_pil))
# [output] 16119360
# 2316 * 2320 * 3 = 16119360 长度刚好是乘积，当上面tensorflow经过编码后的长度会短很多，也可以节省很多存储空间

读取TFRrcord文件

如果是通过tf.gfile.FastGFile().read()写入的TFRecord，可以用tf.image.decode_jpeg(如果是jpg格式)解码

# 还是上面的image_path
import numpy as np
image_data_tf_decode = tf.image.decode_jpeg(image_data_tf)
with tf.Session() as sess:
    image_data_tf_decode_sess = sess.run(image_data_tf_decode)
print(type(image_data_tf_decode_sess))
# [output] numpy.ndarray
print(np.shape(image_data_tf_decode_sess))
# [output] (2316, 2320, 3)