iceberg表的metadata文件是一个json格式的文件。包含了以下的信息

// TableMetadata.java
private final String metadataFileLocation;  
// format-version： 当前表的版本。使用的iceberg版本不能比这个低
private final int formatVersion;  
// table-uuid： 表创建时生成的UUID，不能变化
private final String uuid;  
// localtion： 表的基础路径。决定了表相关文件的存储位置
private final String location;  
// last-sequence-number：表的最高分配序列号，一个单调递增的long值，跟踪snapshot的顺序
private final long lastSequenceNumber;  
// last-updated-ms：metadata最近一次修改时的时间戳
private final long lastUpdatedMillis;  
// last-column-id：分配的column id的最大值。每个column都会分配一个id
private final int lastColumnId;  
// current-schema-id：表当前schema的ID
private final int currentSchemaId; 
// schemas：schema列表，每个schema都带有自己的schema-id
private final List<Schema> schemas; 
// default-spec-id：writer默认使用的分区规范id
private final int defaultSpecId;  
// paritition-specs：分区规范列表，结合defaultSpecId使用，与schema类似。writer写数需要使用，不过reader读数使用的是manifest中指定的spec
private final List<PartitionSpec> specs;  
// last-partition-id：已经分配的partition spec的最高id
private final int lastAssignedPartitionId; 
// default-sort-order-id：writer使用的默认排序id。reader使用的是manifest中的
private final int defaultSortOrderId;  
// sort-orders：排序列表
private final List<SortOrder> sortOrders;  
// properties：表属性，用于控制读写设置
private final Map<String, String> properties;
// current-snapshot-id：当前快照的ID，必须和ref当中main分支的当前ID一致
private final long currentSnapshotId; 
// 辅助结构，用于快速根据id找到对应的对象
private final Map<Integer, Schema> schemasById;  
private final Map<Integer, PartitionSpec> specsById;  
private final Map<Integer, SortOrder> sortOrdersById; 
// snapshot-log：快照变更的历史。主要记录的是时间戳
private final List<HistoryEntry> snapshotLog;  
private final List<MetadataLogEntry> previousFiles; 
// statistics：表统计信息
private final List<StatisticsFile> statisticsFiles;  
// partition-statistics：分区统计信息
private final List<PartitionStatisticsFile> partitionStatisticsFiles;  
private final List<MetadataUpdate> changes;  
private SerializableSupplier<List<Snapshot>> snapshotsSupplier;  
// snapshots：有效快照列表
private volatile List<Snapshot> snapshots;  
private volatile Map<Long, Snapshot> snapshotsById; 
// refs：快照引用名到快照引用对象的映射。就算refs为空也有一个main分支指向current-snapshot-id
private volatile Map<String, SnapshotRef> refs;  
private volatile boolean snapshotsLoaded;

其实经过上面的简单介绍，可以发现，比较简单的信息都是直接放在metadata字段当中的。父表复杂的或者需要演进的对象。例如schema，partition都是在引入了引入了版本号进行管理。

并且这些最新的版本号也都是给writer写数的时候使用的。reader读取时都会使用当前写入时使用的版本号。这样确保了历史的数据仍然能够有效使用，而不会受到后续修改的影响。

在这里还有统计文件，例如表的统计文件和分区的统计信息文件，是通过单独的文件来存储的，只是在metadata文件中进行追踪。

Table Statistics

存储的文件格式为puffin files。存储了iceberg表的索引和存储信息等，reader在正确读取表数据的时候并不需要依赖统计文件。

统计文件的元数据则是使用StatisticsFile记录在metadata文件当中。具有如下字段

// snapshot-id：统计文件关联的快照id
private final long snapshotId;  
// statistics-path：puffin格式的统计文件的路径
private final String path;  
// file-size-in-bytes：统计文件的大小
private final long fileSizeInBytes; 
// file-footer-size-in-bytes：统计文件页脚总大小。和格式相关
private final long fileFooterSizeInBytes;  
// blob-metadata：统计文件当中blob元数据列表
private final List<BlobMetadata> blobMetadata;

上面的blob-metadata的作用类似于metadata的schema，用来表述表的结构，这里的blob-metadata则是用来描述统计文件。

// type：匹配blob的类型
private final String type;  
// snapshot-id：blob数据来源的快照版本
private final long sourceSnapshotId;  
// sequence-number：blob数据来源的序列号
private final long sourceSnapshotSequenceNumber;  
// fields：计算统计信息涉及的字段列表
private final List<Integer> fields;  
// properties：blob额外的属性，类似于metadata当中的properties
private final Map<String, String> properties;

partition statistics

和table statistics文件类似，只是这里文件存储的内容变为了分区相关。同样的这个文件对于reader来说也不是必须的，writer在每次写入的时候可以选择性地写入分区统计文件，这个文件之后在metadata中注册了之后才能被reader使用。就是使用PartitionStatisticsFile跟踪其元数据

snapshot-id：分区统计文件关联的快照版本
statistics-path：分区统计文件的路径
file-size-in-bytes：分区统计文件的大小

Smarticen Notes

Explorer

metadata

Table Statistics

partition statistics

Graph View

Table of Contents

Backlinks