Apache CarbonData是一种新的高性能数据存储格式,针对当前大数据领域分析场景需求各异而导致的存储冗余问题,CarbonData提供了一种新的融合数据存储方案,以一份数据同时支持“任意维度组合的过滤查询、快速扫描、详单查询等”多种应用场景,并通过多级索引、字典编码、列存等特性提升了IO扫描和计算性能,实现百亿数据级秒级响应。
Apache CarbonData 1.0.0完成了80+ 特性,100+ bugfixs,亮点特性如下:
1.New load data solution
The old CarbonData load solution depends on Kettle engine, but Kettle engine is not designed for handling big data domain and the code maintainability is complex in this flow. So in the 1.0 version, a new data loading solution without kettle dependency is added and makes more modular and improved performance.
2.Support Spark2.1 integration in carbon
Spark 2.1 has added many features and improved the performance. CarbonData also gets the advantage of it after upgrading.
3.Data update/delete SQL support
Now user can delete and update the carbon table using standard sql syntax. This feature currently is supported in Spark 1.5/1.6 integration, it will be support in Spark 2.1 integration soon.
4.Support adaptive data compression for int/bigint/decimal to increase compression ratio
This feature can adapt the data to the smaller data type that fits the value, and it also supports delta compression technique to reduce the store size.
5.Support to define Date/Timestamp format for different columns
Now user can provide Date/Timestamp format for each column while loading the data. Provide option in the create table DDL itself to define the format for each Timestamp column, also provide defaults so that users can create table with Timestamp columns without having to always define the Date/Timestamp format.
6.Implement LRU cache for B-Tree
Btree in CarbonData keeps the information of blocks and blocklets of carbon tables inside memory. If number of tables increases or data increases there is a possibility of going out of memory. LRU cache of Btree now keep only recently or frequently used block/blocklet information in memory and evicts the unused or less used block/blocklet information.
7.CarbonData V2 format to improve first time query performance
This V2 format is more organized and maintains less metadata(reads metadata on demand) so that first time queries are faster. And also it has less IO cost compare to V1. Several testcases show that first time query response time reduced around 50%.
8.Vectorized reader support
It reads the data in batches, column by column. This feature reduces GC time and improve performance during data scan.
9.Fast join using bucket table
This feature enable bucket table support for CarbonData. It can improve the join query performace by avoiding shuffling if both tables are bucketed on same column with same number of buckets.It is supported in Spark 2.1 version.
10.Leveraging off-heap memory to reduce GC
By leveraging off-heap memory, it improves both loading and reading performance. In data loading, it improves data sorting performance and in reading, also it reduces GC overhead as it stores data in off-heap
11.Support single-pass loading
Currently data loading happens in 2 jobs (generate dictionary first, then do the actual data loading), this feature enables single job to finish the data loading with dictionary generation on the fly. It can improve the performance for the scenario that data loading with less incremental updates on dictionary, which usually is this case after initial data load.
12.Support pre-generated dictionary for data loading
User can use the generated dictionary, this feature also supports with customized dictionary by users to improve data load efficiency.
Apache CarbonData社区:
- 码云源代码:https://git.oschina.net/huawei_esdk/incubator-carbondata
- github源代码:https://github.com/apache/incubator-carbondata
- 邮件列表:dev@carbondata.incubator.apache.org
- Apache JIRA:https://issues.apache.org/jira/browse/CARBONDATA/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
- 更多信息,请参考:
https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home
http://carbondata.apache.org