Apache Paimon 使用 MySQL CDC 获取数据

news/2024/6/16 14:12:41 标签: mysql, flink, hadoop, 大数据

Paimon支持使用(CDC)同步来自不同数据库的更改,此功能需要Flink及其CDC连接器。

准备 CDC Bundled Jar 依赖

flink-sql-connector-mysql-cdc-*.jar

同步表

在Flink DataStream中或通过flink run使用MySqlSyncTableAction,可以将MySQL中的一个或多个表同步到一个Paimon表中。

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_table
    --warehouse <warehouse-path> \
    --database <database-name> \
    --table <table-name> \
    [--partition_keys <partition_keys>] \
    [--primary_keys <primary-keys>] \
    [--type_mapping <option1,option2...>] \
    [--computed_column <'column-name=expr-name(args[, ...])'> [--computed_column ...]] \
    [--metadata_column <metadata-column>] \
    [--mysql_conf <mysql-cdc-source-conf> [--mysql_conf <mysql-cdc-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
ConfigurationDescription
–warehouseThe path to Paimon warehouse.
–databaseThe database name in Paimon catalog.
–tableThe Paimon table name.
–partition_keysThe partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example “dt,hh,mm”.
–primary_keysThe primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example “buyer_id,seller_id”.
–type_mappingIt is used to specify how to map MySQL data type to Paimon type. Supported options:“tinyint1-not-bool”: maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN.“to-nullable”: ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL ‘ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x’ operation.“to-string”: maps all MySQL types to STRING.“char-to-string”: maps MySQL CHAR(length)/VARCHAR(length) types to STRING.“longtext-to-bytes”: maps MySQL LONGTEXT types to BYTES.“bigint-unsigned-to-bigint”: maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won’t occur when using this option.
–computed_columnThe definitions of computed columns. The argument field is from MySQL table field name. See here for a complete list of configurations.
–metadata_column–metadata_column is used to specify which metadata columns to include in the output schema of the connector. Metadata columns provide additional information related to the source data, for example: --metadata_column table_name,database_name,op_ts. See its document for a complete list of available metadata.
mysql_confThe configuration for Flink CDC MySQL sources. Each configuration should be specified in the format “key=value”. hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations.
–catalog_confThe configuration for Paimon catalog. Each configuration should be specified in the format “key=value”. See here for a complete list of catalog configurations.
–table_confThe configuration for Paimon table sink. Each configuration should be specified in the format “key=value”. See here for a complete list of table configurations.

如果指定的Paimon表不存在,将自动创建该表。它的模式将从所有指定的MySQL表中派生出来。如果Paimon表已经存在,其模式将与所有指定MySQL表的模式进行比较。

示例1:将表同步到一个Paimon表中

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name='source_db' \
    --mysql_conf table-name='source_table1|source_table2' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

如上示例所示,mysql_conf的表名支持正则表达式,以监控满足正则表达式的多个表。所有表的模式将合并到一个Paimon表模式中。

示例2:将分片同步到一个Paimon表中

可以使用正则表达式设置“数据库名称”来捕获多个数据库。典型的场景是,表“source_table”被拆分为数据库“source_db1”,“source_db2”…,然后将所有“source_table”的数据同步到一个Paimon表中。

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name='source_db.+' \
    --mysql_conf table-name='source_table' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

同步数据库

通过在Flink DataStream或通过flink run使用MySqlSyncDatabaseAction,可以将整个MySQL数据库同步到一个Paimon数据库中。

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_database
    --warehouse <warehouse-path> \
    --database <database-name> \
    [--ignore_incompatible <true/false>] \
    [--merge_shards <true/false>] \
    [--table_prefix <paimon-table-prefix>] \
    [--table_suffix <paimon-table-suffix>] \
    [--including_tables <mysql-table-name|name-regular-expr>] \
    [--excluding_tables <mysql-table-name|name-regular-expr>] \
    [--mode <sync-mode>] \
    [--metadata_column <metadata-column>] \
    [--type_mapping <option1,option2...>] \
    [--mysql_conf <mysql-cdc-source-conf> [--mysql_conf <mysql-cdc-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
ConfigurationDescription
–warehouseThe path to Paimon warehouse.
–databaseThe database name in Paimon catalog.
–ignore_incompatibleIt is default false, in this case, if MySQL table name exists in Paimon and their schema is incompatible,an exception will be thrown. You can specify it to true explicitly to ignore the incompatible tables and exception.
–merge_shardsIt is default true, in this case, if some tables in different databases have the same name, their schemas will be merged and their records will be synchronized into one Paimon table. Otherwise, each table’s records will be synchronized to a corresponding Paimon table, and the Paimon table will be named to ‘databaseName_tableName’ to avoid potential name conflict.
–table_prefixThe prefix of all Paimon tables to be synchronized. For example, if you want all synchronized tables to have “ods_” as prefix, you can specify “–table_prefix ods_”.
–table_suffixThe suffix of all Paimon tables to be synchronized. The usage is same as “–table_prefix”.
–including_tablesIt is used to specify which source tables are to be synchronized. You must use ‘|’ to separate multiple tables.Because ‘|’ is a special character, a comma is required, for example: ‘a|b|c’.Regular expression is supported, for example, specifying “–including_tables test|paimon.*” means to synchronize table ‘test’ and all tables start with ‘paimon’.
–excluding_tablesIt is used to specify which source tables are not to be synchronized. The usage is same as “–including_tables”. “–excluding_tables” has higher priority than “–including_tables” if you specified both.
–modeIt is used to specify synchronization mode. Possible values:“divided” (the default mode if you haven’t specified one): start a sink for each table, the synchronization of the new table requires restarting the job.“combined”: start a single combined sink for all tables, the new table will be automatically synchronized.
–metadata_column–metadata_column is used to specify which metadata columns to include in the output schema of the connector. Metadata columns provide additional information related to the source data, for example: --metadata_column table_name,database_name,op_ts. See its document for a complete list of available metadata.
–type_mappingIt is used to specify how to map MySQL data type to Paimon type. Supported options:“tinyint1-not-bool”: maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN.“to-nullable”: ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL ‘ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x’ operation.“to-string”: maps all MySQL types to STRING.“char-to-string”: maps MySQL CHAR(length)/VARCHAR(length) types to STRING.“longtext-to-bytes”: maps MySQL LONGTEXT types to BYTES.“bigint-unsigned-to-bigint”: maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won’t occur when using this option.
mysql_confThe configuration for Flink CDC MySQL sources. Each configuration should be specified in the format “key=value”. hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations.
–catalog_confThe configuration for Paimon catalog. Each configuration should be specified in the format “key=value”. See here for a complete list of catalog configurations.
–table_confThe configuration for Paimon table sink. Each configuration should be specified in the format “key=value”. See here for a complete list of table configurations.

只有带有主键的表才会同步。

对于要同步的每个MySQL表,如果相应的Paimon表不存在,将自动创建该表。它的模式将从所有指定的MySQL表中派生出来。如果Paimon表已经存在,其模式将与所有指定MySQL表的模式进行比较。

示例1:同步整个数据库

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name=source_db \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

示例2:同步数据库下新添加的表

首先,假设Flink作业是在数据库source_db下同步表[产品、用户、地址]。

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name=source_db \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4 \
    --including_tables 'product|user|address'

然后,希望该Job也同步包含历史数据的表[order, custom]。可以通过从之前的Job快照中恢复,从而重用作业的现有状态来实现这一点。恢复的Job将首先快照新添加的表,然后自动从上一个位置继续读取changelog。

从以前的快照恢复并添加新表进行同步的命令如下所示:

<FLINK_HOME>/bin/flink run \
    --fromSavepoint savepointPath \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name=source_db \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --including_tables 'product|user|address|order|custom'

注意:可以设置–mode combined启动,自动不同新增加的表而无需重启Job。

示例3:同步和合并多个碎片

假设有多个数据库分片db1db2,…每个数据库都有表tbl1tbl2,…,可以同步所有的db.+.tbl.+到表test_db.tbl1, test_db.tbl2 …。

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name='db.+' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4 \
    --including_tables 'tbl.+'

通过将数据库名称设置为正则表达式,同步作业将捕获匹配数据库下的所有表,并将同名的表合并到一个表中。

设置--merge_shards false,以防止合并分片。同步表将被命名为“databaseName_tableName”,以避免潜在的名称冲突。

常见问题

  1. 从MySQL中提取的数据汉字乱码
  • flink-conf.yaml设置env.java.opts: -Dfile.encoding=UTF-8(自Flink-1.17以来,该选项已更改为env.java.opts.all)。

http://www.niftyadmin.cn/n/5431186.html

相关文章

mysql5.7离线安装 windows

windows上离线安装mysql5.7 下载安装包 去官网下载对应版本的mysql官网 点击archives,接着选择自己要下载的版本&#xff0c;选择windows系统&#xff0c;并根据自己电脑的位数选择相应的版本【找到“此电脑”&#xff0c;鼠标右击&#xff0c;出来下拉框&#xff0c;选择“属性…

腾讯地图的(地图选点|输入模糊匹配)

1.支持用户输入框输入进行模糊匹配获取详细地址以及经纬度2.支持用户模糊匹配后点击选点获取详细地址以及经纬度 1.支持用户输入框输入进行模糊匹配获取详细地址以及经纬度2.支持用户模糊匹配后点击选点获取详细地址以及经纬度 <template><div class"tencentMap-…

GPT如何与回归模型分析、混合效应模型、多元统计分析及结构方程模型、Meta分析、随机森林模型及贝叶斯回归分析结合应用

自2022年GPT&#xff08;Generative Pre-trained Transformer&#xff09;大语言模型的发布以来&#xff0c;它以其卓越的自然语言处理能力和广泛的应用潜力&#xff0c;在学术界和工业界掀起了一场革命。在短短一年多的时间里&#xff0c;GPT已经在多个领域展现出其独特的价值…

oracle基础-子查询 备份

一、什么是子查询 子查询是在SQL语句内的另外一条select语句&#xff0c;也被称为内查询活着内select语句。在select、insert、update、delete命令中允许是一个表达式的地方都可以包含子查询&#xff0c;子查询也可以包含在另一个子查询中。 【例1.1】在Scott模式下&#xff0…

安卓UI面试题 21-25

21. ListView 数据集改变后, 如何更新 ListView?使用该 ListView 的 adapter 的notifyDataSetChanged()方法. 该方法会使 ListView 重新绘制.🚀🚀🚀🚀🚀🚀22. 如何在ListView间添加分割线?//推荐用divider设置drawable的分割线 >>.设置全局属性 a).android…

【LeetCode每日一题】2312. 卖木头块(DFS记忆化搜索+动态规划)

文章目录 [2312. 卖木头块](https://leetcode.cn/problems/selling-pieces-of-wood/)思路1:用DFS进行记忆化搜索代码&#xff1a;思路2:动态规划代码&#xff1a; 2312. 卖木头块 思路1:用DFS进行记忆化搜索 1.要用DFS深度优先遍历每一种情况。在递归的同时&#xff0c;不断更…

springboot274基于web的电影院购票系统

电影院购票系统设计与实现 摘 要 传统办法管理信息首先需要花费的时间比较多&#xff0c;其次数据出错率比较高&#xff0c;而且对错误的数据进行更改也比较困难&#xff0c;最后&#xff0c;检索数据费事费力。因此&#xff0c;在计算机上安装电影院购票系统软件来发挥其高效…

element ui el-select组件添加选项下拉加载

需求描述&#xff1a;在做搜索的时候由于有一个下拉列表接口返回数据特别多所以对列表进行了一个下拉触底加载的事件&#xff0c;但是官方文档是没有对应的api的所以自己使用指令写了一个方法。 实现代码&#xff1a; <el-selectv-model"sellerNameSearchVal"v-s…