Automation of Chinese Support for Hive on AWS EMR
AWS EMR is a managed service with which you can launch a big-data cluster with supported open source tools such as Hadoop, Spark, Hudi, Tez, Tensorflow etc. Among a group of supported open source tools, Hive is involved in the list.
Hive can expedite the data manipulation on large datasets placed in distributed storage. To view specific datasets, you need to define the composition of the datasets as tables. And in some scenarios for end users, they might want to look up what columns mean in a table with readable languages. Like what Column
Attrition_Flag indicates in the table of
BankChurners where an end user can look up its native-language-friendly definition, for example, by executing
SHOW CREATE TABLE BankChurners; .
If you launch an EMR cluster without too much modification where Hive resides plus there are non-English descriptions in table comments and column comments, you will encounter a display issue similar as the following:
Attrition_Flag string ??????
Step by step
Modify metadata using SQL
The default metadata storage on EMR is MySQL. You need to modify specific columns in the MySQL database in order to define Hive components with non-English comments on columns and tables. Those modifications can be achieved with the following MySQL SQL commands.
Modify metadata using CLI
For executing the SQLs through CLI, you can utilize
mysql . Combining the above information, one can make non-English comments on columns and tables available, or can be displayed, with the following bash command.
Modify metadata using .sh
The information of the user name and password can be found at
hive-site.xml via the properties of
hive-site.xml is located at
/etc/hive/conf.dist/hive-site.xml on the master node of the EMR cluster. Without modifying
hive-site.xml, the alteration of the tables in the meta database can be done like the following.
Now, we only need to make it part of the launch of the EMR cluster. Placing the shell script executed during bootstrap actions is not proper since by the time point of the execution, Hive will haven’t be installed. We are going to put it into step actions.
With proper settings and combination of AWS Lambda and Amazon CloudWatch, you should be able to launch an EMR cluster automatically where Chinese comments (data definition) on Hive are available!
Solution in Brief
- Write a shell script where proper modification in metadata storage (MySQL) will be progressed.
- Place the script into a location where the launch of an EMR cluster can access (I put it on AWS S3).
- Make it run in Steps during the launch of the EMR cluster.
With the configuration without searching the information of metadata database from
hive-stie.xml is also feasible, you could try yourself.
- Ahmed, 2018. What Is The Correct Syntax For Running A Bash Script As A Step In EMR?. [online] Stack Overflow. Available at: <https://stackoverflow.com/questions/51803980/what-is-the-correct-syntax-for-running-a-bash-script-as-a-step-in-emr> [Accessed 20 November 2020].
- Boto3.amazonaws.com. 2020. EMR — Boto3 Docs 1.16.25 Documentation. [online] Available at: <https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow> [Accessed 27 November 2020].
- Cwiki.apache.org. 2020. Languagemanual DDL — Apache Hive — Apache Software Foundation. [online] Available at: <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL> [Accessed 20 November 2020].
- Dev.mysql.com. n.d. Mysql :: Mysql 8.0 Reference Manual :: 4.5.1 Mysql — The Mysql Command-Line Client. [online] Available at: <https://dev.mysql.com/doc/refman/8.0/en/mysql.html> [Accessed 20 November 2020].