Display Chinese Characters for Comments in Hive

Automation of Chinese Support for Hive on AWS EMR

AWS EMR is a managed service with which you can launch a big-data cluster with supported open source tools such as Hadoop, Spark, Hudi, Tez, Tensorflow etc. Among a group of supported open source tools, Hive is involved in the list.

Hive can expedite the data manipulation on large datasets placed in distributed storage. To view specific datasets, you need to define the composition of the datasets as tables. And in some scenarios for end users, they might want to look up what columns mean in a table with readable languages. Like what Column Attrition_Flag indicates in the table of BankChurners where an end user can look up its native-language-friendly definition, for example, by executing SHOW CREATE TABLE BankChurners; .

If you launch an EMR cluster without too much modification where Hive resides plus there are non-English descriptions in table comments and column comments, you will encounter a display issue similar as the following:

Attrition_Flag string ??????

Modify metadata using SQL

The default metadata storage on EMR is MySQL. You need to modify specific columns in the MySQL database in order to define Hive components with non-English comments on columns and tables. Those modifications can be achieved with the following MySQL SQL commands.

Metadata Modification in MySQL for Hive

Modify metadata using CLI

For executing the SQLs through CLI, you can utilize mysql . Combining the above information, one can make non-English comments on columns and tables available, or can be displayed, with the following bash command.

Metadata Modification with CLI for Hive

Modify metadata using .sh

The information of the user name and password can be found athive-site.xml via the properties of javax.jdo.option.ConnecitonUserName and javax.jdo.option.ConnectionPassword. And hive-site.xml is located at /etc/hive/conf.dist/hive-site.xml on the master node of the EMR cluster. Without modifying hive-site.xml, the alteration of the tables in the meta database can be done like the following.

Shell Script for Metadata Modification

Now, we only need to make it part of the launch of the EMR cluster. Placing the shell script executed during bootstrap actions is not proper since by the time point of the execution, Hive will haven’t be installed. We are going to put it into step actions.

With proper settings and combination of AWS Lambda and Amazon CloudWatch, you should be able to launch an EMR cluster automatically where Chinese comments (data definition) on Hive are available!

  1. Write a shell script where proper modification in metadata storage (MySQL) will be progressed.
  2. Place the script into a location where the launch of an EMR cluster can access (I put it on AWS S3).
  3. Make it run in Steps during the launch of the EMR cluster.

With the configuration without searching the information of metadata database from hive-stie.xml is also feasible, you could try yourself.

  1. Ahmed, 2018. What Is The Correct Syntax For Running A Bash Script As A Step In EMR?. [online] Stack Overflow. Available at: <https://stackoverflow.com/questions/51803980/what-is-the-correct-syntax-for-running-a-bash-script-as-a-step-in-emr> [Accessed 20 November 2020].
  2. Boto3.amazonaws.com. 2020. EMR — Boto3 Docs 1.16.25 Documentation. [online] Available at: <https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow> [Accessed 27 November 2020].
  3. Cwiki.apache.org. 2020. Languagemanual DDL — Apache Hive — Apache Software Foundation. [online] Available at: <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL> [Accessed 20 November 2020].
  4. Dev.mysql.com. n.d. Mysql :: Mysql 8.0 Reference Manual :: 4.5.1 Mysql — The Mysql Command-Line Client. [online] Available at: <https://dev.mysql.com/doc/refman/8.0/en/mysql.html> [Accessed 20 November 2020].

9 x AWS-certified Sr. data engineer in the 104 Corporation. An AWS Community Builder