Skip to content

Conversation

vrozov
Copy link
Member

@vrozov vrozov commented Aug 22, 2025

What changes were proposed in this pull request?

Upgrade Hive dependency to 4.1.0

Why are the changes needed?

Apache Hive 1.x, 2.x and 3.x are EOL and are not longer supported by the Apache Hive community

Does this PR introduce any user-facing change?

Yes, it drops support for EOL versions of Hive 1.x, 2.x and 3.x.

There is no change in the Spark behavior and several changes on Hive 4.x compared to Hive 2.3:

  • Hive managed tables
  • Indices are not supported
  • ORC invalid Gregorian dates are handled differently compared to Spark and prior Hive version

How was this patch tested?

Few tests required modification due to changed Hive behavior, otherwise all existing Spark tests pass.

Was this patch authored or co-authored using generative AI tooling?

No

@sarutak
Copy link
Member

sarutak commented Aug 27, 2025

@vrozov
Thank you for your contribution! I'll leave high level comments.

  • There seem lots of non trivial changes to upgrade Hive to 4.1 so it's better to explain what changes are needed for what purpose and the reason, and what and how did you tested. These helps reviewers.
  • I'm concerning whether upgrading Hive to 4.1 and dropping 2.3 causes not small impact for users. Can we switch 2.3 and 4.1 using profile (-P flag) when we build and for the time being, use 2.3 by default?
  • This PR is large and changes span a wide range of files. To make review easily and avoid conflict with other PRs, can we break this PR into smaller ones somehow? If we can switch 4.1 and 2.3 like I mentioned above, we don't need to complete upgrade to 4.1 at once.

@dongjoon-hyun Do you have other suggestions because you had experience to add 2.3 support.

@dongjoon-hyun
Copy link
Member

To @sarutak , I have no opinion for the PR because apparently this PR didn't pass the CI yet and what I can say for now is that Apache Spark community doesn't allow this kind of regression (of this PR). I'm not sure why Hive 4.1 upgrade proposal enforces us to downgrade the dependencies. I hope it's a mistake and to make it sure that to avoid this kind of hidden stuff.

- <antlr4.version>4.13.1</antlr4.version>
+ <antlr4.version>4.9.3</antlr4.version>

@sarutak
Copy link
Member

sarutak commented Aug 27, 2025

@dongjoon-hyun
I didn't intend to ask you to look into each part of changes as it didn't pass CI as you mentioned. I wanted to ask you high level comment If you had because upgrading Hive should be done carefully .

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 27, 2025

Sorry but I'm not using Apache Hive in these days in all cases including even a HiveMetaStore. So, it's a little hard for me to help this PR.

According to my previous experience, I can say that this PR requires significant verification efforts even after CI passes.

May I ask if @vrozov and @sarutak uses Hive 4.1 or this patch internally in the production level with AWS Glue? If then, it would be a great reference to the community.

@sarutak
Copy link
Member

sarutak commented Aug 28, 2025

@dongjoon-hyun
Thank you for your comment. Yes, I agree that we need to verify very carefully so I'm wondering if we can minimize the risk by allowing to build with both 2.3(as default) and 4.1 transitionally like as we supported to build with Scala 2.12 and 2.13 before.

May I ask if @vrozov and @sarutak uses Hive 4.1 or this patch internally in the production level with AWS Glue? If then, it would be a great reference to the community.

@vrozov Can you share the background if possible?

@vrozov
Copy link
Member Author

vrozov commented Aug 29, 2025

To @sarutak , I have no opinion for the PR because apparently this PR didn't pass the CI yet and what I can say for now is that Apache Spark community doesn't allow this kind of regression (of this PR). I'm not sure why Hive 4.1 upgrade proposal enforces us to downgrade the dependencies. I hope it's a mistake and to make it sure that to avoid this kind of hidden stuff.

- <antlr4.version>4.13.1</antlr4.version>
+ <antlr4.version>4.9.3</antlr4.version>

@dongjoon-hyun @sarutak Unfortunately it is not a mistake. Hive uses 4.9.3 that is not compatible with 4.10.x and above: "Mixing ANTLR 4.9.3 and 4.10 can lead to errors that point to a version mismatch. A very common Java error looks like this: Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 3 (expected 4)." and this is exactly the error I got without changing antlr4.version.

One possible solution is to shade antlr4 in Spark, so there will be no conflict between Spark and Hive version. Please let me know what do you think and if that sounds reasonable, I'll open a separate PR for shading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants