spark
[PYTHON] Fix `assertDataFrameEqual` behavior with mixed DataFrame types
#46836
Closed

[PYTHON] Fix `assertDataFrameEqual` behavior with mixed DataFrame types #46836

arminnh
arminnh341 days ago (edited 340 days ago)

What changes were proposed in this pull request?

  • Avoid AttributeError (see examples below) when mixing Spark DataFrame & Pandas or Pandas-on-Spark DataFrame in assertDataFrameEqual by flowing into _assert_pandas_almost_equal/_assert_pandas_equal instead of assertAlmostEqual/assertEqual.
  • In PandasOnSparkTestUtils.assert_eq, applied the Pandas-on-Spark flow for both params left and right, instead of only left, and clarified the error to specify that a Pandas or Pandas-on-Spark object is expected, since which is not immediately obvious from the current error: DataFrame, DataFrame, Series, Series, IndexIndex

Why are the changes needed?

  • assertDataFrameEqual results in AttributeError when providing a Spark DataFrame as the first argument and a Pandas DataFrame or a Pandas-on-Spark DataFrame as the second argument (when not inheriting from unittest).

Does this PR introduce any user-facing change?

  • Better errors will be raised in the situation described above:
    • PySparkAssertionError with a message is raised instead of AttributeError.
    • PySparkAssertionError error when mixing Spark & Pandas-on-Spark DataFrames is consistently raised in PandasOnSparkTestUtils.assert_eq, regardless of which one is left or right.
    • Clarified error message Expected type DataFrame, DataFrame, Series, Series, IndexIndex, for ... -> Expected type Pandas or Pandas-on-Spark DataFrame, Series, or Index for ...

Setup:

import pandas as pd
import pyspark.pandas as ps
from pyspark.testing import assertDataFrameEqual

df1 = spark.createDataFrame([(10,), (20,), (30,)], ["Numbers"])
df2 = pd.DataFrame(data=[10, 11, 13], columns=["Numbers"])
df3 = ps.DataFrame(data=[10, 11, 13], columns=["Numbers"])

Before:

>>> assertDataFrameEqual(df1, df2, ignoreColumnType=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 483, in assert_eq
    self.assertAlmostEqual(lobj, robj)
    ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PandasOnSparkTestUtils' object has no attribute 'assertAlmostEqual'. Did you mean: 'assertPandasEqual'?

>>> assertDataFrameEqual(df2, df1, ignoreColumnType=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 472, in assert_eq
    _assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
  File "...spark/python/pyspark/testing/pandasutils.py", line 314, in _assert_pandas_almost_equal
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index,  for `right` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.
>>>

>>> assertDataFrameEqual(df1, df3, ignoreColumnType=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 483, in assert_eq
    self.assertAlmostEqual(lobj, robj)
    ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PandasOnSparkTestUtils' object has no attribute 'assertAlmostEqual'. Did you mean: 'assertPandasEqual'?

>>> assertDataFrameEqual(df3, df1, ignoreColumnType=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 438, in assert_eq
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, DataFrame, Series, Series, IndexIndex,  for `expected` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.

After

>>> assertDataFrameEqual(df1, df2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 476, in assert_eq
    _assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
  File "...spark/python/pyspark/testing/pandasutils.py", line 300, in _assert_pandas_almost_equal
    raise PySparkAssertionError(
**pyspark.errors.exceptions.base.PySparkAssertionError: [INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index,  for `left` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.**

>>> assertDataFrameEqual(df2, df1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 465, in assert_eq
    _assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
  File "...spark/python/pyspark/testing/pandasutils.py", line 311, in _assert_pandas_almost_equal
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index,  for `right` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.

>>> assertDataFrameEqual(df1, df3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 426, in assert_eq
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [INVALID_TYPE_DF_EQUALITY_ARG] Expected type Pandas or Pandas-on-Spark DataFrame, Series, or Index for `left` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.

>>> assertDataFrameEqual(df3, df1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...spark/python/pyspark/testing/utils.py", line 828, in assertDataFrameEqual
    return PandasOnSparkTestUtils().assert_eq(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...spark/python/pyspark/testing/pandasutils.py", line 438, in assert_eq
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [INVALID_TYPE_DF_EQUALITY_ARG] Expected type Pandas or Pandas-on-Spark DataFrame, Series, or Index for `right` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.

How was this patch tested?

  • Manually tested new behavior in local SparkSession.
  • Extended existing test case with Pandas-on-Spark DataFrame to confirm the correct error is raised when the parameters are flipped.
  • Added test case with Spark DataFrame & Pandas DataFrame.

Was this patch authored or co-authored using generative AI tooling?

No

arminnh Add test cases that result in errors
ef4b0c16
github-actions github-actions added SQL
github-actions github-actions added PYTHON
arminnh More robust type checking and consistent errors in PandasOnSparkTestU…
9f21770d
arminnh arminnh force pushed from cb661269 to 9f21770d 341 days ago
github-actions github-actions added PANDAS API ON SPARK
HyukjinKwon
HyukjinKwon340 days ago
arminnh Fix lint error & test that mixed numpy array and pd.Series
1f6dfa35
arminnh arminnh force pushed from 564a8e28 to 1f6dfa35 340 days ago
arminnh arminnh changed the title [WIP][PYTHON] Fix `assertDataFrameEqual` behavior with mixed DataFrame types [PYTHON] Fix `assertDataFrameEqual` behavior with mixed DataFrame types 340 days ago
arminnh
arminnh340 days ago

@itholic @HyukjinKwon I didn't find an issue about this on Jira but I hope it's fine to propose a solution. It's a minor bug I encountered while setting up automated testing on Databricks at work.

HyukjinKwon
HyukjinKwon340 days ago

you cab create an account and create a JIRA ticket.

itholic
itholic commented on 2024-06-04
itholic339 days ago

Let's do not change the existing tests. Otherwise looks fine.

python/pyspark/pandas/tests/indexes/test_indexing_adv.py
6969 self.assertEqual(psdf.at[3, "b"], 6)
7070 self.assertEqual(psdf.at[3, "b"], pdf.at[3, "b"])
7171 self.assert_eq(psdf.at[9, "b"], np.array([0, 0, 0]))
72
self.assert_eq(psdf.at[9, "b"], pdf.at[9, "b"])
72
self.assert_eq(psdf.at[9, "b"], pdf.at[9, "b"].to_numpy())
itholic339 days ago

I believe the previous one looks more reasonable.

python/pyspark/sql/tests/test_utils.py
955954 exception=pe.exception,
956955 error_class="INVALID_TYPE_DF_EQUALITY_ARG",
957956 message_parameters={
958 "expected_type": f"{ps.DataFrame.__name__}, "
959 f"{pd.DataFrame.__name__}, "
960 f"{ps.Series.__name__}, "
961 f"{pd.Series.__name__}, "
962 f"{ps.Index.__name__}"
963 f"{pd.Index.__name__}, ",
964 "arg_name": "expected",
957
"expected_type": "Pandas or Pandas-on-Spark DataFrame, Series, or Index",
itholic339 days ago

+1 This one looks more clearer to me.

github-actions
github-actions238 days ago

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions github-actions added Stale
github-actions github-actions closed this 237 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone