Running Apache Spark/.NET and getting data from Oracle DB

Running Apache Spark/.NET on Windows

MS has announced support of .NET for Apache Spark.

The main Spark/.NET documentation is here.

In the article I will highlight steps necessary to get the .NET for Apache Spark running in Windows.

Download and install all components, mentioned in the README.md, initialize the necessary environment variables and create the HelloSpark program.

Get winutils.exe as described in the step 3 of Spark 2: How to install it on Windows in 5 steps) and copy it into your Spark bin-directory.

Define environment variables SPARK_HOME and HADOOP_HOME as described in the article above.

Start your example in the HelloSpark deployment directory (use your version of microsoft-spark-XXX.jar):

spark-submit --class org.apache.spark.deploy.DotnetRunner --master local microsoft-spark-2.4.x-0.2.0.jar dotnet HelloSpark.dll

You may expect a warning WARN SparkEnv: Exception while deleting Spark temp dir .. in your output, which is most likely related to Spark issues SPARK-12216 and SPARK-8333. I didn’t find how to get rid of the warning on Windows.

Get data from Oracle DB into Spark/.NET

Get ojdbc6.jar (ojdbc8.jar works too) from Oracle Database 11.2.0.4 JDBC Driver & UCP Downloads.

You can either copy the ojdbc8.jar into $SPARK_HOME/jars or you can specify path to the jar as an argument of spark-submit (see below).

Here is the C# code accessing data in Oracle:

using Microsoft.Spark.Sql;

namespace HelloSpark {
    class Program {
        static void Main(string[] args) {
            var spark = SparkSession.Builder().GetOrCreate();
            var df = spark
                .Read()
                .Format("jdbc")
                .Option("url", "jdbc:oracle:thin:@<db-ip-addr>:<port>:<SID>")
                .Option("dbtable", "<table>")
                .Option("user", "<user>")
                .Option("password", "<pwd>")
                .Option("driver", "oracle.jdbc.driver.OracleDriver")
                .Load();
            df.PrintSchema();
            df.Show();
        }
    }
}

If you have copied ojdbc8.jar into $SPARK_HOME/jars then run the test program as:

spark-submit --class org.apache.spark.deploy.DotnetRunner --master local microsoft-spark-2.4.x-0.2.0.jar dotnet HelloSpark.dll

Otherwise add --driver-class-path argument:

spark-submit --driver-class-path <path-to-ojdbc>\ojdbc8.jar --class org.apache.spark.deploy.DotnetRunner --master local microsoft-spark-2.4.x-0.2.0.jar dotnet HelloSpark.dll

In both cases you should see the table schema and data dumped to console.