Holden Karau is a transgender Canadian, Apache Spark committer, active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo.
Testing & Validating Apache Spark Jobs
As Spark continues to evolve, we need to revisit our testing techniques to support Datasets, streaming, and more. This talk expands on “Beyond Parallelize and Collect” (not required to have been seen) to discuss how to create large scale test jobs while supporting Spark’s latest features. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.