Jenkins and Machine Learning Plugins for Data Science

Project goal: Create a new plugin for integrating Jenkins with one of Machine Learning tools (e.g. Jupyter Python, TensorBoard, or Sacred)

Skills to study/improve: Java, Jenkins plugin, Apache Zeppelin, Jupyter Notebooks, Python, Machine Learning, Data Science

Details

Background

Machine learning (ML) has established itself as a key data science (DS) technology in finance, retail, marketing, science, and many other fields. Machine learning algorithms learn by analyzing features of training data sets that can then be applied to make predictions, estimations, and classifications in new test cases. The power of machine learning comes at the cost of complex development and production environment to support the data, algorithms, and models generated from machine learning [1, see figure below]

Project Proposal

It is envisioned that Jenkins and its plugin ecosystem can play a key role in supporting, facilitating and accelerating the integration of ML workflow components and orchestrating their reproducible execution. Currently, Python and interactive computational notebooks (such as Jupyter and Zeppelin) are the dominant software tools for teaching, composing and executing machine learning workflows. Interactive notebooks have proven valuable data exploration and teaching tools as they adopt a ‘literal programming’ paradigm where code fragments, results, instructions and documentation are integrated into a single UI, resembling a familiar paper ‘notebook’. However, interactive notebooks present significant limitations to the adoption of good software engineering principles (test-driven development, code versioning, reproducibility, etc) as well as scaling-up to data sizes typically used in DS production environments [2]. We propose the development of a Jenkins plugin that will integrate machine learning tasks and algorithms to build ML data systems that are easier to develop, test, deploy and maintain. The hidden power of interactive computational notebooks comes from back-end ‘polyglot’ interpreters (a.k.a kernels) that interpret the code fragments, and return the results of the computation is a useful display (graph, table or other). We propose to build a Jenkins plugin that will allow the integration of existing polyglot notebook kernels to support notebook - like computations as Jenkins build tasks. The goal is to allow data scientists to define production-grade ML workflows as Jenkins builds, integrating typical Jenkins build tasks with notebook code fragments as build steps. The plugin will support a new build task, to execute code via an appropriate language ‘kernel’ as currently supported by existing notebooks such as Zeppelin and Jupyter.

Technical Quickstart

The project involves interaction between a Jenkins node and an IPython Kernel. Jenkins must be able to connect to a running IPython Kernel - or start one if necessary - to be used to evaluate code used in a Jenkins plugin. Apache Zeppelin will be used for this interaction, as they provide existing Java code to interact with an IPython Kernel [3]. In Jenkins, we will use the simple Builder plugin from the Jenkins tutorial [4], modifying its original code to call the IPython instead of printing Hello World.

Here’s a step-by-step of what the student will be expected to achieve.

Use Apache Maven to create a Java project
Learn the API of the Python Maven module of Apache Zeppelin project
Use the Python Maven module of Apache Zeppelin (item 2) in the Java project created in item 1 to send Python code, and receive its output
Follow the Jenkins Plugin Tutorial [4] and create a simple Builder plug-in, that prints Hello World
Modify the plug-in (which is also a Maven project) to call the code created in item 3
Prove that we are able to call an IPython Kernel from a Jenkins job

Extra steps could involve investigate and document for future developers (which is a good practice in Software Engineering) what would be required to use other Kernels (e.g. Perl, R, Groovy, etc), where the code would have to be modified, and perhaps also see if we could use Apache Zeppelin Interpreters [5] from Jenkins.

References

Models in Minutes not Months: Data Science as Microservices
https://towardsdatascience.com/5-reasons-why-jupyter-notebooks-suck-4dc201e27086
https://github.com/apache/zeppelin/blob/master/python/README.md
link:/doc/developer/tutorial/
https://zeppelin.apache.org/docs/0.7.0/manual/interpreters.html

Inspiration or Evaluation

Other tools that researchers use for experiment tracking (should not be seen as a direct alternative! Some can be used together with Jenkins!) * https://colab.research.google.com/ * https://neptune.ml/ ( $99/user/mo) * https://www.comet.ml/ ( $39/user/mo) * https://polyaxon.com/ * https://github.com/ModelChimp/modelchimp * https://github.com/datmo/datmo * https://github.com/IDSIA/sacred + https://github.com/vivekratnavel/omniboard / https://github.com/chovanecm/sacredboard * Airflow * studio.ml * https://medium.com/polyaxon * Version Control System for Machine Learning Projects https://dvc.org/

TensorFlow

Jupyter-Zeppelin

https://www.dataquest.io/blog/jupyter-notebook-tutorial/
https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
https://nextjournal.com/schmudde/how-to-version-control-jupyter
Existing Java integration with IPython Kernel (Jupyter), via Apache Zeppelin
- https://github.com/apache/zeppelin/blob/master/docs/interpreter/python.md#ipython-support
- https://github.com/apache/zeppelin/blob/master/python/README.md

Example Jenkins Pipeline

An example pipeline launching Tensorboard from a Jenkins pipeline can be found here

Newbie-friendly issues

We currently do not have any issues but can readily answer questions via the gitter channel here

Potential Mentors

Project Links

Organization Links

Jenkins GSoC page - documentation, application guidelines
Participate and contribute to Jenkins - landing page for newcomer contributors
Newbie-friendly issues - list of organization-wide newbie-friendly issues (use them if there is no links in the project idea)

> Go back to other GSoC 2020 project ideas