GSoC Project Intro: Usage Statistics Analysis
About myself
Hello, my name is Payal Priyadarshini. I am pursing my major in Computer Science & Engineering at the Indian Institute of Technology Kharagpur, India. I am very proficient in writing code in Python, C++, Java and currently getting familiar and hopefully good in Groovy too.
I have internship experiences in renowned institutions like Google and VMware where I worked with some exciting technologies for example Knowledge Graphs, BigTable, SPARQL, RDF in Google. I am a passionate computer science student who is always interested in learning and looking for new challenges and technologies.That’s how I came across to Google Summer of Code where I am working on some exciting data mining problems which you are going to encounter below in this blog.
Project Overview
Jenkins has collected anonymous usage information of more than 100,000 installations which includes set of plugins and their versions etc and also release history information of the upgrades. This data collection can be used for various data mining experiments. The main goal of this project is to perform various analysis and studies over the available dataset to discover trends in data usage. This project will help us to learn more about the Jenkins usage by solving various problems, such as:
-
Plugin versions installation trends, will let us know about the versions installation behaviour of a given plugin.
-
Spotting downgrades, which will warn us that something is wrong with the version from which downgrading was performed.
-
Correlating what users are saying (community rating) with what users are doing (upgrades/downgrades).
-
Distribution of cluster size, where clusters represents jobs, nodes count which approximates the size of installation.
-
Finding set of plugins which are likely to be used together, will setup pillar for plugin recommendation system.
As a part of the Google Summer of Code 2016, I will be working on the above mentioned problems. My mentors for the project are Kohsuke Kawaguchi and Daniel Beck. Some analyses has already been done over this data but those are outdated as charts can be more clearer and interactive. This project aims to improvise existing statistics and generating new ones discussed above.
Use Cases
This project covers wide-range of the use-cases that has been derived from the problems mentioned above.
Use Case 1: Upgrade/Downgrade Analysis
Understanding the trend in upgrades and downgrades have lots of utilities, some of them have already been explained earlier which includes measuring the popularity, spotting downgrades, giving warning about the wrong versions quickly etc.
Use Case 1.1: Plugin versions installation trends
Here we are analysing the trend in the different version installations for a given plugin. This use-case will help us to know about:
-
Trend in the upgrade to the latest version released for a given plugin.
-
Trend in the popularity decrement of the previous versions after new version release.
-
Find the most popular plugin version at any given point of time.
Use Case 1.2: Spotting dowgrades
Here we are interested to know, how many installations are downgraded from any given version to previously used version. Far fetched goal of this analysis is to give warning when something goes wrong with the new version release, which can be sensed using downgrades performed by users. This analysis can be accomplished by studying the monotonic property of the version number vs. timestamp graph for a given plugin.
Use Case 1.3: Correlation with the perceived quality of Jenkins release
To correlate what users are saying to what users are doing, we have community ratings which tells us about the ratings and reviews of the releases and has following parameters:
-
Used the release on production site w/o major issues.
-
Don’t recommend to other.
-
Tried but rolled it back to the previous version.
First parameters can be calculated from the Jenkins usage data and third parameter is basically spotting downgrades(use case 1.2). But the second parameter is basically an expression which is not possible to calculate. This analysis is just to get a subjective idea about the correlation.
Use Case 2: Plugin Recommendation System
This section involves setting up ground work for the plugin recommendation system. The idea is to find out the set of plugins which are most likely to be used together. Here we will be following both content based filtering as well as collaborative filtering approach.
Collaborative Filtering
This approach is based upon analysing large amount of information on installation’s behaviours and activities. We have implicit form of the data about the plugins, that is for every install ids, we know the set of plugins installed. We can use this information to construct plugin usage graph where nodes are the plugins and the edges between them is the number of installations in which both plugins are installed together.
Content-based Filtering
This method is based on a properties or the content of the item for example recommending items that are similar to the those that a user liked in the past or examining in the present based upon some properties. Here, we are utilizing Jenkins plugin dependency graph to learn about the properties of a plugin. This graph tells us about dependent plugins on a given plugin as well as its dependencies on others. Here is an example to show, how this graph is use for content based filetring, suppose if a user is using “CloudBees Cloud Connector”, then we can recommend them for “CloudBees Registration Plugin” as both plugins are dependent on “CloudBees Credentials Plugin”.
Additional Details
You may find the complete project proposal along with the detailed design of the use-cases with their implementation details here in the design document.
A complete version of the use-case 1: Upgrade & Downgrade Analysis should be available in late June and basic version of plugin recommendation system will be available in late July.
I do appreciate any kind of feedback and suggestions. You may add comments in the design doc. I will be posting updates about the statistics generation status on the jenkins-dev mailing list and jenkins-infra mailing list.