Data extraction assignment (MSR course)


  • Assignment posted on 19 May 2015.
  • Scrum 2 June 2015.
  • Presentation 9 June 2015.


Students exercise the MSR phase of data extraction while combining practical and research attitude. Data extraction precedes any sort of data synthesis (e.g., metrics) and data analysis (e.g., distribution). Data extraction is the most software engineering-oriented part of MSR. Data synthesis and analysis are much closer to the general discipline of information retrieval. Ideally, students will continue on their projects during subsequent synthesis/analysis-related assignments.


To provide some common context for the students, let's focus on "developer profiling", which is defined here to mean that we aim to extract information from software repositories that allows us to compare or rank developers on the grounds of metrics or qualities or topics or alike.


Students can work alone or couple in teams of size 2. If you work in a team, you must make an extra effort to convince everyone that all team members have made similar contributions to the assignment.


  • Formulate a preliminary research question related to developer profiling.
    • We will later look into data synthesis and analysis. The question may be revised then.
  • Identify a concrete data source to which you have access.
    • Please note that the data source needs to include "traces" of multiple developers.
    • Do not select a very general source such as "(all of) GitHub".
    • Use existing publications for inspiration; see below.
    • In the interest of limiting your effort, pick a relatively "small" data source or filter.
  • Identify access data access technologies; e.g.:
    • If you access GitHub source code, familiarize yourself with the relevant API.
  • Implement raw data extraction.
    • For instance, you may dump your raw into XML, JSON, RDF, or a SQL/noSQL database.
  • Implement extra data extraction activities, where necessary.
    • This could be filtering, transformation, abstraction, e.g.:
      • If you plan to process text, familiarize yourself with stemming, e.g., on the grounds of NLTK.
      • If you plan to analyze program identifiers, familiarize yourself with identifier splitting.
    • Use existing publications for inspiration; see below.
    • It's enough to dump data past such extra activities (and not to dump raw data).
  • Report on related data show cases at the MSR conference or elsewhere.
    • What was the data source?
    • What technologies were used?
  • Submit all source code and slides of your presentation to SVN.
  • Submit your data dumps, if feasible (< 1MB), to SVN.
    • If you want to use public or unikold Git, please submit the repo URL to SVN.



If there are any questions, please contact Ralf Lämmel <ed.znelbok-inu|lemmeal#ed.znelbok-inu|lemmeal>.

If you want to get your plan approved, please also contact Ralf Lämmel.


Be prepared (no slides!) to briefly summarize your choices regarding the assignment parameters. Also, identify open problems you might have so that teaching staff or fellow students can help. Please commit a short README (as a summary of standup comedy) to the SVN.


  • Prepare a 10-13min talk with 15 slides or less.
  • Address the parameters from the assignment explicitly.
  • Demo your data extractor.
  • Try to give a good talk, as you were advised before.

Data sources

Also have a look at "Data show cases"

Data extraction activities


Thomas Bernau helped with collecting and tagging MSR papers with regard to categories of data sources and extra data extraction activities.