Ptt16 Assignment3

Mining software data on 101worker

Summary

Develop another 101worker module.

Logistics

  • 9-13 June (Preparation)
    • Make sure you are familiar with the methods of "Mining Software Data" lecture.
    • Prioritize approx. 3 options of those listed below so that you can speak up during the meeting on 13 June. By all means, we should be able to assign one of your preferred options to your team.
    • If you want to propose another option (one shot per team), send an email to ed.znelbok-inu|gnaltfos#ed.znelbok-inu|gnaltfos with "[course:ptt16-assignment3] proposal of an option". If this is a good idea, we will reserve it for you. If the proposal is considered redundant or suboptimal, it will not be considered. You will know only by 13 June (during the meeting).
  • 13 June (Option assignment during lab)
    • 1-2 team (max 3) representatives must attend the first lab (14:15-15:45).
    • Ask questions on your preferred options.
    • Participate in the audition for the available options.
    • You can get more consultation during the second lab slot.
  • 29 June (Deadline for assignment)

Options

This list may be revised by 13 June.

  • Document learning: The different namespaces on the wiki come with different expected sections. For instance, each page in the namespace "Contribution" is supposed to contain a section "Characteristics". The complete set of possible (actual) section names can be found by extracting all section names from the dump wiki-content.json. Apply machine learning to provide a prediction model with the namespace as the input data and section names as labels. Use some pages as a training set. By applying the prediction model to further pages, try to identify pages with missing or unexpected sections. A missing section means that prediction suggests that a section should be there for the page's namespace at hand; likewise for an unexpected section.
  • API clusters: Apply a suitable cluster analysis to source-code units considered as documents with stems extracted from program identifiers as terms to be clustered. Choose a set up so that you are likely to find sets of stems that correspond to APIs. For instance, if several contributions make use of JUnit or JAXB then, you should be able to find a cluster for JUnit or JAXB. Report and discuss your results. This one is tough and definitely qualifies for a bonus.
  • Documentation strength: Use IDF and/or TF-IDF when applied on the "Characteristics" section of contributions to identify contributions are less distinctive than others. The assumption is here that a distinctive "Characteristics" section should contain terms that are not common across all the contributions, which make them rank high. A distinctive "Characteristics" section is an element of strong documentation of contributions. Report and discuss your results.
  • Cosine code similarity: We consider some form of "clone detection". Apply cosine similarity at the code level with one vector per contribution where the vectors represent term frequency with terms extracted from program identifiers. Appropriate pre-processing and term selection needs to be performed. Use the similarity measure to identify contributions of the same language and of different languages that are the most similar. In the case of two different languages such as Java and Haskell, the question would be which pair of Java and Haskell contributions is the most similar. Evaluate your findings by judging whether the found similarity makes sense. Report and discuss your results.
  • Spearman code similarity: Instead of using cosine similarity (see above), use Spearman correlation or some other correlation that you consider more appropriate.
  • Feature location: Define a collection of stems for a number of features of 101's system. The stems should relate to parts of program identifiers. For instance, you might hypothesize that "cut", if it appears as part of a program identifier (after preprocessing) implies feature cut. Generally, we assume that the occurrence of the stem in the source code would imply that the source code implements the corresponding feature. Synthesize a map from contributions to located features and compare the located features with those declared on the wiki. Report your results.
  • Feature learning: This is tougher than the previous one; it qualifies more likely for a bonus. Apply machine learning to the pre-processed source-code identifiers with 101's features, as declared on the wiki, as labels for a prediction model. Use some subset of contributions as training set. Use some additional contributions to measure precision and recall of prediction. Report your results.
  • Comment sentiments: Apply sentiment analysis to the source-code comments. Synthesize data to facilitate some interesting comparison. For instance, how do Java and Haskell-based contributions differ in terms of sentiments? Report your results.
  • Documentation sentiments: Instead of sentiment analysis to comments (see above), apply it to the documentation on the wiki.
  • Wiki completeness: Apply TF-IDF to the wiki with contribution pages considered documents. Check for top-5 ranking terms per page whether there is a corresponding page on the wiki. Report missing pages, if any.

Submission

See 2nd assignment; the very same rules apply.

You may want to generate plots as part of your solution to better illustrate or visualize your results. Just commit these plots with your module and possibly mention and explain the plots in your README.md file.

Bonus

This assignment is eligible for collecting a bonus for the exam, as explained on the course page.

A solution for this assignment qualifies for the bonus, if:

  1. all team members leave traces of their work on gitlab/github;
  2. the work is explained well in the README.md file which comes with the work;
  3. the limitations / underlying assumptions of the solution are explained well;
  4. the work applies concepts of "mining software data" correctly and understandably.

No application for bonus (by email) is needed this time.

Further reading

  • See 2nd assignment.
  • See slide deck on "Mining Software Data".
  • See latest version of 101worker with these dumps: