Project Practical 101worker

Summary

101worker is the computational infrastructure in the http://101companies.org/ project ('101'). It is meant to compute data and visualizations from the contributions aggregated in the project. Contributions are open-source projects written in different languages, using diverse technologies, exercising different design options while targeting a common feature model for an information system. '101' is used heavily in the research of the Software Languages Team over the last few years. 101worker could fill the role as a 'data science' infrastructure for '101', eventually.

Alas, 101worker is in a bit of a misery. A previous version had been discontinued because it was too hard to maintain. A new version had been implemented which is architecturally clean and simple, but it now lacks an interesting suite of actual computations that show the power of the "data science" admitted by the 101 project. Thus, the overarching goal of this project is to mature a good suite of simple but representative modules that exercise data mining, information retrieval, reverse engineering on the 101 data set in meaningful ways. This requires development of some sort of a nice story as to what can and should be done with the data set and then the implementation of actual modules that are "pearls" (wonderful, simple programs conveying the problem's solution in an accessible way, thereby encouraging more complex experiments by others). There is a lot of inspiration available from the papers of the project.

As this project may easily touch upon scientifically relevant subjects, students may possibly also enroll in the project per mode "research practical". All enrolling students, though, must be prepared to "make their hands dirty" in the project.

Logistics

  • Enrollment until 21 Nov, 7pm (Late enrollment may be possible.)
  • Contact Ralf Lämmel ed.znelbok-inu|gnaltfos#ed.znelbok-inu|gnaltfos for enrollment.
  • Kickoff meeting 21 Nov in softlang meeting, 6:15pm, room B 013.
  • Milestone presentations approx. once per month in softlang meeting.
  • Andre Emmerichs ed.znelbok-inu|shciremmea#ed.znelbok-inu|shciremmea is the manager of the project practical.
  • Students typically work 4-6 months on the project.

Detailed objectives

  • Development of showcase 101worker modules:
  • Exploration of visualization: at this stage, 101worker demonstrates the derivation of resources for source files (e.g., .json files with LOC) as well as dumps (e.g., .json files for some aggregated information over all contributions, e.g., metrics and languages). We would also like to add a systematic approach towards generating data visualizations (such as tables, charts, plots, tagclouds) for which some sort of lightweight HTML5 approach would be needed.
  • Improvement of documentation
    • Detailed explanation of wiki dump
    • More to be defined

Challenges

Some of these are pros; others are possibly cons. (Well, challenges are always good.)

  • Many languages: at least some of the modules should meaningfully deal with a variety of languages and somehow give insight into the difference of usage of these languages within the 101 data set.
  • Need for a simple setup: 101worker must remain simple to set up and run on, say, ubuntu and MacOS machines. Thus, selection of helper technologies has to be very carefully decided. Deployment must be automated.
  • Reuse data science technology: information retrieval, data mining, etc. should take advantage of existing technology as opposed to implementing algorithms from scratch.
  • Suboptimal status of available documentation: Students on the project must be able to deal with real-world suboptimal status of documentation. Of course, the expectation is that the project practical improves the situation.
  • Lack of a visualization approach: While a visualization approach is clearly wanted and needed, no requirements analysis, no design has been completed.
  • Complexity of story: Defining a good story of showcase modules requires very significant domain knowledge, specifically some familiarity with reverse engineering, data mining, and information retrieval.

Resources