Msr19 Assignment3

Technology Usage History Mining

ANTLR ( is one of the popular technologies for realizing a parser of a custom language. Even complex technologies such as XteXt depend on it. Many open source projects maintained on GitHub use it. Many technical details are explained in the documentation that is part of the official GitHub repository.

Lists of Repositories (Pre-Mature)

In the following, two lists of repositories are provided that contain links to repositories with traces of ANTLR usage.

  1. Pre-Final Repository list with metadata (CSV)
  2. Final Repository list filtered based on Star Gazers and grammar file count. (CSV)

The two lists still need to be merged and filtered. We will discuss this as a showcase for Pandas-based data analysis.


In this analysis, we are interested in insights provided by inspecting repositories' history.

  • How often do people actually copy grammars written by Terrence Parr (e.g., here)?
  • What is the typical complexity of grammars (.g4 files)?
  • How does usage of semantic actions/listener pattern/visitor pattern evolve? Do developers switch between those?
  • What dependencies to other technologies than ANTLR have typically been set in Maven build files? How do they change over time?
  • The resulting table now contains the following default columns:
      • A repository identifier such as 'libgdx/libgdx' or 'antlr/grammars-v4' .
      • The SHA of a commit, where you analyze the changed lines.
      • The timestamp of a commit.