Results

We tested the BVAE and its baselines on datasets of C#, Java, and Python code extensively.
We were quite satisfied with the results, which even outperformed the source paper in the summarization task. These results are detailed below.

First, lets briefly discuss the metrics used for evaluation

Mean Reciprocal Rank takes the inverse of the rank of the "correct" result in the algorithms final ranking of all possibilities, and averages them over all tests. MRR is used to evaluate retrieval.

METEOR score uses alignments between the "correct" result and the given one to calculate a score. Learn more here. METEOR is used to evaluate summarization.

Given our evaluation metrics, lets see how the baselines performed.

As in the BVAE, our summarization baseline (IR) outperforms theirs on the common dataset. This is likely due to our unusual IR implementation. Our RET-IR scores are on-par with theirs. Note that a * indicates a result attained through Shuffle Query Testing. For reference, the C# dataset attained an MRR of 0.3739 in shuffle query testing.

The following results are the BVAE's performance on each task for each of the three datasets.

The fact that our Summarization results outperformed those from Chen & Zhou was the most exciting part of these results by far. Retrieval results were still strong, but underperformed comparatively.

We were surprised to see both model and baseline perform poorly in the task of python summarization, given that Python is intuitively the closest language to natural language. What makes this odder is that results on Python the best by far in the retrieval task.

Though we are not certain, we believe that the outperformance of Chen & Zhou in summarization could be due to our usage of subword tokenization, which through reduction of uncommon terms into more common ones could be outperforming the C# parser Chen & Zhou used, and be more easily applied generally.

We were quite satisfied with these results; the scores that didn't outperform Chen & Zhou certainly measured up/were in the same order of magnitude, and the model(s) performed well enough that the GUI applet is quite functional.