Skip to content

Review4Repair/Review4Repair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Review4Repair

This repository contains the dataset and source codes for "Review4Repair: Code Review Aided Automatic Program Repairing".

Pretrained Models

Index Model Name Link
1 best_model_cc_hard link
2 best_model_c_hard link
3 best_model_cc_soft link
4 best_model_c_soft link
5 best_model_c_10k_vocab link
6 best_model_c_20k_vocab link
7 best_pretrained_model_c_hard link

To run inference and evaluation, unzip the particular checkpoint. Sample tokenized test sets are available in "Inference" directory.
Evaluation: python Inference\evaluation.py <test_set_src> <test_set_tgt> <model_name>

Predict: python Inference\prediction.py <test_set_src> <model_name>

For example, to evaluate model_cc on test set,

python Inference\evaluate.py Inference\CC_sample_src-test.txt Inference\CC_sample_tgt-test.txt best_model_cc_hard.pt

To predict using model_cc on test set, python Inference\predict.py Inference\CC_sample_src-test.txt best_model_cc_hard.pt

Train

For details on model training, refer to OpenNMT documentation. To run training:

!onmt_preprocess -train_src training_data/<SRC_DATA>.txt \
    -train_tgt training_data/c/<TGT_DATA>.txt \
    -save_data <SAVE_DIR> \
    -src_vocab vocab/<SRC_VOCAB_FILE>.txt \
    -tgt_vocab vocab/<TGT_VOCAB_FILE>.txt \
    -src_vocab_size <SRC_VOCAB_SIZE> \
    -tgt_vocab_size <TGT_VOCAB_SIZE> \
    -src_seq_length <SRC_LEN> \
    -src_seq_length_trunc <SRC_LEN> \
    -tgt_seq_length 100 \
    -tgt_seq_length_trunc 100 \
    -dynamic_dict \
    -overwrite

<SRC_LEN> = 600 for model_cc, 400 for model_c

Dataset Details

Our database includes a total of 35 tables, providing useful information about changes codes, reviewer's details, review time, review and corresponding response etc. We highlight some major table from our database below. The full ERD diagram is attached here.

To reproduce a particular database <DB_NAME>, follow this following query.

cd PROJECT_NAME
mysql -u root -p <DB_NAME> < <PROJECT_NAME>.sql

Dataset Structure

The dataset follows the following architecture.

├───acumos
  ├───acumos.sql
  ├───gerrit_db_acumos.json
  └───Downloaded_Codes_acumos.zip
      ├───acumos_1
        ├───1.java
        ├───2.java
        └───....
      ├───....  
      └───acumos_MapJSON.json
├───android
├───....
├───unicorn
└───unified_with_date.json

We provide the database of each project seperately for modularity. The database and downloaded codes are provided in each project folder. The downloaded codes are provided in a zipped file. Under each folder in the zipped file, there are multiple versions of the same file at different commit times. Each zipped folder also contains a <PROJECT_NAME_>MapJSON file that maps the folder name with the corresponding one in the database and json file. Each project folder contains a Json file named gerrit_db_<PROJECT_NAME> which contains the code review with the Java code file name before and after the change. A sample example is given below:

{
		"comment_id" : "3a045101_dd7d55b3",
		"message" : "GROUP_INTERFACE string given is wrong.Please modify.",
		"file_name" : "android-android_api-base-src-main-java-org-iotivity-base-OcPlatform.java",
		"line_number" : 47,
		"base_patch_number" : 3,
		"changed_patch_number" : 5,
		"line_change" : 1
}

To create a Json from database for other programming language, for example c, execute the following query and export as a Json:

use <DB_NAME> ;
select c.comment_id, cu.message, c.base_code as file_name, c.line_number, 
c.base_patch_number, c.changed_patch_number, cu.line_change, cu.written_on 
from code c
inner join comment_usefulness cu
on c.comment_id = cu.comment_id
where c.changed_patch_number != -1
and c.base_code like "%.c"
and cu.in_reply_to is null;

The root directory contains a file named 'unified_with_date.json' which contains the combined query result of the fifteen projects. The entire dataset is also available in Mega.

Change Calculation Tool

Input Format:

Find Change Closer to a Specific Line:

Input: linediff file1 file2 line_number change_window_size

Output: change_type <|sep|> input_code <|sep|> output_code

Find All Diffs between Two Files:

Input: alldiff file1 file2 max_target_len max_input_len

Output:

If no one line change found: <|nochange|>

If change found:

input_code <|sep|> output_code <|datasep|> -- repeat

  1. Separate with <|datasep|> to get the changes
  2. Split one datapoint with <|sep|>

Find Line Numbers of All One Line Changes between Two Files:

Input: oneliner file1 file2

Output:

If no one line change found: <|nochange|>

Else:

original_pos1 revised_pos1

original_pos2 revised_pos2

original_pos3 revised_pos3

. . .

(split with newline, then split with space)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published