Mostcomm

Mostcomm is a command line tool that scans a directory of text files and reports repeated runs of lines across those files. It is useful for finding copy‑pasted blocks, duplicated configuration or just getting an overview of common data in a large corpus.

The detection algorithm hashes each line and uses those hashes to build ranges that appear in more than one file. Results can be sorted and filtered to help focus on the most significant duplicates.

Why Mostcomm?

Quickly discover duplicated content across any collection of plain‑text files.
Spot configuration drift or repeated boilerplate in large repositories.
Identify common patterns in datasets or logs.

Features

Works on any text files; simply point it at a directory.
Supports glob patterns (semicolon separated) to select which files are scanned.
Filter results by a minimum number of lines or a minimum percentage of the file.
Limit the maximum number of files that may share the same block.
Sort results by run length or by average file coverage.
Provides a Go library (mostcomm package) so the detection logic can be reused programmatically.

Installation

If you have Go installed you can build the binary directly from the source repository:

# clone the repository
git clone https://github.com/arran4/mostcomm.git
cd mostcomm

# install the command
go install ./cmd/mostcomm

This will place the mostcomm binary in your GOBIN (or $GOPATH/bin).

The manual page distributed with the releases is generated from mostcomm.md using go-md2man. You can rebuild it locally with:

go install github.com/cpuguy83/go-md2man/v2@latest
go-md2man -in mostcomm.md -out mostcomm.1

Usage

mostcomm -dir DIRECTORY [-mask PATTERN] [options]

Common flags:

-dir Directory to scan (default .).
-mask Glob patterns to match files (default *.txt). Use ; to separate multiple patterns.
-sort Sorting algorithm for the output (none, lines, average-coverage).
-sort-direction ascending or descending (default ascending).
-percent-threshold Only report duplicates covering at least this percent of a file.
-lines-threshold Only report duplicates of at least this many lines.
-match-max-threshold Exclude duplicates that occur in more than this many files.

Example

Mostcomm ships with a small example data set under testdata. The three text files contain simple numbered lines so you can easily see what the tool reports. They are structured as follows:

a.txt lists the numbers 1 through 10, one per line.
b.txt contains the same numbers but with a blank line after 5.
c.txt repeats the range 6 through 10 twice.

Run the command against that directory:

mostcomm -dir ./testdata -mask '*.txt'

Which will output something similar to:

2023/02/27 22:57:22 Scanning a.txt
2023/02/27 22:57:22 Scanning b.txt
2023/02/27 22:57:22 Scanning c.txt
3 files scanned
11 unique lines scanned
32 total lines scanned
2 Duplicate runs
- a.txt:0-4 (5), b.txt:0-4 (5)
- a.txt:5-9 (5), b.txt:7-11 (5), c.txt:0-4 (5), c.txt:5-9 (5)

Ideas and Contributions

Mostcomm started as a small utility and there is plenty of room for improvement. Some ideas for future enhancements include:

Smarter range detection that handles reordered lines.
Output in machine readable formats such as JSON.
Integration with code editors or CI pipelines.

Pull requests and issues are welcome. Feel free to open a discussion if you have suggestions or run into any problems.

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
cmd/mostcomm		cmd/mostcomm
testdata		testdata
.goreleaser.yml		.goreleaser.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
go.mod		go.mod
go.sum		go.sum
mostcomm.go		mostcomm.go
mostcomm.md		mostcomm.md
readme.md		readme.md
util.go		util.go
util_test.go		util_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mostcomm

Why Mostcomm?

Features

Installation

Usage

Example

Ideas and Contributions

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

arran4/mostcomm

Folders and files

Latest commit

History

Repository files navigation

Mostcomm

Why Mostcomm?

Features

Installation

Usage

Example

Ideas and Contributions

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages