Remove duplicate lines using Groovy

A simple use of sort and uniq can remove duplicate lines. But, what if you need to keep the existing line order or only part of the line contains the duplication?

Traditional command shell scripting
Using the command shell you can run:

cat test.txt | sort | uniq 

Command shell scripting keeping line order
If you need to maintain the text order you can use more powerful scripting language tools like awk.
At this discussion this script is shown:

awk '!x[$0]++'  

Another, more conventional and perhaps less elegant approach using awk is shown here:

awk '{
if ($0 in stored_lines)
}' filein > fileout

Line dedupe where subset of line is duplicated
The above approaches are fine when the duplication is per line. When the dupe is based on a subset of the line or multiple lines per record are involved, it gets more complicated. Then you reach for Perl, Python, Ruby and other scriptable languages.

Duplicate Keys in a Properties file
Recently, I had to identify duplicate keys in a Java properties file. Figuring out how to do it with with command line approach would have taken long since I don’t usually use a shell and I’m on Windows (yes I use cygwin).

Note that Creating a Properties object from the file then writing out the unique entries would have not worked since this would lose the comments in the file.

Groovy solution
This is where Groovy really shines; allowing any Java developer to create little helpers without the ponderous “noise”.
The dupe key detector.

def hist = [:]
def result = ""
new BufferedReader(new InputStreamReader(
     .eachLine(){ s ->

	result = s
	if(s ==~ /^s*[#!].*/){ // comment?
	    println result
	    return  // continue

	if(s ==~ /.*?[=:].*/){   // property?
		// get everything before the '=' or ':'
		def prop = ((s =~ /(.*?)[=:].*/)[0][1]).trim()
			result = ("#*** DUPE *** $s")
			hist[prop] = ''
	println result

Of course, this is not robust and many complications are not addressed that would require more complex parsing per the Properties spec. One problem is that commented out lines in the file are not treated as such. But, this quick solution should work for simple situations.

An alternative approach by extending the Properties class is shown in post: “Java Properties dupe key detect using subclass”.

Property file with duplicate key entries

!  one=one


type test.txt | groovy DedupeLines.groovy

Resulting output. Note that the duplicates are not removed, just marked. This allows testing before they are permanently removed.

#*** DUPE *** two=two
#*** DUPE *** one=one
!  one=one
#*** DUPE *** one:one
#*** DUPE *** one=une=35
#*** DUPE *** four=four

Large files
How about large files that a map or list cannot contain all the lines?
— One possibility is to use a hash function for each line (or subset) and store that instead.
— Another possibility is to use something similar to external sorting, ie., use external files for interim manipulation.
— Use an embedded database.


Further reading
Search for “remove duplicate lines”

Similar Posts:

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

5 thoughts on “Remove duplicate lines using Groovy”

  1. While I certainly don’t won’t to criticize your solution, it might be handy to know that IntelliJ does this automatically for you: it marks duplicate entries in properties files as errors. You might already know it, but it’s quite a handy feature!

  2. No problem at all. I’m using Eclipse, and it doesn’t seem to offer that or is not configured.
    Edit: Forgot to mention. This is also just notes I make to myself. The approach may be valid for other types of files, like CSV. Properties file was just the instigating scenario.

Leave a Reply

Your email address will not be published. Required fields are marked *