cleaning up email lists with ruby and regex

Posted by on Jan 22nd, 2007
2007
Jan 22

lately, we have been getting a ton of email lists from clients. these usually consist of THOUSANDS of email addresses that have been gathered on their websites. in every case, there was no address validation on the code that gathered the addresses, so we end up with a bunch of gobledeygoop. the list management software that we use (”php list”:http://www.phplist.com/) will poop if there are any invalid addresses in the text file.. so i came up with this..

i am aware that the function that creates the text file from the array is clunky.. i just cut and pasted it from another program i wrote to do something entirely different.. if it turns out that this becomes a bottleneck, i will look at it..

anyway.. hope this helps someone out..



if ARGV.length != 1
  puts "Usage: clean_addresses.rb [lt]input_filename[gt]\n”
  puts “Will output [lt]input_filename[gt]_cleaned.txt”
  exit
end

filename = ARGV[0]

unless FileTest::exists?(filename)
  puts “Input file does not exist..”
  exit
end

successes = Array.new
failures = Array.new

def create_txt_from_array(array,filename)
  text_file = String.new
  array.each do |e|
    text_file [lt][lt] e
  end
  open(filename,’w') {|f| f[lt][lt] text_file }
end

filename_parsed = filename.split(”.”)
output_filename = filename_parsed[0].to_s+’_cleaned.txt’
fails_filename = filename_parsed[0].to_s+’_failed.txt’

output_file = File.new(output_filename,’w+’)
fails_file = File.new(fails_filename,’w+’)

puts “Outputting to: “+ output_filename

IO.foreach(filename) do |line|
  if line.chomp =~ /\A[\w\._%-]+@[\w\.-]+\.[a-zA-Z]{2,4}\z/
    successes [lt][lt] line
  else
    failures [lt][lt] line
  end
end

create_txt_from_array(successes,output_filename)
create_txt_from_array(failures,fails_filename)

puts “There were: “+successes.length.to_s+” successes.\n”
puts “There were: “+failures.length.to_s+” failures.\n”

Leave a Comment




XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.