HOW-TO Extract Lines from File A that Contains Words in File B?
#1
I have a large text file, over 1gb large containing data line by line. This is text file A.txt

I then have the second file, text file B.txt that contains 30,000 unique words that I want to extract from text file A, along with the rest of the line where the word is found in text file A.

An example of this is:

--Text File A--

dog in house
cat at school
kid in playground
tom at oaks
so much stuff
inhouse cool stuff

--Text File B--

house
oaks

--Result File Output--

dog in house
tom at oaks
inhouse cool stuff


How would I go about doing this that would work the fastest way possible? Is there any software on the market for purchase that specializes in this type of task?

I don't know any programming languages whatsoever so if anyone knows a solution that takes writing code I would need newbie instructions on how to carry it out.

I've searched for hours and hours on google in hopes to finding a solution to this but have come up with absolutely nothing meaningful.

Thanks in Advance
#2
I think you might be able to achieve this playing around with some of the hashcat utilities but here is a short python snippet you can use as well:
Code:
#!/usr/bin/env python

import re

fileA = 'fileA.txt'  # Your main input file
fileB = 'fileB.txt'  # Your file full of stuff to match against
fileC = 'fileC.txt'  # Output file we will save matching lines to

fr=open(fileA)                            # File Reader Handle
fw=open(fileC, 'w+')                      # File Writer Handle
tokens_to_match = open(fileB).readlines() # Read All Lines from FileB into an Array

# Iterate line by line in FileA
for line in fr:
  # Check if any matches from fileB exist
  for token in tokens_to_match:
    # If match, then log the matching line to fileC
    if re.search(token.strip(), line.strip()):
      fw.write(line.strip() + "\n")

fr.close()
fw.close()

Just edit the filenames and paths for fileA, fileB (optionally fileC) and then run:
python scriptname.py

When it is done you should find fileC.txt in the same directory as the script with your matched lines.

Hope that helps a bit....
#3
Hi,

Thanks for your reply bro. I actually am paying blazer to make a program for me that will do what I described above. Once he is finished he will also release it to the public for anyone else who needs this type of task done.
#4
cool that he is going to share it but kind of lame you have to pay him to make it. If you just tell me what you want it to do that it isn't doing above I will gladly modify for you for free....
#5
The program you need was written like 60 years ago and it's free.

grep -Ff B.txt A.txt > C.txt

EDIT: This is better:

grep -wFf B.txt A.txt > C.txt

(And to make it case insensitive, use -iwFf)
#6
one liner, even better indeed +1!
#7
always helps to know your tools.
#8
(01-04-2015, 05:57 PM)iRuser Wrote: cool that he is going to share it but kind of lame you have to pay him to make it. If you just tell me what you want it to do that it isn't doing above I will gladly modify for you for free....


Hi, I have no problem at all paying a software developer to create a program. It was my idea to pay blazer for the project.

I've been Using ULM and ULM CCR every single day for months and months, it's my most used software and it's saved me hundreds and hundreds of hours in productivity. Giving a payment to blazer is the very least of what I can do for all he's already given to me and everyone else who benefits from ULM.

I appreciate you offering to help out Iruser, if there's anything I can help you with as well just hit me up Smile
#9
(01-05-2015, 02:53 AM)magnum Wrote: The program you need was written like 60 years ago and it's free.

grep -Ff B.txt A.txt > C.txt

EDIT: This is better:

grep -wFf B.txt A.txt > C.txt

(And to make it case insensitive, use -iwFf)

You my friend are amazing! Thank you so much for that great info, grep works like it's literally some sort of magic or something, it boggles my mind how it finds all the matches and how it does it so fast. True genius.