May 29, 2018

Phishing: Scraping LinkedIn for Target Names

Alright. What do we have?

We’ve got a sockpuppet account on LinkedIn with more connections than our real account. We’ve connected to our target’s internal recruiter. She’s commented that our job history is exactly what she needs for six open positions. Probably a good reason for that... We might have to phish her just to get a foothold onto the network, but we’ll save that rabbit hole for another moment.

Right now we’re going to scrape LinkedIn to build out a name list to be massaged into an email list for a mass phishing campaign. And we’re going to do it with custom crafted scripts using some nasty Regular Expressions, along with a bit of grep/cut/sed.

Well Fucking A.

(::coughs:: terms of service ::cough cough:: banned accounts ::coughs:: something ::wheezes::)

Let’s get to it.

Approach and Methodology

I could kick over a script that works right now. Hell, I could even explain why it works. But Web App UI/UX developers are a fidgety bunch, always changing things around. As soon as you get comfortable with where that couch is, you go to sit in it only to find out (too late!) that it’s been moved. As the shock of having your tail bone hit the ground starts to dissipate you realize you’ve gotta write ANOTHER ERIS DAMNED SCRIPT!!!! to scrape that web site.

So. We’ll go through making one from scratch so that the day after I post this blog and the script no longer works you’ll be able to build a new one for yourself (that’ll last a go or two and then be worthless).

In the process of this, I’ll go over some Regular Expressions and how to use them. With that in mind...

A Note on Regular Expressions

As a young, naive, and idealistic script-kiddie, I believed that “Regular Expressions” described a stable and consistent concept. I thought once I learned “the” set of Regular Expressions, they could applied everywhere that RegEx was accepted. I was wrong. Every two-bit application that takes RegEx uses a slightly different syntax or a slightly modified way of applying it.

So we aren’t going to learn RegEx as a concept. We’re going to learn RegEx as it applies to one application, ‘grep’, and even then only as far as we need it. Don’t ever commit RegEx concepts to memory. DuckDuckGo that shit when you need it and then forget it. I’ve seen developers and analysts buy those thick ass O’Reilly “Master Regular Expressions” books, like a PolySci undergraduate buying a gilt-edged copy of Hobbe’s Leviathan, thinking they’re gonna read it. They’re not.

So fuck that noise! We’ll use just a bit, as we need it, and then move on.

Oh! One other thing.

My RegEx syntax sucks. So does my bash scripting. It works. Works well, in fact. But it’s ugly. There are a thousand ways to use RegEx and you’ll run into folks who’ll say “if you use this syntax you can save a couple character spaces or do more dynamic searches!” Punk! I’ve got plenty of RAM and enough hard drive space for my 30 word bash script and hacked torgether regex that I can afford a nasty script that has a few extra characters! I need to get this done and get on with hacking this gorram target!


Scraping Names from LinkedIn Searches

Alright. So you’ve logged into LinkedIn and enumerated a thousand target employees. Go to the first page of the search results and press Ctrl+S to save the page (HTML only, not the complete thing. We don’t need pictures.). Name the HTML page some gibberish and then put it in a folder of its own. Go to the second page of the search results and do the same thing with a different gibberish name.

Now bring up a bash terminal and navigate to the folder where you saved the page.

~/$ cd LinkedInRaw
~/LinkedInRaw/$ ls

Our goal is to pull out the first and last names of each of the search results, and nothing more. We’ll use this to populate a Full Name List and use that subsequent list to create email addresses. We want to be able to do that for one page or for a hundred pages, so we’ll keep this in mind when building out our RegEx.

Note: From here on out, I’ll be using an example from LinkedIn taken at the time of this writing. The following specific RegEx may not work for you. So follow the logic and build the RegEx out as you need it.

First. Let’s find a known First Name and see where it sits in HTML. In our search results in the web browser we see a “Frank Hamilton”. So let’s look for the word "Frank”. (You might find a different name than Frank Hamilton, so use whatever you see in your search result. Me? I've got it in for Frank. So we're scraping his shit.)

~/LinkedInRaw/$ grep -i 'frank' *

The -i is to make everything case insensitive. The asterisk “*” is to grep every one of the files in the folder. This will be important as we save 50 pages in one folder and want to go through them all at once.

Run that grep in the folder you saved the HTML files from LinkedIn. You should get something like:

< A lot of junk >
[&quot;backgroundImage&quot;],&quot;occupation&quot;:&quot;Machine Learning 
Blockchain Development God-
< more junk >

Take a look at the output of that grep. Holy shit, right? If you scroll up in the terminal you might see “frank” highlighted a couple of times in a sea of HTML.

Let’s use the -o flag to only bring out what we search for and the -e flag to use Regular Expressions to give us some flexibility in what we can find. We’ll use the RegEx period “.” before and after the name “frank” to see what the HTML looks like before and after where the name appears in the source code. Just throw a fuckton of periods before and after "frank". Just wanna look around.

~/LinkedInRaw/$ grep -i -o -e '............................frank...........................' *

You should get the output that looks a bit like:


Ok! We’re getting somewhere! You can see that "firstName&quot" appears just before "Frank"! It looks like there’s a "lastName" there as well! Yes! Let’s cue off of "firstName" and see everything after it. Notice that I’m removing the -i flag as I know exactly what the case structure is now.

~/LinkedInRaw/$ grep -o -e 'firstName............................' *

You should get many outputs along the lines of:


Let’s clean up our RegEx just a bit, because these period strings are getting ridiculous. We'll use one period “.” and then stick {100} after it to tell grep that we want to see 100 characters out after the firstName variable. That should give us enough to cover even large names. Note in bash we’ll stick escapes “\” in front of the brackets... Because... Bash.

~/LinkedInRaw/$ grep -o -e 'firstName.\{100\}' *

That should give us a similar output as previous but out to 100 characters past “firstName”. Let’s look at one line as an example:


Looking at the output, we can see that the name “Frank” comes a bit after the variable “firstName” and the last name “Hamilton” comes a bit after the variable “lastName”. The HTML also contains a consistent set of characters like semicolons, commas, ampersands, etc.

Let’s use ‘cut’ to clean this up a bit. Let’s specifically try to pull out the first and last names. Before each of those names I see a semicolon “;”. Let’s use that as a deliminator. With the semicolon as a field deliminator, it looks like the first name is in the 3rd field and the last name is in the 7th field. (I'm using my fingers and touching my screen as I count that out. I'm nasty.)

~/LinkedInRaw/$ grep -o -e 'firstName.\{100\}' * | cut -d";" -f 3,7

A sample of your output should look like:


Let’s take that "&quot;” in the middle and change it into a space using ‘sed’.

~/LinkedInRaw/$ grep -o -e 'firstName.\{100\}' * | cut -d";" -f 3,7 | sed 's/&quot;/ /'

And we get:

Frank Hamilton&quot

We’ll use ‘cut’ again to lop off that last bit…

~/LinkedInRaw/$ grep -o -e 'firstName.\{100\}' * | cut -d";" -f 3,7 | sed 's/&quot;/ /' | cut -d "&" -f 1

And voila, we have output that lists out the first and last name!

Frank Hamilton

Some folks on LinkedIn put certificate acronyms after their last names, often separated by a comma. Pipe your oneliner out to another cut and lop off the commas (e.g. cut -d “,” -f 1) and massage your data bit a more if you need.

Once you’re happy, convert everything to lower case (pipe to ‘tr’) and output it to a file.

grep -o -e 'firstName.\{100\}' * | cut -d";" -f 3,7 | sed 's/&quot;/ /' | cut -d "&" -f 1 | tr [:upper:] [:lower:] > firstlastnames.txt

Scraping Entire Corporate Employee Lists

At this point, you can go back through the search results by hand. Save each page with a Ctrl+S into the same folder and run that bash oneliner against it at the end.

If you like, you can further script this to automatically log into LinkedIn for you, search, pull files, and run the RegEx. If you search through github, you’ll find a lot of tools that do some form of LinkedIn scraping. Some better than others. Some maintained, most not.

I always say automate the things you do over and over again, but LinkedIn is fairly decent at detecting scraping activity. I still download each page by hand (Ctrl+S, then click next page, Ctrl+S, etc. Takes 5 minutes.) because I have established sockpuppets that I don’t want banned. But do as you like!

Once you have this list, you can build out nice email addresses based on your target's email schema!

Alright. My brain is hemoraging right now and I'm starting to bleed out of my nose. Hack the Planet.