Scraping Linkedin profiles using Python and Selenium

Scraping Linkedin profiles using Python and Selenium

Is it possible to scrape Linkedin in this 2023 era ? this is a mistetious question that many b2b marketers asked to themself. Most of the tools now available for LinkedIn data extraction are cloud based and are pretty expensive.

In this post, we’ll explore an ethical approach to scrape LinkedIn profiles at scale using Selenium and headless Chrome. The provided Python code demonstrates how to extract key profile details while adhering to LinkedIn’s access limits.

 Why Scrape LinkedIn Profiles?

Do you know that LinkedIn is the world’s largest professional networking platform with over 750 million members? As a rich source of professional data, LinkedIn profiles are highly valuable for recruitment, sales prospecting, market research, and more. However, LinkedIn employs strict scraping restrictions to prevent misuse of their data.

Skip the lecture and view the steps

In this post, we’ll explore an ethical approach to scrape LinkedIn profiles at scale using Selenium and headless Chrome. The provided Python code demonstrates how to extract key profile details while adhering to LinkedIn’s access limits.

Scraping LinkedIn can be useful for numerous legitimate business purposes

  • Recruiters can source candidate profiles matching skills, titles, companies etc. This allows targeting suitable profiles instead of spam applications.
  • Sales teams can enrich their lead data with social profiles to understand interests, skills and current roles. This helps personalize outreach and conversations.
  • Researchers can analyze profile data to identify hiring trends, skill gaps, salary ranges, demographics and more.
  • Product managers can research competitors’ employees to estimate company size, structure and growth.
  • Marketing agencies can identify industry influencers and potential partners or spokespersons.
    For Account-based marketing

The applications are vast. Manually browsing profiles is time-consuming and lacks scale. Automated scraping enables collecting large profile samples that would be impossible manually.

The Ethical Approach

However, scraping LinkedIn raises important ethical concerns. Excessive scraping can overload LinkedIn’s servers while indiscriminate copying of personal data infringes on privacy. As a result, LinkedIn employs technical safeguards like CAPTCHAs and blocking to prevent abuse.

Linkedin profile scraping using Python & Selenium

You can copy this code

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import csv
import time

webdriver_path = "C:/chromedriver/chromedriver.exe"
MAX_CONCURRENT_REQUESTS = 5
RETRY_LIMIT = 3

urls = []
failed_urls = []
retry_queue = []

# Read URLs from the input CSV
with open('profileurl.csv', 'r') as file:
reader = csv.reader(file)
next(reader) # Skip the header row
for row in reader:
urls.append(row[0]) # Assuming URL is the first column

# Open the output CSV file for writing
with open('output.csv', 'w', newline='', encoding='utf-8') as output_file:
writer = csv.writer(output_file)
writer.writerow(['URL', 'Name', 'Job Title', 'Location', 'Followers', 'Current Company',
'Current Company Profile URL', 'About'])

# Iterate over the URLs
for index, url in enumerate(urls, start=1):
print(f"Crawling URL {index}/{len(urls)}: {url}")

options = Options()
options.add_argument("--incognito")
options.add_argument("--headless")

driver = webdriver.Chrome(webdriver_path, options=options)

try:
retries = 0
success = False

while retries < RETRY_LIMIT:
driver.get(url)
time.sleep(5) # Optional: wait for 5 seconds for the page to fully load

# Here you can add your own scraping logic.
# Selenium provides a way to locate elements, similar to Puppeteer.
try:
name = driver.find_element(By.CSS_SELECTOR, '.top-card-layout__title').get_attribute(
'innerText').strip()
job_title = driver.find_element(By.CSS_SELECTOR, '.top-card-layout__headline').get_attribute(
'innerText').strip()
location = driver.find_element(By.CSS_SELECTOR, '.top-card__subline-item').get_attribute(
'innerText').strip()
followers = driver.find_element(By.CSS_SELECTOR, '.top-card__subline-item:nth-child(2)').get_attribute(
'innerText').strip()
current_company = driver.find_element(By.CSS_SELECTOR, '.top-card-link__description').get_attribute(
'innerText').strip()
current_company_profile_url = driver.find_element(By.CSS_SELECTOR,
'.top-card-link--link').get_attribute('href')
website = driver.find_element(By.CSS_SELECTOR, '.top-card-link__description:last-child').get_attribute(
'innerText').strip()
about = driver.find_element(By.CSS_SELECTOR, 'h3.top-card-layout__first-subline').get_attribute(
'innerText').strip()

# Write the scraped data to the output CSV
writer.writerow([url, name, job_title, location, followers, current_company,
current_company_profile_url, about])
success = True
break
except NoSuchElementException as e:
print("Error processing {}: {}".format(url, str(e)))
retries += 1

if not success:
failed_urls.append(url)
else:
# Reset the retry count if scraping is successful
retries = 0

finally:
driver.quit() # This will close the browser window and clear all session data

# Add the URL to the retry queue if retries are remaining
if retries < RETRY_LIMIT:
retry_queue.append(url)

# Check if the retry queue has URLs and the maximum concurrent requests limit is reached
if retry_queue and len(retry_queue) % MAX_CONCURRENT_REQUESTS == 0:
print("Retrying failed URLs (Retry {}/{}):".format(retries + 1, RETRY_LIMIT))
for retry_url in retry_queue:
print("Crawling URL: {}".format(retry_url))
# Perform the retry using the same logic as above

retry_queue.clear() # Clear the retry queue after performing retries

# Flush the output file buffer to ensure immediate writing
output_file.flush()

# Write the failed URLs to a separate CSV file
with open('failed.csv','w', newline='', encoding='utf-8') as failed_file:
writer = csv.writer(failed_file)
writer.writerows([[url] for url in failed_urls])

The responsible approach is to scrape respectfully and minimize impact. The sample code demonstrates several best practices:

  • Use headless browser automation instead of simple HTTP requests. This mimics real user behavior and prevents easy blocking based on traffic patterns.
  • Implement random delays and retry logic to throttle requests. This adheres to LinkedIn’s access limits and avoids overloading their infrastructure.
  • Scrape only necessary profile fields instead of entire pages. This reduces data collection and privacy invasion.
  • Write to CSV files locally instead of storing data on external servers. This limits data retention and unnecessary copies.
  • Check for opt-outs like noindex meta tags that signal profiles that don’t want indexing.

Such precautions allow gathering just enough data for legitimate purposes without crossing ethical boundaries.

Overview of the Selenium Web Scraping Process

The code implements a robust web scraping pipeline to extract profile information at scale. Let’s examine the key steps:

  •  Load list of profile URLs from a CSV file. This contains the starting seed list to crawl.
  •  Initialize headless Chrome using Selenium. Headless mode hides the browser which prevents detection.
  • Iterate through the profile URLs and launch each one using the browser.
  • On each profile, use Selenium to locate and extract details like name, job title, location etc by CSS selectors.
  • Write the scraped profile data to a CSV file for later analysis.
  • Retry failed profiles up to 3 times to handle transient errors.
  • Throttle concurrent requests within a fixed limit to avoid flooding LinkedIn.
  • Save failed URLs to file for debugging and analysis.

This structured process allows systematic data extraction from LinkedIn profiles. The headless browser and retries handle common scraping issues like page timeouts, network errors etc.

Caveats and Ethical Usage

While powerful, using this scraper judiciously remains vital:

– Only scrape data needed for the defined business purpose, nothing more.

– Implement opt-out mechanisms for individuals and comply with requests.

– Do not sell, share or expose the scraped data externally.

– Rotate IPs and add delays to minimize footprint if running at large volumes.

– Consult LinkedIn’s terms of service and stay aligned with their acceptable use policies.

With responsible design and usage, automating LinkedIn scraping can unlock great benefits for businesses while respecting user rights and reducing disruption.

A step-by-step method to follow if you are new to the world of Python and data scraping

To use this LinkedIn scraping code, you will need to set up Python and Selenium on your system:

1. Download and install Python from python.org

2. Install pip, then run:

pip install selenium csv time

3. Download the ChromeDriver from chromium.org/downloads

4. Update the `webdriver_path` variable in the code to point to your ChromeDriver location

5. Copy the above provided Python code and save it into a folder and save it as linkedin_scraper.py

5.1 Create a CSV file with LinkedIn profile URLs to scrape and save it as profileurl.csv to the same folder where we saved the code.

6. Run the code by typing this in the Terminal

python linkedin_scraper.py

7. Scraped profile data will be saved to output.csv

Follow these steps to configure the scraper, provide input URLs, and extract LinkedIn profiles programmatically. Customize it further to meet your specific use case.

Playground activities

Customizing the LinkedIn Profile Scraper

  • The code can be adapted to capture additional profile fields or tweak the crawl parameters:
    Edit the CSS selectors to extract other elements from the profiles like skills, education, certifications etc.
  • Adjust the retry limit, concurrency limits and delays to tune performance vs. impact.
    Enhance the logic to check for bans, CAPTCHAs etc and handle them accordingly.
  • Generalize the code to support scraping other social networks like Twitter, Facebook etc.
  • For large-scale scraping, the process can be parallelized across multiple machines. The list of URLs can be partitioned to allow concurrent scraping from different IPs.

Let me know if you have any other questions!

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *