Scraping Linkedin profiles using Python and Selenium
Is it possible to scrape Linkedin in this 2023 era ? this is a mistetious question that many b2b marketers asked to themself. Most of the tools now available for LinkedIn data extraction are cloud based and are pretty expensive.
In this post, we’ll explore an ethical approach to scrape LinkedIn profiles at scale using Selenium and headless Chrome. The provided Python code demonstrates how to extract key profile details while adhering to LinkedIn’s access limits.
Why Scrape LinkedIn Profiles?
Do you know that LinkedIn is the world’s largest professional networking platform with over 750 million members? As a rich source of professional data, LinkedIn profiles are highly valuable for recruitment, sales prospecting, market research, and more. However, LinkedIn employs strict scraping restrictions to prevent misuse of their data.
In this post, we’ll explore an ethical approach to scrape LinkedIn profiles at scale using Selenium and headless Chrome. The provided Python code demonstrates how to extract key profile details while adhering to LinkedIn’s access limits.
Scraping LinkedIn can be useful for numerous legitimate business purposes
- Recruiters can source candidate profiles matching skills, titles, companies etc. This allows targeting suitable profiles instead of spam applications.
- Sales teams can enrich their lead data with social profiles to understand interests, skills and current roles. This helps personalize outreach and conversations.
- Researchers can analyze profile data to identify hiring trends, skill gaps, salary ranges, demographics and more.
- Product managers can research competitors’ employees to estimate company size, structure and growth.
- Marketing agencies can identify industry influencers and potential partners or spokespersons.
For Account-based marketing
The applications are vast. Manually browsing profiles is time-consuming and lacks scale. Automated scraping enables collecting large profile samples that would be impossible manually.
The Ethical Approach
However, scraping LinkedIn raises important ethical concerns. Excessive scraping can overload LinkedIn’s servers while indiscriminate copying of personal data infringes on privacy. As a result, LinkedIn employs technical safeguards like CAPTCHAs and blocking to prevent abuse.
Linkedin profile scraping using Python & Selenium
You can copy this code
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.common.exceptions import NoSuchElementException import csv import time webdriver_path = "C:/chromedriver/chromedriver.exe" MAX_CONCURRENT_REQUESTS = 5 RETRY_LIMIT = 3 urls = [] failed_urls = [] retry_queue = [] # Read URLs from the input CSV with open('profileurl.csv', 'r') as file: reader = csv.reader(file) next(reader) # Skip the header row for row in reader: urls.append(row[0]) # Assuming URL is the first column # Open the output CSV file for writing with open('output.csv', 'w', newline='', encoding='utf-8') as output_file: writer = csv.writer(output_file) writer.writerow(['URL', 'Name', 'Job Title', 'Location', 'Followers', 'Current Company', 'Current Company Profile URL', 'About']) # Iterate over the URLs for index, url in enumerate(urls, start=1): print(f"Crawling URL {index}/{len(urls)}: {url}") options = Options() options.add_argument("--incognito") options.add_argument("--headless") driver = webdriver.Chrome(webdriver_path, options=options) try: retries = 0 success = False while retries < RETRY_LIMIT: driver.get(url) time.sleep(5) # Optional: wait for 5 seconds for the page to fully load # Here you can add your own scraping logic. # Selenium provides a way to locate elements, similar to Puppeteer. try: name = driver.find_element(By.CSS_SELECTOR, '.top-card-layout__title').get_attribute( 'innerText').strip() job_title = driver.find_element(By.CSS_SELECTOR, '.top-card-layout__headline').get_attribute( 'innerText').strip() location = driver.find_element(By.CSS_SELECTOR, '.top-card__subline-item').get_attribute( 'innerText').strip() followers = driver.find_element(By.CSS_SELECTOR, '.top-card__subline-item:nth-child(2)').get_attribute( 'innerText').strip() current_company = driver.find_element(By.CSS_SELECTOR, '.top-card-link__description').get_attribute( 'innerText').strip() current_company_profile_url = driver.find_element(By.CSS_SELECTOR, '.top-card-link--link').get_attribute('href') website = driver.find_element(By.CSS_SELECTOR, '.top-card-link__description:last-child').get_attribute( 'innerText').strip() about = driver.find_element(By.CSS_SELECTOR, 'h3.top-card-layout__first-subline').get_attribute( 'innerText').strip() # Write the scraped data to the output CSV writer.writerow([url, name, job_title, location, followers, current_company, current_company_profile_url, about]) success = True break except NoSuchElementException as e: print("Error processing {}: {}".format(url, str(e))) retries += 1 if not success: failed_urls.append(url) else: # Reset the retry count if scraping is successful retries = 0 finally: driver.quit() # This will close the browser window and clear all session data # Add the URL to the retry queue if retries are remaining if retries < RETRY_LIMIT: retry_queue.append(url) # Check if the retry queue has URLs and the maximum concurrent requests limit is reached if retry_queue and len(retry_queue) % MAX_CONCURRENT_REQUESTS == 0: print("Retrying failed URLs (Retry {}/{}):".format(retries + 1, RETRY_LIMIT)) for retry_url in retry_queue: print("Crawling URL: {}".format(retry_url)) # Perform the retry using the same logic as above retry_queue.clear() # Clear the retry queue after performing retries # Flush the output file buffer to ensure immediate writing output_file.flush() # Write the failed URLs to a separate CSV file with open('failed.csv','w', newline='', encoding='utf-8') as failed_file: writer = csv.writer(failed_file) writer.writerows([[url] for url in failed_urls])
The responsible approach is to scrape respectfully and minimize impact. The sample code demonstrates several best practices:
- Use headless browser automation instead of simple HTTP requests. This mimics real user behavior and prevents easy blocking based on traffic patterns.
- Implement random delays and retry logic to throttle requests. This adheres to LinkedIn’s access limits and avoids overloading their infrastructure.
- Scrape only necessary profile fields instead of entire pages. This reduces data collection and privacy invasion.
- Write to CSV files locally instead of storing data on external servers. This limits data retention and unnecessary copies.
- Check for opt-outs like noindex meta tags that signal profiles that don’t want indexing.
Such precautions allow gathering just enough data for legitimate purposes without crossing ethical boundaries.
Overview of the Selenium Web Scraping Process
The code implements a robust web scraping pipeline to extract profile information at scale. Let’s examine the key steps:
- Load list of profile URLs from a CSV file. This contains the starting seed list to crawl.
- Initialize headless Chrome using Selenium. Headless mode hides the browser which prevents detection.
- Iterate through the profile URLs and launch each one using the browser.
- On each profile, use Selenium to locate and extract details like name, job title, location etc by CSS selectors.
- Write the scraped profile data to a CSV file for later analysis.
- Retry failed profiles up to 3 times to handle transient errors.
- Throttle concurrent requests within a fixed limit to avoid flooding LinkedIn.
- Save failed URLs to file for debugging and analysis.
This structured process allows systematic data extraction from LinkedIn profiles. The headless browser and retries handle common scraping issues like page timeouts, network errors etc.
Caveats and Ethical Usage
While powerful, using this scraper judiciously remains vital:
– Only scrape data needed for the defined business purpose, nothing more.
– Implement opt-out mechanisms for individuals and comply with requests.
– Do not sell, share or expose the scraped data externally.
– Rotate IPs and add delays to minimize footprint if running at large volumes.
– Consult LinkedIn’s terms of service and stay aligned with their acceptable use policies.
With responsible design and usage, automating LinkedIn scraping can unlock great benefits for businesses while respecting user rights and reducing disruption.
A step-by-step method to follow if you are new to the world of Python and data scraping
To use this LinkedIn scraping code, you will need to set up Python and Selenium on your system:
1. Download and install Python from python.org
2. Install pip, then run:
pip install selenium csv time
3. Download the ChromeDriver from chromium.org/downloads
4. Update the `webdriver_path` variable in the code to point to your ChromeDriver location
5. Copy the above provided Python code and save it into a folder and save it as linkedin_scraper.py
5.1 Create a CSV file with LinkedIn profile URLs to scrape and save it as profileurl.csv to the same folder where we saved the code.
6. Run the code by typing this in the Terminal
python linkedin_scraper.py
7. Scraped profile data will be saved to output.csv
Follow these steps to configure the scraper, provide input URLs, and extract LinkedIn profiles programmatically. Customize it further to meet your specific use case.
Playground activities
Customizing the LinkedIn Profile Scraper
- The code can be adapted to capture additional profile fields or tweak the crawl parameters:
Edit the CSS selectors to extract other elements from the profiles like skills, education, certifications etc. - Adjust the retry limit, concurrency limits and delays to tune performance vs. impact.
Enhance the logic to check for bans, CAPTCHAs etc and handle them accordingly. - Generalize the code to support scraping other social networks like Twitter, Facebook etc.
- For large-scale scraping, the process can be parallelized across multiple machines. The list of URLs can be partitioned to allow concurrent scraping from different IPs.
Let me know if you have any other questions!
Leave a Reply