Skip to main content

Web Scrapping using Jsoup , Selenium and Java

 Web Scrapping using Jsoup , Selenium and Java


In this article we will see how to do web scrapping using Jsoup and Selenium using Java.


Jsoup Jar you can download from below links:

https://jsoup.org/download

http://www.java2s.com/Code/Jar/j/Downloadjsoup160jar.htm


Below is the full code to do web scrapping and write the data to text file.

In the following example we are navigating to a web which has a drop down with 12 values. For each value we select and clieck on a search button it will navigate to a page for which we have to scrape a data which is spread on multiple page. So we will srcrap all those data present on a multiple page. Navigate back to home page(page from which we have selected the dropdown value and cliecked on a search button) then again select next dropdown value and repeat the same procedure. This will repeat till we reach to the end of a dropdown values i.e. for all 12 dropdown values. Kindly please do comment when you find this as working for you.

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.List;

import java.util.concurrent.TimeUnit;


import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.openqa.selenium.By;


import org.openqa.selenium.WebDriver;

import org.openqa.selenium.WebElement;

import org.openqa.selenium.chrome.ChromeDriver;

import org.openqa.selenium.chrome.ChromeOptions;

import org.openqa.selenium.support.CacheLookup;

import org.openqa.selenium.support.FindBy;

import org.openqa.selenium.support.How;


import org.openqa.selenium.support.ui.Select;

import org.openqa.selenium.support.ui.WebDriverWait;

import org.testng.annotations.BeforeTest;

import org.testng.annotations.Test;


public class JAVA_WEB_SCRAPPING_DEM0 {

String currentUrl =null;

public WebDriver driver;

String url="url of a webpage/website";

@FindBy(how = How.XPATH, using="xpath")

@CacheLookup

WebElement ele_dropdown;

@FindBy(how = How.XPATH, using="xpath")

@CacheLookup

WebElement search_btn;

@BeforeTest( alwaysRun=true)

public void beforeTest() throws IOException {


System.setProperty("webdriver.chrome.driver", "location of a chromedriver"); 

HashMap<String, Object> chromePrefs = new HashMap<String, Object>();

chromePrefs.put("profile.default_content_settings.popups", 0);

ChromeOptions options = new ChromeOptions();

options.setExperimentalOption("prefs", chromePrefs);

options.addArguments("disable-popup-blocking");

driver = new ChromeDriver(options);

driver.get(url);

driver.manage().window().maximize();

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);


}


@Test(priority = 0, alwaysRun=true)

public void selecteleDropdown() throws InterruptedException, IOException{

try {

File myObj = new File("path to a file with txt file name");

      if (myObj.createNewFile()) {

        System.out.println("File created: " + myObj.getName());

      } else {

        System.out.println("File already exists.");

      }

    } catch (IOException e) {

      System.out.println("An error occurred.");

      e.printStackTrace();

    }

FileWriter myWriter = new FileWriter("path to a file with txt file name");

WebDriverWait wait = new WebDriverWait(driver, 30); 


for(int i=1; i<12;i++)

            {

Thread.sleep(3000);

Select dropdown = new Select(driver.findElement(By.name("webelement")));  

dropdown.selectByIndex(i);

System.out.println(dropdown.getOptions().get(i).getText());

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

driver.findElement(By.xpath("webelement")).click();

currentUrl = driver.getCurrentUrl();

System.out.println(currentUrl);

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

Document doc = Jsoup.connect(currentUrl).get();

Element parentElement = doc.select("web element tag name for which you have to get all data").first();

//System.out.println(parentElement.text());

myWriter.write(parentElement.text());

myWriter.write("==========================================");

driver.navigate().back();

driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);

            }


myWriter.close();

}

}


Comments

Popular posts from this blog

Add, remove, search an item in listview in C#

Below is the C# code which will help you to add, remove and search operations on listview control in C#. Below is the design view of the project: Below is the source code of the project: using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Windows.Forms; namespace Treeview_control_demo {     public partial class Form2 : Form     {         public Form2()         {             InitializeComponent();             listView1.View = View.Details;                   }         private void button1_Click(object sender, EventArgs e)         {             if (textBox1.Text.Trim().Length == 0)...

MySQL practical Tutorials part 9- SQL not operator, SQL Not Like, SQL greater than, SQL less than greater than operator

 ========================================================================= Not Equal SELECT title FROM books WHERE released_year = 2017;   SELECT title FROM books WHERE released_year != 2017;   SELECT title, author_lname FROM books;   SELECT title, author_lname FROM books WHERE author_lname = 'Harris';   SELECT title, author_lname FROM books WHERE author_lname != 'Harris'; ========================================================================= Not Like SELECT title FROM books WHERE title LIKE 'W';   SELECT title FROM books WHERE title LIKE 'W%';   SELECT title FROM books WHERE title LIKE '%W%';   SELECT title FROM books WHERE title LIKE 'W%';   SELECT title FROM books WHERE title NOT LIKE 'W%'; ========================================================================= Greater Than SELECT title, released_year FROM books ORDER BY released_year;   SELECT title, released_year FROM books  WHERE released_year > 2000 ORDER BY release...

MULTIPLEXER , Design & Implement the given 4 variable function using IC74LS153. Verify its Truth-Table

TITLE: MULTIPLEXER   AIM: Design & Implement the given 4 variable function using IC74LS153. Verify its Truth-Table.   LEARNING OBJECTIVE: ·        To learn about IC 74153 and its internal structure. ·        To realize 8:1 MUX and 16:1 MUX using IC 74153.   COMPONENTS REQUIRED: IC 74153, IC 7404, IC 7432, CDS, wires, Power supply. IC PINOUT:            1)     IC 74153 2)      IC 7404:                                              3) IC 7432 THEORY:   ·        Multiplexer is a combinational circuit that is one of the most widely used in digital design. ·        The multiplexer is a data selector which gates one out of several inputs to a sin...