Skip to main content

How to parse HTML/ extract data from HTML / using Java?

Java has a large set of APIs to parse HTML. To extract data from HTML  and perform any manipulation we should be able to parse it.

jsoup provides a very easy to use , powerful and compact API to pare HTML and extract data. It supports DOM, CSS and jquery like selectors.

It is designed for all types of HTML and will even parse HTML which is not perfectly valid.


Example

In this example we fetch BBC Sport , parse it to DOM and then select headlines using css-selector.

Document doc = soup.connect("http://www.bbc.co.uk/sport/0/").get();
 Elements newsHeadlines = doc.select("#more-news-headlines li");


We can also provide HTML directly in a string. 

A detailed documentation is available at jsoup website. You may start from official cook book available here


Comments

Post a Comment

Popular posts from this blog

How to add JCalendar/ date chooser to WindowBuilder, for Swing GUI in Java?

WindowBuilder is most popular eclipse plugin for drag and drop GUI design. It supports SWT and Swing. Swing does not have a date chooser component of its own. But there are many components available that you can use. My personal favourite is JCalendar . You can add JCalendar components to your WindowBuilder  palette  by following these instructions. Download and extract JCalendar.  Right-click on the palette   in WindowBuilder Select jar file of Jcalendar select all componetnts Restart Eclipse. Now you will see JCalendar components in your  palette . For more information about  visit this page.  

Why use shopify?

Are you looking to setup an online store? You can setup your store and you can start selling in a week. There are many platforms available to use but Shopify is easiest, reliable and SEO friendly.   Not familiar with Shopify ? It is a great way to sell tour products, without worrying about managing a server, payment gateways, etc. as for a small monthly fee everything is included.  Click here to create a free trial store and start exploring. 

Find and remove duplicate files in Ubuntu / Mint Linux

If you have accumulated thousands of files , some times you will have more than one versions of same file. If you use Ubuntu of Mint linux you can find and remove these duplicates very easily. There are two popular tools for this fdupes and fslint Using fdubes ftubes is the most popular, simple and powerful tool to find duplicates and remove them. It is a command line tool so if you don't like command line move to next heading. It compares file size  and MD5 signatures(Do not worry if you don't know what it is). So it will find duplicates even if they have different names. Installation From terminal execute following command  sudo apt-get install fdupes ftubes Syntax fdupes [options] directory where  options available are -r --recurse include files residing in subdirectories -s --symlinks follow symlinked directories -H --hardlinks normally, when two or more files point to the same disk area they are treated as non-duplicates; this option