PulsarRPA

Basic Usage

In a nutshell, PulsarRPA is all about correctly implementing two methods: loading web pages and extracting data. To achieve this goal, PulsarRPA has written and refined nearly a million lines of code, developing a range of cutting-edge technologies.

Loading Web Pages:

Load a web page or resource.
Load and render a web page using a browser, or load a single resource using the raw protocol.
Load hundreds of thousands of web pages on a single machine.
Load millions of web pages on ten machines.
Load 10 x N million web pages on N machines.

Extracting Data:

Extract data from web page content text, from text resources.
Extract data from the Document Object Model (DOM).
Extract data after interacting with web pages.
Extract data after taking screenshots of web pages.
Automatically extract data from millions of web pages.

Combined:

Use SQL to load web pages and extract data.
Load and extract data through cloud services.
Correctly load and extract data in extremely complex application scenarios.
Other auxiliary methods around loading and extracting.

PulsarRPA implements the web as a database paradigm, treating the external web like an internal database. If the required data is not in local storage, or the existing version does not meet analysis needs, the system will collect the latest version of the data from the internet.

This course introduces the basic APIs for loading and extracting data, which appear in PulsarSession. PulsarSession provides a rich set of APIs to cover all needs of the “load-parse-extract” process.

These rich APIs allow us to solve the “load-parse-extract” problem with just one line of code in most of our programming scenarios:

Regular loading.
Asynchronous loading.
Batch loading.
Batch loading of linked pages.
Various parameter combinations.

Here’s a preview of these APIs, which we will explain in detail later.

load()
loadDeferred()
loadAsync()
submit()
loadAll()
submitAll()
loadOutPages()
submitForOutPages()
loadResource()
loadResourceDeferred()
loadDocument()
scrape()
scrapeOutPages()

Let’s see how common usage looks like. First, create a Pulsar session, where all important work is processed:

val session = PulsarContexts.createSession()
val url = "https://www.amazon.com/dp/B0C1H26C46"

The fundamental idea and method is load(), which first tries to load the web page from local storage. If the required page does not exist, has expired, or does not meet other requirements, it fetches the page from the Internet:

val page = session.load(url)

A simple parameter can be used to specify the web page expiration time. If the required page is already stored locally and not expired, the local version will be returned:

// Returns the local version
val page2 = session.load(url, "-expires 100d")

In continuous crawling tasks, we will handle large-scale crawling tasks in an asynchronous and parallel manner, submitting a large batch of URLs to the URL pool for continuous processing in the crawling loop:

// Submit URLs to the URL pool, which will be processed in a crawling loop
session.submit(url, "-expires 10s")
// Submit a batch of URLs, which will be processed in a crawling loop
session.submitAll(urls, "-expires 30d")

Once a web page is successfully loaded or crawled, we usually need to parse the web page into a DOM and start the subsequent data extraction work:

// Parse the page content into a document
val document = session.parse(page)
// Use the document
val title = document.selectFirstText(".title")
// ...

Alternatively, the “load-parse” two steps can be completed in one method:

// Load and parse
val document = session.loadDocument(url, "-expires 10s")
// Use the document
val title = document.selectFirstText(".title")
// ...

In later chapters, we will introduce in detail how to use standard CSS selectors and extended CSS selectors to extract data.

In many scenarios, we start from a list page and crawl linked pages. For example, we need to open a product list page and then crawl the detail pages of the products. PulsarSession provides a set of methods to simplify this task:

// Load the portal page, then load all links specified by `-outLink`.
// Option `-outLink` specifies the cssSelector to select links in the portal page to load.
// Option `-topLinks` specifies the maximum number of links selected by `-outLink`.
val pages = session.loadOutPages(url, "-expires 10s -itemExpires 10s -outLink a[href~=/dp/] -topLinks 10")

// Load the portal page and submit the out links specified by `-outLink` to the URL pool.
// Option `-outLink` specifies the cssSelector to select links in the portal page to submit.
// Option `-topLinks` specifies the maximum number of links selected by `-outLink`.
session.submitForOutPages(url, "-expires 1d -itemExpires 7d -outLink a[href~=/dp/] -topLinks 10")

Web page crawling ultimately involves extracting fields from web pages. PulsarSession also provides a wealth of methods to simplify the “crawl-parse-extract” composite task, with both single-page and batch processing versions:

// Load page, parse page, and extract fields
val fields = session.scrape(url, "-expires 10s", "#centerCol", listOf("#title", "#acrCustomerReviewText"))

// Load page, parse page, and extract named fields
val fields2 = session.scrape(url, "-i 10s", "#centerCol",
    mapOf("title" to "#title", "reviews" to "#acrCustomerReviewText"))

// Load linked pages, parse linked pages, and extract named fields from linked pages
val fields3 = session.scrapeOutPages(url, "-i 10s -ii 10s -outLink a[href~=/dp/] -topLink 10", "#centerCol",
    mapOf("title" to "#title", "reviews" to "#acrCustomerReviewText"))

In large-scale crawling projects, we usually do not write sequential code like load() -> parse() -> select(). Instead, we activate the parsing subsystem and register event handlers in the parsing subsystem to perform document-related tasks, such as extracting fields, saving fields to the database, collecting more links, etc.:

// Add `-parse` option to activate the parsing subsystem
val page = session.load(url, "-parse")

PulsarSession also provides a set of practical methods to facilitate actual programming work, such as asynchronous calls, coroutine support, parameter combinations, etc.

// Kotlin suspend calls
val page = runBlocking { session.loadDeferred(url, "-expires 10s") }

// Java-style async calls
session.loadAsync(url, "-expires 10s").thenApply(session::parse).thenAccept(session::export)

Finally, let’s summarize the family of methods we provide to load pages:

Function Name	Description
load()	Load a page
loadDeferred()	Asynchronously load a page, which can be executed in Kotlin coroutines
loadAsync()	Asynchronously load a page, executed in Java async programming style
submit()	Submit a link, which will be loaded in the crawling loop
loadAll()	Load batch pages
submitAll()	Submit batch pages, which will be loaded in the crawling loop
loadOutPages()	Load a batch of linked pages
submitForOutPages()	Submit a batch of linked links, which will be loaded in the crawling loop
loadResource()	Load a resource, using a simple HTTP protocol, not rendered through a browser
loadResourceDeferred()	Load a resource, which can be executed in Kotlin coroutines
loadDocument()	Load a page and parse it into a DOM document
scrape()	Load a page, parse it into a DOM document, and extract data from it
scrapeOutPages()	Load a batch of linked pages, parse them into DOM documents, and extract data from them

These functions provide a wealth of overloaded functions to meet the majority of complex programming requirements. For example, when we are ready to start a new crawling project, the first step is to assess the difficulty of the crawl, which can be initiated with one line of code:

fun main() = PulsarContexts.createSession().scrapeOutPages(
  "https://www.amazon.com/",  "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

The complete code for this course can be found here: kotlin, Chinese mirror. To understand more detailed usage methods, you can directly read the source code: PulsarSession, Chinese mirror.

In the next chapter, we will introduce Load Options in detail. By configuring load options, you can precisely define our crawling tasks.

Prev Home Next

This site is open source. Improve this page.