PulsarRPA

Top Project Practice

Prev Home Next

Exotic Amazon (Chinese mirror: exotic-amazon) is a complete solution for crawling the entire amazon.com website, ready to use out of the box, containing most data types of Amazon, and it will be permanently provided for free and open source.

The methods and processes for data collection of other e-commerce platforms are basically similar. You can modify and adjust the business logic based on this project, and its infrastructure solves all the difficulties faced by large-scale data collection.

Thanks to the comprehensive Web data management infrastructure provided by PulsarRPA, the entire solution consists of no more than 3500 lines of Kotlin code and less than 700 lines of X-SQL to extract more than 650 fields.

Data Introduction

Getting Started

git clone https://github.com/platonai/exotic-amazon.git
cd exotic-amazon && mvn

java -jar target/exotic-amazon*.jar
# Or on Windows:
java -jar target/exotic-amazon-{the-actual-version}.jar

Open System Glances to get a clear view of the system status.

Handling Extraction Results

Extraction Rules

All extraction rules (Chinese mirror: exotic-amazon) are written in X-SQL. Data type conversion and data cleaning are also handled by powerful X-SQL inline processing, which is an important reason why we developed X-SQL. A good example of X-SQL is x-asin.sql (Chinese mirror: exotic-amazon), which extracts more than 70 fields from each product page.

Saving Extraction Results in the Local File System

By default, results are written in json format to the local file system.

Saving Extraction Results to the Database

There are several ways to save results to the database:

  1. Serialize the results as key-value pairs and save them as a field of the WebPage object, which is the core data structure of the entire system and this feature is also enabled by default.
  2. Write the results to a JDBC-compatible database, such as MySQL, PostgreSQL, MS SQL Server, Oracle, etc.
  3. Write a few lines of code to save the results to any destination you wish.

Prev Home Next