Sentiment Anaylsis with Rust, Rocket.rs & sqlx

Project Setup

Rust provides the promise of performance in a strongly typed language with Guards which provides a complete coding enviornment. Guards allow for a new style of control flow for information to pass through a program. Data can be expressed as struct and/or enum in Rust. Combined with Macros, syntax to design a concise and discrete logic is, in my opinion, a great ergonomic experince when it comes to wirting software.

In this article, We'll cover the basics of implenting a Command Line Interface using Clap. We'll explore how to use derive to provide mulitple functions for an application to run. We'll design a basic software package that will launch a Rocket.rs service, as well as a command line interface that will insert information into a (PostgreSQL)[https://hub.docker.com/_/postgres] which will in turn be rendered by a JSON API we've built.

Let's start by creating a binary package and adding crates to the Cargo.toml file.

$ carge new cli-webservice
$ cd cli-webservice

Cargo.toml

[dependencies]
# Rust Struct serialization and deserialization 
serde = { version = "1.0.197", features = ["derive"] }
# Serialize and deserialize into and from JSON
serde_json = { version = "1.0.114" }
# Command Line Argument Parser for Rust: The derive feature provides procedual macros to configure the CLI Interface/Outputs
clap = { version = "4.5.1", features = ["derive"] }
# An async, pure Rust SQL crate featuring compile-time checked queries without a DSL. Supports PostgreSQL, MySQL, and SQLite. 
sqlx = { package = "sqlx", version="0.6.3", features=["runtime-tokio-rustls", "macros", "uuid", "postgres", "chrono"] }
# Application-level tracing for Rust.
tracing = { version = "0.1.40" }
tracing-subscriber = { version = "0.3.18", features=["fmt"] }
# A web framework for Rust gear Rust that makes it simple to write fast, type-safe, secure web applications with incredible usability, productivity and performance.
rocket = { version = "=0.5.1", features = ["json", "secrets"]}
rocket_db_pools = { version = "=0.2.0", features=["sqlx_postgres"]}

What we'll build together - Architecture

Great, now that we have everything setup. Let's go ahead and start designing the CLI Interface. The program we're building is part web application and part machine learning pipeline. We'll serve an ML model that will take in a sentence or sentences and return a sentiment on the words used in those sentence(s).

Topology of Services

Any number of programming languages might submit a Document to be parsed and analyzed for sentiment. To track these submissions and save on processing power, we'll reduce the document into a hash that'll indicate to us if it has been processed before, keeping us from having to reprocess the information. If the document hasn't been processed before, we'll break the document down into tokens and feed those tokens into the ML Pipeline which will produce sentiment for each sentence.

Sentiment Analysis Architecture

Document Submission & Retrieval

Through accepting JSON through the Rocket.js JSON API, a client could be written in any number of programs capable of sending HTTP(S) requests. We'll write a client in Rust using clap that will be capable of running any number of programs using the command line interface(CLI), all within the same codebase. We'll use Cargo Workspaces to organize the code into internal-crates. The submission & retrieval program will be written with reqwest

Rocket.rs JSON API

Rocket.rs provides us with the tools we need to accept JSON API requests and provide enough information back to the program sending requests. We'll create two endpoints, POST /submit & GET /retrieve

POST /sa/submit
POST /sa/submit
HOST: localhost:8200
Content-Type: application/json
Accept: application/json

{
  "documents": [String, String, ...],
}

// Response
202: Accepted
{
  "identity": uuid4 -> String
}

400: Bad Request
{
  "message": String
}
GET /sa/retrieve
POST /sa/retrieve
HOST: localhost:8200
Accept: application/json

identity=uuid4 -> String

// 200: Processing Complete
{
  "sentences": {
    "sent": String,
    "sentiment": String,
    "tokens": [String, String, ...]
  }
}
// 204: Processing Pending - No Content

// 418: Proccessing Error
{
  "message": String
}

PostgreSQL

We'll need someplace to store information. PostgreSQL provides a number of features, for this document we'll utilize some of the more interesting RDBMS features such as an UPSERT

Sentiment Analysis Entity Diagram

Sentiment Analysis Pipeline

Writing and implementing production qualitiy MLOps is not a simple task. For this tutorial, we'll focus on compontents that will allow us to run basic sentiment analysis on sentences. We'll need to clean the input-text some so that we can provide a consistently structured text

Building the Software

With the architeture out of the way, its time to begin implementing software. Let's first create the Command Line Interface that will launch clients, Rocket.rs, and the ML Pipelines

src/main.rs

Tokio is great for IO-Bound tasks. Initialize the async fn main function with #[tokio::main] to turn the program into async/await software.

use clap::Parser;

mod opt;

#[tokio::main]
async fn main() {
  match opt::SentimentCLI::parse() {
    opt::SentimentCLI::Client(options) => {},
    opt::SentimentCLI::Rocket(options) => {},
    opt::SentimentCLI::Pipeline(options) => {}
  }
}
src/opt.rs

opt.rs provides the interface we'll be using to run operations and routines. From here we'll call into subcrates that'll be designed to perform specific operations.

use clap;

#[derive(clap::Args, Clone, Debug)]
#[command(author, version, about, long_about = None)]
pub struct SOptions {}

#[derive(clap::Parser, Clone)]
#[command(name="sentiment")]
#[command(bin_name="sentiment")]
pub enum SentimentCLI {
  Client(SOptions),
  Rocket(SOptions),
  Pipeline(SOptions),
}

Organizing our workspace

There are primarily three different components that should be seprated out into different internal-crates. client, rocket, and pipeline; the software components put together will provide complete functionality for processing documents into sentences and then into sentiment for each sentence to be returned by the JSON API.

Let's begin by altering Cargo.toml and provide dependencies to the workspace

[package]
name = "sentiment"
version = "0.1.0"
edition = "2021"

[workspace]
members = [".", "components/client", "components/rocket", "components/pipeline"]

[workspace.dependencies]
tokio = { version = "1.37.0", features = ["rt", "macros", "rt-multi-thread"] }
tracing = { version = "0.1.40" }
tracing-subscriber = { version = "0.3.18", features=["fmt"] }
sqlx = { package = "sqlx", version="0.6.3", features=["runtime-tokio-rustls", "macros", "uuid", "postgres", "chrono"] }
serde = { version = "1.0.197", features = ["derive"] }
serde_json = { version = "1.0.114" }

[dependencies]
tokio = { workspace = true }
tracing = { workspace = true }
tracing-subscriber = { workspace = true }
sqlx = { workspace = true }
clap = { version = "4.5.1", features = ["derive"] }

Create the component modules we'll call into and check to make sure everything is setup correctly.

$ mkdir components
$ cd components
$ cargo new --lib client
$ cargo new --lib rocket
$ cargo new --lib pipeline
$ cd -
$ cargo check

If everything was correctly setup, the check should finish with three or so warnings. We'll go ahead and remove those warnings now.

client

client will scrape a subset of wikipedia pages online and submit the bulk of the content to Rocket.rs JSON API to for sentiment analysis. The architecture and build of the program will be procedural; because tokio is used, IO will be managed asynchronously as long as we use the correct file reading functions.