Looking for a Tika-as-a-service?

Developers

If you are interested in extracting text and metadata from documents, you have probably heard about Apache Tika. It is a great tool that can extract a bunch of information from a wide range of file formats, including PDF, Word, Excel, PowerPoint, HTML, XML, and many more.

The problem is that it is a Java application, it is a pretty low-level component, and it is not always easy to integrate it into your application.

Let’s see how you could use the Nuclia Understanding API as a Tika-like service with no pain!

How to process your files

First, you need to create a Nuclia account. You can do it here.

Once created, go to Manage account and then go to Understanding API keys section.

In this section, you can create your Understanding API key.

This key token allows to authenticate your calls to the NU-API endpoints, it must be passed as a bearer in the x-stf-nuakey header in each request.

To process files, you have to use the following flow:

upload the file to /processing/upload
push it in the processing queue with /processing/push
and then collect the results whenever they are ready by calling /processing/pull

Indeed, the processing is asynchronous, you do not get the results as a response to the /processing/push call.
Files are put in a queue and when their processing is complete, the corresponding results will be retrieve with your next call to /processing/pull.
Alternatively, you can define a callback endpoint at the time you push the file, and the results will be posted to this endpoint.

Processing can take few minutes, it mostly depends on how much pending resources are already in the queue.

What to expect in the results

The Nuclia Understanding API will return the file metadata provided by Tika. But that’s not all!

It also retrieves the following:

speech-to-text (if it is a video file or an audio file)
nested texts (like text in an embedded image)
paragraphs and sentences (defined by the position of their first and last characters, plus start time and end time for a video or audio file)
links
embedded files

Beside extracting data, it also generates some extra information:

a summary
a thumbnail
all the named entities (people, dates, places, organizations, etc.)

Let’s take an example.

I have downloaded this video named “Get Creative” from the Internet Archive, and sent it to the Nuclia Understanding API.

The results I get are quite amazing.

Obviously, I have standard metadata extracted by Tika, like the video title (“get_creative – http://www.archive.org/details/GetCreative“), its size (320×240), its duration (396.92 seconds), etc.

But what else? First, in extracted_text, I get the full transcript of the video, extracted from its audio track:

These are Jack and Meg quite also known as The White Stripes They’re a band from Detroit. They make rock and roll without a bass guitarist.. This is Steve McDonald […]

This text has been split into paragraphs and sentences. For example, the 5th paragraph starts at character 734 and it ends at character 793. It also tells me the corresponding times in the video, from 1’06 to 1’10. If you check in the text, that’s the sentence “It can be that easy when you skip the intermediaries”.

Let’s dig a bit more in the data I got.

I get vectors, which are the semantic representation of each sentences. I could use that to make semantic search.

I also get entities:

“Steve McDonald”, identified as a person
“late 1980s”, identified as a date
“Detroit”, identified as a place
“Creative Commons”, identified as an organization

And I also get this summary:

The White Stripes album called white blood cells and re-recorded it laying a base track down on every song then he released the results as MP3s on red cross’s website. McDonald began putting these songs copyrighted online without permission from the white stripes or their record label during the project. Creative Commons wanted to find an easy way to help people tell the world upfront that they want to allow some uses of their work.

And yes, this summary is far from perfect, but still, it gives a pretty accurate idea about what the video is talking about.

The thumbnail is not provided directly in the results, I just got a token that I can pass to the /processing/download endpoint to get the corresponding image.

All of this is pretty amazing, and it has been done just by making few HTTP calls!

How to do it with Python

Of course, making HTTP calls manually is not very convenient, so you will probably prefer to use a script.

A typical Python script to upload a file would be like this (assuming your region is europe-1):

import mimetypes
import requests

NUA_KEY = "YOUR_NUA_KEY"
content_path = "/path/to/file"
with open(content_path, "rb") as source_file:
    print(f"Uploading {content_path}")
    response = requests.post(
        'https://europe-1.nuclia.cloud/api/v1/processing/upload',
        headers={
            "content-type": mimetypes.guess_type(content_path)[0] or "application/octet-stream",
            "x-stf-nuakey": f"Bearer {NUA_KEY}",
        },
        data=source_file.read(),
        verify=False,
    )
    if response.status_code != 200:
        print(f'Error uploading {content_path}: {response.status_code} {response.text}')
    else:
        print(f'Pushing {content_path} in queue')
        response = requests.post(
            'https://europe-1.nuclia.cloud/api/v1/processing/push',
            headers={
                "content-type": "application/json",
                "x-stf-nuakey": "Bearer " + NUA_KEY,
            },
            json={"filefield": {"my-file": response.text}},
            verify=False,
        )
        if response.status_code != 200:
            print(f'Error pushing {content_path}: {response.status_code} {response.text}')

The results collected from the /processing/pull endpoint are in Protobuf format, so you will need the corresponding models to be able to parse them:

pip install nucliadb-protos

Your script would be like that:

from nucliadb_protos.writer_pb2 import BrokerMessage
import requests
import base64

NUA_KEY = "YOUR_NUA_KEY"

res = requests.get('https://europe-1.stashify.cloud/api/v1/processing/pull',headers={
    "x-stf-nuakey": "Bearer " + NUA_KEY,
}).json()

if "payload" in res:
    pb = BrokerMessage()
    pb.ParseFromString(
        base64.b64decode(res["payload"])
    )
    print(pb)
else:
    print('No payload')

How to do it with NodeJS

If Python is not your thing, you can also use NodeJS.

The code to push a file in NodeJS will look like that:

import fetch from "node-fetch";
import fs from "fs";

const NUA_KEY = "YOUR_NUA_KEY";

const bufferContent = fs.readFileSync("/path/to/my-file");
const type = "video/mp4";

fetch("https://europe-1.nuclia.cloud/api/v1/processing/upload", {
  method: "POST",
  headers: {
    "content-type": type,
    "x-stf-nuakey": `Bearer ${NUA_KEY}`,
  },
  body: bufferContent,
})
  .then((res) => res.text())
  .then((fileId) =>
    fetch("https://europe-1.nuclia.cloud/api/v1/processing/push", {
      method: "POST",
      headers: {
        "content-type": "application/json",
        "x-stf-nuakey": `Bearer ${NUA_KEY}`,
      },
      body: JSON.stringify({ filefield: { "my-file": fileId } }),
    })
  )
  .then((res) => res.json())
  .then((res) => console.log(res));

To read the results, here again you need the Protobuf models, they are provided by the @nuclia/protobuf package:

npm install @nuclia/protobuf
# OR
yarn add @nuclia/protobuf

The code will look like this:

import { NucliaProtobufConverter } from "@nuclia/protobuf";
import fetch from "node-fetch";

const NUA_KEY = "YOUR_NUA_KEY";

fetch("https://europe-1.stashify.cloud/api/v1/processing/pull", {
  headers: {
    "x-stf-nuakey": `Bearer ${NUA_KEY}`,
    "content-type": "application/json",
  },
  method: "GET",
})
  .then((res) => res.json())
  .then((data) => {
    if (data.status === "ok" && data.payload) {
      return NucliaProtobufConverter(Buffer.from(data.payload, "base64"));
    } else {
      return data;
    }
  })
  .then((res) => console.log(res));

How to do it with other technologies

The Nuclia Understanding API is a regular REST API, so you can use any technology you want to interact with it.

The only specific thing is that the results are in Protobuf format, so you will need to use the corresponding models to be able to parse them.

The models are available here: github.com/nuclia/nucliadb/tree/main/nucliadb_protos

The one used to parse the results in the examples above is writer.proto, and it depends on knowledgebox.proto, noderesources.proto, resources.proto, and utils.proto.

If you decide to implement a client for a new language, please let us know, we will be happy to add it to the existing list!

—
Photo by Vighnesh Dudani on Unsplash