Tutorial 2: Loading Massive Data
Quickstart Guide
In this tutorial, we provide examples about loading files including multiple users, items, or interactions, the three cases being quite similar. As stated in Tutorial 1, both users and items can be introduced, one by one, into the recommender system. However, it can be helpful to load multiple entities by using a single command. For instance, that would be the case when migrating an existing database.
In these examples, we assume that the structure has already been created (see Tutorial 1 for step-by-step instructions). We have sampled three of the files from the dataset Data Science for Good: DonorsChoose.org, in order to simplify the examples:
Projects.csv
: 12,804 projects (or items).Donors.csv
: 2,000 donors (or users).Donations.csv
: 4,975 donations (or interactions).
Note that we have omitted the headers required to authenticate the API user, for the sake of clarity. Please, refer to the documentation for the corresponding instructions.
File Formats
Currently, the Sherpa.ai Custom Content Recommendation API is capable of interpreting two file formats: JSON and CSV. You can choose the one that best fits your needs, as long as the contents are correctly structured:
Type of data | Mandatory fields | Optional fields | Observations |
---|---|---|---|
Items | itemId | Any item attribute | Item attributes have to be previously defined. Items have to be new; updates are not allowed. |
Users | userId | Any user attribute | User attributes have to be previously defined. Users have to be new; updates are not allowed. |
Interactions | itemId , userId , interactionId , timestamp | value | The interaction types have to be previously defined. User-item interactions have to be new; updates are not allowed. |
Register an Upload Order
First, a batch upload order is required for each of the files. The files Projects.csv
, Donors.csv
and Donations.csv
, which are already uploaded on https://recommender-tutorial-data-set.s3-eu-west-1.amazonaws.com/
, contain the data we want to upload.
POST /v2/recomm/projects/items/batch HTTP/1.1
{
"url": "https://recommender-tutorial-data-set.s3-eu-west-1.amazonaws.com/Projects.csv",
"format": "csv"
}
POST /v2/recomm/users/batch HTTP/1.1
{
"url": "https://recommender-tutorial-data-set.s3-eu-west-1.amazonaws.com/Donors.csv",
"format": "csv"
}
POST /v2/recomm/projects/interactions/batch HTTP/1.1
{
"url": "https://recommender-tutorial-data-set.s3-eu-west-1.amazonaws.com/Donations.csv",
"format": "csv"
}
The JSONs returned in the response to these commands include two fields: requestId
and status
. The former is the needed to check the order status, which is indicated by the latter:
HTTP/1.1 202 Accepted
Content-Type: application/json
{
"requestId": "d9edf72a-3b0c-4ee1-a38a-eb8ecdbcc05e",
"status": "queued"
}
HTTP/1.1 202 Accepted
Content-Type: application/json
{
"requestId": "6ee4e1dd-f988-492f-8224-4575d95c7fb9",
"status": "queued"
}
HTTP/1.1 202 Accepted
Content-Type: application/json
{
"requestId": "a2ff670a-bfa9-4eb0-b9e7-420fd0e02bf9",
"status": "queued"
}
Check Upload Status
The work orders are processed on-demandi, but if those commands are run immediately after posting the orders, the response will be the same as above:
GET /v2/recomm/projects/items/batch/d9edf72a-3b0c-4ee1-a38a-eb8ecdbcc05e HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
{
"requestId": "d9edf72a-3b0c-4ee1-a38a-eb8ecdbcc05e",
"status": "queued"
}
When the system starts to upload the files, the status will change to processing
:
GET /v2/recomm/users/batch/6ee4e1dd-f988-492f-8224-4575d95c7fb9 HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
{
"requestId": "6ee4e1dd-f988-492f-8224-4575d95c7fb9",
"status": "processing"
}
Depending on the file sizes, the process can last anywhere from minutes to hours.
Verify Completed Uploads
After the uploading operation finishes, if no fatal error has occurred, there are two possible statuses: completed
and completed with errors
. The former asserts that all the elements in the files were uploaded correctly, whereas the latter is returned when some of them could not be saved correctly. This last one will be the case for three of the files:
- The donations file includes the one saved in Add a New Item, so that duplicate is detected.
GET /v2/recomm/projects/items/batch/d9edf72a-3b0c-4ee1-a38a-eb8ecdbcc05e HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
{
"requestId": "d9edf72a-3b0c-4ee1-a38a-eb8ecdbcc05e",
"status": "completed_with_errors",
"entitiesNotAdded": [
{
"errorId": "69da11b15b82cf59c389ba81c444731e",
"code": "EntityAlreadyExistsException",
"message": "Item [69da11b15b82cf59c389ba81c444731e] could not be created because it already exists."
}
]
}
- Additionally, the donor from Add a New User is also detected as a duplicate.
GET /v2/recomm/users/batch/6ee4e1dd-f988-492f-8224-4575d95c7fb9 HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
{
"requestId": "6ee4e1dd-f988-492f-8224-4575d95c7fb9",
"status": "completed_with_errors",
"entitiesNotAdded": [
{
"errorId": "5f24f7ece308e11c9e31a6b9ad53cf68",
"code": "EntityAlreadyExistsException",
"message": "User [5f24f7ece308e11c9e31a6b9ad53cf68] could not be created because it already exists."
}
]
}
- The interactions file contains more errors. There are many duplicates (including the donation that had already been saved in Register a New Interaction), due to the fact that there can only be one interaction per timestamp (with millisecond precision):
GET /v2/recomm/projects/interactions/batch/d9edf72a-3b0c-4ee1-a38a-eb8ecdbcc05e HTTP/1.1
HTTP/1.1 200 OK
Content-Type: application/json
{
"requestId": "a2ff670a-bfa9-4eb0-b9e7-420fd0e02bf9",
"status": "completed_with_errors",
"entitiesNotAdded": [
{
"id": "146f776eee7976f925afd40acad560e2_ee2bebfe4f2d2f080614026b18b63d5a_donate_1503506976000",
"errorId": "EntityAlreadyExistsException",
"message": "Interaction [donate on Item [ee2bebfe4f2d2f080614026b18b63d5a] of User [146f776eee7976f925afd40acad560e2] could not be created because it already exists."
},
{
"id": "1b24af21c9b9968d393bd943bf92eb32_012dd04d073c46ccff32c793e09cbdac_donate_1515358963000",
"errorId": "EntityAlreadyExistsException",
"message": "Interaction [donate on Item [012dd04d073c46ccff32c793e09cbdac] of User [1b24af21c9b9968d393bd943bf92eb32] could not be created because it already exists."
},
.................
...[continues]...
.................
{
"id": "f071b0ae20413ac9c8c28ca529899ac6_720a8c9c3b2d90921757debc2402e0f1_donate_1508052224000",
"errorId": "EntityAlreadyExistsException",
"message": "Interaction [donate on Item [720a8c9c3b2d90921757debc2402e0f1] of User [f071b0ae20413ac9c8c28ca529899ac6] could not be created because it already exists."
}
]
}