URL classification is the task of determining for a given URL, using the text it contains, what is the appropriate category of the URL
We can find many URL classification API services, a highly accurate one and offering a large number of categories is this URL classification service and API: websitecategorizationapi.com.
This URL classification API service has one valuable feature in that it provides the results not only based on IAB taxonomy (which is the go to taxonomy of most services), but also provides URL classification using custom e-commerce categorization with more than 1000 categories.
This is ideal for someone operating an online store who wants to provide not only the top, broad categorizations into e.g. main 20 categories, but would like to improve searchability on their online store by categorization into more than 1000+ niche categories.
Having more detailed categories helps not only to users in terms of finding the products more easily but can also lead to improved rankings on search engines, because more refined categories act as relevancy signals for the search engine ranking algorithms.
What is also useful in that respect is another service, which offers automated tagging via API.
But back to our URL classification services.
What is the approach for URL classification
Most of URL classification services use an automated classification, due to sheer number of domains and URLs to classify. Thus one needs to first build an appropriate machine learning model that will do classification.
Prior to that, one also has to decide on the categories that will be classified.
This is also known as URL taxonomy and there are many taxonomies available, most known ones are those from IAB and Google Taxonomy,
Once we have the taxonomies, we need to gather the data for the appropriate training data set and then use that for training a machine learning model.
Next, let us look into some of the most common use cases of URL classification service and API tools.
Usage of URL classification
URL classification can be used in both online, real-time live setting or as part of an offline database.
By offline, we mean the the vendor is using the offline database locally as part of own application and does not need to do classifications on the fly.
The other case, especially with new URLs, is to send the URL directly to URL classification service, which then retrieves the URL on the fly, from server, preprocesses its text (e.g. removing stop words, doing lemmatization), then sends it to the URL classifiers and obtain json payload back from the classifier.
What kind of service one needs really depends on the use case. E.g. offline databases can be used by companies which want to set up some internal system for controlling which webpages their employees use.
Whereas real-time, URL path can be used for service which needs to constantly verify new URLs for safety or for marketing purposes.
How does the service classify websites
The service accepts either URL or text. In case of URL, it first fetches the URL, gets it text, pre-processes it and then returns it to the pipeline which has a ML model and which then makes the classification and returns it in json format.
Which is a dictionary of probabilities for categories that are supported.
You also get example code, e.g. in python or general one using curl, which you can use to start immediately with your classifications.
Neural Machine Translation for addressing non-english languages
As the ML models are built for the english language, it is necessary to translate the texts to English in order to classify them.
The service has a NMT engine built in, so that you can also send it texts written in many languages and it translates it on the fly to English.
What kind of Taxonomies are in use
The main taxonomy is the one from IAB, in 3 Tiers. This is appropriate for general URLs.
Another important taxonomy is adjusted to ecommerce domains and is useful for determining product categories like the Google Product Taxonomy.
You can learn more about IAB taxonomies here: https://iabtechlab.com/standards/content-taxonomy/
IAB often revises their website categories to keep them up to date with new developments of both services and products.
The service regularly updates their ML models to keep them up to date with developments in underlying taxonomies.
The key features of URL classification service
– it is among the most accurate URL classifiers available
– it supports categories than many other services, e.g. IAB Tier 3 level and 1000+ categories for ecommerce domain
– easy integration to your own products and services
– it has dedicated support and quick turnaround for custom solutions and tackling of issues
– you can using it away with many API example code and well documented API endpoints
– the service can be used in many usage cases
– it is using live, real-time categorization on full path URLs
– it has both monthly plans and custom API credits support
URL Classification API service