Fyp Interim Report
ABSTRACT This report describes the progress the author has made on her Final year Project from August 2012 – November 2012. The report also briefly explains the project topic, future work and project plan for the following semester. The purpose of this project is to build an algorithm to evaluate a web page’s visual quality and appearance.
For this a database of 113 webpages and 101 webpages were collected for training and test respectively.
The training dataset’s webpages were classified into aesthetically appealing and non-appealing based on ground truth. Image features were extracted from the training dataset. According to these features an image classifier was to be obtained using SVN technique. Unknown and new webpages will be tested and subjected to prediction regarding its aesthetic appearance based on this classifier. 1.
INTRODUCTION The content on a website becomes information only when the brain cognitively processes it. But the very first capture of a website is definitely based on its visual layout and colors.
Just like a well-designed website attracts users, it also compels the user to stay longer on it; which increases the traffic and, at times, the revenue of the website. This alone proves the necessity to make a web site not just content-wise but aesthetically also rich. However Wu et al [1] has mentioned that only little research has been done on the overall visual aspects of the webpage.
Furthermore the possibility that a webpage can be treated as an image with various visual features opens up chances for major improvements in this field.
This implies that the techniques applied in the computer vision field which combines object recognition and data mining can be applied here as well. This would be the motivation behind this project to explore existing techniques in this field and improve them. In the following section the existing research and techniques are explained. 2. LITERATURE REVIEW As the webpage is assessed solely on its visual quality and not the content, the problem comes down to that of image recognition, computer vision and data mining.
One of the approaches used by Wu et al was to collect a database of training and test webpages.
All the training webpages were divided into blocks named layout blocks. This was done by VLBE algorithm. 30 features belonging to 4 categories of image features such as Layout visual features, Text visual features, Classical visual features such as color and texture and the visual complexity feature were selected. A group of 7 people in an iterative process labeled all the images in the training set on the basis of these features. These scores were used to obtain a training classifier using Support Vector Machine (SVM) algorithm.
This classifier predicts the visual quality of a new webpage (or those in the test dataset) based on the training dataset. The results achieved using this method was said to prove that the features describing a web page’s structure yields better discrimination capability that other classical features based on visual appearance alone. [1] It was also mentioned that the results obtained were close to the assessment done by people manually which Wu et al interpreted as the possibility that a computational classification method can be used to asses visual quality of webpages.
FIGURE 1 : COMPARISON BETWEEN THE PREDICTED AND HUMAN LABELLED SCORES FOR WEBPAGES COLLECTED FROM VARIOUS YEARS. Even though the initial and the final stages of the method mentioned above are similar by that followed by Lazebnik et al, to obtain the features and manipulate the data, a computational and a mathematical approach has been used. The approach is to divide the webpage image into finesub regions and compute histograms of local image features found inside each of these subregions.
Thus a spatial pyramid is obtained which is used to create the SVM classifier.
In short it performs pyramid matching in the 2-D image space and then uses the traditional clustering methods in the feature space. [2] This so-called ‘sub-divide and disorder’ approach has been implemented in various computer vision applications. This method, as opposed to the multi-resolution histograms [3] which varies the resolution at which the image features are computed, but the histogram resolution remains the same [2], fixes the resolution at which the image features are computed, but varies the histogram resolution. This method is said to have been tested and succeeded on three diverse large scale datasets.
Hence in this project the author will follow the above method. 3. METHODOLOGY 3. 1. DATABASE COLLECTION Firstly a collection of over 200 websites was made.
To adhere to the algorithm which will be used, JPEG images of these webpages were obtained. The collection was then divided into training and test data. A survey was created to collect ground truth which will be used to separate the training data into good and bad webpages. In the following sub-section extraction of features from these webpages’ images is described. 3. 2.
FEATURE EXTRACTION
Features were extracted from these images using the SIFT descriptors algorithm by [2]. Scaleinvariant feature transform (or SIFT) is an algorithm used in computer vision to detect and describe local features in images. [4] Using the SIFT algorithm a dense grid of features was obtained for each of the images. This is believed to be more efficient as it can capture detailed information regarding the different features of an image (webpage image, in the current case). In the next sub-section obtaining a codebook from the above features is explained. 3.
3.
OBTAINING THE CODEBOOK AND THE KERNAL MATRIX K-means is applied to all the image features extracted in the above section where the number of center points was optimized and set as 200. Thus a codebook is obtained with 200 center points. Following this the feature descriptors are quantized to the closest corresponding code word. This was then converted into a histogram with the codebook’s words as the horizontal axis and the vertical axis was kept as the number of features falling into each code word.
The histogram above gives a unified length feature vector for each image.
Histogram Intersection Kernel (HIK) method was applied to images to obtain the similarity matrix between two of the images. Each element of this matrix represents the similarity score between the two image vectors. In the following sub-section the procedures done on the testing data is explained. 3. 4.
MANIPULATING THE TESTING DATA The above procedures are applied onto the testing data as well. Features are extracted from them, a codebook has been obtained and a kernel matrix has been calculated. In the following section work to be done further is described. 4. FUTURE WORK
The training data has been collected and manipulated.
This information is to be used to find a generalized scientific method which can predict the visual quality of unknown webpages (test data). Support Vector Machine (SVM) classification method is to be used for this. SVMs are supervised classification models associated with machine learning which analyzes data and recognize patters. They can be used for classification and regression analysis [5]. Given a training data set which can be classified into two categories, an SVM algorithm builds a model that can predict in which category a new data value will belong.
In the case of this project, SVM can be used to form the training classifier on the basis of which webpages in the future can be assessed on to know about its visual quality.
To obtain the classifier the testing kernel matrix and the image vector of the training data set obtained in section 3. 3 is to be inputted into the SVM classifier. Once the classifier is obtained the training stage has ended. Following that, the testing kernel matrix and the image vector of the testing data set obtained in section 3. 4 is inputted to the algorithm.
The ‘svmpredict’ function outputs the result based on the classifier.
These results are then to be tested for accuracy. 4. 1. PLAN OF ACTION Apart from the proposed work in the above section the author plans to follow a trial and error approach on the parameters used in the algorithms such as the number of code words (k value of k-means), pyramid levels etc. to improvise the algorithm and increase the accuracy.
Author would also like to apply dynamic clustering instead of k-means and compare the results. Below shown is the Gantt chart timeline proposed by the author at the start of this project.
The progress of the project has been according to this timeline. As opposed to the timeline, both training and testing set has been prepared for the classification and prediction. The next step would be obtaining the SVM model.
This stage has not been accomplished yet as the author would like to learn in-depth regarding SVM classification before obtaining the model. Later stages will be followed according to the timeline. FIGURE 2 : PROJECT PLAN FOR SEMESTER 1. BLUE COLUMN REPRESENTS THE CURRENT PROGRESS IN THE PROJECT. FIGURE 3 : PROJECT PLAN FOR SEMESTER 2.

