Javad Azimi
Ph.D. Candidate


  • Bayesian Optimization

    Bayesian optimization has been widely used for experimental design problems where the goal is to optimize an unknown function f(.), that is costly to evaluate. Due to the high experimental cost, it is not practical to apply methods that rely on many function evaluations, such as stochastic search or empirical gradient methods. Bayesian optimization addresses this issue by leveraging Bayesian modeling to maintain a posterior over the unknown function based on all of the previous experiments, which allows BO to focus on a small number of carefully selected experiments.

  BO Big Picture  

As you can see in the above figure, Bayesian optimization works in an iterative framework. In general, at each iteration, given a set observation points, these methods select k>=1 unobserved points from the function to be evaluated. The results of those experiments are then added to the set of observations and the procedure is repeated until some stopping criteria.  There are two key components in the basic framework of Bayesian Optimization, posterior model and selection criterion.

The first component is a probabilistic model of the underlying function that is built based on the prior information (i.e., the existing observed experiments). This process is also known as Kriging. Gaussian process(GP) regression has been widely used in the literature of Bayesian optimization for this purpose. For any unobserved point, GP models its function output as a normal random variable, with its mean predicting the expected function output of the point and the variance capturing the uncertainty associated with the prediction.

The second key component of BO is the selection criterion that is used to determine what experiment to select based on the learned model. In existing literature, various selection criteria have been proposed and most of them are a combination of exploring the unexplored input space of the function (i.e., areas of high variance) and exploiting the promising area (i.e., area with large mean). A selection criterion can be either sequential in which only one experiment is asked at each iteration or non-sequential where a batch of experiment are requested at each iteration.

Here is the link to my Gaussian Process implementation with RBF kernel function in Matlab.


  • Creative Visual Features in Performance Display Advertising

Display advertising has been a significant source of revenue for publishers and ad networks in online advertising ecosystem. One of the main goals in display advertising is to maximize user response rate for advertising campaigns, such as click through rates (CTR) or conversion rates. Although in the online advertising industry we believe that the visual appearance of ads (creatives) matters for propensity of user response, there is no published work so far to address this topic via a systematic data-driven approach. In this prpject, we quantitatively study the relationship between the visual appearance and performance of creatives using large scale data in the world's largest display ads exchange system, RightMedia. We designed a set of 43 visual features, some of which are novel and some are inspired by related work. We extracted these features from real creatives served on RightMedia. We also designed and conducted a series of experiments to evaluate the effectiveness of visual features for CTR prediction, ranking and performance classification. Based on the evaluation results, we selected a subset of features that have the most important impact on CTR. We believe that the findings presented in this project will be very useful for the online advertising industry in designing high-performance creatives. It also provides the research community with the first ever data set, initial insights into visual appearance's effect on user response propensity, and evaluation benchmarks for further study.


    As a part of our results, we provide the following set of recommendations to designers for optimizing creative performance:

    • Creatives with higher gray level contrast achieve higher CTR.
    • Small number of salient components, with all components close to the center of the creative and the major component consistent with the rule of third, achieves higher CTR.
    • Creatives with good color harmony (those with small deviation from color harmony models) achieve higher CTR.
    • Average lightness across whole image and the largest segment of the image has a positive correlation with CTR.
    • Cluttered creatives (those with large number of connected components) are unlikely to achieve high CTR.
    • Creatives with large number of characters cause textual clutter and are unlikely to achieve high CTR.
    • Too many different hues, in both the whole image and the largest component in the image, is not desirable.


  • Statistical Anomaly Detection

    In this project, we have several features (more than 2000) which we know that they are sometimes relevant together. All of the features are numeric and we have missing values some times. The most important difficulty is we don’t have any learning data set. It means in contrast with previous anomaly detection methods which we know some good and bad pattern, we don’t have any idea about the bad and good pattern. In fact, we only know there are some modules which are different from other modules in some ways.

    The data is the output of the tester systems which measure the different criteria of a module. When a module fails, we know that we have detected a bad module. But, we are looking for the modules that the test systems could not detect their abnormal behavior. It means that they have some problems, and that the test system could not detect them. We can cast this issue as a tester system problem or acceptance boundary of each test. Sometimes, the acceptance boundary of a test is too wide and the measured value is very close to it. In addition, most of the times we have a problem in one test only. It means that all of the other tests are completely normal but the value of a test is a bit abnormal when compared to other test. Note that, the test in question, which is different from the other tests, passes the test successfully, but its value is not normal comparing to other tests although it is within the acceptance boundary. Since we are dealing with high dimension data, detecting such abnormal behavior is very difficult. Note that, we can not do any assumption about the distribution of the data since it does not follow a particular distribution.

    We found that there is a high correlation between the tests. Therefore, instead of removing the highly correlated tests, same as other approaches, we try to keep them. When two tests are highly correlated to each other, we can predict the value of one of them having the value of the other one with minimum amount of error. We use polynomial regression to predict the value of one test using the other test.

    We propose our solution as follows. First, we measure the correlation matrix of all tests together. Suppose we have n tests, then the correlation matrix is an n*n matrix which the entry (i, j) indicates the correlation value between test i and j.  The absolute correlation values close to 1 means the tests are highly correlates together and the values close to zero means there is no information about one of the test given the value of other test. Then, for each test and its 5 highly correlated tests, we compute the regression coefficient of each pair using the previous given information. Note that, we have a lot of modules that we have the information of all of their tests (more than 30,000 modules). We use this information to compute the correlation values and regression coefficient values.

    After generating the coefficient values, we try to predict the values of each test given its highly correlated tests. For each test, we predict the test value using the correlated test and compare it with its real value. Then we compute the prediction error for each pair over the given information. Finally, we compute the standard deviation of prediction error for each pair. Based on our experimental results, although the distribution of a test is not normal, the prediction error distribution is usually normal. Therefore, if a prediction value of a test, given one of its highly correlated tests, is far from the measured value more than three times of its prediction error standard deviation, we report an anomaly for that test. In fact, we report an anomaly if the measured value of a test i which has a highly correlation value with test j doesn’t satisfy the below equation.