14 min. read

AI’s apparel eye

Deep learning based approach to recognize attributes of apparel


The “FashionAI Global Challenge 2018 – Attributes Recognition of Apparel” is conducted to push the ability of AI to help the fashion industry in recognizing the attributes of clothing from a given image. This capability could be widely applied in applications such as apparel image searching, navigating tagging, mix- and-match recommendations, etc. The competition was hosted at the Alibaba cloud competitions site: Tianchi. The dataset released for the competition is the largest dataset available in the domain of attributes  recognition of the apparel. We finished the competition at 30th position out of 2,950 contestants across the world. In the sections that follow, we will define the dataset, our approach, other experiments, results, conclusions and further ideas, and references.


Apparel attributes are the basic knowledge of the fashion field, and they are large and complex. The competition provided us with a hierarchical attributes tree as a structured classification target to describe the cognitive process of apparel, which is shown below. The “subject” refers to an apparel. Our focus for the competition was in the characteristics of the apparel.

Data provided has eight categories, each representing a clothing type. Each category is further broken down into labels defining it in terms of design or length. If the design or length is not clearly visible in the image, an invisible label is assigned. Tables below show the categories, the labels in them and the total number of images inside each label:




  • In some images, the way the person is posing might obscure the design or length of the clothes. For example, if the person is sitting, a floor length skirt might seem like an ankle length skirt
  • The background in the image also added to the noise. For some images, it merged with the dress color, making it difficult for the model to distinguish between the dress boundary and the background
  • Also, in some cases, the model couldn’t differentiate between clothes of similar length (example, knee vs. midi length skirt)

Data is given to us in separate folders for each category. This eliminated the requirement to first predict the category and then the labels inside it. We need to predict the labels of each category. As obvious as it may sound, the test dataset also has a similar structure.


Convolutional neural networks (CNNs) are used to solve the problems related to image classification. We have used the same technique and approach, which can be divided into four parts:

A. Preprocessing the images and data augmentation

B. Choosing network architecture of CNN

C. Optimizing the parameters of the network

D. Test time augmentation


We normalized (took difference from the mean) the pixel values of the images (0-255) to suit the network architecture used. We applied certain transformations like zooming (1-1.1X), adjusting the image contrast (randomly between 0-0.05), rotation (randomly between 0-10 degrees) and flipping the images. This helped in making the model more invariant to orientation and illumination in the image. In every epoch, a random transformation was chosen, so that in every epoch we are showing a different version of the same image, thus avoiding the network to overfit to a set of images.


We used transfer learning for solving this problem. Transfer learning means using a model that is trained for another task to assist us in solving the problem at hand. This helps in creating the initial base features and avoids training the model from scratch when you have limited data and computational resources. We took network trained on ImageNet data as a starting point. ImageNet is a large database of images, and every year many researchers try to improve upon the accuracy of the classification of objects in ImageNet and submit it to Large Scale Visual Recognition Challenge (ILSVRC). This challenge has 1,000 categories to predict.To suit the problem at hand, the final output layer after the Fully Connected layers (FC layers) in the architecture were replaced with the number of labels of the given category.

We experimented with different types of Residual Networks: ResNet [1], ResNext [2] and the current state of the art architectures NasNets& SeNets. In our experiments, ResNext gave better results than other algorithms when looked in both accuracy and computational time.



Choosing a starting value of the learning rate is highly important to ensure convergence of the network parameters to the optimal value.

Leslie Smith’s (researcher in field of deep learning) recent work on “Cyclical Learning Rates for Training Neural Networks” [3] contains a point on choosing an initial learning rate for the given problem. In summary, the idea is to start with a very small learning rate and gradually increase the learning rate in powers of 2 or 10 for every iteration in the epoch. Initially, when the learning rate is too small, error will decrease at a very slow rate. If you keep on increasing, at some point the learning rate becomes so high that the error skips the minimal value and starts shooting upwards. This indicates that beyond this learning rate shouldn’t be chosen, as it has become too high for parameters to converge.

The image below shows the learning rate finder for the ResNext-101 architecture when trained on a category of clothing. Loss has been decreasing drastically between 10-4 to 10-3, and then from 10-2 loss has started increasing, and at 10-1 it has increased drastically. Ideally, we should choose a learning rate between 10-4 and 10-3. We have chosen 10-4 to accommodate another technique of adjusting the learning rate.

Leslie Smith’s (a researcher in the field of deep learning) recent work on “Cyclical Learning Rates for Training Neural Networks” points out that instead of just having a constant learning rate across the epochs, the learning rate can be made cyclical across the epochs. In summary, the number of epochs could be equal to the number of cycles, and in each cycle the learning rate resets back to the original learning rate (learning rate chosen from learning rate finder above). Inside a cycle, the learning rate decreases gradually for each batch in cosine fashion. This process helps the network to escape the narrow regions (local minima) in error surface and favors a wider region.

The plots below show the cyclic learning rate plots. After running for the few cycles, we can change the length of the cycles so that the learning rate gradually decreases to help weights converge. When the cycle length is two, in that case, the learning rate of the next epoch is equal to the latest learning rate in the last epoch.A total of seven epochs were run in the case shown in the image. Here, the total cycles are three, and the cycle length is multiplied by two times the length of the previous cycle. We can see that the last cycle was run for four epochs, the second cycle was run for two epochs, and the first one was run for one epoch.

We can see that error surface is not smooth, and the concept of cyclic learning rate can help us jump past the narrow regions of error surface. We have increased the number of cycles and have seen that loss has been constant, indicating that it is not a narrow region of error surface.


As we are using a pretrained architecture and performing transfer learning, not all layers require additional training. As the architectures are state of the art on ImageNet, they are already good in identifying the low-level abstract features like boundaries and edges. Those are captured in the few initial layers of the architecture. Hence, they don’t require much re-training.

We chose different learning rates for different parts of the network, and the layers are grouped into three parts. The first part corresponds to the initial set  of layers, the second part corresponds to the layers in the middle, and the third part corresponds to the last set of layers plus FC layers (Fully Connected layers).

Two steps that are used to train the network are listed below:

  • Initially, the network is frozen for all layers except for the last Fully Connected We mean those layers are not trained; we are just predicting the values till the layers before the FC layers and trying to tune the weights between Fully Connected layers (two layers of size 512) and the output layers (size is dependent on the category we are predicting).
  • Next, the network is unfrozen, e. all the layers are made trainable. Now, the learning rate for the three groups of layers is set by the rule of thumb of [lr/100, lr/10, lr] in the order for the images like ImageNet, but in our case [lr/100, lr, lr] has proven to work well. Information getting captured in the middle layers is having equal importance to the information that is getting captured in the layers near to the FC layers. (Here, “lr” refers to learning rate).

This concept of using different learning rates across different layer groups is termed as the differential learning rates [4].


The images we received in the data are mostly at 512 pixels, and we resized  the images to 224 (since most of the ImageNet images are of this size) for the initial tuning of weights. And then we resized the images to 299, and we ran the same number of epochs using the final weights generated at 224 as initial weights. Then we resized all the images to 512 pixels and used the weights generated, using 299 pixels as the initial weights.

The advantages were twofold:

  • We would get computational time advantage, since larger images increase the time it takes to tune the weights of the network. Hence, if we provide the weights obtained from the smaller size of images for the same problem, we are  providing optimal weights for  the problem and network converges in less time than it takes.
  • We get accuracy gains from We are providing the data  at  different sizes; hence, those categories that are very far way in the classification (i.e., sleeveless vs. wrist length of the sleeve) will already be taken care of in the smaller sizes. But the nearby classes will be  classified more accurately in  the case of higher resolution input image.


During the prediction, we applied the same transformation parameters that  we used during the training and generated eight images. We choose four of them randomly and predicted on these sets of images and also on the original image. We averaged out the prediction probabilities, and this has increased  the accuracy of the predictions obtained. We believe that one possible reason for this is some center cropping could happen while resizing the image, which could result in loss of information from the sides. When we do transformations, that information is captured in one or more of the images, and averaging the probabilities is increasing the accuracy of the model.


As discussed earlier, a gradual increase in size is bringing the accuracy improvements, but most of the original images are of size 512 pixels; hence, we applied a concept of super resolution (a deep learning based method to resize the images to a higher resolution). We resized the images to 1024 pixels and performed the similar experiments, but we didn’t get the accuracy improvements, and computational time grew exponentially from 720 pixels.

During the semi-finals of the competition, we were provided with the dataset that contains images of apparel that were just hanging on a hanger or on the wall (i.e., not worn by humans). We used a similar approach but used yolo to separate out images of hanger and humans and built separate models. But the combined model of human and hanger always gave better results in comparison with separate models.


We have results for all the categories following similar trends of results across the experiments. Hence, we will present the results of one category: skirt length (as we have provided example images for the same category).


We have started with ResNet architecture and  moved  on  to  ReNext-50 and ReNext-101. ResNext-101 outperformed all the other architectures, as shown in the results below. Notations: Epoch: Number indicates the cycle-indexing starting with zero; trn_loss- log loss on training data; val_loss: log loss on validation data; accuracy: classification accuracy of validation data.

It can also be observed that the TTA has always provided an improvement in the prediction accuracy when compared to the last epoch’s accuracy.


Choosing the best performing architecture, ResNext-101, we have sequentially increased the size of the input image. Accuracy has increased from 85.39% at size 224 pixel to 88.25% at size 512 pixel.


There is no confusion between the short length and floor length (i.e., those categories that we as humans can also do very accurately). But the model suffers to correctly classify the nearby classes. We have tried other approaches like modeling separately for  the nearby categories and  making changes in loss function, but none of them are able to solve the problem of confusion of nearby classes.


The way the learning rate is chosen is very important for neural nets convergence, and the way the network is optimized is also another  important step in the process of model building. But we feel the current  state of the art architectures are throwing away a lot of information when it reaches the FC layers. But taking all the activations will make the parameter space exponentially bigger, thereby causing overfitting and increasing time complexity to tune the network. Current methods are taking the average value of all the channels before the FC layers. By doing so, we are losing the detailed information captured till that layer. We propose the following ideas to improve on this:

  • We tried XGBoost at the end of the competition by taking all activation values from all the filters in the layer before the FC layers or final conv We observed that we can get better  results  when compared to just taking average values of those filters. But due to the time constraint in competition, we haven’t been able to complete the experiment, and we will publish those results soon. In summary, using XGBoost on the activations obtained on filters just before FC layers could help boost the accuracy of nearby classes by capturing some detailed information.
  • The approach of bagging could help, where we selectively expose all the activations of the few important filters based on their weights importance and repeat it multiple times and take an average This might help us capture the detailed information from some important filters.



Data Scientist

Mathematics and Scientific Computing graduate from IIT Kanpur. Fascinated by the mathematics behind machine learning. Interested in applications of deep learning, especially in the field of computer vision.


Senior Data Scientist

Material Science graduate from IIT Roorkee. Enjoys problem solving and has worked on creating data driven solutions for various domains like Insurance, Supply Chain, CPG.


Data Scientist

Material Science graduate from NIT Warangal. Passionate about solving problems that create a meaningful impact to the society/business using machine learning and deep learning.


      1. ResNet. Deep Residual Learning for Image Recognition. https://arxiv.org/pdf/1512.03385.pdf
      2. ReNext. Aggregated Residual Transformations for Deep Neural Networks.https://arxiv.org/pdf/1611.05431.pdf
      3. Cyclical Learning Rates for Training Neural Networks. https://arxiv.org/ pdf/1506.01186.pdf
      4. Lectures by fast.ai. http://www.fast.ai/


We would like to thank Jeremy Howard and Rachel Thomas for generously helping everyone learn the state of the art deep learning techniques through their lectures at fast.ai [4]. Many of our ideas are inspired from the lectures there.