tf.keras.applications is a set of built-in models in TensorFlow-Keras. They are pretrained on ImageNet-1k, and are just one function call away. This makes life easier for the ML folks as they have ready-to-go models at their disposal. RegNets are highly efficient and scalable models proposed by Facebook AI Research (FAIR). They are used in works like SEER which need models that can scale to billions and billions of parameters. In submitting a PR to Keras, I implement and train 24 RegNet models of varying complexity to tf.keras.applications.
Even though I was responsible for the primary development work on the PR as well as training the models, I received great help from the community which makes it truly collaborative.
I performed several experiments with these models, because the reported accuracies could not be reproduced using the hyperparameters provided in the paper. This blog post is a record of these experiments I tried and how they panned out.
I sincerely thank the Keras team for allowing me to add these models. Huge thanks to the TPU Research Group (TRC) for providing TPUs for the entire duration of this project, without which this would not have been possible. Thanks a lot to Francois Chollet for allowing this and guiding me throughout the process. Thanks to Qianli Scott Zhu for his guidance in building Keras from source on TPU VMs. Thanks to Matt Watson for his support regarding grouped convolutions. Special thanks to Lukas Geiger for his contributions to the code. Last but not least, thanks a ton to Sayak Paul for his continuous guidance and encouragement.
The paper “Designing Network Design Spaces” aims to systematically deduce the best model population starting from a model space that has no constraints. The paper also aims to find a best model population, as opposed to finding a singular best model as in works like NASNet. The outcome of this experiment is a family of networks which comprises models with various computational costs. The user can choose a particular architecture based upon the need.
Every model of the RegNet family consists of four Stages. Each Stage consists of numerous Blocks. The architecture of this Block is fixed, and three major variants of this Block are available: X Block, Y Block, Z Block. Other variants can be seen in the paper, and the authors state that the model deduction method is robust and RegNets generalize well to these block types. The number of Blocks and their channel width in each Stage is determined by a simple quantization rule put forth in the paper. More on that in this blog. RegNets have been the network of choice for self supervised methods like SEER due to their remarkable scaling abilities.
Before opening a pull request which requires large amounts of work, it is advisable to consult the team first so that there is no conflict of interest. After getting a solid confirmation from the Keras team, I started working on the code. You can check out our discussion here. Below is a small snippet from our conversation:
François Chollet and the Keras team were super supportive and made merging the PR a smooth process. I express my heartfelt gratitude to the team for their help. Even though I had 24 models to implement, the basic code was fairly straightforward. Thus, I was able to create a PR with the code and get reviews from the team quickly. Check out the PR here.
I mainly used the TPUv3-8 Node for training. It has a 96-core VM with around 335 GB RAM, which handles heavy preprocessing with ease. After preprocessing raw images/posts/training-regnets were resized to 224x224 as mentioned in the paper. I used multiple TPU Nodes simultaneously to allow running many experiments in parallel. This reduced the experimentation time considerably.
The code used for training is available here.
In this section, I simply jot down the takeaways and methods I used for achieving this performance. Input pipeline I trained these models on the powerful TPU-v3 (thanks to TRC). This meant that I had to employ a lightning-fast input pipeline. Also, the input pipeline must be static, which means I cannot have abrupt changes in the preprocessing graph at runtime (since the preprocessing functions are optimized using AutoGraph). As per the requirements of TPUs, I stored the ImageNet-1k TFRecords on a GCS bucket and employed an interleaved dataset read.
Learning point: It is important to implement augmentations in the most efficient and stable way possible and minimize slow and redundant ops in the process.
Some chunks of the code are repeated, but this guarantees that the function remains pure. Here being pure simply means absence of break statements, which would otherwise cause the graph to change arbitrarily. One can also see, for example, the variable w_crop is cast to tf.int32exactly once in the entire function call. It is important to do such optimizations, because we are working with a single image at a time and not a batch of images/posts/training-regnets. You can check out the code here. The actual code was not included in this blog for the sake of brevity.
We can see that some chunks of the code are repeated, but this guarantees that the function remains pure. Here being pure simply means absence of break statements, which would otherwise cause the graph to change arbitrarily. One can also see, for example, the variable w_crop
is cast to tf.int32
exactly once in the entire function call. It is important to do such optimizations, because we are working with a single image at a time and not a batch of images/posts/training-regnets.
Apart from inception style cropping, the implementation of the remaining input pipeline was fairly simple. I used inception cropping, channel-wise PCA jitter, horizontal flip and mixup.
PCA jitter:
def _pca_jitter(self, image, target):
"""
Applies PCA jitter to images/posts/training-regnets.
Args:
image: Batch of images/posts/training-regnets to perform random rotation on.
target: Target tensor.
Returns:
Augmented example with batch of images/posts/training-regnets and targets with same dimensions.
"""
aug_images/posts/training-regnets = tf.cast(image, tf.float32) / 255.
alpha = tf.random.normal((self.batch_size, 3), stddev=0.1)
alpha = tf.stack([alpha, alpha, alpha], axis=1)
rgb = tf.math.reduce_sum(
alpha * self.eigen_vals * self.eigen_vecs, axis=2)
rgb = tf.expand_dims(rgb, axis=1)
rgb = tf.expand_dims(rgb, axis=1)
aug_images/posts/training-regnets = aug_images/posts/training-regnets + rgb
aug_images/posts/training-regnets = aug_images/posts/training-regnets * 255.
aug_images/posts/training-regnets = tf.cast(tf.clip_by_value(aug_images/posts/training-regnets, 0, 255), tf.uint8)
return aug_images/posts/training-regnets, target
Mixup:
def _mixup(self, image, label, alpha=0.2) -> Tuple:
"""
Function to apply mixup augmentation. To be applied after
one hot encoding and before batching.
Args:
entry1: Entry from first dataset. Should be one hot encoded and batched.
entry2: Entry from second dataset. Must be one hot encoded and batched.
Returns:
Tuple with same structure as the entries.
"""
image1, label1 = image, label
image2, label2 = tf.reverse(
image, axis=[0]), tf.reverse(label, axis=[0])
image1 = tf.cast(image1, tf.float32)
image2 = tf.cast(image2, tf.float32)
alpha = [alpha]
dist = tfd.Beta(alpha, alpha)
l = dist.sample(1)[0][0]
img = l * image1 + (1 - l) * image2
lab = l * label1 + (1 - l) * label2
img = tf.cast(tf.math.round(tf.image.resize(
img, (self.crop_size, self.crop_size))), tf.uint8)
return img, lab
Random horizontal flip:
def random_flip(self, image: tf.Tensor, target: tf.Tensor) -> tuple:
"""
Returns randomly flipped batch of images/posts/training-regnets. Only horizontal flip
is available
Args:
image: Batch of images/posts/training-regnets to perform random rotation on.
target: Target tensor.
Returns:
Augmented example with batch of images/posts/training-regnets and targets with same dimensions.
"""
aug_images/posts/training-regnets = tf.image.random_flip_left_right(image)
return aug_images/posts/training-regnets, target
– PCA jitter and random horizontal flip were suggested in the paper, whereas addition of mixup was inspired by the papers Revisited ResNets.
Weight decay is a regularization technique where we penalize the weights for being too large. Weight decay is a battle-tested method and is often used when training deep neural networks. A small note, I used decoupled weight decay and not the conventional implementation of weight decay.
I saw increasing the weight decay too much made it difficult for the model to converge. However, small weight decay caused the model to have near-constant accuracy during the final epochs. These observations suggest the weight decay is a strong regularizer, especially for smaller models. Inspired by the paper “Revisiting ResNets: Improved Training and Scaling Strategies”, I kept the weight decay same for large models, since mixup was increased simultaneously.
Learning point: Weight decay is a strong regularizer. It is advisable to reduce weight decay or keep it the same for large models, where other augmentations or regularizers are being used simultaneously.
Finally, I used a constant weight decay of 5e-5 for all models, which was suggested in the original paper.
Regularization as a function of model size It is empirically known that increasing augmentation of regularization results in better performance. Conforming to this, I gradually increased the strength of mixup augmentation as the model size increased. I saw good results using this simple technique.
Learning point: Increase augmentations and regularization when increasing model size.
I had to train 12 variants of RegNetX and RegNetY each. This included small models which don’t have as many parameters as the larger ones. It is speculated that these models simply do not have enough capacity to hold the given information in them. They tend to underfit and the solution is seldom as simple as adding augmentation. The best starting point in most cases was low regularization and medium augmentation. I could tune the rest of the hyperparameters from there. These models took a lot of time to fine-tune and train, whereas the larger models had more flexibility. Smaller models are sensitive to small changes in regularization or augmentation.
Learning point: Do a hyperparameter search for small models. Repeat the search as the size of models increases.
Use multiple copies of the data if possible
It was observed the training would stop abruptly and would be stuck at
the end of an epoch. I use the tf.data.Dataset.interleave
method which reads data from multiple TFRecords simultaneously. During this read operation, the TFRecords are unavailable to other processes. I used to train multiple models in parallel, and thus they constantly needed to read data from the same bucket. Thus, to counter this, I created multiple copies of the TFRecords and saved them in different buckets. This reduced collisions and the problem reduced substantially.
Dump logs at a single location While training a number of models, maintaining the logs may get out of hand. The best way, in my opinion, is to dump all the raw logs at one location. In our case, I organized the logs and checkpoints of the models using the time and date of training. This made it easier to locate and use the checkpoints where needed. Following is a snap of the same:
Use automation to reduce cognitive load Managing many experiments simultaneously quickly becomes a difficult task. Automating the things which you need to do repeatedly is an extremely useful thing to do from the beginning. For example, one can use Weights and Biases (W&B) to automatically track all experiments. It is useful to log the hyperparameters along with the runs in W&B rather than feeding them manually. These seemingly small things reduce a ton of cognitive load, so you can actually focus on what’s important - running experiments. Following is a snapshot of our runs:
Gain intuitions of how the model might react to certain changes to hyperparameters After working on a single architecture for days or months, you may notice patterns in the performance of models. This helps in building intuitions of how the model might react to different changes in hyperparameters. Using this newfound intuition, you might be able to come up with ideas which might help in increasing performance. For example, using a slightly higher weight decay for RegNetY004 leads to sudden increase followed by a decrease of accuracy at the end of the run, but using a lower weight decay flattens this out. This implies that usage of a more aggressive augmentation policy along with lower weight decay may help in training, in this case. In similar fashion, one can spot changes in hyperparameters which lead to significant improvements.
Finally, here are the results. In the following tables, I compare our results with the paper. The last column has the hyperparameters which are different from the original implementation.
X variant
Model | Paper | Ours | Diff | Comments |
---|---|---|---|---|
X002 | 68.9 | 67.15 | 1.75 | adamw, area_factor=0.25 |
X004 | 72.6 | 71.22 | 1.38 | adamw, area_factor=0.08 |
X006 | 74.1 | 72.37 | 1.73 | adamw, area_factor=0.08 |
X008 | 75.2 | 73.45 | 1.75 | adamw, area_factor=0.08 |
X016 | 77 | 75.55 | 1.45 | adamw, area_factor=0.08, mixup=0.2 |
X032 | 78.3 | 77.09 | 1.21 | adamw, area_factor=0.08, mixup=0.2 |
X040 | 78.6 | 77.87 | 0.73 | adamw, area_factor=0.08, mixup=0.2 |
X064 | 79.2 | 78.22 | 0.98 | adamw, area_factor=0.08, mixup=0.3 |
X080 | 79.3 | 78.41 | 0.89 | adamw, area_factor=0.08, mixup=0.3 |
X120 | 79.7 | 79.09 | 0.61 | adamw, area_factor=0.08, mixup=0.4 |
X160 | 80 | 79.53 | 0.47 | adamw, area_factor=0.08, mixup=0.4 |
X320 | 80.5 | 80.35 | 0.15 | adamw, area_factor=0.08, mixup=0.4 |
Y variant
Model | Paper | Ours | Diff | Comments |
---|---|---|---|---|
Y002 | 70.3 | 68.51 | 1.79 | adamw, WD=1e-5, area_factor=0.16 mixup=0.2 |
Y004 | 74.1 | 72.11 | 1.99 | adamw, WD=1e-5, area_factor=0.16, mixup=0.2 |
Y006 | 75.5 | 73.52 | 1.98 | adamw, area_factor=0.16, mixup=0.2 |
Y008 | 76.3 | 74.48 | 1.82 | adamw, area_factor=0.16, mixup=0.2 |
Y016 | 77.9 | 76.95 | 0.95 | adamw, area_factor=0.08, mixup=0.2 |
Y032 | 78.9 | 78.05 | 0.85 | adamw, area_factor=0.08, mixup=0.2 |
Y040 | 79.4 | 78.2 | 1.2 | adamw, area_factor=0.08, mixup=0.2 |
Y064 | 79.9 | 78.95 | 0.95 | adamw, area_factor=0.08, mixup=0.3 |
Y080 | 79.9 | 79.11 | 0.69 | adamw, area_factor=0.08, mixup=0.3 |
Y120 | 80.3 | 79.45 | 0.85 | adamw, area_factor=0.08, mixup=0.4 |
Y160 | 80.4 | 79.71 | 0.69 | adamw, area_factor=0.08, mixup=0.4 |
Y320 | 80.9 | 80.12 | 0.78 | adamw, area_factor=0.08, mixup=0.4 |
I trained a total of 24 models for this PR. It was an enriching experience and I hope that these models will be used by many developers. It was a great learning experience, and I hope to continue contributing to TensorFlow & Keras in the near future again.
]]>This article originally appeared on Towards Data Science.
Keras has recently taken a big step towards improving developer inference by hosting the codebase in a separate repository. As mentioned in the RFC, one of the main objectives is to eliminate the lengthy feedback loop caused due to long build times of the core TensorFlow library. Due to this change, it is now possible to run tests in an extremely feasible amount of time.
This blog post aims to serve as a beacon for budding developers who wish to contribute to Keras but are not familiar with the building and testing procedure. Let’s dive in!
To all the new ones here, I’d like to take a moment to explain what “building from source” exactly means. This can mean many things (best explained here), but in our case it means the following:
“Compile the source code into an installable package and link all modules to their respective endpoints” [1].
Note that even after this migration, Keras is still accessed by calling from tensorflow import keras
. This is enabled by something known as golden APIs. These are endpoints exposed by the Keras library for TensorFlow library to pick up. Therefore, even though Keras is developed separately, for the user it still resides at tf.keras
. You can learn more on this in this post. The code that enables this is available here.
I assume you are doing this on a Linux machine. Bonus point, this works flawlessly with TPU-enabled Cloud VMs.
All of the following commands were taken directly or inspired by the official Keras contributing guide. Please go through the same before opening a PR.
I’ve also created a Colab notebook, which you can use to build and test the code easily!
Just like TensorFlow, Keras uses Bazel [2], a graph-based build management system. This means you can build Keras once and the successive builds will reuse the parts which have not changed since the previous one. Due to this, the time required to rebuild decreases dramatically. Here’s what we do to set up the environment:
#Install the latest version of Bazel. At the time of writing, the latest version was 4.2.1.
wget https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-installer-linux-x86_64.sh
chmod +x bazel-4.2.1-installer-linux-x86_64.sh
./bazel-4.2.1-installer-linux-x86_64.sh
export PATH="$PATH:$HOME/bin"
bazel
Next we set up a virtual python environment. This is recommended if you’re working on a development machine. In Colab, it’s fine to reinstall Keras in the base environment.
mkdir keras_installation
cd keras_installation
mkdir keras_env
python3 -m venv keras_env
source keras_env/bin/activate
Next we clone our development fork. We also install the nightly versions of TensorFlow, which ensures we’re in sync with the main TensorFlow repo.
# Replace BRANCH and USERNAME with your branch name and GitHub username respectively
git clone -b BRANCH https://github.com/USERNAME/keras.git
cd keras
pip install -r requirements.txt
pip uninstall -y keras-nightly
pip install --upgrade tf-nightly
This part applies only if you’ve added some new files. You need to add their names in the following files. This ensures that your module will be built and accessible to the users later on.
.
└── keras/
├── api/
│ ├── BUILD
│ └── api_init_files.bzl
├── ...
└── <submodule name>/
└── BUILD
Now is the crux of the process. Here we build and install our version of Keras. Use the following commands to do so:
# Make sure you run the following command from the root of your Keras fork
bazel build //keras/tools/pip_package:build_pip_package
~/keras_installation/keras/bazel-bin/keras/tools/pip_package/build_pip_package ~/keras_installation/keras_pkg
pip3 install --force-reinstall --user ~/keras_installation/keras_pkg/keras-2.8.0-py2.py3-none-any.whl
# Note: The version 2.8.0 can change with changes in Keras versions.
After executing the above commands, Keras will be installed anew with your change.
If you want to only conduct tests, then you can use bazel test
instead of bazel build
. In this case, you can make changes to the code and run bazel test
again. You need not manually install the package as we do with bazel build
.
Example:
bazel test keras/layers/convolutional_test
Here you can run as many tests as you like. You can run all tests using the following command:
!bazel test --test_timeout 300,450,1200,3600 --test_output=errors --keep_going --define=use_fast_cpp_protos=false --build_tests_only --build_tag_filters=-no_oss --test_tag_filters=-no_oss keras/...
ERROR: /home/jupyter/.cache/bazel/_bazel_jupyter/ebc81b3ee71ff9bb69270887ebdc0d7b/external/bazel_skylib/lib/unittest.bzl:203:27: name 'analysis_test_transition' is not defined
ERROR: error loading package '': Extension 'lib/unittest.bzl' has errors
ERROR: error loading package '': Extension 'lib/unittest.bzl' has errors
INFO: Elapsed time: 3.557s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
Restart the VM. This happens just after installation, because path is not updated across the system.
ImportError: cannot import name 'saved_metadata_pb2' from 'keras.protobuf' (unknown location)` while importing
Change directory and try again. This happens due to mixture of local and global environments
I thank TPU Research Cloud (TRC) [3] for supporting this project. TRC provided TPU access for the duration of this project. Google supported this work by providing Google Cloud credit. Thanks to Qianli Scott Zhu from the Keras team for guiding me through the process.
Keras is a versatile and flexible library for deep learning. It is used by thousands of developers and is a big open source project. If you find a bug or want a feature implemented in Keras, do it yourself! There’s no better joy than watching your code being used by countless people. And with being able to build Keras with ease, it has become within everyone’s reach to provide improvements to the codebase and make Keras a better product for everybody.
[1] Greg Mattes’ answer on StackOverflow
[2] “Bazel - a fast, scalable, multi-language and extensible build system”
]]>