Controlling 3D Human Action with Transformer Variational Autoencoder in Latent Space

Overview

Action-conditioned transformer VAE has shown its ability to generate realistic and diverse human motion sequences. Taking a step further, we want to control the specific body part of the generated human motions, thereby achieving more degrees of freedom and diversity in human actions. In order to attain the control of the body part, we acquire attribute vectors through low-rank factorization and null space projection. However, we empirically found that the attribute vector of each part is tangled with each other due to the posterior collapse, which is a phenomenon that VAE models are found to ignore the latent variables, especially when using flexible generators. We mitigate this problem using scheduling schemes for KL-term(β) among various attempts to address the posterior collapse. Furthermore, to enhance the controllability, we propose the data augmentation which encourages the change rate of motions to be diverse. We evaluate our approach to UESTC and HumanAct12 datasets in the class conditional settings. We also show that manipulated actions using our method are plausible and human-like. In addition, we show that we can apply our control on the actions, generated in the unconditional settings, which reveals potential for the future research. To the best of our knowledge, our work is the first for directly controlling motions in the latent space without using any other modalities.

Motivation for the work

Manually changing pose parameters (θ) lead to problems such as going beyond the normal range of human motion or penetration. The videos below show different effects (sideeffects) when same modification is applied to various classes (class0, class22, class5).

[Class 0] For class 0 it seems like moving more dynamically, but nothing has changed or showing unnatural behavior.
[Class 22] For class 22 you can see that the person in moving its arm beyond the normal range of an arm.
[Class 5] You can see the penetration of hand and stomach in the modified version. Since SMPL model does not consider the penetration, and is just a mesh-based model, penetration is often observed after the modification.

unconditional result 1

Class 0: Normal motion after modification

unconditional result 2

Class 22: Out of normal range of human motion

unconditional result 2

Class 5: Penetration of human body parts

Results

Conditional Results

In conditional setting, we can perform part-wise control. Below, we show arm and leg control repectively.

Arm

Arm pulled up	Original z	Arm pulled down

In the given class shown above, we can see that the leftmost SMPL model is raising its arm more than that in the middle(Original z). Likewise, when we move latent z to the opposite direction where it makes model to raise its arm, we can see that the right most figure is merely raising its arm.

We can also see that body parts other than arm is identical in those 3 figures.

Leg

Leg spread	Original z	Leg narrowed

In the figures above, when we compare left and right figures with the middle figure(Original z), we can see that angle between legs is changing. Left figure shows the result when moved the latent into the leg widening direction. Right figure shows the result when we move to the opposite direction where it narrows the leg of the SMPL model.

Please notice that parts other than leg is identical in the 3 figures above.

Unconditional Results

In unconditional setting, we can interpolate between two different class to make new class which is not contained in the train dataset. Further, we can also control the genereted new class using the found direction from our method. We show the results below.

unconditional result 1

unconditional result 2

Full Demo Video

Application of our method on related work [1]

We we could apply our part-wise controlling method to other motion generation network[1] to examine the effectiveness our method.

(Top) We controlled only arm while leaving rest of the body parts almost unchanged. You can see in the picture on the right that person(rig) is raising its arm more than the original image(left).

(Bottom) We controlled only leg while leaing the rest of the body parts almost unchanged. You can see in the picture on the right that person(rig) is stretching its leg less than the original image(left).

unconditional result 1

Controlling Arm (raise arm more)

unconditional result 2

Controlling Leg (stretch legs less)

[1] Lu, Q., Zhang, Y., Lu, M., & Roychowdhury, V. (2022, October). Action-conditioned On-demand Motion Generation. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 2249-2257).

Related Work

Jiapeng Zhu, Ruili Feng, Yujun Shen, Deli Zhao, Zhengjun Zha, Jingren Zhou, Qifeng Chen
Low-Rank Subspaces in GANs.
Conference on Neural Information Processing Systems (NeurIPS) 2021.
Comment: Controlling images in latetent space only modifiying the region of interest, while maintaining the rest region.
We also borrowed project page format from the authors of this paper (Thanks!).

Mathis Petrovich Michael J. Black, and Gül Varol
Action-Conditioned 3D Human Motion Synthesis with Transformer VAE.
ICCV 2021.
Comment: Generates action-conditioned realistic and diverse human motion sequences using Transformer-based VAE architecture