Connect with us

FPGA Virtualization for Deep Learning: Achieving 3X Performance in the Cloud

Published

on

This is a guest post from Song Yao Xilinx Senior Director in AI Business

Cloud computing has become the new computing paradigm. For cloud computing, virtualization is necessary to enable isolation between users, high flexibility and scalability, high security, and maximized utilization of hardware resources.

Since 2017, because of the advantages of programmability, low latency, and high energy efficiency, FPGA has been widely adopted into cloud computing. Amazon Web Service, Aliyun (Alibaba Cloud), Microsoft Azure, Huawei Cloud and etc. have all provided Xilinx FPGA instances on their cloud at present. However, FPGA instances on the cloud are still physical instances which are aimed for single-task and static-workload scenario, which means if there are multiple users, they can only share one FPGA instance in a time-division multiplexing (TDM) way. There is an increasing demand for virtualized FPGA.

Research group of Professor Yu Wang, Tsinghua University, has been working on FPGA virtualization for years, and recently proposed a framework to enable node-level FPGA virtualization for deep learning acceleration applications. Performance isolation for multiple users is enabled through a two-level instruction dispatch module (IDM) and a multi-core-based hardware resources pool. Overhead of online re-compilation is reduced to about 1ms by a tiling-based instruction frame package design and two-stage static-dynamic compilation is adopted. The baseline design of DNN accelerator baseline design is based on Angel-Eye, a DNN acceleration on FPGA published by Prof. Wang’s group in 2017.

The paper has just been presented at 28th FCCM, a premier academic conference in the programmable computing area. A demo is also provided for anyone who wants to try it:

https://github.com/annoysss123/FPGA-Virt-Exp-on-Aliyun-f3

General Introduction

As shown in Figure 1 (a), an FPGA instance provided on the cloud is usually the one with plenty of resources such as Xilinx VU9P, which could support a large number of users. For public cloud, there are two typical isolation methods between users: physical resources and performance isolation. Physical resources isolation allocates different hardware resources for different users while performance isolation means the performance provided to each user is not be disturbed by tasks executed of multiple other users. For private cloud, virtualization aims to maximize the overall system performance.

In this research, two baseline designs are used for comparing two different configurations: a static single large core design that supports multiple users using time-division multiplexing (TDM) and a static multi-core design with 16 small cores which supports one user by one core each. The virtualization design has also 16 small cores but uses space-division multiplexing (SDM) to support multiple tasks dynamically.

 Figure 1 Virtualization method for ISA-based DNN accelerator on FPGA: (a) Hardware architecture for public cloud; (b) Two-stage compiler design for private cloud.Figure 1 Virtualization method for ISA-based DNN accelerator on FPGA: (a) Hardware architecture for public cloud; (b) Two-stage compiler design for private cloud.

As shown in Figure 1 (b), a two-stage compilation flow is proposed to reduce online re-compilation overhead. The basic idea is to tile the output feature map into blocks and has an instruction frame package (IFP) for a series of instructions for each tile. During offline deployment, IFPs are generated based on the DNN model and hardware configuration of basic shareable units (small cores). During the online deployment stage, we only need to re-allocate the pre-generated IFPs to each core based on re-allocated hardware resources for new users. A simple latency simulator is also proposed so that we can predict the latency of each IFP and achieve workload balance among all the allocated cores.

Figure 2. Hardware architecture of the proposed virtualized FPGA DNN acceleratorFigure 2. Hardware architecture of the proposed virtualized FPGA DNN accelerator

Hardware and Compiler Design

To realize virtualization, as shown in Figure 2, the hardware architecture of the DNN accelerator is different from traditional DNN accelerator. First of all, as shown in Figure 2 (a) this accelerator has adopted Hardware Resource Pool (HRP) with multi small cores, which are allocated for different users exclusively. Besides, unlike the original Instruction Distribution Module (IDM), which is only used to implement instruction distribution and dependency management in a single core, a two-level IDM is designed to achieve multi-core sharing and synchronization, as shown in Figure 2 (b).

The first level IDM has 4 modules, including Instruction Mem., Instruction Decoder, Content-Switch Controller, and Multi-Core Sync. Controller. Instruction Mem. fetches the instructions from DDR and caches them until the next reconfiguration. The Instruction Decoder decodes them and sends the instructions to the second level IDM of the corresponding core according to the core index of each instruction. Content-Switch Controller records the index of the DNN layer that has been executed, so that other cores can continue computing upon the intermediate results. Multi-core Sync. Controller managers the layer-wise multi-core synchronization.

The second level IDM manages the computation inside each core. The Context-Switch Module can restart the computation based on the context information recorded by the first level IDM in the online reconfiguration stage. The System Synchronization Controller generates the local synchronization signal when the computation of the current DNN layer finishes and then waits for a valid global synchronization signal for the next layer to start.

Figure 3. The compilation flow for virtualized FPGA DNN accelerator, including static compilation (left) and dynamic compilation (right).Figure 3. The compilation flow for virtualized FPGA DNN accelerator, including static compilation (left) and dynamic compilation (right).

Tiling is an important idea in designing DNN accelerators to achieve massive parallelism and data reuse by partitioning the feature maps into tiles. We can achieve tiling among different dimensions, such as the height and width of feature maps. However, since the compiler generates Convolution instructions along the height dimension, feature map width and output channel dimensions are selected as the tiling dimension to generate IFPs.

The left part of Figure 3 shows how to get the latency results for each IFP using different tiling dimensions. After tiling, instructions in each tile are integrated into an IFP and then a cycle-level latency simulator predicts the latency of each IFP (T_1 to T_N or T_M). In the dynamic compilation stage, the dynamic compiler fetches the latency predictions and finds the optimal allocation strategy for multi-core sharing to minimize the total latency of a DNN layer.

Experimental Results

To evaluate the performance of this virtualization design, the research team used Xilinx Alveo U200, Xilinx VU9P FPGA in Aliyun (Alibaba Cloud), and nVidia Tesla V100 GPU to run 4 famous DNN models for comparison, including Inception v3, VGG16, MobileNet, and ResNet50 Xilinx SDAccel 2018.3 is used for hardware synthesis and software deployment. Hardware resource utilization results are shown in Table 1. For the virtualized multi-core design, 1% more logic and memory resources are used compared with static multi-core design, since the two-level IDM costs a little bit more resources. Multi-core designs consume almost double resources compared with the single large core design since a lot of modules like data mover will be copied for each small core in the multi-core design.

table1.png

Firstly, the cost of context switching with different numbers of re-allocated cores is evaluated. As shown in Table 2. The static compilation takes 14.7-46.8s to generate IFPs during the offline deployment, while the dynamic compilation costs only 0.4-1.5ms. The total online reconfiguration overhead is limited to 0.45-1.70ms considering the time of transfer instruction files. Compared with a non-virtualized design that takes tens of seconds to re-compile the whole DNN model, the online reconfiguration overhead of the virtualized design is negligible.

table2.png

A good virtualization framework should achieve good performance isolation, which means the performance of one user should not be influenced by other tasks. The second experiment assumes there are 4 possible users, gives one user fixed resources x (100%, 75%, 50%, and 25% of total resources), adjusts the remaining users to occupy the other (1 – x) resources, and finally gets the maximum and minimum performance of a user can get. As shown in Figure 4, when a user monopolizes all resources, there is no performance deviation. When the resources occupied by a single user are 75%, 50%, and 25% of total resources, GPU virtualization solution has 7.1-13.1%, 5.5-10.9%, and 6.5-8.1% performance deviations, while FPGA virtualization design limits them within 1%. The FPGA virtualization solution achieves much better isolation than GPU while meeting all the requirements for isolation.

Figure 4. Performance isolation: The performance deviation for ideal situation for one user with different hardware resources when there are 4 users.Figure 4. Performance isolation: The performance deviation for ideal situation for one user with different hardware resources when there are 4 users.

A good virtualization framework should make you achieve similar performance linear to the resources allocated for you. Figure 5 shows the performance results with different DNN models and different tiling strategies. The redline presents the performance with a single large core and the dark blue line presents a virtualized design with workload balance.

For Inception v3 and VGG16, the performance loss results of virtualized design are just 0.95% and 3.93%; but for MobileNet, there is a performance loss of 31.64% since this compact DNN model requires much more memory bandwidth and multi-core design further increases the demand for bandwidth.

We can also find that for VGG16, the performance is nearly linear to parallelism since it is a computation-bounded task. But for Inception v3 and MobileNet, even the performance of a single large core design deteriorates a lot compared with ideal linearity with a large number of parallelism because they are all memory bounded.

Figure 5. The single-task throughput under different situations with different parallelism.Figure 5. The single-task throughput under different situations with different parallelism.

For the situation with multiple uses and multiple tasks, the performance is measured as the total throughput of the FPGA chip. As shown in Figure 6, four columns from light blue to dark blue represent the throughput with virtualized cores, virtualized cores optimized for single task each core, static multi-core design, and large single-core design. A total of 16 cores are implemented on FPGA so that at most 16 tasks can be supported simultaneously.

When there are only a few of tasks like 1, 2, or 4 tasks, if we send each task to a small core, the static multi-core design cannot achieve high throughput, while virtualized designs perform much better. When there are 8, 12, or 16 tasks, the throughput of static large single-core design will not improve since all tasks are executed in a TDM manner. But for virtualized design, the more tasks executed, the higher throughput can be achieved. When there are total of 16 tasks, the optimized virtualized design achieves the optimal throughput.Figure 6. The multi-task throughput under different situationsFigure 6. The multi-task throughput under different situations

In conclusion, we have seen that the proposed FPGA virtualization framework provided excellent performance isolation, scalability, and flexibility. With an online reconfiguration overhead of about 1ms and 1.12% single-core performance loss, it achieves 1.07x – 1.69x and 1.88x – 3.12x performance improvement compared with the baseline design. The virtualized FPGA design also achieves great isolation and linearity to hardware resources. It will help further reduce TCO of all deep learning applications in the cloud.

If you are interested in the original paper, please do not hesitate to download it on Arxiv:

https://arxiv.org/abs/2003.12101.

Source: https://forums.xilinx.com/t5/AI-and-Machine-Learning-Blog/FPGA-Virtualization-for-Deep-Learning-Achieving-3X-Performance/ba-p/1104520

Blockchain

A Face Too Sexy For Social Media

Published

on


Fullmetal Magdalene

Being a 90s kid I don’t remember a time where female sexual empowerment wasn’t a hot topic. When Madonna kissed both Britney Spears and Christina Aguilera in their 2003 VMA performance people were shocked to see such behavior on TV. Fast forward to today where WAP has over 400 million views on YouTube with no age restriction on the video, photos and videos of scantily clad women flood all social media platforms, and popular streaming service Netflix hosts Cuties, a film featuring what some critics describe as ‘soft core porn’ involving girls at the young age of eleven.

As a female artist who explores my own relationship to feminine sexual energy in my works, I began posting my art to my social media accounts with no concern that any of them would be viewed as obscene or breaking community guidelines. None of the women in my pieces are engaged in sexual acts and they were created with the intent of exploring the female experience rather than as visual aids for sexual gratification. In fact, I have been met with criticism from viewers that my work is not sexually explicit enough for their liking.

My current NFT art series titled ‘Crypto Sluts’ plays with tongue-in-cheek sexual innuendo but is some of my most demure work. Each piece in the series features a portrait of a beautiful woman, face flushed and eyes rolled back in ecstasy, with a round item on her tongue sporting her favorite Cryptocurrency’s logo. ‘Crypto Slut’ is a self descriptive term I use for myself as I am not a maximalist for any crypto project but rather I prefer to experiment with them all. The collection itself is a representation of the passion I have witnessed the crypto community showing for their favorite projects, so Crypto Sluts felt like a perfect title. Innuendo aside, each piece shows absolutely no sexual activity nor adult theme. It’s drawn leaving interpretation completely up to the viewer, including what that round item on her tongue might be.

When the art reveal video for my Bitcoin Slut caught some traction on YouTube it was met with a near 50/50 like to dislike ratio and the comments were ‘WTF?’, ‘Why…’, and ‘I am utterly disgusted’. Unexpected but not terrible. Twitter flagged multiple posts of mine for using the word ‘Slut’ but I was able to resolve that. The most shocking was what happened on TikTok. TikTok is a platform where videos of girls under the age of 18 twerking in crop tops and booty shorts, strippers in the club dancing on stage, and women discussing working as Sugar Babies have hundreds of thousands of views. Artwork from my Crypto Sluts collection was flagged as ‘Adult Sexual Content’ so many times on TikTok that I was eventually restricted from posting. Each time my work was flagged I appealed and each time the team said that after review my posts were found to have in fact broken the platform rules of no adult sexual activity and were permanently removed. Compare this to Minds, a platform that requires users to flag NSFW content or face a channel strike. Not only am I not required to flag my Crypto Sluts as NSFW, but they are eligible for promotion across the site (Minds does not allow promotion of NSFW material and all promotions are reviewed by the platform).

I’ve experienced inconsistent censorship of my art before this, but this is the most perplexing circumstance. This and previous instances have created blurred lines of what is too sexy for social media and leaves me confused. Whereas I started publishing my work believing it would be met with general platform acceptance when compared to the content already being hosted, I now lack confidence in my ability to continue to share my art on social media with out fear of losing my entire account.

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.

Click here to access.

Source: https://medium.com/@fullmetalmagdalene/a-face-too-sexy-for-social-media-1fbb8d181872?source=rss——cryptocurrency-5

Continue Reading

Blockchain

Post-Bitcoin’s Mild Drop, El Salvador’s Bukele Reveals Excitement For A Larger Bitcoin Dip

Published

on

Why Bitcoin Is Unlikely To Ever Experience Another March-Like 50% Price Drop

Advertisement &  & 

Key Takeaways 

  • President Nayib Bukele is unfazed by Bitcoin’s price dip to $60k.
  • Bukele teases need for lower lows in a bid to tap into perfect entry point. 
  • El Salvador shows no sign of disposing its Bitcoin holdings in the long term.

The president of El Salvador is showing himself to be unshaken by Bitcoin’s price volatility. President Nayib Bukele is beginning to adopt the culture of calmness that many Bitcoin proponents have shown over the years when the market is hindered by a price drop.

In a recent tweet, the President is seen asking his followers whether to buy the dip or not. He then proceeds to tease the need for Bitcoin to drop even further, so as to allow him an opportunity to buy the asset at a much lower price.

“Should we buy the dip?

Or is it too small?

Come on guys, we need a better discount here!” said Bukele in a recent tweet.

The concept of buying low and holding till the price of Bitcoin goes higher is one that key players have preached and presented continuously as the least risky and most promising way to hold Bitcoin.

Because maximalists’ views are often tied around the belief that Bitcoin has more upside potential in every market —whether bearish or bullish— the act of holding regardless of how low the prices drop, is an indicator that the holders’ sentiments are bullish in the long term.

Advertisement &  & 
BTCUSD Chart By TradingView

For President Nayib Bukele who has expressed similar views in the past, it is clear where he stands with Bitcoin at this time. Recall that back in September, El Salvador bought an additional 150 Bitcoins, following the selloff that caused Bitcoin to shed $5,000 and sent its price down to $45,000.

Although Bukele’s methods have attracted criticism from many onlookers, his pattern of buying the dip is a bet that could pay off greatly in the long term.

In the past, long-term holders have also seen the most success with Bitcoin. Reports from on-chain analytical platform Glassnode have recorded holders who have not sold their assets for more than 3-years.

Notably, last year, when the price of Bitcoin hit $20,000, these holders saw their asset value surge significantly. However, for institutions and traders, the culture of exiting to avoid a perceived bear trend is normal. But for El Salvador, the question of selling seems to be out of the picture for now.

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.

Click here to access.

Source: https://zycrypto.com/post-bitcoins-mild-drop-el-salvadors-bukele-reveals-excitement-for-a-larger-bitcoin-dip/

Continue Reading

Blockchain

6 Common Mistakes of Crypto Beginners – Be Extremely Cautious!

Published

on

It takes more than diamond hands to succeed in this space

Don’t let yourself be blinded by greed! Photo by Thought Catalog from Unsplash

First rule: Don’t lose money

Second rule: Make money.

And always follow this order.

You’ve heard of people having 100x gains (or more!) on their crypto assets, it sounds amazing, right…

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.

Click here to access.

Source: https://medium.com/yardcouch-com/6-common-mistakes-of-crypto-beginners-be-extremely-cautious-eb848e2c9ac8?source=rss——cryptocurrency-5

Continue Reading
Blockchain5 days ago

MEXC Will Launch ARPAUSDT, REEFUSDT, KEEPUSDT & NUUSDT Futures With 6,000 USDT Bonus Giveaway

Blockchain5 days ago

MEXC Exchange Will Launch BITUSDT Futures With 2,000 USDT Bonus Giveaway

Blockchain5 days ago

Charity and Community Focused Project ‘TheFloorNFT’ Announces New Artistic Collectibles on Ethereum

Blockchain5 days ago

Fozeus AMA: An AMA That Is Quite Promising And Interesting!

Blockchain5 days ago

Ripple CEO Calls Gensler and Clayton’s Meeting Before XRP Lawsuit “Bad Optics,” Here’s Why

Blockchain5 days ago

Crypto Scammers Take Over Dating Apps Users’ iPhones

Blockchain5 days ago

TA: Ethereum Breaking This Barrier Could Spark a Significant Surge

Blockchain5 days ago

Iain Rogers Joins UK Broker Finveo as New CEO

Blockchain5 days ago

Arbitrum extends lead over Optimism as Uniswap posts record volume on L2

Blockchain5 days ago

Binance Coin, Dogecoin Price Analysis: October 19, 2021

Blockchain5 days ago

Ausgaben der Qtum Chain Foundation für das dritte Quartal 2021

Blockchain5 days ago

FinCEN Links More Than $5 Billion in Bitcoin Transactions to Ransomware

Blockchain5 days ago

THEME: Slow week; literally, economies are slowing down (but not asset prices)

Blockchain5 days ago

Ethereum Price Holds Bullish Case Aiming For $4,400

Blockchain5 days ago

How I made $1300 profit from just $130 in 3 months Trading Crypto

Blockchain5 days ago

African Nation Ghana Prepares for the Offline Use of Its CBDC E-Cedi

Blockchain5 days ago

Top 100 Polygon Holders Collectively Own >90% of MATIC Supply

Blockchain5 days ago

TA: Bitcoin Consolidates Gain: What Could Trigger Fresh Rally

Blockchain5 days ago

Sanctions Top-5 for the week ending 15 October 2021

Blockchain5 days ago

Mitsubishi Power Receives Order of Two Gas Turbines for the Hunter Power Project in Australia

Trending