Connect with us

FPGA Virtualization for Deep Learning: Achieving 3X Performance in the Cloud

Avatar

Published

on

This is a guest post from Song Yao Xilinx Senior Director in AI Business

Cloud computing has become the new computing paradigm. For cloud computing, virtualization is necessary to enable isolation between users, high flexibility and scalability, high security, and maximized utilization of hardware resources.

Since 2017, because of the advantages of programmability, low latency, and high energy efficiency, FPGA has been widely adopted into cloud computing. Amazon Web Service, Aliyun (Alibaba Cloud), Microsoft Azure, Huawei Cloud and etc. have all provided Xilinx FPGA instances on their cloud at present. However, FPGA instances on the cloud are still physical instances which are aimed for single-task and static-workload scenario, which means if there are multiple users, they can only share one FPGA instance in a time-division multiplexing (TDM) way. There is an increasing demand for virtualized FPGA.

Research group of Professor Yu Wang, Tsinghua University, has been working on FPGA virtualization for years, and recently proposed a framework to enable node-level FPGA virtualization for deep learning acceleration applications. Performance isolation for multiple users is enabled through a two-level instruction dispatch module (IDM) and a multi-core-based hardware resources pool. Overhead of online re-compilation is reduced to about 1ms by a tiling-based instruction frame package design and two-stage static-dynamic compilation is adopted. The baseline design of DNN accelerator baseline design is based on Angel-Eye, a DNN acceleration on FPGA published by Prof. Wang’s group in 2017.

The paper has just been presented at 28th FCCM, a premier academic conference in the programmable computing area. A demo is also provided for anyone who wants to try it:

https://github.com/annoysss123/FPGA-Virt-Exp-on-Aliyun-f3

General Introduction

As shown in Figure 1 (a), an FPGA instance provided on the cloud is usually the one with plenty of resources such as Xilinx VU9P, which could support a large number of users. For public cloud, there are two typical isolation methods between users: physical resources and performance isolation. Physical resources isolation allocates different hardware resources for different users while performance isolation means the performance provided to each user is not be disturbed by tasks executed of multiple other users. For private cloud, virtualization aims to maximize the overall system performance.

In this research, two baseline designs are used for comparing two different configurations: a static single large core design that supports multiple users using time-division multiplexing (TDM) and a static multi-core design with 16 small cores which supports one user by one core each. The virtualization design has also 16 small cores but uses space-division multiplexing (SDM) to support multiple tasks dynamically.

 Figure 1 Virtualization method for ISA-based DNN accelerator on FPGA: (a) Hardware architecture for public cloud; (b) Two-stage compiler design for private cloud.Figure 1 Virtualization method for ISA-based DNN accelerator on FPGA: (a) Hardware architecture for public cloud; (b) Two-stage compiler design for private cloud.

As shown in Figure 1 (b), a two-stage compilation flow is proposed to reduce online re-compilation overhead. The basic idea is to tile the output feature map into blocks and has an instruction frame package (IFP) for a series of instructions for each tile. During offline deployment, IFPs are generated based on the DNN model and hardware configuration of basic shareable units (small cores). During the online deployment stage, we only need to re-allocate the pre-generated IFPs to each core based on re-allocated hardware resources for new users. A simple latency simulator is also proposed so that we can predict the latency of each IFP and achieve workload balance among all the allocated cores.

Figure 2. Hardware architecture of the proposed virtualized FPGA DNN acceleratorFigure 2. Hardware architecture of the proposed virtualized FPGA DNN accelerator

Hardware and Compiler Design

To realize virtualization, as shown in Figure 2, the hardware architecture of the DNN accelerator is different from traditional DNN accelerator. First of all, as shown in Figure 2 (a) this accelerator has adopted Hardware Resource Pool (HRP) with multi small cores, which are allocated for different users exclusively. Besides, unlike the original Instruction Distribution Module (IDM), which is only used to implement instruction distribution and dependency management in a single core, a two-level IDM is designed to achieve multi-core sharing and synchronization, as shown in Figure 2 (b).

The first level IDM has 4 modules, including Instruction Mem., Instruction Decoder, Content-Switch Controller, and Multi-Core Sync. Controller. Instruction Mem. fetches the instructions from DDR and caches them until the next reconfiguration. The Instruction Decoder decodes them and sends the instructions to the second level IDM of the corresponding core according to the core index of each instruction. Content-Switch Controller records the index of the DNN layer that has been executed, so that other cores can continue computing upon the intermediate results. Multi-core Sync. Controller managers the layer-wise multi-core synchronization.

The second level IDM manages the computation inside each core. The Context-Switch Module can restart the computation based on the context information recorded by the first level IDM in the online reconfiguration stage. The System Synchronization Controller generates the local synchronization signal when the computation of the current DNN layer finishes and then waits for a valid global synchronization signal for the next layer to start.

Figure 3. The compilation flow for virtualized FPGA DNN accelerator, including static compilation (left) and dynamic compilation (right).Figure 3. The compilation flow for virtualized FPGA DNN accelerator, including static compilation (left) and dynamic compilation (right).

Tiling is an important idea in designing DNN accelerators to achieve massive parallelism and data reuse by partitioning the feature maps into tiles. We can achieve tiling among different dimensions, such as the height and width of feature maps. However, since the compiler generates Convolution instructions along the height dimension, feature map width and output channel dimensions are selected as the tiling dimension to generate IFPs.

The left part of Figure 3 shows how to get the latency results for each IFP using different tiling dimensions. After tiling, instructions in each tile are integrated into an IFP and then a cycle-level latency simulator predicts the latency of each IFP (T_1 to T_N or T_M). In the dynamic compilation stage, the dynamic compiler fetches the latency predictions and finds the optimal allocation strategy for multi-core sharing to minimize the total latency of a DNN layer.

Experimental Results

To evaluate the performance of this virtualization design, the research team used Xilinx Alveo U200, Xilinx VU9P FPGA in Aliyun (Alibaba Cloud), and nVidia Tesla V100 GPU to run 4 famous DNN models for comparison, including Inception v3, VGG16, MobileNet, and ResNet50 Xilinx SDAccel 2018.3 is used for hardware synthesis and software deployment. Hardware resource utilization results are shown in Table 1. For the virtualized multi-core design, 1% more logic and memory resources are used compared with static multi-core design, since the two-level IDM costs a little bit more resources. Multi-core designs consume almost double resources compared with the single large core design since a lot of modules like data mover will be copied for each small core in the multi-core design.

table1.png

Firstly, the cost of context switching with different numbers of re-allocated cores is evaluated. As shown in Table 2. The static compilation takes 14.7-46.8s to generate IFPs during the offline deployment, while the dynamic compilation costs only 0.4-1.5ms. The total online reconfiguration overhead is limited to 0.45-1.70ms considering the time of transfer instruction files. Compared with a non-virtualized design that takes tens of seconds to re-compile the whole DNN model, the online reconfiguration overhead of the virtualized design is negligible.

table2.png

A good virtualization framework should achieve good performance isolation, which means the performance of one user should not be influenced by other tasks. The second experiment assumes there are 4 possible users, gives one user fixed resources x (100%, 75%, 50%, and 25% of total resources), adjusts the remaining users to occupy the other (1 – x) resources, and finally gets the maximum and minimum performance of a user can get. As shown in Figure 4, when a user monopolizes all resources, there is no performance deviation. When the resources occupied by a single user are 75%, 50%, and 25% of total resources, GPU virtualization solution has 7.1-13.1%, 5.5-10.9%, and 6.5-8.1% performance deviations, while FPGA virtualization design limits them within 1%. The FPGA virtualization solution achieves much better isolation than GPU while meeting all the requirements for isolation.

Figure 4. Performance isolation: The performance deviation for ideal situation for one user with different hardware resources when there are 4 users.Figure 4. Performance isolation: The performance deviation for ideal situation for one user with different hardware resources when there are 4 users.

A good virtualization framework should make you achieve similar performance linear to the resources allocated for you. Figure 5 shows the performance results with different DNN models and different tiling strategies. The redline presents the performance with a single large core and the dark blue line presents a virtualized design with workload balance.

For Inception v3 and VGG16, the performance loss results of virtualized design are just 0.95% and 3.93%; but for MobileNet, there is a performance loss of 31.64% since this compact DNN model requires much more memory bandwidth and multi-core design further increases the demand for bandwidth.

We can also find that for VGG16, the performance is nearly linear to parallelism since it is a computation-bounded task. But for Inception v3 and MobileNet, even the performance of a single large core design deteriorates a lot compared with ideal linearity with a large number of parallelism because they are all memory bounded.

Figure 5. The single-task throughput under different situations with different parallelism.Figure 5. The single-task throughput under different situations with different parallelism.

For the situation with multiple uses and multiple tasks, the performance is measured as the total throughput of the FPGA chip. As shown in Figure 6, four columns from light blue to dark blue represent the throughput with virtualized cores, virtualized cores optimized for single task each core, static multi-core design, and large single-core design. A total of 16 cores are implemented on FPGA so that at most 16 tasks can be supported simultaneously.

When there are only a few of tasks like 1, 2, or 4 tasks, if we send each task to a small core, the static multi-core design cannot achieve high throughput, while virtualized designs perform much better. When there are 8, 12, or 16 tasks, the throughput of static large single-core design will not improve since all tasks are executed in a TDM manner. But for virtualized design, the more tasks executed, the higher throughput can be achieved. When there are total of 16 tasks, the optimized virtualized design achieves the optimal throughput.Figure 6. The multi-task throughput under different situationsFigure 6. The multi-task throughput under different situations

In conclusion, we have seen that the proposed FPGA virtualization framework provided excellent performance isolation, scalability, and flexibility. With an online reconfiguration overhead of about 1ms and 1.12% single-core performance loss, it achieves 1.07x – 1.69x and 1.88x – 3.12x performance improvement compared with the baseline design. The virtualized FPGA design also achieves great isolation and linearity to hardware resources. It will help further reduce TCO of all deep learning applications in the cloud.

If you are interested in the original paper, please do not hesitate to download it on Arxiv:

https://arxiv.org/abs/2003.12101.

Source: https://forums.xilinx.com/t5/AI-and-Machine-Learning-Blog/FPGA-Virtualization-for-Deep-Learning-Achieving-3X-Performance/ba-p/1104520

Blockchain

JP Morgan: Put 1% In Bitcoin as a Hedge as Demand is ‘Massively Outstripping’ Supply

Republished by Plato

Published

on

The narrative that investors should allocate 1% of their portfolio in bitcoin as a hedge has received support from strategists representing the giant US multinational investment bank – JPMorgan Chase & Co.

The analysts also highlighted the evaporating liquid supply, as giant institutions and corporations are purchasing substantial quantities rather rapidly.

JPM Suggest: Put 1% in BTC

Among the most popular topics of discussion within the community is how big should be the percentage investors allocate to bitcoin. The narrative ranges from BTC maximalists saying that all eggs should be in one bitcoin basket to others advocating for a broader diversification.

However, very few outsiders of the crypto community had ever suggested any BTC exposure until last year. Perhaps the first one to go public with it was the legendary legacy investor Paul Tudor Jones III following the COVID-19-induced market crash.

Since then, more representatives of the traditional financial field have joined, and the latest ones are strategists from JPMorgan.

Cited by Bloomberg, they seemed somewhat cautious but still indicated that investors should look into BTC for a possible hedge.

“In a multi-asset portfolio, investors can likely add up to 1% of their allocation to cryptocurrencies in order to achieve any efficiency gain in the overall risk-adjusted returns of the portfolio.”

However, the analysts advised investors to explore other fiat currencies, such as the yen or the dollar, if they want to hedge a macro event and not cryptocurrencies as they are “investment vehicles and not funding currencies.”

BTC’s Declining Liquid Supply

JPM also touched upon another compelling topic, which has surged in popularity in the past several months – BTC’s decreasing liquid supply.

After all, numerous giant names joined the BTC craze since the summer of 2020. As of now, MicroStrategy owns over 90,000 bitcoins, Grayscale is purchasing new coins at record levels, Tesla allocated $1.5 billion in the asset, and numerous institutions bought in as well.

Simultaneously, the production rate of newly-created bitcoins was slashed in half in May 2020 following the third-ever halving. Consequently, the skyrocketing demand and the decreasing liquid supply affected the asset price, which is up by 50% since the start of the year – even after the latest massive correction.

“Through the insatiable buy-side pressure from exchange-traded fund issuers, close-ended funds, and large public corporations adding Bitcoin to their positions, demand is massively outstripping supply.” – concluded JPM’s strategists.

SPECIAL OFFER (Sponsored)
Binance Futures 50 USDT FREE Voucher: Use this link to register & get 10% off fees and 50 USDT when trading 500 USDT (limited offer).

PrimeXBT Special Offer: Use this link to register & enter CRYPTOPOTATO35 code to get 35% free bonus on any deposit up to 1 BTC.

You Might Also Like:


Source: https://cryptopotato.com/jp-morgan-put-1-in-bitcoin-as-a-hedge-as-demand-is-massively-outstripping-supply/

Continue Reading

Blockchain

Monero, Ontology, Synthetix Price Analysis: 26 February

Republished by Plato

Published

on

Monero was treading water around the $200-level, with the crypto likely to give way to a wave of selling pressure. Ontology fell under multiple levels of former support over the last few days and could break past one or two more. Finally, Synthetix saw a region of demand flipped to one of supply.

Monero [XMR]

Monero, Ontology, Synthetix Price Analysis: 26 February

Source: XMR/USDT on TradingView

The RSI fell below 50 and tested it as resistance on the hourly chart after XMR’s bulls attempted to keep the price above $200. This could be an uphill battle, especially if Bitcoin continues to drop.

Over the next few days, $220 and $180 are the levels to watch out for. Climbing above $220 would imply that a recovery has begun for XMR, while dropping below its previous local low of $180 would see XMR shed value further.

The Stochastic RSI was recovering from oversold territory over the past few hours. The trading volume rose as the price fell, pointing to the fact that strong bearish market sentiment was still in play.

Ontology [ONT]

Monero, Ontology, Synthetix Price Analysis: 26 February

Source: ONT/USDT on TradingView

The Directional Movement Index showed a strong bearish trend was in progress as the ADX (yellow) rose above 20 alongside the -DI (pink). The Awesome Oscillator also underlined southbound market momentum.

The next levels of interest for ONT were the $0.75 and the $0.68-support levels. A sign of some strength from the bears, such as a double top, would be required before any coin can be considered to be on the road to recovery.

Synthetix [SNX]

Monero, Ontology, Synthetix Price Analysis: 26 February

Source: SNX/USDT on TradingView

On the hourly chart, the fractals were used to give some further importance to the points that formed the descending channel’s boundaries. As can be seen, SNX closed a trading session under the channel and rose to retest the $18-region as one of supply, formerly demand.

Having confirmed this dip, the market’s bears forced the price lower. The next levels of support for SNX lay at $16 and $14, both representing drops of 10% and 21% from where the price was trading, at the time of writing.

The MACD noted strong bearish momentum, as did the 8-period and 20-period exponential moving averages (blue and white respectively).


Sign Up For Our Newsletter


Source: https://ambcrypto.com/monero-ontology-synthetix-price-analysis-26-february

Continue Reading

Blockchain

This Bitcoin metric may be key to Gold’s flippening in the future

Republished by Plato

Published

on

At the time of writing, Bitcoin’s price was falling again, with the cryptocurrency’s performance breaking from its rangebound behavior between $49,000 and $51,000 yesterday. And yet, despite the scale of the drop, many still expected recovery to come soon enough. In fact, a few signs were visible just before BTC’s latest fall below $47,000.

Consider this – At the time, the volatility was up to 16%, rising by 2% post the dip from its ATH of $58,330. While it’s almost given that Bitcoin will soon bounce back, it’s worth examining what will drive such recovery. On CMC’s latest podcast, Jeff Ross of Vailshire Cap spoke about the prevailing narrative during this market cycle. According to him, the narrative of Gold 2.0 is the one that is playing out.

Gold has been repeatedly mentioned in popular narratives since the flippening of gold is seen by most as a major event. Since a majority of Gold bugs are key investors and hedge fund managers, there is potential market capitalization to tap into. After crossing the $1 trillion-mark, Bitcoin is even closer to $10 trillion, with the price following the S2F model like clockwork.

Gold’s S2F ratio was 62 while Bitcoin’s S2F was 52, at press time, and this may be one of the reasons for following S2F, despite the fact that many gold bugs will still find a reason to criticize BTC’s price action.

Will the narrative of Gold 2.0 play out this market cycle?

Source: Digitalk

The fact that Bitcoin’s annualized average daily volatility was observed to be above 120% and for Gold, it was a little over 20%, highlighted how the two are uncorrelated. Despite the two assets not being correlated post the decoupling in November 2020, the Gold 2.0 narrative is driving institutional investment inflows into Bitcoin. When Bitcoin’s S2F crosses 100, the flippening may occur and the comparisons between Gold and Bitcoin may cease to exist.

The cyclical movement of price, at the press time volatility of 16%, may continue in Bitcoin. In the last 24 hours alone, based on on-chain metrics, the trade volume has dropped by over 44% across exchanges. This drop in trade volume may be in response to the Bitcoin Options expiry on Deribit.

Previously, Options expiry events have had a significant impact on the price of the asset in the short-term. However, post the expiry, the price may sustain itself below its press time level, before recovering in a cyclical manner over the following month.

Will the narrative of Gold 2.0 play out this market cycle?

Source: Skew

Since this has emerged as a pattern in previous market cycles, it may repeat at least until the crypto’s price recovers and trades above the $55,000-level. A few days ago, the aggregate daily volume in BTC Futures on top exchanges was close to $180 billion. With a hike in volatility expected in the near-term, the figures for the same are likely to grow even more, especially if recovery is surely underway.


Sign Up For Our Newsletter


Source: https://ambcrypto.com/this-bitcoin-metric-may-be-key-to-golds-flippening-in-the-future

Continue Reading
Blockchain4 days ago

Ankr adds Eth2 futures (fETH) to its staking system

Blockchain5 days ago

Ripple now registered as a Wyoming business

Blockchain5 days ago

Former BoE, BoC Governor Mark Carney joins Stripe board of directors

Blockchain4 days ago

Peter Schiff Now Discusses Bitcoin More Often Than His Beloved Gold

Blockchain5 days ago

Litecoin, Cosmos, Tezos Price Analysis: 21 February

Blockchain5 days ago

A Review of BTCGOSU — Reviewer of Crypto Casinos

Blockchain4 days ago

DeFi Protocol Primitive Finance Self Hacks to Prevent Exploit

Blockchain4 days ago

Long Blockchain Corp has officially been delisted by SEC

Blockchain4 days ago

Kraken Daily Market Report for February 21 2021

Blockchain5 days ago

The Many Theories Of Elon Musk Being Satoshi Nakamoto

Blockchain5 days ago

Is Ethereum heading to another ATH?

Blockchain4 days ago

NFT Platform Ethernity to Launch IDO on Polkastarter

Blockchain3 days ago

New report predicts NFTs will explode in popularity during 2021

Blockchain3 days ago

Bitcoin falls to $45K in sequel to 20% BTC price crash

Blockchain5 days ago

Banks will be required to work with crypto, e-money and CBDCs to survive

Blockchain4 days ago

MoneyGram suspends Ripple partnership, citing SEC lawsuit

Blockchain3 days ago

Kraken users demand refunds over flash-crash liquidations

Blockchain5 days ago

Today 11:40 am EST: First Bitcoin Elite NFT Art Drop

Blockchain5 days ago

Binance Smart Chain, DeFi, and Ethereum: A trinity possible?

Blockchain5 days ago

Bitcoin pizza all over again — delivery driver reportedly cashes in on $400 BTC tip

Trending