Connect with us
[crypto-donation-box]

FPGA Virtualization for Deep Learning: Achieving 3X Performance in the Cloud

Avatar

Published

on

This is a guest post from Song Yao Xilinx Senior Director in AI Business

Cloud computing has become the new computing paradigm. For cloud computing, virtualization is necessary to enable isolation between users, high flexibility and scalability, high security, and maximized utilization of hardware resources.

Since 2017, because of the advantages of programmability, low latency, and high energy efficiency, FPGA has been widely adopted into cloud computing. Amazon Web Service, Aliyun (Alibaba Cloud), Microsoft Azure, Huawei Cloud and etc. have all provided Xilinx FPGA instances on their cloud at present. However, FPGA instances on the cloud are still physical instances which are aimed for single-task and static-workload scenario, which means if there are multiple users, they can only share one FPGA instance in a time-division multiplexing (TDM) way. There is an increasing demand for virtualized FPGA.

Research group of Professor Yu Wang, Tsinghua University, has been working on FPGA virtualization for years, and recently proposed a framework to enable node-level FPGA virtualization for deep learning acceleration applications. Performance isolation for multiple users is enabled through a two-level instruction dispatch module (IDM) and a multi-core-based hardware resources pool. Overhead of online re-compilation is reduced to about 1ms by a tiling-based instruction frame package design and two-stage static-dynamic compilation is adopted. The baseline design of DNN accelerator baseline design is based on Angel-Eye, a DNN acceleration on FPGA published by Prof. Wang’s group in 2017.

The paper has just been presented at 28th FCCM, a premier academic conference in the programmable computing area. A demo is also provided for anyone who wants to try it:

https://github.com/annoysss123/FPGA-Virt-Exp-on-Aliyun-f3

General Introduction

As shown in Figure 1 (a), an FPGA instance provided on the cloud is usually the one with plenty of resources such as Xilinx VU9P, which could support a large number of users. For public cloud, there are two typical isolation methods between users: physical resources and performance isolation. Physical resources isolation allocates different hardware resources for different users while performance isolation means the performance provided to each user is not be disturbed by tasks executed of multiple other users. For private cloud, virtualization aims to maximize the overall system performance.

In this research, two baseline designs are used for comparing two different configurations: a static single large core design that supports multiple users using time-division multiplexing (TDM) and a static multi-core design with 16 small cores which supports one user by one core each. The virtualization design has also 16 small cores but uses space-division multiplexing (SDM) to support multiple tasks dynamically.

 Figure 1 Virtualization method for ISA-based DNN accelerator on FPGA: (a) Hardware architecture for public cloud; (b) Two-stage compiler design for private cloud.Figure 1 Virtualization method for ISA-based DNN accelerator on FPGA: (a) Hardware architecture for public cloud; (b) Two-stage compiler design for private cloud.

As shown in Figure 1 (b), a two-stage compilation flow is proposed to reduce online re-compilation overhead. The basic idea is to tile the output feature map into blocks and has an instruction frame package (IFP) for a series of instructions for each tile. During offline deployment, IFPs are generated based on the DNN model and hardware configuration of basic shareable units (small cores). During the online deployment stage, we only need to re-allocate the pre-generated IFPs to each core based on re-allocated hardware resources for new users. A simple latency simulator is also proposed so that we can predict the latency of each IFP and achieve workload balance among all the allocated cores.

Figure 2. Hardware architecture of the proposed virtualized FPGA DNN acceleratorFigure 2. Hardware architecture of the proposed virtualized FPGA DNN accelerator

Hardware and Compiler Design

To realize virtualization, as shown in Figure 2, the hardware architecture of the DNN accelerator is different from traditional DNN accelerator. First of all, as shown in Figure 2 (a) this accelerator has adopted Hardware Resource Pool (HRP) with multi small cores, which are allocated for different users exclusively. Besides, unlike the original Instruction Distribution Module (IDM), which is only used to implement instruction distribution and dependency management in a single core, a two-level IDM is designed to achieve multi-core sharing and synchronization, as shown in Figure 2 (b).

The first level IDM has 4 modules, including Instruction Mem., Instruction Decoder, Content-Switch Controller, and Multi-Core Sync. Controller. Instruction Mem. fetches the instructions from DDR and caches them until the next reconfiguration. The Instruction Decoder decodes them and sends the instructions to the second level IDM of the corresponding core according to the core index of each instruction. Content-Switch Controller records the index of the DNN layer that has been executed, so that other cores can continue computing upon the intermediate results. Multi-core Sync. Controller managers the layer-wise multi-core synchronization.

The second level IDM manages the computation inside each core. The Context-Switch Module can restart the computation based on the context information recorded by the first level IDM in the online reconfiguration stage. The System Synchronization Controller generates the local synchronization signal when the computation of the current DNN layer finishes and then waits for a valid global synchronization signal for the next layer to start.

Figure 3. The compilation flow for virtualized FPGA DNN accelerator, including static compilation (left) and dynamic compilation (right).Figure 3. The compilation flow for virtualized FPGA DNN accelerator, including static compilation (left) and dynamic compilation (right).

Tiling is an important idea in designing DNN accelerators to achieve massive parallelism and data reuse by partitioning the feature maps into tiles. We can achieve tiling among different dimensions, such as the height and width of feature maps. However, since the compiler generates Convolution instructions along the height dimension, feature map width and output channel dimensions are selected as the tiling dimension to generate IFPs.

The left part of Figure 3 shows how to get the latency results for each IFP using different tiling dimensions. After tiling, instructions in each tile are integrated into an IFP and then a cycle-level latency simulator predicts the latency of each IFP (T_1 to T_N or T_M). In the dynamic compilation stage, the dynamic compiler fetches the latency predictions and finds the optimal allocation strategy for multi-core sharing to minimize the total latency of a DNN layer.

Experimental Results

To evaluate the performance of this virtualization design, the research team used Xilinx Alveo U200, Xilinx VU9P FPGA in Aliyun (Alibaba Cloud), and nVidia Tesla V100 GPU to run 4 famous DNN models for comparison, including Inception v3, VGG16, MobileNet, and ResNet50 Xilinx SDAccel 2018.3 is used for hardware synthesis and software deployment. Hardware resource utilization results are shown in Table 1. For the virtualized multi-core design, 1% more logic and memory resources are used compared with static multi-core design, since the two-level IDM costs a little bit more resources. Multi-core designs consume almost double resources compared with the single large core design since a lot of modules like data mover will be copied for each small core in the multi-core design.

table1.png

Firstly, the cost of context switching with different numbers of re-allocated cores is evaluated. As shown in Table 2. The static compilation takes 14.7-46.8s to generate IFPs during the offline deployment, while the dynamic compilation costs only 0.4-1.5ms. The total online reconfiguration overhead is limited to 0.45-1.70ms considering the time of transfer instruction files. Compared with a non-virtualized design that takes tens of seconds to re-compile the whole DNN model, the online reconfiguration overhead of the virtualized design is negligible.

table2.png

A good virtualization framework should achieve good performance isolation, which means the performance of one user should not be influenced by other tasks. The second experiment assumes there are 4 possible users, gives one user fixed resources x (100%, 75%, 50%, and 25% of total resources), adjusts the remaining users to occupy the other (1 – x) resources, and finally gets the maximum and minimum performance of a user can get. As shown in Figure 4, when a user monopolizes all resources, there is no performance deviation. When the resources occupied by a single user are 75%, 50%, and 25% of total resources, GPU virtualization solution has 7.1-13.1%, 5.5-10.9%, and 6.5-8.1% performance deviations, while FPGA virtualization design limits them within 1%. The FPGA virtualization solution achieves much better isolation than GPU while meeting all the requirements for isolation.

Figure 4. Performance isolation: The performance deviation for ideal situation for one user with different hardware resources when there are 4 users.Figure 4. Performance isolation: The performance deviation for ideal situation for one user with different hardware resources when there are 4 users.

A good virtualization framework should make you achieve similar performance linear to the resources allocated for you. Figure 5 shows the performance results with different DNN models and different tiling strategies. The redline presents the performance with a single large core and the dark blue line presents a virtualized design with workload balance.

For Inception v3 and VGG16, the performance loss results of virtualized design are just 0.95% and 3.93%; but for MobileNet, there is a performance loss of 31.64% since this compact DNN model requires much more memory bandwidth and multi-core design further increases the demand for bandwidth.

We can also find that for VGG16, the performance is nearly linear to parallelism since it is a computation-bounded task. But for Inception v3 and MobileNet, even the performance of a single large core design deteriorates a lot compared with ideal linearity with a large number of parallelism because they are all memory bounded.

Figure 5. The single-task throughput under different situations with different parallelism.Figure 5. The single-task throughput under different situations with different parallelism.

For the situation with multiple uses and multiple tasks, the performance is measured as the total throughput of the FPGA chip. As shown in Figure 6, four columns from light blue to dark blue represent the throughput with virtualized cores, virtualized cores optimized for single task each core, static multi-core design, and large single-core design. A total of 16 cores are implemented on FPGA so that at most 16 tasks can be supported simultaneously.

When there are only a few of tasks like 1, 2, or 4 tasks, if we send each task to a small core, the static multi-core design cannot achieve high throughput, while virtualized designs perform much better. When there are 8, 12, or 16 tasks, the throughput of static large single-core design will not improve since all tasks are executed in a TDM manner. But for virtualized design, the more tasks executed, the higher throughput can be achieved. When there are total of 16 tasks, the optimized virtualized design achieves the optimal throughput.Figure 6. The multi-task throughput under different situationsFigure 6. The multi-task throughput under different situations

In conclusion, we have seen that the proposed FPGA virtualization framework provided excellent performance isolation, scalability, and flexibility. With an online reconfiguration overhead of about 1ms and 1.12% single-core performance loss, it achieves 1.07x – 1.69x and 1.88x – 3.12x performance improvement compared with the baseline design. The virtualized FPGA design also achieves great isolation and linearity to hardware resources. It will help further reduce TCO of all deep learning applications in the cloud.

If you are interested in the original paper, please do not hesitate to download it on Arxiv:

https://arxiv.org/abs/2003.12101.

Source: https://forums.xilinx.com/t5/AI-and-Machine-Learning-Blog/FPGA-Virtualization-for-Deep-Learning-Achieving-3X-Performance/ba-p/1104520

Blockchain

Members of WallStreetBets Forum Alleged in Telegram Crypto Scam Stealing $2M in BNB and ETH

Republished by Plato

Published

on

Members of the popular WallStreetBets Reddit forum were suspected of a presumable cryptocurrency fraud that could have caused losses of no less than $2 million. By creating a designated Telegram group, they duped investors by guaranteeing remarkable returns through capitalizing on the recent crypto market rally.

The Core of the Hoax

Per a report by Bloomberg, alleged members of the WallStreetBets Reddit Forum used the Telegram messaging service to execute a blatant scam. A particular account by the name of ”WallStreetBets – Crypto Pumps” presented users the chance to purchase a new token certified as WSB Finance before it was listed on crypto exchanges. The operation is known as a pre-mine sale.

The essence of the fraud was connected to the recent cryptocurrency boom as bitcoin and most altcoins skyrocketed in value lately. With some of the digital assets reaching 1,000% gains, the targeted WSB members conned investors into sending money without asking questions and with the potential of netting huge profits.

The notorious account also urged users to transfer popular cryptocurrencies such as Binance Coin (BNB) and Ethereum (ETH) to a designated crypto wallet and then to reach its ”token bot” to gain WSB Finance coins.

However, the perpetrators never dispatched those coins. Furthermore, another message on Telegram revealed that the people who had already issued a payment had to send an equivalent amount again or they would risk losing their initial investment.


ADVERTISEMENT

The Aftermath

After executing the hoax, more than 3,451 Binance Coins were withdrawn on Tuesday (May, 4th) from the wallet inside the Crypto Pumps messages.

Since the price of BNB at that point was approximately $625, the fraud caused losses of more than $2.1 million. Following the scam, thousands of people expressed their frustration and tried to expose the individuals behind the account. Moreover, the quantity of the other cryptocurrency – ether – still remains a mystery.

Two weeks ago WSB admins warned about offers that might try to take advantage of the forum’s name in order to allure the crypto audience. The ”WallStreetBets – Crypto Pumps” account has been removed from Telegram but whoever managed it left a message that might stun the affected victims:

”Buying Lambo now.”

SPECIAL OFFER (Sponsored)

Binance Futures 50 USDT FREE Voucher: Use this link to register & get 10% off fees and 50 USDT when trading 500 USDT (limited offer).

PrimeXBT Special Offer: Use this link to register & enter POTATO50 code to get 50% free bonus on any deposit up to 1 BTC.

You Might Also Like:


Coinsmart. Beste Bitcoin-Börse in Europa
Source: https://cryptopotato.com/members-of-wallstreetbets-forum-alleged-in-telegram-crypto-scam-stealing-2m-in-bnb-and-eth/

Continue Reading

Blockchain

South Korean Crypto Exchange Accused Of $1.5 Billion Scam

Republished by Plato

Published

on

The South Korean cryptocurrency exchange platform V Global was accused of luring 40,000 people into illicit multi-level deceit. The entire scheme amounts to more than 1.7 million won, which equals $1.5 billion.

The Investigation

As reported by the Korean officials, the police raided many places in the country related to a virtual cryptocurrency exchange, and its notorious CEO – known as LEE – alleged to fundraising without regulatory permission. The authorities blocked the exchange’s cash deposits as a part of the investigation.

In total, the Gyeonggy Nambu Police Agency reported that it searched the exchange’s headquarters in southern Seoul along with 21 other places and froze more than $214 million left in the account.

Another report from today shed more light on the developments. According to Yonhap News, the name of the organization is V Global. The Korean police are examining the accusations against them for fraud under the Certain Economic Crimes Weighted Penalty Act, the Similar Receiving Act, and the door-to-door sales business.

The main accusation against the exchange is gaining a deposit of 1.7 trillion won ($1.5 billion) from 40,000 members in the period between August 2020 and January 2021. The announcement revealed that most of the people were elderly or housewives with no experience in cryptocurrency trading.


ADVERTISEMENT

Too Good To Be True

The investigation revealed that the exchange urged investors to entrust their funds to an account and lured the members that the expected return would be three times higher than the initial investment. According to the authorities, there was a pyramid element in the scam as the exchange promised to grant an introduction fee of 1.2 million won ($1,065) for every newly recruited member.

The report affirmed that the trading venue paid some members in the form of a block. Therefore, people who signed up earlier received funds from individuals who entered the exchange later.

Moreover, the Korean police seem confident to deal with the fraud case as it revealed its intention to confiscate 240 billion won ($214 million) left in the V Global account as of the 15th last month, even before the prosecution process.

SPECIAL OFFER (Sponsored)

Binance Futures 50 USDT FREE Voucher: Use this link to register & get 10% off fees and 50 USDT when trading 500 USDT (limited offer).

PrimeXBT Special Offer: Use this link to register & enter POTATO50 code to get 50% free bonus on any deposit up to 1 BTC.

You Might Also Like:


Coinsmart. Beste Bitcoin-Börse in Europa
Source: https://cryptopotato.com/south-korean-crypto-exchange-accused-of-1-5-billion-scam/

Continue Reading

Blockchain

Georgia’s central bank is exploring ‘Digital Gel’ CBDC

Republished by Plato

Published

on

The National Bank of Georgia said that it is considering launching a central bank digital currency.

In an announcement today, the central bank hinted at the issuance of a central bank digital currency, or CBDC, in an effort “to enhance efficiencies of the domestic payment system and financial inclusion.” The National Bank of Georgia, or NBG, said it would be inviting fintech firms and other financial institutions to participate in the project, named Digital Gel after the symbol for the country’s fiat currency, the lari.

“CBDC holds the promise to unlock the tremendous value of innovative business models for the benefit of society,” said the announcement. “The introduction of CBDC could increase financial intermediation efficiency, help introduce new financial technologies, facilitate financial inclusion, and reach previously unbanked populations.”

However, the bank mentioned the possibility of risks in the launch of a CBDC in the Republic of Georgia given the “new and potentially disruptive technology.” The NBG said it may conduct extensive testing of the CBDC in a controlled environment to ensure a smooth rollout, but did not provide any details regarding a timeline for launch.

With a population of roughly 4 million and a gross domestic product of approximately $15 billion, a nation like Georgia falls at the smaller end of countries exploring CBDCs. The Bahamas officially rolled out its Sand Dollar central bank digital currency in October, while China has been piloting its digital yuan in select cities prior to a full-scale launch. In the United States, Fortune 500 company Accenture announced this week it would be partnering with the Digital Dollar Foundation to conduct CBDC trials.

Coinsmart. Beste Bitcoin-Börse in Europa
Source: https://cointelegraph.com/news/georgia-s-central-bank-is-exploring-digital-gel-cbdc

Continue Reading
Blockchain2 days ago

Mastercard adds 6 blockchain payments startups to accelerator program

Blockchain5 days ago

dVPN network Sentinel initiates bandwidth sharing on the Cosmos IBC testnet

Blockchain5 days ago

Should Ethereum, Binance Coin, and FTT be part of traders’ portfolios?

Blockchain9 hours ago

Major Law Firm CMS Adds Stratis (STRAX) to its Legal Accelerator Program

Blockchain1 day ago

Starcoll To Issue Limited Edition Star Wars Collectibles as NFTs

Blockchain5 days ago

2 key Ethereum price metrics prove pro traders are behind ETH’s new highs

Blockchain2 days ago

S&P DJI Releases Bitcoin and Ethereum Indexes

Blockchain1 day ago

Pro traders buy the Bitcoin price dip while retail investors chase altcoins

Blockchain5 days ago

Crypto custodian Finoa closes $22M Series A funding round led by Balderton Capital

Blockchain22 hours ago

China’s Central Bank to Partner With Alibaba’s Ant Group on Digital Yuan

Blockchain2 days ago

Iranian companies can now pay for imports with officially mined cryptocurrencies.

Blockchain2 days ago

eBay could add a crypto payment option, says CEO

Blockchain2 days ago

CBDCs Could Harm Bitcoin But BTC May Replace Gold as a Store of Value: Deutsche Bank

Blockchain1 day ago

The Reason for Ethereum’s Recent Rally to ATH According to Changpeng Zhao

Blockchain23 hours ago

Here Is Why XRP Volume Has Recover Across Payment Corridors

Blockchain1 day ago

‘This ain’t no game’ as DOGE briefly flippens Nintendo and takes #4 spot from XRP

Blockchain1 day ago

Bitcoin Miners Moving Away from China, F2Pool Observes

Blockchain1 day ago

Bybit Launches Ether (ETH) Cloud Mining Service as Demand Booms

Blockchain1 day ago

Another XRP lawsuit update: SEC accuses XRP Holders of ‘reciting’ Ripple’s litigation position

Blockchain2 days ago

Qredo raises $11M in seed funding to launch new cross-chain asset management infrastructure

Trending