Eagerly fetch data in hot-storage #807

jsign · 2021-03-18T18:14:20Z

This PR:

Have a workaround to allow not relying on IpfsOnlineMode=true
Allow a flag/env to configure a hard-timeout for stuck retrievals.
Improve UX by masking behavior in which the miner claims that retrieval will be done by a worker miner and not the miner that made the deal.
The StageCid method also pulls the data now from the IPFS network.

The main motivation for this change is: textileio/textile#533

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign · 2021-03-18T19:21:52Z

api/client/utils_test.go

+		FFSDealFinalityTimeout:       time.Minute * 30,
+		FFSMaxParallelDealPreparing:  1,
+		FFSGCAutomaticGCInterval:     0,
+		FFSRetrievalNextEventTimeout: time.Hour,


A new config attribute to set a timeout for retrieval that might get stuck.
If we don't receive any data or event while doing the retrieval in this duration, we fail.

Unfortunately, there're situations/bugs in Lotus in which a retrieval might get stuck, so this is a safety net.

jsign · 2021-03-18T19:22:29Z

cmd/powd/main.go

@@ -387,6 +389,7 @@ func setupFlags() error {
 	pflag.String("ffsminerselector", "reputation", "Miner selector to be used by FFS: 'sr2', 'reputation'.")
 	pflag.String("ffsminerselectorparams", "", "Miner selector configuration parameter, depends on --ffsminerselector.")
 	pflag.String("ffsminimumpiecesize", "67108864", "Minimum piece size in bytes allowed to be stored in Filecoin.")
+	pflag.Duration("ffsretrievalnexteventtimeout", time.Hour, "Maximum amount of time to wait for the next retrieval event before erroring it.")


Would be great in the future to change all duration flags to .Duration.
Created #803

Yea, I've been using that lately, quite nice.

jsign · 2021-03-18T19:22:51Z

deals/module/retrieve.go

-					log.Infof("in progress retrieval errored: %s", err)
+					log.Infof("in progress retrieval errored: %s", e.Err)


Wrong error variable.

jsign · 2021-03-18T19:24:05Z

ffs/coreipfs/coreipfs.go

@@ -72,11 +72,15 @@ func (ci *CoreIpfs) Stage(ctx context.Context, iid ffs.APIID, r io.Reader) (cid.
 	return p.Cid(), nil
 }

-// StageCid stage-pin a Cid.
+// StageCid pull the Cid data and stage-pin it.


We're changing the meaning of StageCid to do something similar to what Stage does.
Before it only tracked the Cid as "staged-pined" (that temporal pin that we created to allow multitenancy).
Now it also pins the data in the go-ipfs node, instead of assuming that's true.

jsign · 2021-03-18T19:25:32Z

ffs/filcold/filcold.go

+	for {
+		select {
+		case <-time.After(fc.retrNextEventTimeout):
+			return ffs.FetchInfo{}, fmt.Errorf("didn't receive events for %d minutes", int64(fc.retrNextEventTimeout.Minutes()))
+		case e, ok := <-events:
+			if !ok {
+				break Loop
+			}
+			if e.Err != "" {
+				return ffs.FetchInfo{}, fmt.Errorf("event error in retrieval progress: %s", e.Err)
+			}
+			strEvent := retrievalmarket.ClientEvents[e.Event]
+			strDealStatus := retrievalmarket.DealStatuses[e.Status]
+			fundsSpent = e.FundsSpent.Uint64()
+			newMsg := fmt.Sprintf("Received %s, total spent: %sFIL (%s/%s)", humanize.IBytes(e.BytesReceived), util.AttoFilToFil(fundsSpent), strEvent, strDealStatus)
+			if newMsg != lastMsg {
+				fc.l.Log(ctx, newMsg)
+				lastMsg = newMsg
+			}
+			lastEvent = e
 		}
 	}
+	if lastEvent.Status != retrievalmarket.DealStatusCompleted {
+		return ffs.FetchInfo{}, fmt.Errorf("retrieval failed with status %s and message %s", retrievalmarket.DealStatuses[lastEvent.Status], lastMsg)
+	}
+


TL;DR:

Use the timeout to fail if the retrieval got stuck

If the events channel is closed, we check that the last received event is exactly the one that means the retrieval ended successfully. If for some reason the channel gets closed in some non-final status, that's wrong and we should error.

jsign · 2021-03-18T19:26:32Z

ffs/manager/manager.go

-				AddTimeout: 480, // 8 min
+				AddTimeout: 15 * 60, // 15min


I believe this is a better default for big-data. The user can change it anyway.

Worth updating everywhere in the db using the massive change tool?

Yep, I have that as a task in my pre-deployment steps. Using powcfg to do it :)

jsign · 2021-03-18T19:28:29Z

ffs/scheduler/scheduler_storage.go

+	// We want to avoid relying on Lotus working in online-mode.
+	// We need to take care ourselves of pulling the data from
+	// the IPFS network.
+	if !a.Cfg.Hot.Enabled && a.Cfg.Cold.Enabled {
+		s.l.Log(ctx, "Automatically staging Cid from the IPFS network...")
+		stageCtx, cancel := context.WithTimeout(ctx, time.Duration(a.Cfg.Hot.Ipfs.AddTimeout)*time.Second)
+		defer cancel()
+		if err := s.hs.StageCid(stageCtx, a.APIID, a.Cid); err != nil {
+			return ffs.StorageInfo{}, nil, fmt.Errorf("automatically staging cid: %s", err)
+		}
+	}


OK, so the gist of all this.
The main problem was that we were using IpfsOnlineMode=true in the Lotus node for the Hub case, but that has a bug for retrievals. So what I do here is simply forcing staging the Cid in the hot-storage in Powergate, and allowing Lotus to use the go-ipfs HTTP in offline mode since the data will be there thanks to Powergate.

TBH, the IpfsOnlineMode=true was much more elegant than this pre-fetching; but this was mostly about doing a workaround for the bug, not really fixing something wrong.

jsign · 2021-03-22T13:16:08Z

deals/module/records.go

-			Miner:                   offer.Miner.String(),
+			Miner:                   offer.MinerPeer.Address.String(),


Solve a situation in which miners claim that the retrieval will be done from a worker address, and not their owner address. This might confuse the user since the deal was made with miner X, and it might see that the retrieval is being made from miner Y. (And miner Y hasn't stored data in the network and it's mostly empty).

asutula

Looks good!

asutula · 2021-03-22T19:19:56Z

cmd/powd/main.go

@@ -387,6 +389,7 @@ func setupFlags() error {
 	pflag.String("ffsminerselector", "reputation", "Miner selector to be used by FFS: 'sr2', 'reputation'.")
 	pflag.String("ffsminerselectorparams", "", "Miner selector configuration parameter, depends on --ffsminerselector.")
 	pflag.String("ffsminimumpiecesize", "67108864", "Minimum piece size in bytes allowed to be stored in Filecoin.")
+	pflag.Duration("ffsretrievalnexteventtimeout", time.Hour, "Maximum amount of time to wait for the next retrieval event before erroring it.")


Yea, I've been using that lately, quite nice.

asutula · 2021-03-22T19:24:59Z

ffs/manager/manager.go

-				AddTimeout: 480, // 8 min
+				AddTimeout: 15 * 60, // 15min


Worth updating everywhere in the db using the massive change tool?

jsign added the rd-patch label Mar 18, 2021

jsign self-assigned this Mar 18, 2021

jsign mentioned this pull request Mar 18, 2021

Enable Filecoin retrievals in Buckets textileio/textile#533

Merged

jsign force-pushed the jsign/stagecid branch from 9295675 to c398f53 Compare March 18, 2021 18:18

jsign added 7 commits March 19, 2021 10:19

stage cid

79fb7bb

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

.

fe8dd8a

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

check final status and fail if needed

1eb7a77

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

change default ipfs-timeout && pre-fetch from IPFs if cold is enabled

a2f60e2

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

changes

427d347

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

fix

30eefec

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

use owner addr, not potential worker miner

680f182

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign force-pushed the jsign/stagecid branch from c398f53 to 680f182 Compare March 19, 2021 15:12

jsign added 2 commits March 19, 2021 12:44

fix raceg

1c35588

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

fix miner addr owner & dont lock on pinning

90a20cf

Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>

jsign commented Mar 22, 2021

View reviewed changes

jsign marked this pull request as ready for review March 22, 2021 18:59

jsign requested a review from asutula March 22, 2021 18:59

asutula approved these changes Mar 22, 2021

View reviewed changes

jsign merged commit 9c642ba into master Mar 22, 2021

jsign deleted the jsign/stagecid branch March 22, 2021 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eagerly fetch data in hot-storage #807

Eagerly fetch data in hot-storage #807

Uh oh!

jsign commented Mar 18, 2021 •

edited

Loading

Uh oh!

jsign Mar 18, 2021

Uh oh!

jsign Mar 18, 2021

Uh oh!

asutula Mar 22, 2021

Uh oh!

jsign Mar 18, 2021

Uh oh!

jsign Mar 18, 2021

Uh oh!

jsign Mar 18, 2021

Uh oh!

jsign Mar 18, 2021

Uh oh!

asutula Mar 22, 2021

Uh oh!

jsign Mar 22, 2021

Uh oh!

jsign Mar 18, 2021

Uh oh!

jsign Mar 22, 2021

Uh oh!

asutula left a comment

Uh oh!

asutula Mar 22, 2021

Uh oh!

asutula Mar 22, 2021

Uh oh!

Uh oh!

		log.Infof("in progress retrieval errored: %s", err)
		log.Infof("in progress retrieval errored: %s", e.Err)

		Miner: offer.Miner.String(),
		Miner: offer.MinerPeer.Address.String(),

Eagerly fetch data in hot-storage #807

Eagerly fetch data in hot-storage #807

Uh oh!

Conversation

jsign commented Mar 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asutula left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jsign commented Mar 18, 2021 •

edited

Loading